Linear Algebra and Application in Linear Regression

Matrix Operations and Linear Regression

I. Basic Matrix Differentiations

Generally, take derivatives of a matrix (if we use denominator layout), there is a transpose effect applied on the original form.

  • $$(AB)^T=B^TA^T$$

  • $$(ABC)^T=C^TB^TA^T$$

  • $$\frac{\partial (Ax)}{\partial x}=A^T\text{, } \frac{\partial (x^TA)}{\partial x}=A$$

  • $$\frac{\partial(b^TAx)}{\partial x} = (b^TA)^T = A^T b$$

  • $$\frac{\partial(x^TAx)}{\partial x} = (A+A^T)x = 2Ax \text{ , if A is symmetric.}$$

II. Linear Regression

$$y=\beta X+\epsilon$$

$$\min \text{ } \epsilon^2 $$

Where $\epsilon^2 = (y-\beta X)^T(y-\beta X)$.

Regard $f(\beta)$ as a function of $\beta$ and $X,y$ are both known variables:

$$f(\beta) = (y-\beta X)^T(y-\beta X)$$

$$ = (y^T-X^T \beta^T)(y-\beta X)$$

$$ = y^T y - X^T \beta^T y - y^T \beta X + X^T \beta^T \beta X$$

Apply rules above:

$$\frac{\partial f}{\partial \beta} = 0-X^Ty-(y^TX)^T + 2X^T X \beta$$

$$=-2 X^T y + 2X^T X \beta$$

Set it to $0$:

$$\frac{\partial f}{\partial \beta} = 0 \Rightarrow X^T X \beta = X^T y $$

$$\hat{\beta} = (X^T X)^{-1}X^T y$$

III. Ridge Regression

Instead of minimize $\epsilon^2 = (y-\beta X)^T(y-\beta X)$, we add more penalty on the number of features:

$$\min \text{ } (y-\beta X)^T(y-\beta X)+\lambda \beta^T \beta$$

Similarly,

$$f(\beta) = y^T y - X^T \beta^T y - y^T \beta X + X^T \beta^T \beta X + \lambda \beta^T \beta$$

$$\frac{\partial f}{\partial \beta} = 0 - X^T y - (y^T X)^T + 2(X^T X)\beta + 2\lambda \beta$$

$$=-2X^Ty + 2(X^T X + \lambda)\beta$$

$$\frac{\partial f}{\partial \beta} = 0 \Rightarrow \hat{\beta} = (X^TX + \lambda)^{-1}X^T y$$

IV. SVD

SVD of $X$, for any $n \times p$ matrix $X$:

$$X= UDV^T$$

Where $U,V$ are orthogonal matrix and $D$ is diagonal matrix.

$$X^T X = (VD^TU^T)(UDV^T) = VD^TU^TUDV^T$$

Since $U$ is orthogonal: $U^T=U^{-1}$,

$$X^T X =VD^TU^{-1}UDV^T = VD^TDV^T$$

Since $D$ is diagonal:

$$X^T X = VD^2V^T$$

Hence, $V$ is all the $(X^T X)$’s eigen vectors spaned matrix of corresponding eigen values, and $D$ is all the $(X^T X)$’s eigen value spaned diagonal matrix:

$$V = [v_1 | v_2| … |v_n] \text{ , } $$

$$D=
\begin{bmatrix}
\lambda_1^2 & 0 & 0\\
.. & .. & ..\\
0 & 0 & \lambda_n^2
\end{bmatrix}
$$

Where $v_n$ are all eigen vectors of $X^T X$, $\lambda_n$ are all eigen values of $X^T X$.

V. LU Decomposition and Cholesky Decompostion

1. LU Decomposition $A=LU$:

$$
\begin{bmatrix}
a_{11} & a_{12} & a_{13} \\
a_{21} & a_{22} & a_{23} \\
a_{31} & a_{32} & a_{33}
\end{bmatrix} =
\begin{bmatrix}
L_{11} & 0 & 0 \\
L_{21} & L_{22} & 0 \\
L_{31} & L_{32} & L_{33}
\end{bmatrix}
\begin{bmatrix}
U_{11} & U_{12} & U_{13} \\
0 & U_{22} & U_{23} \\
0 & 0 & U_{33}
\end{bmatrix}
$$

2. Cholesky Decompostion $A=LL^T$:

$$
\begin{bmatrix}
a_{11} & a_{12} & a_{13} \\
a_{21} & a_{22} & a_{23} \\
a_{31} & a_{32} & a_{33}
\end{bmatrix} =
\begin{bmatrix}
L_{11} & 0 & 0 \\
L_{21} & L_{22} & 0 \\
L_{31} & L_{32} & L_{33}
\end{bmatrix}
\begin{bmatrix}
L_{11} & L_{21} & L_{31} \\
0 & L_{22} & L_{32} \\
0 & 0 & L_{33}
\end{bmatrix}
$$

We have a general form for values of Cholesky Decompostion matrix $L$:

1) For elements lie in diagnoal:

$$L_{kk}=\sqrt{a_{kk}-\sum_{j=1}^{k-1}L_{kj}^2}$$

For example,

$$L_{33}=\sqrt{a_{33}-L_{31}^2-L_{32}^2}$$

1) For elements lie below diagnoal:

$$L_{ik}=\frac{1}{L_{kk}}(a_{ik}-\sum_{j=1}^{k-1}L_{ij}L_{kj})$$

For example,

$$L_{32}=\frac{1}{L_{22}}(a_{32}- L_{31}L_{21} )$$