2023-09-27 08:40:51 +02:00
\chapter { Linear Model}
2023-09-22 17:32:56 +02:00
2023-09-27 08:40:51 +02:00
\section { Simple Linear Regression}
2023-09-22 17:32:56 +02:00
\[
2023-09-27 08:40:51 +02:00
Y_ i = \beta _ 0 + \beta _ 1 X_ i + \varepsilon _ i
2023-09-22 17:32:56 +02:00
\]
2023-09-27 08:40:51 +02:00
\[
\Y = \X \beta + \varepsilon .
\]
\[
2023-09-22 17:32:56 +02:00
\begin { pmatrix}
Y_ 1 \\
Y_ 2 \\
\vdots \\
Y_ n
\end { pmatrix}
2023-09-27 08:40:51 +02:00
=
\begin { pmatrix}
2023-09-22 17:32:56 +02:00
1 & X_ 1 \\
1 & X_ 2 \\
\vdots & \vdots \\
2023-09-27 08:40:51 +02:00
1 & X_ n
2023-09-22 17:32:56 +02:00
\end { pmatrix}
2023-09-22 23:48:46 +02:00
\begin { pmatrix}
2023-09-22 17:32:56 +02:00
\beta _ 0 \\
\beta _ 1
\end { pmatrix}
2023-09-27 08:40:51 +02:00
+
2023-09-22 17:32:56 +02:00
\begin { pmatrix}
\varepsilon _ 1 \\
\varepsilon _ 2 \\
2023-09-27 08:40:51 +02:00
\vdots
2023-09-22 17:32:56 +02:00
\varepsilon _ n
\end { pmatrix}
2023-09-27 08:40:51 +02:00
\]
2023-09-22 17:32:56 +02:00
2023-09-27 08:40:51 +02:00
\paragraph * { Assumptions}
\begin { enumerate} [label={ \color { primary} { ($ A _ \arabic * $ )} } ]
\item $ \varepsilon _ i $ are independent;
\item $ \varepsilon _ i $ are identically distributed;
\item $ \varepsilon _ i $ are i.i.d $ \sim \Norm ( 0 , \sigma ^ 2 ) $ (homoscedasticity).
\end { enumerate}
\section { Generalized Linear Model}
2023-09-22 17:32:56 +02:00
2023-09-27 08:40:51 +02:00
\[
g(\EE (Y)) = X \beta
\]
with $ g $ being
2023-09-22 17:32:56 +02:00
\begin { itemize}
2023-09-27 08:40:51 +02:00
\item Logistic regression: $ g ( v ) = \log \left ( \frac { v } { 1 - v } \right ) $ , for instance for boolean values,
\item Poisson regression: $ g ( v ) = \log ( v ) $ , for instance for discrete variables.
2023-09-22 17:32:56 +02:00
\end { itemize}
2023-09-27 08:40:51 +02:00
\subsection { Penalized Regression}
When the number of variables is large, e.g, when the number of explanatory variable is above the number of observations, if $ p >> n $ ($ p $ : the number of explanatory variable, $ n $ is the number of observations), we cannot estimate the parameters.
In order to estimate the parameters, we can use penalties (additional terms).
Lasso regression, Elastic Net, etc.
2023-09-22 17:32:56 +02:00
\subsection { Statistical Analysis Workflow}
\begin { enumerate} [label={ \bfseries \color { primary} Step \arabic * .} ]
\item Graphical representation;
\item ...
\end { enumerate}
2023-09-22 23:48:46 +02:00
\[
Y = X \beta + \varepsilon ,
\]
is noted equivalently as
\[
\begin { pmatrix}
y_ 1 \\
y_ 2 \\
y_ 3 \\
y_ 4
\end { pmatrix}
= \begin { pmatrix}
1 & x_ { 11} & x_ { 12} \\
1 & x_ { 21} & x_ { 22} \\
1 & x_ { 31} & x_ { 32} \\
1 & x_ { 41} & x_ { 42}
\end { pmatrix}
\begin { pmatrix}
\beta _ 0 \\
\beta _ 1 \\
\beta _ 2
\end { pmatrix} +
\begin { pmatrix}
\varepsilon _ 1 \\
\varepsilon _ 2 \\
\varepsilon _ 3 \\
\varepsilon _ 4
\end { pmatrix} .
\]
2023-09-22 17:32:56 +02:00
\section { Parameter Estimation}
\subsection { Simple Linear Regression}
\subsection { General Case}
2023-09-27 08:40:51 +02:00
If $ \X ^ T \X $ is invertible, the OLS estimator is:
2023-09-22 17:32:56 +02:00
\begin { equation}
2023-09-27 08:40:51 +02:00
\hat { \beta } = (\X ^ T\X )^ { -1} \X ^ T \Y
2023-09-22 17:32:56 +02:00
\end { equation}
\subsection { Ordinary Least Square Algorithm}
We want to minimize the distance between $ \X \beta $ and $ \Y $ :
\[
\min \norm { \Y - \X \beta } ^ 2
\]
(See \autoref { ch:elements-of-linear-algebra} ).
\begin { align*}
\Rightarrow & \X \beta = proj^ { (1, \X )} \Y \\
\Rightarrow & \forall v \in w,\, vy = v proj^ w(y)\\
\Rightarrow & \forall i: \\
& \X _ i \Y = \X _ i X\hat { \beta } \qquad \text { where $ \hat { \beta } $ is the estimator of $ \beta $ } \\
2023-09-27 08:40:51 +02:00
\Rightarrow & \X ^ T \Y = \X ^ T \X \hat { \beta } \\
\Rightarrow & { \color { gray} (\X ^ T \X )^ { -1} } \X ^ T \Y = { \color { gray} (\X ^ T \X )^ { -1} } (\X ^ T\X ) \hat { \beta } \\
\Rightarrow & \hat { \beta } = (\X ^ T\X )^ { -1} \X ^ T \Y
2023-09-22 17:32:56 +02:00
\end { align*}
2023-09-27 08:40:51 +02:00
This formula comes from the orthogonal projection of $ \Y $ on the vector subspace defined by the explanatory variables $ \X $
2023-09-22 17:32:56 +02:00
$ \X \hat { \beta } $ is the closest point to $ \Y $ in the subspace generated by $ \X $ .
If $ H $ is the projection matrix of the subspace generated by $ \X $ , $ X \Y $ is the projection on $ \Y $ on this subspace, that corresponds to $ \X \hat { \beta } $ .
2023-09-30 07:03:23 +02:00
\section { Sum of squares}
$ \Y - \X \hat { \beta } \perp \X \hat { \beta } - \Y \One $ if $ \One \in V $ , so
\[
\underbrace { \norm { \Y - \bar { \Y } \One } } _ { \text { Total SS} } = \underbrace { \norm { \Y - \X \hat { \beta } } ^ 2} _ { \text { Residual SS} } + \underbrace { \norm { \X \hat { \beta } - \bar { \Y } \One } ^ 2} _ { \text { Explicated SS} }
\]
2023-09-22 23:48:46 +02:00
\section { Coefficient of Determination: \texorpdfstring { $ R ^ 2 $ } { R\textsuperscript { 2} } }
2023-09-22 17:32:56 +02:00
\begin { definition} [$ R ^ 2 $ ]
\[
0 \leq R^ 2 = \frac { \norm { \X \hat { \beta } - \bar { \Y } \One } ^ 2} { \norm { \Y - \bar { \Y } \One } ^ 2} = 1 - \frac { \norm { \Y - \X \hat { \beta } } ^ 2} { \norm { \Y - \bar { \Y } \One } ^ 2} \leq 1
2023-09-22 23:48:46 +02:00
\] proportion of variation of $ \Y $ explained by the model.
2023-09-22 17:32:56 +02:00
\end { definition}
2023-09-22 23:48:46 +02:00
\begin { figure}
\centering
\includestandalone { figures/schemes/orthogonal_ projection}
2023-09-30 07:03:23 +02:00
\caption { Orthogonal projection of $ \Y $ on plan generated by the base described by $ \X $ . $ \color { blue } a $ corresponds to $ \norm { \X \hat { \beta } - \bar { \Y } } ^ 2 $ and $ \color { blue } b $ corresponds to $ \hat { \varepsilon } = \norm { \Y - \hat { \beta } \X } ^ 2 $ } and $ \color { blue } c $ corresponds to $ \norm { Y - \bar { Y } } ^ 2 $ .
2023-09-22 23:48:46 +02:00
\label { fig:scheme-orthogonal-projection}
2023-09-23 09:00:51 +02:00
\end { figure}
\begin { figure}
\centering
\includestandalone { figures/schemes/ordinary_ least_ squares}
\caption { Ordinary least squares and regression line with simulated data.}
\label { fig:ordinary-least-squares}
\end { figure}
2023-09-30 07:03:23 +02:00
\begin { definition} [Model dimension]
Let $ \M $ be a model.
The dimension of $ \M $ is the dimension of the subspace generated by $ \X $ , that is the number of parameters in the $ \beta $ vector.
\textit { Nb.} The dimension of the model is not the number of parameter, as $ \sigma ^ 2 $ is one of the model parameters.
\end { definition}
\section { Gaussian vectors}
\begin { definition} [Normal distribution]
\end { definition}
\begin { definition} [Gaussian vector]
A random vector $ \Y \in \RR [ n ] $ is a gaussian vector if every linear combination of its component is ...
\end { definition}
\begin { property}
$ m = \EE ( Y ) = ( m _ 1 , \ldots , m _ n ) ^ T $ , where $ m _ i = \EE ( Y _ i ) $
...
\[
\Y \sim \Norm _ n(m, \Sigma )
\]
where $ \Sigma $ is the variance-covariance matrix!
\[
\Sigma = \E \left [(\Y -m)(\Y - m)^T\right] .
\]
\end { property}
\begin { remark}
\[
\Cov (Y_ i, Y_ i) = \Var (Y_ i)
\]
\end { remark}
\begin { definition} [Covariance]
\[
\Cov (Y_ i, Y_ j) = \EE \left ((Y_ i-\EE (Y_ j))(Y_ j-\EE (Y_ j))\right )
\]
\end { definition}
When two variable are linked, the covariance is large.
If two variables $ X, Y $ are independent, $ \Cov ( X, Y ) = 0 $ .
\begin { definition} [Correlation coefficient]
\[
\Cor (Y_ i, Y_ j) = \frac { \EE \left ((Y_ i-\EE (Y_ j))(Y_ j-\EE (Y_ j))\right )} { \sqrt { \EE (Y_ i - \EE (Y_ i)) \cdot \EE (Y_ j - \EE (Y_ j))} }
\]
\end { definition}
Covariance is really sensitive to scale of variables. For instance, if we measure distance in millimeters, the covariance would be larger than in the case of a measure expressed in metters. Thus the correlation coefficient, which is a sort of normalized covariance is useful, to be able to compare the values.
\begin { remark}
\begin { align*}
\Cov (Y_ i, Y_ i) & = \EE ((Y_ i - \EE (Y_ i)) (Y_ i - \EE (Y_ i))) \\
& = \EE ((Y_ i - \EE (Y_ i))^ 2) \\
& = \Var (Y_ i)
\end { align*}
\end { remark}
\begin { equation}
\Sigma = \begin { pNiceMatrix}
\VVar (Y_ 1) & & & & \\
& \Ddots & & & \\
& \Cov (Y_ i, Y_ j) & \VVar (Y_ i) & & \\
& & & \Ddots & \\
& & & & \VVar (Y_ n)
\end { pNiceMatrix}
\end { equation}
\begin { definition} [Identity matrix]
\[
\mathcal { I} _ n = \begin { pNiceMatrix}
1 & 0 & 0 \\
0 & \Ddots & 0\\
0 & 0 & 1
\end { pNiceMatrix}
\]
\end { definition}
\begin { theorem} [Cochran Theorem (Consequence)]
Let $ \mathbf { Z } $ be a gaussian vector: $ \mathbf { Z } \sim \Norm _ n ( 0 _ n, I _ n ) $ .
\begin { itemize}
\item If $ V _ 1 , V _ n $ are orthogonal subspaces of $ \RR [ n ] $ with dimensions $ n _ 1 , n _ 2 $ such that
\[
\RR [n] = V_ 1 \overset { \perp } { \oplus } V_ 2.
\]
\item If $ Z _ 1 , Z _ 2 $ are orthogonal of $ \mathbf { Z } $ on $ V _ 1 $ and $ V _ 2 $ i.e. $ Z _ 1 = \Pi _ { V _ 1 } ( \mathbf { Z } ) = \Pi _ 1 \Y $ and $ Z _ 2 = \Pi _ { V _ 2 } ( \mathbf { Z } ) = \Pi _ 2 \Y $ ...
(\textcolor { red} { look to the slides} )
\end { itemize}
\end { theorem}
\begin { definition} [Chi 2 distribution]
If $ X _ 1 , \ldots , X _ n $ i.i.d. $ \sim \Norm ( 0 , 1 ) $ , then;,
\[
X_ 1^ 2 + \ldots X_ n^ 2 \sim \chi _ n^ 2
\]
\end { definition}
\subsection { Estimator's properties}
\[
\Pi _ V = \X (\X ^ T\X )^ { -1} \X ^ T
\]
\begin { align*}
\hat { m} & = \X \hat { \beta } = \X (\X ^ T\X )^ { -1} \X ^ T \Y \\
\text { so} \\
& = \Pi _ V \Y
\end { align*}
According to Cochran theorem, we can deduce that the estimator of the predicted value $ \hat { m } $ is independent $ \hat { \sigma } ^ 2 $
All the sum of squares follows a $ \chi ^ 2 $ distribution:
\[
...
\]
\begin { property}
\end { property}
\subsection { Estimators consistency}
If $ q < n $ ,
\begin { itemize}
\item $ \hat { \sigma } ^ 2 \overunderset { \PP } { n \to \infty } \sigma ^ { * 2 } $ .
\item If $ ( \X ^ T \X ) ^ { - 1 } $ ...
\item ...
\end { itemize}
We can derive statistical test from these properties.
\section { Statistical tests}
\subsection { Student $ t $ -test}
\[
\frac { \hat { \theta } -\theta } { \sqrt { \frac { \widehat { \VVar } (\hat { \theta } )} { n} } } \underset { H_ 0} { \sim } t
\]
where