126 lines
3.0 KiB
TeX
Executable File
126 lines
3.0 KiB
TeX
Executable File
\chapter{Regularized regressions}
|
|
|
|
|
|
Let $\Y$ be a vector of observations and $\X$ a matrix of dimension $n \times (p+1)$.
|
|
Suppose the real model is:
|
|
\[
|
|
\Y = \X^{m^{*}} \beta^{m^{*}} + \varepsilon^{m^{*}} = \X^{*} \beta^{*} + \varepsilon^{*}.
|
|
\]
|
|
if $p$ is large compared to $n$:
|
|
\begin{itemize}
|
|
\item $\hat{\beta} = (\X^{T}\X)^{-1} \X^{T} \Y$ is not defined as $\X^{T}\X$ is not invertible.
|
|
|
|
$m^{*}$ is the number of true predictors, that is, the number of predictor with non-zero values.
|
|
|
|
\item
|
|
|
|
\item
|
|
\end{itemize}
|
|
|
|
\section{Ridge regression}
|
|
|
|
Instead of minimizing the mean square error, we want to minimize the following regularize expression:
|
|
\[
|
|
\hat{\beta}^{\text{ridge}}_{\lambda} = \argmin_{\beta \in \RR[p]} \norm{Y - X \beta}^{2} \lambda \sum_{j=1}^{p} \beta_{j}^{2}
|
|
\]
|
|
it is a way to favor the solution with small values for parameters.
|
|
where $\lambda$ is used to callibrate the regularization.
|
|
\[
|
|
\sum_{j=1}^{p} \beta_{j}^{2} = \norm{\beta_{j}}^{2}
|
|
\]
|
|
is the classical square norm of the vector.
|
|
|
|
|
|
\section{Cross validation}
|
|
|
|
\subsection{Leave-one-out \textit{jackknife}}
|
|
|
|
\begin{example}
|
|
Let $\M_{0}$ be the model $Y_{i} = \beta_{0} + \beta_{1} X_{1i} + \beta_{2}X_{2i} + \beta_{3} X_{3i}$
|
|
|
|
The model will be:
|
|
\[
|
|
\begin{pmatrix}
|
|
y_{1} \\
|
|
y_{2} \\
|
|
y_{3} \\
|
|
y_{4} \\
|
|
y_{5}
|
|
\end{pmatrix} =
|
|
\beta_{0} + \beta_{1} \begin{pmatrix}
|
|
x_{11} \\
|
|
x_{12} \\
|
|
x_{13} \\
|
|
x_{14} \\
|
|
x_{15}
|
|
\end{pmatrix}
|
|
+ \beta_{2} \begin{pmatrix}
|
|
x_{21} \\
|
|
x_{22} \\
|
|
x_{23} \\
|
|
x_{24} \\
|
|
x_{25}
|
|
\end{pmatrix}
|
|
+
|
|
\beta_{3} \begin{pmatrix}
|
|
x_{31} \\
|
|
x_{32} \\
|
|
x_{33} \\
|
|
x_{34} \\
|
|
x_{35}
|
|
\end{pmatrix}
|
|
\]
|
|
\def\x{$\times$}
|
|
\begin{tabular}{ccccc}
|
|
\toprule
|
|
1 & 2 & 3 & 4 & 5 \\
|
|
\midrule
|
|
. & \x & \x & \x & \x \\
|
|
\x & . & \x & \x & \x \\
|
|
\x & \x & . & \x & \x \\
|
|
\x & \x & \x & . & \x \\
|
|
\x & \x & \x & \x & . \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\end{example}
|
|
|
|
We perform computation of $\lambda$ for each dataset without one observation.
|
|
|
|
|
|
\subsection{K-fold cross-validation}
|
|
|
|
We will have as many tables as subsets.
|
|
|
|
|
|
We chose lambda such that the generalization error is the smallest.
|
|
|
|
\section{Lasso regression}
|
|
|
|
The difference with the Ridge regression lies in the penalty:
|
|
|
|
\[
|
|
\hat{\beta}_{\lambda}^{\text{lasso}}= \argmin \norm{Y-X\beta}^{2} + \sum_{j=1}^{p} \abs{\beta_{j}}
|
|
\]
|
|
|
|
$\sum_{j=1}^{p} \abs{\beta_j} = \norm{\beta}_1$
|
|
|
|
Instead of having a smooth increasing for each parameters, each parameters will enter iteratively in the model. Some parameters can be set to 0.
|
|
|
|
Lasso regression can be used to perform variable selection.
|
|
|
|
|
|
We can use the same methods (K-fold and Leave-one-out) to select the $\lambda$ value.
|
|
|
|
\section{Elastic Net}
|
|
|
|
Combination of the Ridge and Lasso regression:
|
|
|
|
\[
|
|
\hat{\beta}_\lambda^{en} = \argmin \norm{Y-X\beta}^{2} + \lambda_{1} \norm{\beta}_{1} + \lambda_{2} \norm{\beta}_{2}^{2}
|
|
\]
|
|
|
|
|
|
\begin{remark}
|
|
In the case of Lasso, Elastic net or Ridge regression, we can no longer perform statistical test on the parameters.
|
|
\end{remark}
|