When the number of variables is large, e.g, when the number of explanatory variable is above the number of observations, if $p >> n$ ($p$: the number of explanatory variable, $n$ is the number of observations), we cannot estimate the parameters.
In order to estimate the parameters, we can use penalties (additional terms).
If $H$ is the projection matrix of the subspace generated by $\X$, $\X\Y$ is the projection on $\Y$ on this subspace, that corresponds to $\X\hat{\beta}$.
\caption{Orthogonal projection of $\Y$ on plan generated by the base described by $\X$. $\color{blue}a$ corresponds to $\norm{\X\hat{\beta}-\bar{\Y}}^2$ and $\color{blue}b$ corresponds to $\hat{\varepsilon}=\norm{\Y-\hat{\beta}\X}^2$} and $\color{blue}c$ corresponds to $\norm{Y -\bar{Y}}^2$.
Covariance is really sensitive to scale of variables. For instance, if we measure distance in millimeters, the covariance would be larger than in the case of a measure expressed in metters. Thus the correlation coefficient, which is a sort of normalized covariance is useful, to be able to compare the values.
Let $\mathbf{Z}$ be a gaussian vector: $\mathbf{Z}\sim\Norm_n(0_n, I_n)$.
\begin{itemize}
\item If $V_1, V_n$ are orthogonal subspaces of $\RR[n]$ with dimensions $n_1, n_2$ such that
\[
\RR[n] = V_1 \overset{\perp}{\oplus} V_2.
\]
\item If $Z_1, Z_2$ are orthogonal of $\mathbf{Z}$ on $V_1$ and $V_2$ i.e. $Z_1=\Pi_{V_1}(\mathbf{Z})=\Pi_1\Y$ and $Z_2=\Pi_{V_2}(\mathbf{Z})=\Pi_2\Y$...
Why can't we use the following model to test each of the parameters values (here for $X_2$)?
\[
Y_i = \theta_0 + \theta_1 X_{2i} + \varepsilon_i
\]
We can't use such a model, we would probably meet a confounding factor: even if we are only interested in relationship $X_2$ with $Y$, we have to fit the whole model.
\begin{example}[Confounding parameter]
Let $Y$ be a variable related to the lung cancer. Let $X_1$ be the smoking status, and $X_2$ the variable `alcohol' (for instance the quantity of alcohol drunk per week).
If we only fit the model $\M: Y_i =\theta_0+\theta_1 X_{2i}+\varepsilon_i$, we could conclude for a relationship between alcohol and lung cancer, because alcohol consumption and smoking is strongly related. If we had fit the model $\M= Y_i =\theta_0+\theta_1 X_{1i}+\theta_2 X_{2i}+\varepsilon_i$, we could indeed have found no significant relationship between $X_2$ and $Y$.
\end{example}
\begin{definition}[Student law]
Let $X$ and $Y$ be two random variables such as $X \indep Y$, and such that $X \sim\Norm(0, 1)$ and $Y \sim\chi_n^2$, then
F = \frac{EMS}{RMS}\underset{H_0}{\sim}\Fish(q-q'; n-q)
\]
\section{Model validity}
Assumptions:
\begin{itemize}
\item$\X$ is a full rank matrix;
\item Residuals are i.i.d. $\varepsilon\sim\Norm(0_n, \sigma^2\mathcal{I}_n)$;
\end{itemize}
We have also to look for influential variables.
\subsection{$\X$ is full rank}
To check that the rank of the matrix is $p+1$, we can calculate the eigen value of the correlation value of the matrix. If there is a perfect relationship between two variables (two columns of $\X$), one of the eigen value would be null. In practice, we never get a null eigen value. We consider the condition index as the ratio between the largest and the smallest eigenvalues, if the condition index $\kappa=\frac{\lambda_1}{\lambda_p}$, with $\lambda_1\geq\lambda_2\geq\ldots\geq\lambda_p$ the eigenvalues.
If all eigenvalues is different from 0, $\X^T \X$ can be inverted, but the estimated parameter variance would be large, thus the estimation of the parameters would be not relevant (not good enough).
\paragraph{Variance Inflation Factor}
Perform a regression of each of the predictors against the other predictors.
If there is a strong linear relationship between a parameter and the others, it would reflect that the coefficient of determination $R^2$ (the amount of variance explained by the model) for this model, which would mean that there is a strong relationship between the parameters.
We do this for all parameters, and for parameter $j =1, \ldots, p$, the variance inflation factor would be:
\[
VIF_j = \frac{1}{1-R^2_j}.
\]
\subparagraph*{Rule}
If $VIF > 10$ or $VIF > 100$\dots
In case of multicollinearity, we have to remove the variable one by one until there is no longer multicollinearity.
Variables have to be removed based on statistical results and through discussion with experimenters.
\subsection{Residuals analysis}
\paragraph*{Assumption}
\[
\varepsilon\sim\Norm_n(0_n, \sigma^2 I_n)
\]
\paragraph{Normality of the residuals} If $\varepsilon_i$ ($i=1, \ldots, n$) could be observed we could build a QQ-plot of $\varepsilon_i /\sigma$ against quantiles of $\Norm(0, 1)$.
Only the residual errors $\hat{e}_i$ can be observed:
Let $e_i^*$ be the studentized residual, considered as estimators of $\varepsilon_i$