The 100th anniversary of Moore-Penrose inverse and its role in statistics and machine learning

All men are equal, but not all matrices have inverses. For instance, rectangular matrices do not have inverses; square matrices without full rank do not have inverses. The matrix rights activists (i.e. E. H. Moore, 1920; Arne Bjerhammar, 1951; and Roger Penrose, 1955) among mathematicians thus stood out and spoke for these computationally unfavored matrices. Thanks to their continual efforts, every matrix finally got an inverse, dubbed the Moore-Penrose (pseudo) inverse. These previously unfavored matrices have since contributed to the academia and revolutionized statistics and machine learning. In memory of its 100th anniversary, let me talk, in this post, about the Moore-Penrose inverse and its applications.

The traditional inverse

For a matrix $A_{n\times n}$, if there exists such a matrix $A^{-1}$ that [ AA^{-1} = A^{-1}A = I, ] where $I$ is the identity matrix, then $A^{-1}$ is called the inverse of $A$.

This concept is taught in the university and is what helps us solve the linear systems. That is, for an invertible matrix $A$, if [ Ax = b, ] then we have [ x = A^{-1}b. ] Nonetheless, not all matrices are invertible, and hence some matrices have multiple solutions, and others have none.

The Moore-Penrose inverse

The existence of inverse brings us great convenience, and its absence is like the pain in the eye. Indeed, sometimes we desperately want an inverse, just like some of us desperately wanting a boyfriend or a girlfriend. To help those lonely matrices, E.H. Moore (1920) generalized the concept of inverse and hence helped everyone find his match, dubbed the Moore-Penrose inverse.

Definition

For a rectangular matrix $A_{m \times n}$, its Moore-Penrose inverse is defined as a matrix $A^+_ {n \times m}$ satisfying all of the following four criteria:

$AA^+A=A$
$A^+AA^+=A^+$
$(AA^+)^T=AA^+$
$(A^+A)^T=A^+A$

Examples

The above equations look frightening, so let us see some examples first. Obviously, an inverse is itself a Moore-Penrose inverse. Another less trivial example takes the form of $(A^TA)^{-1}A^T$. Does it sound familiar? Yes, this is the projection operator in the least square. Indeed, when $A$ has full rank and $m>n$ (i.e., $A$ is tall and thin), we have [ A^+ = (A^TA)^{-1}A^T. ]

The next time you are asked during the interview the least square problem [ \min_{\beta} \lVert y - X\beta \rVert_2^2, ] you can showoff and write [ \hat{\beta} = X^+ y, ] leaving the interviewer a troubled look. Also, remember: true men do not give explanations.

Existence

You might question the existence of Moore-Penrose inverse, just like some heartbroken young ladies desert the fairy tale and no longer believe the existence of true love.

To prove that I am not a charlatan, let me humbly prove the existence of the Moore-Penrose by constructing itself. Let $A=U \Sigma V^T$ be its Singular Value Decomposition (yes, SVD always exists), then [ A^+=V\Sigma^+U^T. ] For a rectangular diagonal matrix such as $\Sigma$, we get its Moore-Penrose inverse by taking the reciprocal of each nonzero element on the diagonal, leaving the zeros in place, and then transposing the matrix.

Uniqueness

Therefore, the true love exists, but is it unique? Is the Moore-Penrose inverse unique? Yes, it is. There is precisely one matrix $A^+$ satisfying all four conditions of the definition.

Moreover, the Moore-Penrose inverse of the Moore-Penrose inverse is the original matrix. That is, [ (A^+)^+ = A. ] You are the true love of your true love; the one you love loves you. No triangular relationship or polygamy here.

By the way, the uniqueness is a consequence of the last two conditions of the four. A matrix satisfying only the first condition is known as a generalized inverse. If it also satisfies the second condition, it is called a generalized reflexive inverse. They are not unique.

Applications of the Moore-Penrose inverse

E. H. Moore built a utopia and made matrices fall in love again. In this section, let us see the power of love by visiting some applications in statistics and machine learning.

Linear least square

Like previously mentioned, the least square estimate $\hat{\beta} = X^+ y$.

Minimum norm solution to a linear system

In the previous case $X$ is a tall and thin matrix (i.e., $m>n$), whereas in this case $X$ is a short and fat matrix (i.e., $m<n$). Needless to say, the linear system $X\beta=y$ is under-determined. In this case, the solution $X^+y$ is the one with the minimum Euclidean norm. Also, see Bartlett et al. (2019)’s paper Benign Overfitting in Linear Regression for recent development.

Obtain all solutions of a linear system

If the linear system $X\beta=y$ has any solutions, they are all given by [ \beta = X^+y + (I-X^+X)w ] for arbitrary vector $w$.

Chi-squared test

We have all been taught in the statistics class how to use Student’s t-test for the mean of a single random variable, such as the height of a population. However, sometimes we are asked to test the means of several random variables (i.e., random vector) in one fell swoop. For instance, the economists may want to test the average of GDP, of inflation rate, and of unemployment rate altogether. In this case, we need to use the Chi-squared test, where we have to construct a statistic in the form of $x^TAx$. Indeed, this $A$ is taken as the Moore-Penrose inverse of the covariance matrix.

Proof. Let $x$ be a multivariate Gaussian random vector with zero mean vector and covariance matrix $\Sigma$. That is, $\mathbb{E}[x]=0$ and $\mathbb{E}[xx^T]=\Sigma$.

Obviously, $\Sigma$ is a positive semi-definite matrix, and it is diagonalizable. That is, there exists an orthogonal matrix $P$ such that [ \Sigma = P^T D P = P^T \sqrt{D}\sqrt{D} P, ] where $\sqrt{D}$ is a matrix taking square root on $D$ elementwise.

To prove that $x^T \Sigma^+ x$ follows a $\chi^2$ distribution, first we need to show that it can be expressed as the squared norm of a multivariate Gaussian random vector. Indeed, [ x^T \Sigma^+ x = x^T P^T \sqrt{D}^+ \sqrt{D}^+ P x = (\sqrt{D}^+ P x)^T (\sqrt{D}^+ P x), ] where the first equality follows from the construction of the Moore-Penrose inverse via SVD. Since the linear transform of a Gaussian vector is still a Gaussian vector, $\sqrt{D}^+ P x$ is effectively a Gaussian vector, and it also preserves the zero mean.

Then, we need to show that this Gaussian random vector has idempotent covariance matrix. \[ \begin{align} \mathbb{E}[(\sqrt{D}^+ P x) ( x^T P^T \sqrt{D}^+)] & = \sqrt{D}^+ P \mathbb{E}[xx^T] P^T \sqrt{D}^+ \newline & = \sqrt{D}^+ P \Sigma P^T \sqrt{D}^+ \newline & = \sqrt{D}^+ D \sqrt{D}^+. \end{align} \] The last term is a diagonal matrix with entries 0’s or 1’s, and it is hence idempotent. Moreover, the corresponding $\chi^2$ distribution has rank($\Sigma$) degrees of freedom.

Summary

In this post, I introduced the Moore-Penrose inverse, its existence and uniqueness, as well as some examples. I also show four applications in statistics and machine learning:

Least square
Minimum norm solution to a linear system
Obtain all solutions of a linear system
Chi-squared test

Wenjie Zheng