Variable selection in high dimensional models can improve the interpretability and performance. One way to achieve sparsity is through penalized minimization problems with bounding of the $L_1$ norm of some parameter entity. This concept has received huge attention within fields such as statistical learning, data mining, signal processing.

SPCA *(Sparse PCA)*

In sparse principal component analysis the aim is to estimate a PCA-like model where sparsity is induced on the model parameters; scores or loadings. Sparsity in both modes is (here) refered to as coclustering.

SPCA, with sparsity constraint on loadings, can be formulated as:

$$ \mathrm{min}(\left|\mathbf{X} - \mathbf{T}\mathbf{P}^T\right|_F^2) $$

subject to

$$ \left|\mathbf{p}_i\right|_1 \leq c, for \ i=1,...,k $$

where $\mathbf{X} (n \times p) $ is the data matrix, $\mathbf{T} (n \times k) $ is the score matrix and $\mathbf{P} = [\mathbf{p_1} \ldots \mathbf{p_k}]$ $(p \times k)$ is the loading matrix. $k$ is the number of components.$\left|\cdot\right|_F$ is the *Frobenius *norm (sum of the squared elements).

This can, for each component, be formulated as a single object function (*Lagrangian*):

$$ \mathrm{min}(\left|\mathbf{X} - \mathbf{t}\mathbf{p}^T\right|_F^2 + \lambda\left|\mathbf{p}\right|_1) $$

Choosing an appropriate value for $\lambda$ (or $c$) leads to a sparse solution, i.e. several of the loadings are exactly zero.

*OBS: *There is not an easy translation from $c$ to $\lambda$, and for estimation of more than one component, one has to choose whether to fix $c$ or $\lambda$.

An algorithm for estimation of a SPCA model can be downloaded here. Details are described in the section *Algorithms *below.

12 samples of jam corresponding to a full factorial design of four locations and three harvest times are evaluated on six instrumental parameters and 13 sensory attributes. Data can be downloaded here.

In the figure below results from PCA (stems) and SPCA (bars) are compared.

**SPCA/PCA figure **Score plot (A and B) with color reflecting *harvest time *(component 1) and *location *(component 2). Loading plot (C and D).

Interpreting the loadings (C and C) reveals that the SPCA deselects some of the low influencing variables, and thereby leeds to more intepretable components.

Rasmussen MA and Bro R (2012) A tutorial on the LASSO approach to sparse modelling, *Chemometrics and Intelligent Laboratory Systems*, 119,21:31

Shen, H. and Huang, J.Z. (2008) Sparse principal component analysis via regularized low rank matrix approximation, Journal of multivariate analysis, 99,6:1015-1034

Solving a constrained optimization problems like:

$$ \mathrm{min}(\left|\mathbf{X} - \mathbf{T}\mathbf{P}^T\right|_F^2) s.t. \left|\mathbf{p}_i\right|_1 \leq c, for \ i=1,...,k $$

Is a nested convex problem. This means that for fixed loadings ($\mathbf{P}$), calculating the scores ($\mathbf{T}$) is a convex problem. Estimation of loadings with fixed scores ($\mathbf{T}$) is a componentwise convex problem. This advocates for an iterative alternating procedure. In case of nested components the procedure looks something like this.

**Input: **$\mathbf{X}$ - data, $k$ - number of components, $\lambda$ - penalty from the *Lagrangian *formulation.

**For ***j=1:k*

**Initialize **with ordinary $1$ component PCA solution $\mathbf{X} = \mathbf{t}\mathbf{p}^T$

**For ***i=1,...,*

- Estimate $\mathbf{t}$ when $\mathbf{p}$ is known: $\mathbf{t} = \mathbf{X}\mathbf{p}$
- Estimate $\mathbf{p}$ when $\mathbf{t}$ is known: By soft thresholding

**Repeat **1-2 until convergence

**Deflate: **$\mathbf{X} = \mathbf{X} - \mathbf{t}\mathbf{p}^T$

**Repeat **upto *k *components

Due to similarities with the Non linear Iterative PArtial Least Squares (NIPALS) algortihm, and the fact that softthresholding works on the least squares estimates, a name could be Non linear Iterative PAtial Shrunken Least Squares (NIPASLS).

Calculation of components based on the current residuals, do not directly aim at solving $ \mathrm{min}(\left|\mathbf{X} - \mathbf{T}\mathbf{P}^T\right|_F^2) s.t. \left|\mathbf{p}_i\right|_1 \leq c, for \ i=1,...,k $, as the components in spca are not orthogonal. A more direct approach where the entire set of components are estimated simultanously by iterating between scores and loadings can hence be considered. This goes as follows:

**Input: **$\mathbf{X}$ - data, $k$ - number of components, $\lambda$ - penalty from the *Lagrangian *formulation.

**Initialize **with ordinary $k$ component PCA solution

**For ***i=1,...,*

- Estimate $\mathbf{T}$ when $\mathbf{P}$ is known: $\mathbf{T} = \mathbf{X}\mathbf{P}^+$
- Estimate $\mathbf{P}$ when $\mathbf{T}$ is known: By componentwise soft thresholding of least squares estimates.

**Repeat **1-2 until convergence

$\mathbf{P}^+$ is the pseudo inverse of $\mathbf{P}$

Due to similarities with the Alternating Least Squares (ALS) algortihm, a name could be Alternating Shrunken Least Squares (ASLS).

As none of these problems are convex, a solution might be a local minimum. In the algorithm spca.m for Matlab both options (ASLS and NIPASLS) are available

Estimation of $\mathbf{p}$ when $\mathbf{t}$ is known correspond to solving: $ \mathrm{min}(\frac{1}{2}\left|\mathbf{X} - \mathbf{t}\mathbf{p}^T\right|_F^2 + \lambda\left|\mathbf{p}\right|_1) $ with respect to $\mathbf{p}$. Seeting $\mathbf{p_{LS}} = (\mathbf{t}^T\mathbf{t})^{-1}\mathbf{t}^T\mathbf{X} $ to the least squares solution, the solution is a soft thresholded version of $\mathbf{p_{LS}}$ by $\lambda$:

$$ p_j = S(p_{LSj},\lambda) $$

Below is exemplified how this operator works.

Author: Morten Arendt Rasmussen, mortenr@life.ku.dk.