Sparsity

Variable selection in high dimensional models can improve the interpretability and performance. One way to achieve sparsity is through penalized minimization problems with bounding of the $L_1$ norm of some parameter entity. This concept has received huge attention within fields such as statistical learning, data mining, signal processing.

 


SPCA (Sparse PCA)

In sparse principal component analysis the aim is to estimate a PCA-like model where sparsity is induced on the model parameters; scores or loadings. Sparsity in both modes is (here) refered to as coclustering. 

SPCA, with sparsity constraint on loadings, can be formulated as:

$$ \mathrm{min}(\left|\mathbf{X} - \mathbf{T}\mathbf{P}^T\right|_F^2) $$

subject to

$$ \left|\mathbf{p}_i\right|_1 \leq c, for \ i=1,...,k $$

where $\mathbf{X} (n \times p) $ is the data matrix, $\mathbf{T} (n \times k) $ is the score matrix and $\mathbf{P} = [\mathbf{p_1} \ldots \mathbf{p_k}]$ $(p \times k)$ is the loading matrix. $k$ is the number of components.$\left|\cdot\right|_F$ is the Frobenius norm (sum of the squared elements).

This can, for each component, be formulated as a single object function (Lagrangian):

$$ \mathrm{min}(\left|\mathbf{X} - \mathbf{t}\mathbf{p}^T\right|_F^2 + \lambda\left|\mathbf{p}\right|_1) $$

Choosing an appropriate value for $\lambda$ (or $c$) leads to a sparse solution, i.e. several of the loadings are exactly zero.

OBS: There is not an easy translation from $c$ to $\lambda$, and for estimation of more than one component, one has to choose whether to fix $c$ or $\lambda$.

 

Download algorithm

An algorithm for estimation of a SPCA model can be downloaded here. Details are described in the section Algorithms below.

 

Example

12 samples of jam corresponding to a full factorial design of four locations and three harvest times are evaluated on six instrumental parameters and 13 sensory attributes. Data can be downloaded here.

In the figure below results from PCA (stems) and SPCA (bars) are compared.

  SPCA/PCA figure Score plot (A and B) with color reflecting harvest time (component 1) and location (component 2). Loading plot (C and D).  

 

Interpreting the loadings (C and C) reveals that the SPCA deselects some of the low influencing variables, and thereby leeds to more intepretable components.

 

References:

Rasmussen MA and Bro R (2012) A tutorial on the LASSO approach to sparse modelling, Chemometrics and Intelligent Laboratory Systems, 119,21:31

Witten DM, Tibshirani R, and T Hastie (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3): 515-534.

Zou, H., Hastie, T., Tibshirani, R. (2006) Sparse Principal Component Analysis, Journal of computational and graphical statistics, 15,2:265-286

 Shen, H. and Huang, J.Z. (2008) Sparse principal component analysis via regularized low rank matrix approximation, Journal of multivariate analysis, 99,6:1015-1034 

 


Algorithmic details

Solving a constrained optimization problems like:

$$ \mathrm{min}(\left|\mathbf{X} - \mathbf{T}\mathbf{P}^T\right|_F^2)  s.t. \left|\mathbf{p}_i\right|_1 \leq c, for \ i=1,...,k $$

Is a nested convex problem. This means that for fixed loadings ($\mathbf{P}$), calculating the scores ($\mathbf{T}$) is a convex problem. Estimation of loadings with fixed scores ($\mathbf{T}$) is a componentwise convex problem. This advocates for an iterative alternating procedure. In case of nested components the procedure looks something like this.

 

Input: $\mathbf{X}$ - data, $k$ - number of components, $\lambda$ - penalty from the Lagrangian formulation.

For j=1:k

Initialize with ordinary $1$ component PCA solution $\mathbf{X} = \mathbf{t}\mathbf{p}^T$

 For  i=1,...,

  1. Estimate $\mathbf{t}$ when $\mathbf{p}$ is known: $\mathbf{t} = \mathbf{X}\mathbf{p}$
  2. Estimate $\mathbf{p}$ when $\mathbf{t}$ is known: By soft thresholding

Repeat 1-2 until convergence

Deflate: $\mathbf{X} = \mathbf{X} - \mathbf{t}\mathbf{p}^T$

Repeat upto k components

 

Due to similarities with the Non linear Iterative PArtial Least Squares (NIPALS) algortihm, and the fact that softthresholding works on the least squares estimates, a name could be Non linear Iterative PAtial Shrunken Least Squares (NIPASLS). 

Calculation of components based on the current residuals, do not directly aim at solving $ \mathrm{min}(\left|\mathbf{X} - \mathbf{T}\mathbf{P}^T\right|_F^2)  s.t. \left|\mathbf{p}_i\right|_1 \leq c, for \ i=1,...,k $, as the components in spca are not orthogonal. A more direct approach where the entire set of components are estimated simultanously by iterating between scores and loadings can hence be considered.  This goes as follows:

 

Input: $\mathbf{X}$ - data, $k$ - number of components, $\lambda$ - penalty from the Lagrangian formulation.

Initialize with ordinary $k$ component PCA solution

For  i=1,...,

  1. Estimate $\mathbf{T}$ when $\mathbf{P}$ is known: $\mathbf{T} = \mathbf{X}\mathbf{P}^+$
  2. Estimate $\mathbf{P}$ when $\mathbf{T}$ is known: By componentwise soft thresholding of least squares estimates.

Repeat 1-2 until convergence

 

$\mathbf{P}^+$ is the pseudo inverse of $\mathbf{P}$

 

Due to similarities with the Alternating Least Squares (ALS) algortihm, a name could be Alternating Shrunken Least Squares (ASLS). 

 

As none of these problems are convex, a solution might be a local minimum. In the algorithm spca.m for Matlab both options (ASLS and NIPASLS) are available

 

Soft thresholding

Estimation of $\mathbf{p}$ when $\mathbf{t}$ is known correspond to solving: $ \mathrm{min}(\frac{1}{2}\left|\mathbf{X} - \mathbf{t}\mathbf{p}^T\right|_F^2 + \lambda\left|\mathbf{p}\right|_1) $ with respect to $\mathbf{p}$. Seeting $\mathbf{p_{LS}} = (\mathbf{t}^T\mathbf{t})^{-1}\mathbf{t}^T\mathbf{X} $ to the least squares solution, the solution is a soft thresholded  version of $\mathbf{p_{LS}}$ by $\lambda$: 

$$ p_j = S(p_{LSj},\lambda) $$

Below is exemplified how this operator works.


Author: Morten Arendt Rasmussen, mortenr@life.ku.dk.