Interactive introduction to multi-way analysis in MATLAB
Next Chapter: Advanced PARAFAC modeling Previous Chapter: Handling multi-way data First Chapter: Contents

BASIC PARAFAC MODELING

 Contents Data used: claus.mat contains fluorescence excitation emission data from five samples containing tryptophan, phenylalanine, and tyrosine.  Purpose: Learning to fit and interpret a PARAFAC model  Information: R. Bro, PARAFAC: Tutorial & applications. Chemom. Intell. Lab. Syst., 1997, 38, 149-171.  Prerequisites: Be sure to understand the basics of handling multi-way arrays in MATLAB (Chapter 1), and have a vague idea about what PARAFAC is.

1. Fit a PARAFAC model

Fluorescence data of dilute samples behave approximately according to a PARAFAC model (A mathematical model of fluorescence excitation-emission data). For these data it is known that a three-component PARAFAC model should be adequate since there are three analytes in the samples and each can ideally be described by one PARAFAC component.

Task: Fit a three-component PARAFAC model and investigate the model. To fit the PARAFAC model simply type

> [Factors] = parafac(X,DimX,3);

> [A,B,C] = fac2let(Factors,DimX,3); As can be seen from the plots the PARAFAC model is unique. So what does unique mean? It means that the scores and loadings can not be rotated. In PCA you can rotate the scores and loadings and still the model will fit the data just as well as before. Hence, there is no way that you can estimate for example pure spectra from PCA because there is an infinity of different scores and loadings that all give the same fit. In PARAFAC this is not so. The only thing that can be varied is the order and scale of the components. There is no way to say, from the decomposition whether component one is rightfully the first, second or fifth component. So the order of the components is unidentified. There is also no way to define the scale. This is intrinsic to bi- and multilinear models. If X = tp', then we can equally well let t be twice the original t if we just let p be half the original p.

What is the implication of these two unidentifiabilities? That there is no intrinsic order means that the components must be identified after fitting. There is no way to guarantee that, say tyrosine, will come out as component one (in second-order calibration tricks can be used to automatically identify which factor is the analyte of interest if a pure sample is included. If the scores of that sample are forced to be zero for all but one factor, then naturally this factor must be the one pertaining to the analyte of interest). That the scale is unidentified means that we can only determine spectra, concentrations etc. up to a scaling. In the above example we can agree that the second and third mode loadings are pure emission and excitation spectra. There the scores (amounts of the latent variables) must be concentrations, since the latent variables in this case are real chemical variables. So we can find the concentrations in the samples from PARAFAC. But what is the scale? mM, M?? Because of the scale indeterminacy there is no way to tell. This is natural. We can not measure the absorption of a fluid and tell the concentration from that. We need a reference providing the scale. So if the concentration is known in one sample we can scale the corresponding column so that this score equals the known concentration. Then the remaining scores of that factor will be estimates of the concentrations in the other samples.

Task: Evaluate if the scores provide reasonable estimates of the concentrations. Assume that you know the concentrations in, say, the fifth sample. If the model is unique this should provide sufficient information to estimate the concentrations in the first four samples. Use the reference concentrations (given in the matrix y and described in the file readme) to scale the scores such that estimates of the concentrations of tryptophan is obtained using the known concentration of tryptophan in the fifth sample (y(5,1)). Do the same for tyrosine and phenylalanine. This is the basis for the so-called second-order calibration.

2. Determining the proper number of components

It is difficult to decide the best rank, i.e., the column-dimension of the loading matrices, of a PARAFAC model. However, it is by far easier than for two-way problems because of the uniqueness properties of PARAFAC. In general the same tools are available as are used for two-way analysis: Looking at histograms of residuals, assessing the interpretability of the solution, comparing residual variation to the intrinsic noise-level and using resampling techniques such as cross-validation. But there are also additional very powerful tools that we will discuss after the first exercises

Task: Investigate the fits of a one-, two- and three-component PARAFAC model of the data. Try also to estimate a four-component model. Note that an increased number of iterations are mostly necessary when fitting the four-component model. This is a typical sign of either too many components being extracted or an ill-defined problem (loadings are very correlated).

An important difference between two-way PCA and the multi-way PARAFAC model is that the PARAFAC model is not nested. Hence the parameters of a three-component model do not equal the parameters of a two-component model plus one additional component. The reason for this is that the components are not required to be orthogonal; hence independent. Therefore, every model has to be calculated specifically with all its components.

In estimating the model note some of the options available in the PARAFAC m-file (type help parafac): The input options makes it possible to, e.g., have the interim loadings being plotted during the iterations (options(3) = 2), or only allow for, say, 100 iterations (options(6) = 100).

Additionally to the above-mentioned tools, one may use the split-half analysis for determining the proper number of components. If two sample sets are available that should describe the same common variation, then it holds that for the proper number of components they should give the same loading vectors up to scaling and permutation (more on this). Thus, if this is not the case, it is a clear indication that too many components are being extracted.

A new approach called the core consistency diagnostic (CORCONDIA) is very helpful in determining the right number of components. The details will not be given here (but here in a pdf-file) but the principle is as follows. For a given number of components fit the PARAFAC model to the data. Use the thereby found solution to calculate the so-called core consistency. If the PARAFAC model is valid then the core consistency is close to 100%. If the data can not approximately be described by a trilinear model or too many components are used the core consistency will be close to zero (or even negative). If the consistency is in-between (around 50%) the model is unstable in which case imposing valid constraints may help stabilize the model. In practice the core consistency levels off slowly for an increasing number of components and then sharply when the correct number of components is exceeded. The number of components corresponding to the last high consistency value should be chosen.

Task: Use different tools for assessing the proper number of components for the fluorescence data. For models of dimension one to five

• Look at residuals (mesh plots and alike. Check the size of the residuals compared to the data and the known error)
• Look at scree-plots (sum-of-squared residuals versus the number of components)
• Use the core consistency diagnostic (corcond)
How to do it

The multi-way toolbox also provides an m-file called pftest for checking the number of components (type help pftest). The approach adopted there is more or less similar to the one mentioned here, but additionally each model corresponding to a given number of components is fitted several times. This provides means for assessing if the model converges to the same minimum every time. This is useful also in determining the proper number of components, since if too many components are used the number of local minima typically increases dramatically.

3. Checking convergence of the algorithm

The algorithm for fitting the PARAFAC model is a so-called alternating least squares algorithm. It is iterative and stops when the relative difference in fit between two successive iterations is below a certain limit. For most types of data this limit can be set to 10-6 (default in the algorithm) which will ensure that the model is correct and that not too many iterations are used. For some data the model is very difficult to fit and a lower convergence criterion may therefore be needed. Note that one need not check convergence for the N-PLS and Tucker algorithms as those are very efficient. To assess convergence the following list may be used
• Fit the models several times using random initialization (why?) by setting Options(2)=2;.
• If all models have the same fit (i.e. loss function value) the models have converged (this is common)
• If all but a small fraction of the fitted models have the same (and best) fit, the model have converged and the few models with lower fit may be discarded as accidental local minima.
• If all models have different fit values the model is difficult to fit (maybe too many components) and the convergence criterion has to be lowered.
• If the models converge to a few different but distinct fit-values, i.e. there are several models with the same fit values, then there are multiple local minima, which is a tricky situation. Likely, it is possible to circumvent this either by using some additional constraints (e.g., non-negativity) or otherwise slightly re-specifying the model.
Task: Fit a three-component model several times and see if the estimated model converges. If so, the fits should be identical every time. This demonstrates how to assess convergence. Do the same for a four- or a five-component model. Likely you will see that the fit differs from model to model if you are using a random initialization. Lower the convergence criterion to circumvent this. You can use pftest for testing convergence.

4. Summary

In this chapter it has been shown how to fit the PARAFAC model and how to find the appropriate number of components. Fitting the PARAFAC model is simply done by typing

> [Factors] = parafac(X,DimX,Fac);

where Fac is the number of components. The parameters of the model are held in the vector Factors and may be converted into ordinary scores and loadings by means of  fac2let. In case of a three-way model this would read

> [A,B,C] = fac2let(Factors,DimX,Fac);

The scores and loadings of a PARAFAC model may be assessed and used just as in ordinary two-way analysis (e.g., use scores for building regression models, plot loadings and make bi-plots for interpretation etc.)

It has also been discussed that using the file pftest one may very quickly get a good indication of the appropriate number of components to use.

Some important aspects that differ from ordinary two-way analysis have been discussed:

• PARAFAC models are not nested; hence one may not fit models by augmenting an existing model by adding an extra component
• Unlike many other decomposition models often used, the PARAFAC algorithm can occasionally have difficulties in finding the true least squares model. This is mostly the case when too many components are extracted or when the model is otherwise mis-specified
• The PARAFAC model is unique, which means that it may be used e.g., for resolving spectra. However, a unique PARAFAC model does not explicitly imply that the parameters are estimates of the true underlying parameters. Only in the case where these exist and the model is correctly specified can that be the case. Hence, in the above example with fluorescence data, even a two-component model is unique, but it does not reflect pure spectra since there are three chemical analytes in the samples.