Amino acids fluorescence data
There really isn't any problem here! But these simple fluorescence data are nice for illustrating different aspects of the trilinear PARAFAC model. They can be used for second order calibration, for working with systematically missing data, for imposing constraints, etc. The samples were generated and measured by Claus A. Andersson (KVL, DK).
Get the data
The data are available in zipped MATLAB 4.2 format. Download the data and write load data in MATLAB. If you use the data we would appreciate that you report the results to us as a courtesey of the work involved in producing and preparing the data. Also you may want to refer to the data by referring to
Bro, R, PARAFAC: Tutorial and applications, Chemometrics and Intelligent Laboratory Systems, 1997, 38, 149-171
The data have also been described in
- Bro, R, Multi-way Analysis in the Food Industry. Models, Algorithms, and Applications. 1998. Ph.D. Thesis, University of Amsterdam (NL) & Royal Veterinary and Agricultural University (DK).
- Kiers, H.A.L. (1998) A three-step algorithm for Candecomp/Parafac analysis of large data sets with multicollinearity, Journal of Chemometrics, 12, 155-171.
This data set consists of five simple laboratory-made samples. Each sample contains different amounts of tyrosine, tryptophan and phenylalanine dissolved in phosphate buffered water. The samples were measured by fluorescence (excitation 250-300 nm, emission 250-450 nm, 1 nm intervals) on a PE LS50B spectrofluorometer with excitation slit-width of 2.5 nm, an emission slit-width of 10 nm and a scan-speed of 1500 nm/s. The array to be decomposed is hence 5 × 51 × 201. In Figure 1 measurements of one of the samples are shown. Ideally these data should be describable with three PARAFAC components. This is so because each individual amino acid gives a rank-one contribution to the data.
Figure 1. Fluorescence landscape of a sample containing only phenylalanine.
In Figure 2 and Figure 3 the normalized loadings of a three- and a four-component model are shown. It is readily seen that the three loadings of the three-component model are also found in the four-component model. These three loadings resemble the pure spectra of tryptophan, tyrosine and phenylalanin. The fourth component does not resemble any of the analytes and in fact does not seem to be reflecting chemical information. The reason for the presence of this fourth and quite distinct component must be that non-linearities or scatter effects causes some additional systematic variation.
Figure 2. Loading vectors resulting from fitting a three-component PARAFAC model to amino acid data.
Figure 3. Loading vectors resulting from fitting a four-component model to amino acid data. The fourth suspicious component shown with a thicker line.
In fact, these data have been investigated at several times and always using three components. Even when used for second-order calibration the use of three components has given satisfactory results. This is so because the fourth component has a very low variance. The variance of this fourth component is only 0.03% as compared to 50.7, 25.5, and 16.2% of the three ‘chemical’ components. Therefore the bulk variation is not affected significantly by the fourth component and this is also the reason why traditional tools based on residuals have difficulties in detecting this fourth component.
As an explanation for this finding, it is important to notice in Figure 1 the Rayleigh scatter in the left part, which is not multilinear in its nature. It is situated around a diagonal of corresponding emission and excitation wavelengths. Additionally to Rayleigh scatter the emission below the excitation wavelength does not vary according to the multilinear model, since the emission intensity is zero (up to the noise) regardless of excitation. In fact, the emission mode loading of the fourth ‘spurious’ component resembles the Rayleigh scatter. To avoid such spurious results, the lower part of the data (emission below excitation wavelength) as well as the part corresponding to Rayleigh scatter should not be fitted by the model. Rather these elements must be set to missing values in the three-way array in order not to bias the model.
Figure 4. Core consistency plot of a three (left) and a four (right) component PARAFAC model of the amino acid data with missing entries.
When all appropriate elements of the array have been set to missing, the values in Table 1 are obtained. Clearly, CORCONDIA now correctly identifies that there are three trilinear components in the data. In Figure 4 the core consistency plots are shown. It is easily seen that for the three-component model the Tucker3 core elements do have values close to the target whereas for the four-component model the values of the Tucker3 core vary very much. One element that ideally should be one is close to zero and some elements that ideally should be zero are actually close to one.
Table 1. Amino acid fluorescence data. Results from fitting one- to six-component PARAFAC models to amino acid data with missing values.
In Table 1 it is seen that a six-component solution has a quite high core consistency, but this is preceded by two very low values. Hence, the choice here clearly is to take three components.