The following page is originally based on a white paper by Rasmus Bro and Neal Gallagher from Eigenvector, Inc.

While principal component analysis is useful for analysis of two-way data, parallel factor analysis (PARAFAC) is the method of choice for three-way data. For example, two-way data might consist of *J* measurements (e.g., spectral absorbance at different wavelengths) on *I* objects (or samples) with the data collected into a matrix **X**. In contrast, EEM data consists of measurements of fluorescence at *K* excitation wavelengths and *J* emission wavelengths for each of *I* samples with the data collected into a three-way “data cube” or “box” ** X** of size

One way to write the PARAFAC model is

**X**_{k} = **AD**_{k}**B**^{T} + **E**_{k}, *k* = 1,..,*K* (1)

where **X**_{k} is the matrix containing the emission spectra of all the *I *samples at excitation wavelength *k*. For a model with *F* factors, the matrix **A** is *I* by *F* and corresponds to loadings in the sample mode. This is what we normally call the score matrix. The matrix **B **is *J* by *F* and holds the loadings in the emission mode, and **D**_{k} is an *F* by *F* diagonal matrix. Its elements are the *k*^{th} row of the *K* by *F* loading matrix **C** which contains the excitation loadings. Hence, at any given excitation wavelength *k*, the model of the measured emission spectra, **X**_{k}, is given by the same scores, **A**, and the same emission loadings, **B**, only each component is weighted by specific excitation loadings as defined in the diagonal matrix **D**_{k}. The matrix **E**_{k} is the model residuals for the *k*^{th} matrix.

The real ‘trick’ in PARAFAC is that, unlike PCA, PARAFAC is what is called unique. Hence, if the data follows a PARAFAC model, then the emission loadings are not just abstract orthogonal emission profiles as would be the case in PCA. Instead, the loadings are actually estimates of the real emission spectra of the real fluorophores.

A Perkin-Elmer LS50 B fluorescence spectrometer was used to measure fluorescence landscapes using excitation wavelengths between 200-350 nm with 5 nm intervals. The emission wavelength range was 200-750 nm. Excitation and emission monochromator slit widths were set to 5 nm, respectively. Scan speed was 1500 nm/min. Measurements were made on mixtures of four flourophores (hydroquinone, tryptophan, phenylalanine and dopa) at known concentrations (ppm M) giving a data set that was 27 samples by 121 emission wavelengths by 24 excitation wavelengths [The example uses the dorritdata set.^{2.}This dataset can either be found in a reduced size in PLS_Toolbox or at http://www.models.life.ku.dk/dorrit]. Here we will take the one from the models web-site and work from that. The data is held as a dataset object (http://eigenvector.com/software/dataset.htm) which is just a useful way to make sure that all wavelengths etc. are kept together with the actual measurements.

Always keep the structure of EEM data as samples by emissions by excitations.

With that convention, many of the MATLAB tools that are developed will be easier to use. In the following we will relay heavily on PLS_Toolbox (http://eigenvector.com/software/pls_toolbox.htm) functions, but everything can also be done using the tools available freely from here (http://www.models.life.ku.dk/algorithms,with additional online courses available as well http://www.models.life.ku.dk/~courses/parafac/chap0contents.htm).

Figure 1 shows an EEM landscape for a single sample corresponding to DOPA at 55 ppm. Equation 1 represents a tri-linear model, and ideally all sources contributing to the measured signal would follow this model. The two major peaks in Figure 1 likely have a response that is fairly close to tri-linear however, the Rayleigh scattering indicated by the arrow (corresponding to the line where excitation equals emission) does not adhere to the tri-linear model.

*Figure 1. Landscape as plotted using plotgui(EEM) and selecting View/Samples and Plottype/EnhancedSurface. Could also be achieved by mesh(squeeze(EEM.data(1,:,:))).*

Additionally, the top of the large peak is saturated and will not follow the PARAFAC model. As a result, these regions are treated as ‘missing data’ by replacing their entries with NaN (Not-a-Number). Regions where emission is greater than excitation, and strictly not included in the Rayleigh scatter, are set to zero as expected from the physics of the EEM measurement. This is done using the FLUCUT function:

Xnew = flucut(X,20,[20 20],NaN,NaN,0,0);

In this example, entries below emission = excitation - 20 nm were set to zero, while a band of 20 nm above and below emission = excitation were set to NaN (missing). The result for DOPA at 55 ppm is shown in Figure 2. Note that the secondary Rayleigh scattering was not accounted for in this example. Figure 3 shows that the contributions in mode 1 for factor 2 (column 2 of ) correspond to the known DOPA concentration. The excitation () and emission () profiles are shown for the four factors in Figures 4 and 5 respectively.

The tri-linear PARAFAC model can be easily applied to EEM data to provide interpretable results. However, Rayleigh scattering and signal saturation must be accounted for to avoid artifacts in the analysis.

[1] Smilde, A., Bro, R., Geladi, P., “Multi–way Analysis with Applications in the Chemical Sciences,” John Wiley & Sons, New York, NY (2004).

[2] J. Riu, R. Bro “Jack-knife estimation of standard errors and outlier detection in PARAFAC models,” *Chemometr. Intell. Lab.* 2003; **65**(1): 35–49.