Multi-block Methods for Exploratory Data Mining in Food Technology

 

Frans van den Berg1), Vibeke Povlsen1), Anette Thybo2) and Rasmus Bro1)

 

1) The Royal Veterinary and Agricultural University, Dept. of Dairy and Food Science, Food Technology Core, Denmark

2) Danish Institute of Agricultural Sciences, Dept. of Horticulture, Denmark

 

Modern researchers have no problem collecting huge amounts of data! With the help of computers, electronics and hyphenated instrumentation the number measured variables in almost every field of science grows at enormous speed. Sophisticated mathematical and statistical methods are developed to handle such vast amounts of data. Unfortunately these methods are not used as frequently as anticipated from the abundance of large dataset problems.

In the field of chemistry these new developments are primarily organized in the discipline called chemometrics. Well-known techniques as Principal Component Analysis (PCA) and Partial Least Squares Regression (PLSR) are designed to handle the correlation between series of measured variables, revealing the important, underlying (so-called latent) phenomena in multivariate data tables [1]. Probably the most familiar application of these methods is in multivariate calibration for Near Infra Red (NIR) spectrometers. The highly correlated absorbance values at different wavelengths can be used to predict e.g. protein contents of barley samples [2]. For hyphenated (2D) techniques like multi-wavelength-fluorescence emission-excitation spectroscopy and gas chromatography-mass spectrometry (GC-MS) new methods like Parallel Factor analysis (PARAFAC), Tucker-models and Multilinear-PLSR have been developed [3]. These techniques are designed to decompose higher order data tables (e.g. cubes), again to reveal the underlying, latent phenomena for the purpose of data analysis and predictions.

All these different data methods have in common that they are highly graphical. Next to important figures of merit like model accuracy and precision, they are developed to represent the model diagnostics in the form of plots. Drawings are made to e.g. determine the position of one sample compared to all others (e.g. outlier detection). Other plots are used to evaluate the relative importance of a variable in a multivariate data table.

In this paper we discuss a group tools that handle multiple blocks of data collected on the same set of samples, so-called multi-block models. They can be considered extensions of ‘single-block’ PCA and PLSRa. Their use can be beneficial when analyzing large datasets where measurements are organized in conceptually meaningful blocks. An example of such a ‘natural blocking’ could be data of different instrumental techniques (NIR, GC, physical/rheological parameters, etc.) used on the same set of samples. The first approach for handling this many variables would be to put everything in one big data table, and analysis the entire block. This can however significantly blur the final results. Multi-block models strive to maintain the natural ordering in the data. They try to explain the relation between different blocks, and the block its relative contribution in the model.

Multi-block models are considered ‘data mining’ tools in that they can give a (graphical) overview of large amounts of data, with the aim of improving the knowledge on the subject under study by notably reducing the complexity. Multi-block models are considered ‘exploratory’ in that they are suitable for initial investigations of the data-mountain, to e.g. intelligently reduce this amount of data and find a more dedicated mathematical model from the reduced dataset.

 

 

Multi-block models

 

In this paragraph we will give an introduction to two specific methods: multi-block PCA (developed under the name Consensus-PCA) and multi-block PLSR (officially known as MB-PLSR with deflation on the super scores) [4]. The reader should be aware that many more multi-block methods, dedicated to special data-analytical problems, can be found in literature.

To assist in the explanation of multi-block algorithms a brief description of the PCA and PLSR algorithms will be given at first. We start with a descriptor data table X. In this table every row is formed by one sample (object), while the columns are formed by the measurements (variables). If we measure e.g. NIR spectra on a set of samples we, get a data matrix X of size samples x wavelengths (= objects x variables). The first PCA principal component finds a rank one bilinear model that explains the maximum amount of variance in the original data matrix X (see Box 1a). The variance not captured by the first factor can be subjected to a second analysis step that tries to model the most variance in these residuals. Three sources of information become available from the PCA model: i) how much of the total variance in X is modeled by successive factors (‘how important is the last factor extracted’); ii) the object score values ti for every factor (‘what is the role of individual samples compared to others for this factor’); iii) the variable loadings pi (‘what part do different variables play in this factor'). We also get information on what has not been captured by the factor. To come back to our spectroscopy example: if the first factor explains 80% of the total variance in X, and the object score-values in t1 show a clear separation into two clusters, we know that there are likely to be two types of samples in our dataset. We can use the loadings vector p1 to identify NIR absorbance peaks that cause the two clusters to differ. PCA is used to study the full data table in a model of reduced complexity, formed by a small number of scores and loadings. Any experienced NIR user can tell that deriving conclusions from raw NIR data tables can be very hard!

In PLSR we are looking for a regression relation between a descriptor block X and a response block Y (Box 1b). In our NIR example this could e.g. be concentration for some components of interest, determined by laboratory reference methods. The first PLSR factor (‘Latent Variable’) builds two bilinear models (one for X and one for Y) that are optimized to simultaneously explain the maximum amount of variance in X and predict as well as possible the response variables in Y. The same diagnostics as in PCA – percentage explained in X and Y, score- and loading-values for both models – are available in the PLSR.

As stated in the introduction, multi-block methods have the ‘restriction’ that different blocks have to have one mode in common. This is usually the sample mode: a series of experiments, divided in meaningful blocks, are run on the same set of objects. Next, there are two conceptual viewpoints to handle these data blocks by multi-block methods. The first one is to consider each data block as a separate source of information, where the task of the multi-block model is to express the common structure for the objects. This object ‘consensus’ is formulated in a so-called super level, an additional top layer, combining information from all X-blocks on the lower data level. The alternative view on multi-block modeling is as follows: we have a large number of measurements, and we want to use all of them in an analysis or regression problem. Thus, we form one large data table to do the computations. However, we know that there are distinct groups of variables and we want to keep track of these separate blocks. At the super level we have the augmented data block. One level lower we have the individual blocks. We can actually go one level deeper, by looking at the individual variables in these blocks (just like regular PCA or PLSR).

In MB-PCA (Box 1c) we get the following information: i) percentages of explained variance for the augmented block (‘how important is the common factor over all blocks’); ii) super object score values ts (‘what is the influence of an object, seen over all blocks’); iii) super block-weights ws (‘how important is a block for this factor’). Besides these three tools we have diagnostics on the block level similar to regular PCA.

In the MB-PLSR a regression model is found between response block Y and a super descriptor block T, which itself is a function of the original descriptor X-blocks (Box 1d). Again, the same diagnostics as in MB-PCA and regular PLSR are available, but again with the additional restriction that score-values and loadings are optimized for object consensus at the super level.

It is important to emphasize that different blocks should be weighted before they can be used in one model (similar to the weighting of individual variables in regular PCA and PLSR). If the variance in one block is much larger than all others, this block will dominate the solution, and the conclusions can be misleading.

 

 

Experimental

 

The theory explained in the previous paragraph will now be used in an example from the agricultural industry. Five different potato varieties were harvested in September 1999 and analyzed in November 1999 and May 2000. The yields were sorted by salt weighting in two or three dry matter intervals, resulting in thirteen and ten different so-called bins for the two storage times. From these bins tubers were selected for laboratory analysis and sensory evaluation. The lab-measurements consist of uniaxial compression curves on raw and cooked potato material (Figure 1). In this technique a small potato sample is compressed at constant velocity under well-controlled conditions. The force resistance of the potato material – a function of chemical and physical composition of the tuber, expected to be related to consumer experience of the product – is recorded. These experiments are repeated for ten tubers from each variety. The uniaxial compression curves – averaged over ten replicates to reduce the natural variety in the bins – form the predictor blocks (the blocks raw X1 and cooked X2) in our multi-block models.

The descriptor block is formed by a sensory evaluation of the same potato bins. A trained sensory panel of ten assessors evaluated the cooked tubers on a number of attributes. In this paper we will use two of them: Cohesiveness and Mealiness. The panelists scored these attributes on a scale from zero to fifteen, and the average of this score is used as Y-blocks.

In total we have 23 objects, two X-blocks of compression ‘spectra’, and Y-blocks with Cohesiveness or Mealiness sensory scores. The two X-blocks where scaled to have block unit variance, and all three data tables where column mean centered before modeling. The correct model complexity were determined from a so-called leave-one-sample-out cross validation. In this procedure one potato sample is taken from the calibration set, a MB-PLSR model is build from the remaining 22 samples, and the response value for the removed sample is predicted. This procedure is repeated for all 23 samples in the set. The overall prediction error is determined as the Root Mean Squared error of Prediction (RMSEP), which usually shows a minimum for the optimal model complexity.

 

 

Results

 

The potato samples data is subjected to a MB-PLSR modeling. Figure 2 shows the primary information we get for the two models. Cohesiveness: from the cross validation RMSEP curve (Fig. 2c) we see that a two-factor model is optimal. The super weights ws are normalized to length one in the MB-PLSR algorithm. This means that high value (close to one) indicates that a block is important in this factor, while a low value means the block has little involvement. The super weights ws for Cohesiveness (Fig. 2b) show that both blocks are of approximate equal importance, with a slight preference for cooked sample compression curves (green bar). From the percentages variance explained we learn that the only a small part of the X1-block (raw samples) information is used in the first factor (Fig. 2a).

The Mealiness RMSEP-values also indicate a two-factor model. From ws we learn that the first factor mostly depends on the X1-block while the second factor is dominated by the X2-block (cooked samples). The explained variances start out high for both blocks, hence the relative low model complexities.

Figure 3 shows the predicted versus reference values from cross validation for the two sensory attribute models. The predictive performance for the Cohesiveness (correlation coefficients R2 = 0.77) and Mealiness (R2 = 0.81) are considered acceptable (Fig. 3a), inline with other sensory attribute modeling experiments [5]. From the X-block loadings pXi we see that for raw potato sample compression curves the information extracted is approximately the same (sign is undetermined for loading vectors), with a switch in the factor order (Fig. 3b). For the cooked potato sample curves Mealiness shows somewhat more skewed loading vectors (Fig. 3c). 

In Figure 4 we show the score on factor 2 versus factor 1 plot on the super level. The figure shows the effect of varieties and dry matter on Mealiness and Cohesiveness and clusters according to these design variables. The two models show approximately the same clustering for the two regressions, with the observation that factor one and two have switched places in the Cohesiveness and Mealiness results.

 

 

Conclusions

 

In this paper we tried to familiarize the reader with the concept of multi-block models. The general idea of these methods is to get a comprehensible  (graphical) overview in a large amount of information, while maintaining the natural order in the data (block structure). The alternative to multi-block methods would be to analyze the individual blocks and try to reach an overall conclusion from the separate observations.

In the theory and experimental section of this paper we showed an example of the simplest multi-block situation: two laboratory predictor blocks for the prediction of sensory attributes Cohesiveness and Mealiness in potato samples. This data is part of a larger study where five X-blocks – both physical and chemical – with eight different sensory attributes are available.

The methods presented in this paper are extensions of the well-known bilinear modeling methods PCA and PLSR. There are however many more multi-block methods available from literature, sometimes highly dedicated to specific analysis problems. E.g. in the realm of chemical engineering, process measurements and settings at different points in time are used as separate blocks in the algorithms [6], while in sensory texture studies the panel participants can be seen as ‘blocks’ with the product under evaluation as common denominator [7].

 

 

Acknowledgement

 

The work presented in this paper is part of the Advanced Quality Monitoring (AQM) project, a joined framework of KVL (LMC), DJF and DFU.

 

Back to Main

 

References

 

[1] H.Martens and M.Martens ‘Multivariate Analysis of Quality – and introduction’ Wiley(2001)

[2] J.S.Shenk and M.O.Westerhaus ‘Population structuring of near-infrared spectra and modified partial least-squares regression’ Crop Science no.6, 31(1991)1548-1555

[3] C.A.Andersson and R.Bro ‘The N-way Toolbox for Matlab’ Chemometrics and Intelligent Laboratory Systems 52(2000)1-4

[4] J.A.Westerhuis, Th.Kourti and J.F.MacGregor ‘Analysis of Multiblock and hierarchical PCA and PLS Models’ Journal of Chemometrics 12(1998)301-321

[5] A.K.Thybo, I.E.Bechmann, M.Martens and S.B.Engelsen ‘Prediction of Sensory Texture of Cooked Potatoes using Uniaxial Compression, Near Infrared Spectroscopy and Low Field 1H NMR Spectroscopy’ Lebensmittel Wissenschaft und Technology 33(2000)103-111

[6] A.K.Smilde, J.A.Westerhuis an R.Boqué ‘Multiway multiblock component and covariates regression models’ Journal of Chemometrics 14(2000)301-331

[7] G.M. Arnold and A.A.Williams ‘The use of Generalised Procrustes Techniques in Sensory Analysis’ in J.R.Piggott ‘Statistical Procedures in Food Research’ 1986

 

Back to Main



a Historically this is not entirely correct. PLSR was designed as a general (‘multi-block’) method and later evaluated to the X-Y block regression method popular today.