Interactive introduction to multi-way analysis in MATLAB
Previous Chapter: Preprocessing First Chapter: Contents

MISSING DATA

Missing values should be treated with care in any model. Simply setting the values to zero is sometimes suggested, but this is a very dangerous approach. The missing elements may just as well be set to 1237 or some other value. There is nothing special about zero. Another approach is to impute the missing elements from an ANOVA model or something similar. While better than simply setting the elements to zero, this is still not a good approach. In two-way PCA and any-way PLS estimated through NIPALS-like algorithms the approach normally advocated for in chemometrics is to simply skip the missing elements in the appropriate inner products of the algorithm. This approach has been shown to work well for a small amount of randomly missing data, but also to be problematic in some cases.

A better way, though, to handle missing data follows from the idea that the model is estimated by optimizing the loss function only considering non-missing data. This is a more sensible way of handling randomly missing data. The loss function for any model of incomplete data can thus be stated where X is a matrix containing the data and M the model (both unfolded). The structure (and constraints) of M are given by the specific model being estimated. The matrix W contains weights that are either one if corresponding to an existing element or zero if corresponding to a missing element. If weighted regression is desired W is changed accordingly keeping the zero elements at zero. The natural way to estimate the model with missing data is thus by a weighted regression approach.

Another approach for handling incomplete data is to impute the missing data iteratively during the estimation of the model parameters. The missing data are initially replaced with either sensible or random elements. A standard algorithm is used for estimating the model parameters using all data. After each iteration the model of X is calculated, and the missing elements are replaced with the model estimates. The iterations and replacements are continued until no changes occur in the estimates of the missing elements and the overall convergence criterion is fulfilled. It is easy to see, that when the algorithm has converged, the elements replacing the missing elements will have zero residual.

How then, do these two approaches compare? Henk Kiers has shown that the two approaches give identical results, which can also be realized by considering data imputation more closely. As residuals corresponding to missing elements will be zero they do not influence the parameters of the model, which is the same as saying they should have zero weight in the loss function. Algorithmically, however, there are some differences. Consider two competing algorithms for estimating a model of data with missing elements; one where the parameters are updated using weighted least squares regression with zero weights for missing elements and one where ordinary least squares regression and data imputation is used. Using direct weighted least squares regression instead of ordinary least squares regression is computationally more costly per iteration, and will therefore slow down the algorithm. Using iterative data imputation on the other hand often requires more iterations due to the data imputation (typically 30-100% more iterations). It is difficult to say which method is preferable as this is dependent on implementation, size of the data, and size of the computer. Data imputation has the advantage of being easy to implement, also for problems which are otherwise difficult to estimate under a weighted loss function.

Previous Chapter: Preprocessing First Chapter: Contents

The N-way tutorial