Multi-way Analysis in the Food Industry
Models, Algorithms, and Applications
This monograph was originally written as a Ph. D. thesis (see end of file for
original Dutch information printed in the thesis at this page)
i
MULTI-WAY ANALYSIS IN THE FOOD INDUSTRY
Models, Algorithms & Applications
Rasmus Bro
Chemometrics Group, Food Technology
Department of Dairy and Food Science
Royal Veterinary and Agricultural University
Denmark
Abstract
This thesis describes some of the recent developments in multi-way
analysis in the field of chemometrics. Originally, the primary purpose of this
work was to test the adequacy of multi-way models in areas related to the
food industry. However, during the course of this work, it became obvious
that basic research is still called for. Hence, a fair part of the thesis
describes methodological developments related to multi-way analysis.
A multi-way calibration model inspired by partial least squares regres-
sion is described and applied (N-PLS). Different methods for speeding up
algorithms for constrained and unconstrained multi-way models are
developed (compression, fast non-negativity constrained least squares
regression). Several new constrained least squares regression methods of
practical importance are developed (unimodality constrained regression,
smoothness constrained regression, the concept of approximate constrai-
ned regression). Several models developed in psychometrics that have
never been applied to real-world problems are shown to be suitable in
different chemical settings. The PARAFAC2 model is suitable for modeling
data with factors that shift. This is relevant, for example, for handling
retention time shifts in chromatography. The PARATUCK2 model is shown
and Agricultural University, Denmark). His enthusiasm and general
knowledge is overwhelming and the extent to which he inspires everyone
in his vicinity is simply amazing. Without Lars Munck none of my work
would have been possible. His many years of industrial and scientific work
combined with his critical view of science provides a stimulating environ-
ment for the interdisciplinary work in the Chemometrics Group. Specifically
he has shown to me the importance of narrowing the gap between
technology/industry on one side and science on the other. While industry
is typically looking for solutions to real and complicated problems, science
is often more interested in generalizing idealized problems of little practical
use. Chemometrics and exploratory analysis enables a fruitful exchange of
problems, solutions and suggestions between the two different areas.
Secondly, I am most indebted to Professor Age Smilde (University of
Amsterdam, The Netherlands) for the kindness and wit he has offered
during the past years. Without knowing me he agreed that I could work at
his laboratory for two months in 1995. This stay formed the basis for most
of my insight into multi-way analysis, and as such he is the reason for this
thesis. Many e-mails, meetings, beers, and letters from and with Age
Smilde have enabled me to grasp, refine and develop my ideas and those
of others. While Lars Munck has provided me with an understanding of the
phenomenological problems in science and industry and the importance of
exploratory analysis, Age Smilde has provided me with the tools that enable
me to deal with these problems.
Many other people have contributed significantly to the work presented
in this thesis. It is difficult to rank such help, so I have chosen to present
these people alphabetically.
Claus Andersson (Royal Veterinary and Agricultural University,
Denmark), Sijmen de Jong (Unilever, The Netherlands), Paul Geladi
(University of Umeå, Sweden), Richard Harshman (University of Western
Ontario, Canada), Peter Henriksen (Royal Veterinary and Agricultural
BACKGROUND
1.1 INTRODUCTION
1
1.2 MULTI-WAY ANALYSIS
1
1.3 HOW TO READ THIS THESIS
4
2.
MULTI-WAY
DATA
2.1 INTRODUCTION
7
2.2 UNFOLDING
10
2.3 RANK OF MULTI-WAY ARRAYS
12
3.
MULTI-WAY
MODELS
3.1 INTRODUCTION
15
Structure 17
Constraints 18
Uniqueness 18
Sequential and non-sequential models 19
3.2 THE KHATRI-RAO PRODUCT
54
4.
ALGORITHMS
4.1 INTRODUCTION
57
4.2 ALTERNATING LEAST SQUARES
57
4.3 PARAFAC
61
Initializing PARAFAC 62
Using the PARAFAC model on new data 64
Extending the PARAFAC model to higher orders 64
4.4 PARAFAC2
65
Initializing PARAFAC2 67
Using the PARAFAC2 model on new data 67
Extending the PARAFAC2 model to higher orders 68
vii
4.5 PARATUCK2
68
Initializing PARATUCK2 71
Using the PARATUCK2 model on new data 71
Extending the PARATUCK2 model to higher orders 71
4.6 TUCKER MODELS
72
Initializing Tucker3 76
Using the Tucker model on new data 78
Extending the Tucker models to higher orders 78
4.7 MULTILINEAR PARTIAL LEAST SQUARES REGRESSION
Residual analysis 113
Cross-validation 113
Core consistency diagnostic 113
5.5 CHECKING CONVERGENCE
121
5.6 DEGENERACY
122
viii
5.7 ASSESSING UNIQUENESS
124
5.8 INFLUENCE & RESIDUAL ANALYSIS
126
Residuals 127
Model parameters 127
5.9 ASSESSING ROBUSTNESS
128
5.10 FREQUENT PROBLEMS AND QUESTIONS
129
5.11 SUMMARY
132
6.
CONSTRAINTS
6.1 INTRODUCTION
135
Definition of constraints 139
Extent of constraints 140
Uniqueness from constraints 140
6.2 CONSTRAINTS
141
184
7.
APPLICATIONS
7.1 INTRODUCTION
185
Exploratory analysis 187
Curve resolution 190
Calibration 191
Analysis of variance 192
7.2 SENSORY ANALYSIS OF BREAD
196
Problem 196
Data 197
Noise reduction 197
Interpretation 199
Prediction 200
Conclusion 203
7.3 COMPARING REGRESSION MODELS (AMINO-N)
204
Problem 204
Data 204
Results 204
Conclusion 206
7.4 RANK-DEFICIENT SPECTRAL FIA DATA
207
Problem 207
Data 207
Structural model 209
Uniqueness of basic FIA model 213
8.2 DISCUSSION AND FUTURE WORK
262
APPENDIX
APPENDIX A: MATLAB FILES
265
APPENDIX B: RELEVANT PAPERS BY THE AUTHOR
267
BIBLIOGRAPHY
269
INDEX
285
xi
L
IST OF FIGURES
Page
Figure 1. Graphical representation of three-way array 8
Figure 2. Definition of row, column, tube, and layer 8
Figure 3. Unfolding of three-way array 11
Figure 4. Two-component PARAFAC model 24
Figure 4. Uniqueness of fluorescence excitation-emission model 27
Figure 6. Cross-product array for PARAFAC2 35
Figure 7. The PARATUCK2 model 39
Figure 8. Score plot of rank-deficient fluorescence data 41
Figure 9. Comparing PARAFAC and PARATUCK2 scores 43
Figure 10. Scaling and centering conventions 105
Figure 11. Core consistency – amino acid data 115
Figure 12. Core consistency – bread data 117
Figure 13. Core consistency – sugar data 118
Figure 14. Different approaches for handling missing data 139
Figure 15. Smoothing time series data 154
Box 3. A generic ALS algorithm 59
Box 4. Structure of decomposition models 60
Box 5. PARAFAC algorithm 63
Box 6. PARAFAC2 algorithm 66
Box 7. Tucker3 algorithm 74
Box 8. Tri-PLS1 algorithm 82
Box 9. Tri-PLS2 algorithm 83
Box 10. Exact compression 91
Box 11. Non-negativity and weights in compressed spaces 94
Box 12. Effect of centering 103
Box 13. Effect of scaling 106
Box 14. Second-order advantage example 142
Box 15. ALS for row-wise and columns-wise estimation 165
Box 16. NNLS algorithm 170
Box 17. Monotone regression algorithm 176
Box 18. Rationale for using PARAFAC for fluorescence data 189
Box 19. Aspects of GEMANOVA 196
Box 20. Alternative derivation of FIA model 212
Box 21. Avoiding local minima 215
Box 22. Non-negativity for fluorescence data 233
xiv
A
BBREVIATIONS
ALS Alternating least squares
ANOVA Analysis of variance
CANDECOMP Canonical decomposition
DTD Direct trilinear decomposition
FIA Flow injection analysis
FNNLS Fast non-negativity-constrained least squares regression
GEMANOVA General multiplicative ANOVA
any constraints of a model; i.e., no parameters should
be negative if non-negativity is required
Fit Indicates how well the model of the data describes
the data. It can be given as the percentage of varia-
tion explained or equivalently the sum-of-squares of
the errors in the model. Mostly equivalent to the
function value of the loss function
Latent variable Factor
Layer A submatrix of a three-way array (see Figure 2)
Loading vector Part of factor referring to a specific (variable-) mode.
xvi
If no distinction is made between variables and
objects, all parts of a factor referring to a specific
mode are called loading vectors
Loss function The function defining the optimization or goodness
criterion of a model. Also called objective function
Mode A matrix has two modes: the row mode and the
column mode, hence the mode is the basic entity
building an array. A three-way array thus has three
modes
Model An approximation of a set of data. Here specifically
based on a structural model, additional constraints
and a loss function
Order The order of an array is the number of modes; hence
a matrix is a second-order array, and a three-way
array a third-order array
Profile Column of a loading or score matrix. Also called
loading or score vector
Rank The minimum number of PARAFAC components
necessary to describe an array. For a two-way array
x
)) which is the minimum function value of f(
x
).
cos(
x
,
y
) The cosine of the angle between
x
and
y
cov(
x
,
y
) Covariance of the elements in
x
and
y
diag(
X
) Vector holding the diagonal of
X
max(
x
) The maximum element of
x
min(
x
,F) Singular value decomposition. The matrix
U
will be
the first F left singular vectors of
X
, and
V
the right
singular vectors. The diagonal matrix
S
holds the first
F singular values in its diagonal
tr
X
The trace of
X
, i.e., the sum of the diagonal elements
of
X
vec
X
The term vec
X
is the vector obtained by stringing out
(unfolding)
X
column-wise to a column vector (Hen-
derson & Searle 1981). If
X
= [
ij
X
Y
The Kronecker tensor product of
X
and
Y
where
X
is
of size I × J is defined
X
Y
The Khatri-Rao product (page 20). The matrices
X
and
Y
must have the same number of columns. Then
X
Y
=
[
x
1
y
1
ACKGROUND
1.1 INTRODUCTION
The subject of this thesis is multi-way analysis. The problems described
mostly stem from the food industry. This is not coincidental as the data
analytical problems arising in the food area can be complex. The type of
problems range from process analysis, analytical chemistry, sensory
analysis, econometrics, logistics etc. The nature of the data arising from
these areas can be very different, which tends to complicate the data
analysis. The analytical problems are often further complicated by biological
and ecological variations. Hence, in dealing with data analysis in the food
area it is important to have access to a diverse set of methodologies in
order to be able to cope with the problems in a sensible way.
The data analytical techniques covered in this thesis are also applicable
in many other areas, as evidenced by many papers of applications in other
areas which are emerging in the literature.
1.2 MULTI-WAY ANALYSIS
In standard multivariate data analysis, data are arranged in a two-way
structure; a table or a matrix. A typical example is a table in which each row
corresponds to a sample and each column to the absorbance at a particular
wavelength. The two-way structure explicitly implies that for every sample
the absorbance is determined at every wavelength and vice versa. Thus,
the data can be indexed by two indices: one defining the sample number
and one defining the wavelength number. This arrangement is closely
Background
2
connected to the techniques subsequently used for analysis of the data
(principal component analysis, etc.). However, for a wide variety of data a
more appropriate structure would be a three-way table or an array. An
example could be a situation where for every sample the fluorescence
emission is determined at several wavelengths for several different
Background
3
analysis is one such data analytical development.
&
Some multi-way model structures are unique. No additional constraints,
like orthogonality, are necessary to identify the model. This implicitly
means that it is possible to calibrate for analytes in samples of unknown
constitution, i.e., estimate the concentration of analytes in a sample
where unknown interferents are present. This fact has been known and
investigated for quite some time in chemometrics by the use of methods
like generalized rank annihilation, direct trilinear decomposition etc.
However, from psychometrics and ongoing collaborative research
between the area of psychometrics and chemometrics, it is known that
the methods used hitherto only hint at the potential of the use of
uniqueness for calibration purposes.
&
Another aspect of uniqueness is what can be termed computer
chromatography. In analogy to ordinary chromatography it is possible in
some cases to separate the constituents of a set of samples mathemati-
cally, thereby alleviating the use of chromatography and cutting down
the consumption of chemicals and time. Curve resolution has been
extensively studied in chemometrics, but has seldom taken advantage
of the multi-way methodology. Attempts are now in progress trying to
merge ideas from these two areas.
&
While uniqueness as a concept has long been the driving force for the
use of multi-way methods, it is also fruitful to simply view the multi-way
models as natural structural bases for certain types of data, e.g., in
sensory analysis, spectral analysis, etc. The mere fact that the models
are appropriate as a structural basis for the data, implies that using
of the work presented here deals with how to develop robust and fast
algorithms for expressing common knowledge (e.g. non-negativity of
absorbance and concentrations, unimodality of chromatographic profiles)
and how to incorporate such restrictions into larger optimization algorithms.
1.3 HOW TO READ THIS THESIS
This thesis can be considered as an introduction or tutorial in advanced
multi-way analysis. The reader should be familiar with ordinary two-way
multivariate analysis, linear algebra, and basic statistical aspects in order
to fully appreciate the thesis. The organization of the thesis is as follows:
Chapter 1: Introduction