Chapter 7
A Multimodal Approach to Image
Data Mining and Concept Discovery
7.1 Introduction
This chapter gives an example on multimedia data mining by addressing the
automatic image annotation problem and its application to multimodal image
data mining and retrieval. Specifically, in this chapter, we propose a prob-
abilistic semantic model in which the visual features and the textual words
are connected via a hidden layer which constitutes the semantic concepts to
be discovered to explicitly exploit the synergy between the two modalities;
the association of visual features and the textual words is determined in a
Bayesian framework such that the confidence of the association can be pro-
vided; and extensive evaluations on a large-scale, visually and semantically
diverse image collection crawled from the Web are reported to evaluate the
prototype system based on the model. In the proposed probabilistic model,
a hidden concept layer which connects the visual features and the word layer
is discovered by fitting a generative model to the training images and anno-
tation words. An Expectation-Maximization (EM) based iterative learning
procedure is developed to determine the conditional probabilities of the vi-
sual features and the textual words given a hidden concept class. Based on
the discovered hidden concept layer and the corresponding conditional prob-
abilities, the image annotation and the text-to-image retrieval are performed
using the Bayesian framework. The evaluations of the prototype system on
17,000 images and 7,736 automatically extracted annotation words from the
crawled Web pages for multimodal image data mining and retrieval have in-
dicated that the model and the framework are superior to a state-of-the-art
peer system in the literature.
The rest of the chapter is organized as follows: Section 7.2 introduces the
motivations to this work and outlines the main contributions of this work.
Section 7.3 discusses the related work on image annotation and multimodal
image mining and retrieval. In Section 7.4 the proposed probabilistic seman-
querying modalities. Users can query an image database either by imagery,
by a collateral information modality (e.g., text), or by any combination.
In this chapter, we propose a probabilistic semantic model and the cor-
responding learning procedure to address the problem of automatic image
annotation and show its application to multimodal image data mining and
retrieval. Specifically, we use the proposed probabilistic semantic model to
explicitly exploit the synergy between the different modalities of the imagery
and the collateral information. In this work, we only focus on a specific col-
lateral modality — text. The model may be generalized to incorporate other
collateral modalities. Consequently, the synergy here is explicitly represented
as a hidden layer between the imagery and the text modalities. This hid-
den layer constitutes the concepts to be discovered through a probabilistic
framework such that the confidence of the association can be provided. An
Expectation-Maximization (EM) based iterative learning procedure is devel-
oped to determine the conditional probabilities of the visual features and the
© 2009 by Taylor & Francis Group, LLC
A Multimodal Approach to Image Data Mining and Concept Discovery 237
words given a hidden concept class. Based on the discovered hidden concept
layer and the corresponding conditional probabilities, the image-to-text and
text-to-image retrievals are performed in a Bayesian framework.
In recent image data mining and retrieval literature, COREL data have
been extensively used to evaluate the performance [14, 70, 75, 136]. It has
been argued [217] that the COREL data are much easier to annotate and
retrieve due to their small number of concepts and small variations of the
visual content. In addition, the relative small number (1,000 to 5,000) of
the training images and test images typically used in the literature further
makes the problem easier and the evaluation less convictive. In order to truly
capture the difficulties in real scenarios such as Web image data mining and
retrieval and to demonstrate the robustness and the promise of the proposed
model and the framework in these challenging applications, we have evaluated
translation models. The models are correspondence extensions to Hofmann et
al’s hierarchical clustering aspect model [102, 103, 101], and incorporate multi-
modality information. The models consider image annotation as a process of
translation from “visual language” to text and collect the co-occurrence infor-
mation by the estimation of the translation probabilities. The correspondence
between blobs and words are learned by using statistical translation models.
As noted by the authors [14], the performance of the models is strongly af-
fected by the quality of image segmentation. More sophisticated graphical
models, such as Latent Dirichlet Allocation (LDA) [22] and correspondence
LDA, have also been applied to the image annotation problem recently [21].
Specific reviews on using the graphical models for multimedia data mining
including image annotation are given in Section 3.6.
Another way to address automatic image annotation is to apply classifica-
tion approaches. The classification approaches treat each annotated word (or
each semantic category) as an independent class and create a different image
classification model for every word (or category). One representative work
of these approaches is the automatic linguistic indexing of pictures (ALIPS)
[136]. In ALIPS, the training image set is assumed well classified and each
category is modeled by using 2D multi-resolution hidden Markov models. The
image annotation is based on the nearest-neighbor classification and word oc-
currence counting, while the correspondence between the visual content and
the annotation words is not exploited. In addition, the assumption made in
ALIPS that the annotation words are semantically exclusive is not valid in
nature.
Recently, relevance language models [75] have been successfully applied to
automatic image annotation. The essential idea is to first find annotated
images that are similar to a test image and then use the words shared by the
annotations of the similar images to annotate the test image. One model in
this category is the Multiple-Bernoulli Relevance Model (MBRM) [75], which
is based on the Continuous-space Relevance Model (CRM) [134]. In MBRM,
, i ∈ [1, N ] denotes the visual feature vec-
tor of images in the training database, where N is the size of the image
database. w
j
, j ∈ [1, M] denotes the distinct textual words in the training
annotation word set, where M is the size of annotation vocabulary in the
training database.
In the probabilistic model, we assume the visual features of images in the
database, f
i
= [f
1
i
, f
2
i
, . . . , f
L
i
], i ∈ [1, N], are known i.i.d. samples from an
unknown distribution. The dimension of the visual feature is L. We also
assume that the specific visual feature annotation word pairs (f
i
, w
j
), i ∈
[1, N], j ∈ [1, M ] are known i.i.d. samples from an unknown distribution.
Furthermore, we assume that these samples are associated with an unobserved
semantic concept variable z ∈ Z = {z
1
j
) are conditionally independent
given the respective hidden concept z
k
,
P (f
i
, w
j
|z
k
) = p
F
(f
i
|z
k
)P
V
(w
j
|z
k
) (7.1)
The visual feature and word distribution are treated as a randomized data
generation process, described as follows:
• Choose a concept with probability P
Z
(z
k
is discarded. The graphic representation of this model is depicted in Figure
7.1.
Translating this process into a joint probability model results in the expres-
sion
P (f
i
, w
j
) = P (w
j
)P (f
i
|w
j
)
= P (w
j
)
K
k=1
P
F
(f
i
|z
k
)P (z
k
|w
|z
k
) (7.3)
The mixture of Gaussian [60] is assumed for the feature-concept conditional
probability P
F
(•|Z). In other words, the visual features are generated from
K Gaussian distributions, each one corresponding to a z
k
. For a specific
semantic concept variable z
k
, the conditional pdf of visual feature f
i
is
p
F
(f
i
|z
k
) =
1
(2π)
L/2
|
k
|
1/2
ties P
V
(•|Z), i.e., P
V
(w
j
|z
k
) for k ∈ [1, K], are estimated through fitting the
probabilistic model to the training set.
Following the likelihood principle, one determines P
F
(f
i
|z
k
) by the maxi-
mization of the log-likelihood function
log
N
i=1
p
F
(f
i
|Z)
u
i
=
V
(w
j
|z
k
) can be determined by the maximization of the log-likelihood
function
L = log P(F, V ) =
N
i=1
M
j=1
n(w
j
i
) log P (f
i
, w
j
) (7.6)
where n(w
j
i
) denotes the weight of annotation word w
j
, i.e., the occurrence
frequency, for image f
i
Z
(z
k
)p
F
(f
i
|z
k
)
K
t=1
P
Z
(z
t
)p
F
(f
i
|z
t
)
(7.7)
P (z
k
|f
i
, w
|z
t
)P
V
(w
j
|z
t
)
(7.8)
The expectation of the complete-data likelihood log P (F, V, Z) for the esti-
mated P (Z|F, V ) derived from Equation 7.8 is
K
(i,j)=1
N
i=1
M
j=1
n(w
j
i
) log [P
Z
(z
i,j
)p
F
is the concept variable that associates with
the feature-word pair (f
i
, w
j
). In other words, (f
i
, w
j
) belongs to concept z
t
where t = (i, j).
Similarly, the expectation of the likelihood log P (F, Z) for the estimated
P (Z|F ) derived from Equation 7.7 is
K
k=1
N
i=1
log(P
Z
(z
k
)p
F
(f
i
|z
k
k
) = 1,
K
k=1
P (z
k
|f
i
, w
j
) = 1 (7.11)
for any f
i
, w
j
, and z
l
, the parameters are determined as
µ
k
=
N
i=1
u
i
f
i
p(z
k
)(f
i
− µ
k
)
T
N
s=1
u
s
p(z
k
|f
s
)
(7.13)
P
Z
(z
k
) =
M
j=1
N
i=1
u(w
n(w
j
i
)P (z
k
|f
i
, w
j
)
M
u=1
N
v=1
n(w
u
v
)P (z
k
|f
v
, w
u
)
(7.15)
Alternating Equations 7.7 and 7.8 with Equations 7.12–7.15 defines a conver-
gent procedure to a local maximum of the expectation in Equations 7.9 and
7.10.