ngan, meier, chai - advanced video coding principles and techniques - Pdf 14

Advanced Video Coding:
Principles and Techniques
Series Editor: J. Biemond, Delft University of Technology, The Netherlands
Volume 1
Volume 2
Volume 3
Volume 4
Volume 5
Volume 6
Volume 7
Three-Dimensional Object Recognition Systems
(edited by A.K. Jain and P.J. Flynn)
VLSI Implementations for Image Communications
(edited by P. Pirsch)
Digital Moving Pictures - Coding and Transmission on ATM Networks
(J P. Leduc)
Motion Analysis for Image Sequence Coding (G.Tziritas and C. Labit)
Wavelets in Image Communication (edited by M. Barlaud)
Subband Compression of Images: Principles and Examples
(T.A. Ramstad, S.O. Aase and J.H. Husey)
Advanced Video Coding: Principles and Techniques
(K.N. Ngan, T. Meier and D. Chai)
ADVANCES IN IMAGE COMMUNICATION 7
Advanced Video Coding:
Principles and Techniques
King N. Ngan, Thomas Meier and Douglas Chai
University of Western Australia,
Dept. of Electrical and Electronic Engineering,
Visual Communications Research Group,
Nedlands, Western Australia 6907

Address permissions requests to: Elsevier Science Rights & Permissions Department, at the mail, fax and e-mail addresses noted above.
Notice
No responsibility is assumed by the Publisher for any injury and/or damage to persons or property as a matter of products liability,
negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein.
Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made.
First edition 1999
Library of Congress Cataloging in Publication Data
A catalog record from the Library of Congress has been applied for.
ISBN: 0444 82667 X
The paper used in this publication meets the requirements of ANSI/NISO Z39.48-1992 (Permanence of Paper).
Printed in The Netherlands.
To Nerissa, Xixiang, Simin, Siqi
To Elena
To June
This Page Intentionally Left Blank
Preface
The rapid advancement in computer and telecommunication technologies is
affecting every aspects of our daily lives. It is changing the way we interact
with each other, the way we conduct business and has profound impact on
the environment in which we live. Increasingly, we see the boundaries be-
tween computer, telecommunication and entertainment are blurring as the
three industries become more integrated with each other. Nowadays, one no
longer uses the computer solely as a computing tool, but often as a console
for video games, movies and increasingly as a telecommunication terminal
for fax, voice or videoconferencing. Similarly, the traditional telephone net-
work now supports a diverse range of applications such as video-on-demand,
videoconferencing, Internet, etc.
One of the main driving forces behind the explosion in information traffic
across the globe is the ability to move large chunks of data over the exist-
ing telecommunication infrastructure. This is made possible largely due to

have to evolve to integrate and code the multimedia content. The concept
of video as a sequence of rectangular frames displayed in time is outdated
since video nowadays can be captured in different locations and composed as
a composite scene. Furthermore, video can be mixed with graphics and an-
imation to form a new video, and so on. The new paradigm is to view video
content as audiovisual object which
and composed in whatever way an
MPEG-4 is the emerging stanc
tent. It defines a syntax for a set c
content-based interactivity, compre
does not specify how the video con
as an entity can be coded, manipulated
application requires.
lard for the coding of multimedia con-
,f content-based functionalities, namely,
ssion and universal access. However, it
tent is to be generated. The process of
video generation is difficult and under active research. One simple way is to
capture the visual objects separately , as it is done in TV weather reports,
where the weather reporter stands in front of a weather map captured sepa-
rately and then composed together yith the reporter. The problem is this is
not always possible as in the case mj outdoor live broadcasts. Therefore, au-
tomatic segmentation has to be employed to generate the visual content in
real-time for encoding. Visual content is segmented as semantically mean-
ingful object known as video objec I plane. The video object plane is then
tracked making use of the tempora I ~ correlation between frames so that its
location is known in subsequent frames. Encoding can then be carried out
using MPEG-4. " L
This book addresses the more ~dvanced topics in video coding not in-
cluded in most of the video codingbooks in the market. The focus of the

region.
Chapter 3 describes the foreground/background (F/B) coding scheme
where the facial region (the foreground) is coded with more bits than the
background region. The objective is to achieve an improvement in the
perceptual quality of the region of interest, i.e., the face, in the encoded
image. The F/B coding algorithm is integrated into the H.261 coder with
full compatibility, and into the H.263 coder with slight modifications of
its syntax. Rate control in the foreground and background regions is also
investigated using the concept of joint bit assignment. Lastly, the MPEG-4
coding standard in the context of foreground/background coding scheme is
studied.
As mentioned above, multimedia content can contain synthetic objects
or objects which can be represented by synthetic models. One such model
is the 3-D wire-frame model (WFM) consisting of 500 triangles commonly
used to model human head and body. Model-based coding is the technique
used to code the synthetic wire-frame models. Chapter 4 describes the pro-
cedure involved in model-based coding for a human head. In model-based
coding, the most difficult problem is the automatic location of the object
in the image. The object location is crucial for accurate fitting of the 3-D
WFM onto the physical object to be coded. The techniques employed for
automatic facial feature contours extraction are active contours (or snakes)
for face profile and eyebrow extraction, and deformable templates for eye
and mouth extraction. For synthesis of the facial image sequence, head mo-
tion parameters and facial expression parameters need to be estimated. At
the decoder, the facial image sequence is synthesized using the facial struc-
ture deformation method which deforms the structure of the 3-D WFM to
stimulate facial expressions. Facial expressions can be represented by 44 ac-
tion units and the deformation of the WFM is done through the movement
of vertices according to the deformation rules defined by the action units.
Facial texture is then updated to improve the quality of the synthesized

Video Verification Model 11. This section gives a succinct explanation of
the various techniques employed in the coding of natural images and video
including shape coding, motion estimation and compensation, prediction,
texture coding, scalable coding, sprite coding and still image coding. The
following section gives an overview of the coding of synthetic objects. The
approach adopted here is similar to that described in Chapter 4. In order
to handle video transmission in error-prone environment such as the mobile
channels, MPEG-4 has incorporated error resilience functionality into the
standard. The last section of the chapter describes the error resilient tech-
niques used in MPEG-4 for video transmission over mobile communication
networks.
King N. Ngan
Thomas Meier
Douglas Chai
June 1999
Acknowledgments
The authors would ike to thank Professor K. Aizawa of University of
Tokyo, Japan, for the use of the "Makeface" 3-D wireframe synthesis soft-
ware package, from which some of the images in Chapter 4 are obtained.
Xll
This Page Intentionally Left Blank
Table of Contents
Preface vii
Acknowledgments
xi
1
Image and Video Segmentation
1
1.1 Bayesian Inference and MRF's 2
1.1.1 MAP Estimation 3

2.2.1 Shape Analysis 71
2.2.2 Motion Analysis 72
2.2.3 Statistical Analysis 72
2.2.4 Color Analysis 73
2.3 Applications 74
2.3.1 Coding Area of Interest with Better Quality 74
2.3.2 Content-based Representation and MPEG-4 76
2.3.3 3D Human Face Model Fitting 76
2.3.4 Image Enhancement 76
2.3.5 Face Recognition, Classification and Identification . . 76
2.3.6 Face Tracking 78
2.3.7 Facial Expression Study 78
2.3.8 Multimedia Database Indexing 78
2.4 Modeling of Human Skin Color 79
2.4.1 Color Space 80
2.4.2 Limitations of Color Segmentation 84
2.5 Skin Color Map Approach 85
2.5.1 Face Segmentation Algorithm 85
2.5.2 Stage One- Color Segmentation 87
2.5.3 Stage Two- Density Regularization 90
2.5.4 Stage Three- Luminance Regularization 92
2.5.5 Stage Four- Geometric Correction 93
2.5.6 Stage Five- Contour Extraction 94
2.5.7 Experimental Results 95
References 107
3 Foreground/Background Coding 113
3.1 Introduction 113
3.2 Related Works 116
3.3 Foreground and Background Regions 122
3.4 Content-based Bit Allocation 123

Properties 218
4.4 WFM Fitting and Adaptation 220
4.4.1 Head Model Adjustment 220
4.4.2 Eye Model Adjustment 223
4.4.3 Eyebrow Model Adjustment 225
4.4.4 Mouth Model Adjustment 225
4.5 Analysis of Facial Image Sequences 227
4.5.1 Estimation of Head Motion Parameters 231
4.5.2 Estimation of Facial Expression Parameters 233
4.5.3 High Precision Estimation by Iteration 234
4.6 Synthesis of Facial Image Sequences 234
4.6.1 Facial Structure Deformation Method 235
4.7 Update of 3-D Facial Model 237
4.7.1 Update of Texture Information 239
xvi TABLE OF CONTENTS
4.7.2 Update of Depth Information 242
4.7.3 Transmission Bit Rates 243
References 245
5 VOP Extraction and Tracking
251
5.1 Video Object Plane Extraction Techniques 251
5.2 Outline of VOP Extraction Algorithm 258
5.3 Version I: Morphological Motion Filtering 260
5.3.1 Global Motion Estimation 261
5.3.2 Object Motion Detection Using Morphological Mo-
tion Filtering 265
5.3.3 Model Initialization 277
5.3.4 Object Tracking Using the Hausdorff Distance 277
5.3.5 Model Update 284
5.3.6 VOP Extraction 288

6.6.3 Shape Coding 332
6.6.4 Motion Estimation and Compensation 338
6.6.5 Texture Coding 352
6.6.6 Prediction and Coding of B-VOPs 368
6.6.7 Generalized Scalable Coding 373
6.6.8 Sprite Coding 378
6.6.9 Still Image Texture Coding 386
6.7 Coding of Synthetic Objects 391
6.7.1 Facial Animation 391
6.7.2 Body Animation 393
6.7.3 2-D Animated Meshes 393
6.8 Error Resilience 395
6.8.1 Resynchronization 395
6.8.2 Data Recovery 396
6.8.3 Error Concealment 396
6.8.4 Modes of Operation 397
6.8.5 Error Resilience Encoding Tools 398
References 400
Index 401
This Page Intentionally Left Blank
Chapter 1
Image and Video
Segmentation
Segmentation plays a crucial role in second-generation image and video
coding schemes, as well as in content-based video coding. It is one of the
most difficult tasks in image processing, and it often determines the eventual
success or failure of a system.
Broadly speaking, segmentation seeks to subdivide images into regions of
similar attribute. Some of the most fundamental attributes are luminance,
color, and optical flow. They result in a so-called low-level segmentation,

sequence, VOP segmentation is generally far more difficult than low-level
segmentation. Furthermore, VOP extraction for content-based interactivity
functionalities is an unforgiving task. Even small errors in the contour can
render a VOP useless for such applications.
This chapter starts with a review of Bayesian inference and Markov
random fields (MRFs), which will be needed throughout this chapter. A
brief discussion of edge detection is given in Section 1.2, and Section 1.3
deals with low-level still image segmentation. The remaining three sections
are devoted to video segmentation. First, an introduction to motion and
motion estimation is given in Sections 1.4 and 1.5, before video segmentation
techniques are examined in Sections 1.6 and 5.1. For a review of VOP
segmentation algorithms, we refer the reader to Chapter 5.
1.1
Bayesian Inference and Markov Random Fields
Bayesian inference is among the most popular and powerful tools in image
processing and computer vision [13, 14, 15]. The basis of Bayesian tech-
niques is the famous inversion formula
p(xlo)_ P(OIX)P(X).
(1.1)
P(O)
Although equation (1.1) is trivial to derive using the axioms of probability
theory, it represents a major concept. To understand this better, let X
denote an unknown parameter and 0 an observation that provides some
information about X. In the context of decision making, X and 0 are
sometimes referred to as hypothesis and evidence, respectively.
P(XIO )
can now be viewed as the likelihood of the unknown parameter
X, given the observation O. The inversion formula (1.1) enables us to
express
P(XIO )

does not depend on X. Hence, we can write
P(XIO) c~ P(OIX)P(X ).
(1.2)
For the purpose of a simplified notation, it is often more convenient to
minimize the negative logarithm of
P(X]O)
instead of maximizing
P(XIO )
directly. However, this has no effect on the outcome of the estimation. The
MAP estimate of X is now given by
XMAP
arg
n~x{P(OIX)P(X ) }
= arg n~n{- log
P(OIX) - log P(X)}.
(1.3)
From (1.3) it can be seen that the knowledge of two probability functions
is required. The likelihood
P(X)
contains the information that is available
a priori, that is, it describes our prior expectation on X before knowing O.
While it is often possible to determine
P(X)
from theoretical or experimen-
tal knowledge, subjective experience sometimes plays an important role. As
we will see later, Gibbs distributions are by far the most popular choice for
P(X)
in image processing, which means that X is assumed to be a sample
of a Markov random field (MRF).
The conditional probability

that is available from knowledge and experience and the information brought
in by the observation O [16].
Estimation problems are frequently encountered in image processing and
computer vision. Applications include image and video segmentation [16,
17, 18, 19], where O represents an image or a video sequence and X is the
segmentation label field to be estimated. In image restoration [20, 21, 22], X
is the unknown original image we would like to recover and O the degraded
image. Bayesian inference is also popular in motion estimation [23, 24, 25,
26], with X denoting the unknown optical flow field and O containing two
or more frames of a video sequence. In all these examples, the unknown
parameter X is modeled by a random field.
1.1.2 Markov Random Fields (MRFs)
Without doubt the most important statistical signal models in image pro-
cessing and computer vision are based on Markov processes [27, 20, 28, 29].
Due to their ability to represent the spatial continuity that is inherent in
natural images, they have been successfully applied in various applications
to determine the prior distribution
P(X).
Examples of such Markov ran-
dom fields include region processes or label fields in segmentation prob-
lems [16, 17, 18, 30], models for texture or image intensity [20, 21, 30, 31],
and optical flow fields [23, 26].
First, some definitions will be introduced with focus on discrete 2-D
random fields. We denote by L- {(i,j)ll _< i_< M, 1 _<j _< N} afinite
M • N rectangular lattice of sites or pixels. A neighborhood system Af is
then defined as any collection of subsets Af/,j of L,
A/"- {Afi,jl(i,j) c L
and Af/,j C L},
(1.4)
such that for any pixel

= P(X(i,j) IX(k, 1), (k,l)C Hi,j)
2)
P(X - x) > O for all x E l2
(1.7)
for every (i, j) E L.
The first condition is the well-known Markovian property. It restricts
the statistical dependency of pixel (i, j) to its neighbors and thereby signif-
icantly reduces the complexity of the model. It is interesting to notice that
6 CHAPTER 1. IMAGE AND VIDEO SEGMENTATION
this condition is satisfied by any random field defined on a finite lattice if
the neighborhood is chosen large enough [29]. Such a neighborhood system
would, however, not benefit from a reduction in complexity like, for exam-
ple, a second-order system. The second condition in (1.7), the so-called
positivity condition, requires all realizations x E ~ of the MRF to have
positive probabilities. It is not always included into the definition of MRFs,
but it must be satisfied for the Hammersley-Clifford theorem below.
The definition (1.7) is not directly suitable to specify an MRF, but for-
tunately the Hammersley-Clifford theorem [27] greatly simplifies the speci-
fication. It states that a random field X is an MRF if and only if
P(X)
can
be written as a Gibbs distribution 1. That is,
1 (
P(X - x) - -2
-
1 )
-~U(x) , Vx e ft.
(1.8)
The Gibbs distribution was first used in physics and statistical mechanics.
Best known is the Ising Model, which was proposed to model the magnetic

longing to C. It follows that the energy function
U(x),
and therefore the
1sometimes called a Boltzmann-Gibbs distribution [32]

Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

ngan, meier, chai - advanced video coding principles and techniques - Pdf 14

Tài liệu, ebook tham khảo khác

Học thêm