Báo cáo khoa học: "A Grammatical Approach to Understanding Textual Tables using Two-Dimensional SCFGs" - Pdf 11

Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 905–912,
Sydney, July 2006.
c
2006 Association for Computational Linguistics
A Grammatical Approach to Understanding Textual Tables using
Two-Dimensional SCFGs
Dekai WU
1
Ken Wing Kuen LEE
Human Language Technology Center
HKUST
Department of Computer Science and Engineering
University of Science and Technology
Clear Water Bay, Hong Kong
{dekai,cswkl}@cs.ust.hk
Abstract
We present an elegant and extensible
model that is capable of providing seman-
tic interpretations for an unusually wide
range of textual tables in documents. Un-
like the few existing table analysis mod-
els, which largely rely on relatively ad hoc
heuristics, our linguistically-oriented ap-
proach is systematic and grammar based,
which allows our model (1) to be concise
and yet (2) recognize a wider range of data
models than others, and (3) disambiguate
to a significantly finer extent the under-
lying semantic interpretation of the table
in terms of data models drawn from rela-
tion database theory. To accomplish this,

many ways equivalent to the overlay of prosody on
speech Just as prosody undoubtedly contributes
to the meaning of utterances, so too does a text’s
graphical presentation contribute to its meaning.
However few natural language understanding
systems use graphical presentational features to
aid interpretation ” (Power et al., 2003).
Nowhere is this more evident than in the wide-
spread use of tables in real-world, unsimplified
text documents. Tables have a comparable or
greater complexity as other elements of text. Un-
fortunately, in mainstream NLP it is not uncom-
mon for tables to be regarded as a somehow “de-
generate” form of text, unworthy of the same de-
gree of attention as the rest of the text. But as
we will discuss, the degree of ambiguity in ta-
ble understanding is at least as great as for many
sense and attachment problems. Many of the same
mechanisms used for understanding linear text are
also required for table understanding. The same
division of surface syntax and underlying seman-
tics is found.
Indeed, to perceive the limitations of existing
table understanding models, we may distinguish
several very different levels of table analysis tasks.
In table classification, the table is classified into
one of several coarse categories (in the extreme
case, some models simply predict whether the pur-
pose of the table is for page layout versus tabular
data). In table synactic recognition, the surface

dation toward a linguistic approach by first shift-
ing to a two-dimensional CFG framework. This
permits us to construct a grammar where all the
rules are meaningfully discriminative, such that—
unlike existing table understanding models—any
analysis of a table includes a full parse tree that
assigns precise data model labels to all its regions
(including nested subregions) thereby specifying
the logical relations between the table’s elements.
Additionally, probabilities on the production rules
support thresholding (or ranking) of the alternative
candidate table interpretation hypotheses.
As with many natural language phenomena, a
full model of disambiguation must ultimately inte-
grate lexical semantics. However, in this research
step we focus on the question of how much seman-
tic interpretation can be performed on the basis of
other features, in the absence of a lexical or on-
tological model. Just as syntax and morphology
and prosody alone already permit much recogni-
tion and disambiguation of semantic roles and ar-
gument structure to be done for sentence, the same
can be done for tables. At the same time, we be-
lieve future integration of lexical semantics will be
facilitated by the grammatical framework of our
model.
One way to think about this is that we wish to
Table 1: Example “Martian” table (see text).
Pbje Kwe Zxc Amc
Hoer 15 - 18 17 - 20 19 - 23

ble. As can be seen from the restored original
English version of the same example in Table 2,
the most likely interpretation was predicted even
without access to specific lexical knowledge. We
aim to show that a fairly useful baseline level of
semantic interpretation accuracy can already be
achieved, even with relatively little lexical and on-
tological knowledge.
We model these alternative hypotheses for the
interpretation of ambiguous tables as competing
parses. Just as with ordinary parsing and seman-
tic interpretation, the reader often builds multiple
competing interpretations of the same table.
Note that many previous models do not even
distinguish between the alternative possible inter-
pretations in the Martian example. Existing mod-
906
els such as Hurst (2000) and Yang (2002) inter-
pret tables with the same structural layout simply
by assigning them same data model, which stops
short of recognizing that it is necessary to rank
multiple competing interpretations that entail dif-
ferent sets of logical relations.
In contrast, our proposed model is capable of
producing multiple competing parses indicating
different semantic interpretations of tables having
the same structural layout, by selecting specific
data models for the table and its subregions.
2 Data Models for Specifying Semantic
Interpretations

and ambiguous (e.g., Yang (2002), Yang and Luk
(2002), Wang et al. (2000), Yoshida et al. (2001)).
A table is a logical view of a collection of inter-
related items usually presented as a row-column
structure such that the reader’s ability to access
and compare information can be enhanced, as also
noted by Wang (1996). From a database manage-
Table 2: Example from Table 1 in its original ver-
sion, with the English words restored.
Date Thu Fri Sat
Temp 15 - 18 17 - 20 19 - 23
RH 85 - 95% 70 - 90% 75 - 95%
Weather Cool Cool Cloudy
ment system perspective, each table can be con-
sidered as a (tiny) database. Like a program, the
reader accesses the data. As a result, we consider
that every table must correspond to a data model,
and this model determines how the reader extracts
information from the table.
Each data model has a schema which, as we
shall see below, may or may not surface (partially
or completely) as a subset of cells in the table that
describe attributes. Recognizing the data models
of a table correctly therefore also implies that both
attribute-value pairings and table structures have
been recognized.
At the top level, we categorize the data models
into three broad types:
• Flat model: A table is interpreted as a
database table in non-1NF normal relational

column as the schema (Date, Temp, RH, Weather)
and other columns as data records.
The flat model is closest to the 1-dimensional
table approach used by the majority of previous
models, but our approach designates the flat model
as a semantic representation, in contrast to the
previous models which see 1-dimensional tables
merely as a syntactic surface form (e.g., Yang
(2002), Yang and Luk (2002), Wang et al. (2000),
Yoshida et al. (2001)). While such previous mod-
els only recognize tables that are physically laid
out in this form, our approach clearly delineates an
explicit separation of syntax and semantics, which
provides greater flexibility allowing any table to be
interpreted as a flat model, regardless of its surface
form (though the flat model interpretation is more
common for some surface forms than others).
As an example showing that any kind of
table can be categorized as flat model, consider
Table 6. Even such a table can be semantically
interpreted as a flat model because related at-
tributes can join together to form a composite
attribute, though humans would less naturally
choose this semantic interpretation. Certainly
there are hierarchical relationship between
attributes; for example, Ass1 is a subtype of
Assignments. However, it is also valid to consider
the attributes along a hierarchical path as one
composite attribute. For example, “Mark -> As-
signments -> Ass1” becomes the single attribute

in previous models. A dimensional model refers
to a table, such as the table in Table 4, that resem-
bles a view of collection of data stored in multi-
dimensional data model. A multidimensional data
cube, as described in the database literature (e.g.,
Han and Kamber (2000), Chaudhuri and Dayal
(1997)), consists of a set of numeric measures
(though in fact the data need not be numeric), each
of which is determined by a set of dimensions.
Each dimension is described by a set of attributes.
For example, Table 5 can be semantically inter-
preted using the multidimensional data model de-
picted in Figure 1. Likewise, the cross-tabular ta-
ble in Table 4 can also be semantically interpreted
using the same multidimensional data model in
Figure 1. The value of the first three columns in
Table 5 are the dimension attributes and the rev-
enue values are the measures.
In contrast, among previous models, Yang
(2002) produces a semantically incorrect recogni-
tion of a multidimensional table that inappropri-
ately presents the attributes in hierarchical struc-
ture. Yang and Luk (2002) and Wang et al.
(2000) only recognize the simplest 2-dimensional
case and apparently cannot handle 3 or more di-
mensions. Yoshida et al. (2001) only handle 1-
dimensional cases.
A dimensional model is an inappropriate inter-
pretation for non-cross-tabular tables, such as Ta-
ble 2 and Table 3. A dimensional model is also not

845
1078
2001
2002
Figure 1: Multidimensional data cube corre-
sponding to Tables 4 and 5.
to belong to different dimensions because it is in-
correct to determine the score by both “Assign-
ments” and “Midterm”. Syntactically, the texts
in the last attribute row of Table 6 are all unique;
however, the last attribute row of the table in Ta-
ble 4 is a repeating sequence of (”Phone”, ”Com-
puter”). Therefore, to a non-English reader, an
English cross-tabular table which possess repeat-
ing sequences in the attribute rows is likely to be
semantically interpreted as a dimensional model,
while a cross-tabular table which does not have
this property is likely to be interpreted as a nested
Table 6: Example table of grades.
Mark
Assignments Examinations
Ass1 Ass2 Ass3 Midterm Final Grade
Winter 85 80 75 60 75 75
Spring 80 65 75 60 70 70
Fall 80 85 75 55 80 75
Winter 85 80 70 70 75 75
Spring 80 80 70 70 75 75
Fall 75 70 65 60 80 70
Year Team
1991

being simpler to implement, because it requires
only a single integrated parsing process to com-
plete the entire table analysis. This includes clas-
sifying the functions of each cell (as attribute or
value), pairing attributes and values, and identify-
ing the structure and the data model of a table. In
contrast, previous works require several stages to
complete the entire analysis, introducing complex
909
problems that are difficult to resolve, such as pre-
mature commitment to incorrect early-stage deci-
sions.
To our knowledge Wang et al. (2000) is the only
textual table analysis model that uses a grammar
to describe table structures. However, in that case,
only a simple template matching analyzer is used.
Their grammar notation is unable to show both
physical structure and the semantics of a table at
the same time in a hierarchical manner. In con-
trast, information such as “a data block contains
three rows of data cell” can be stored in the parse
tree constructed by our parsing model.
Outside of the table understanding literature,
there exists a different 2D parsing technique called
PLEX (Feder, 1971), (Costagliola et al., 1994)
which allows an object to have finite sets of attach-
ing points. PLEX is used to generate 2D diagrams
such as molecular structures, circuit diagrams and
flow charts in a grammatical way. However, we
consider it too complex and computationally ex-

parse tree structure, just as in the case of ordinary
1D grammars.
The operators “–>” and “|->” control the gen-
eration direction. In term of table analysis, a non-
terminal represents a matrix of tokens and a termi-
nal represents a single token. Sub-matrices gen-
erated by a horizontal rule will have same height
but not necessarily same width; similarly, sub-
matrices generated by a vertical rule will have
same width but not necessarily same height. In
other words, a matrix is partitioned into two halves
by the binary production rule.
Probabilities are placed on each rule, as in ordi-
nary 1D SCFGs. They are used to eliminate parses
falling below a threshold, which also helps to re-
duce the time complexity in practice.
Parsing with two-dimensional grammars can be
conceptualized most easily via parse tree exam-
ples. Figure 2 shows a complete parse tree for
parsing the table in Table 7 into a flat model. Fig-
ure 3 is a portion of a parse tree for parsing the
table in Table 8 into a nested model, while Figure
4 is a portion of parse trees for parsing Table 7 into
a dimensional model. The following is the gram-
mar fragment that gives the parse tree as Figure
2:
T1-1H |-> FlatModel
FlatModel |-> FlatSchema Records
FlatSchema > CompositeAttribute FlatSchema
FlatSchema > CompositeAttribute

VA
Composite
Attribute |->
Attribute
P
FlatSchema >
Composite
Attribute |->
Attribute |->
VA
Composite
Attribute |->
Attribute
C
FlatSchema >
Composite
Attribute |->
Attribute |->
VB
Composite
Attribute |->
Attribute |->
P
FlatSchema >
Composite
Attribute |->
Attribute |->
VB
Composite
Attribute |->

Base |->
VA
Final >
Attribute |->
P
Final >
Attribute |->
C
NestedAttribute |->
Base |->
VB
Final >
Attribute |->
X
Final >
Attribute |->
Y
1
Figure 3: A partial parse for a nested model.
DimensionalModelSchema |->
Dimension >
DimAttribute |->
DimAttribute |->
VA
Dimension >
Dim
Attribute |->
P
Dimension
>

11 12 13 14
21 22 23 24
1
Table 9: Example table showing a floor legend.
6 School of Business & Management
5 Department of Biochemistry
4 Classrooms 4202 - 4205
3 Department of Computer Science
3 Department of Mathematics
For the blind evaluation, a human annotator in-
dependently manually annotated a randomly cho-
sen sample of 45 tables from the collection. All ta-
bles in the evaluation sample were previously un-
seen test cases, never inspected prior to the con-
struction of the two-dimensional grammar.
Each tokenized table was tagged by the human
judge with a list of types T relevant to the table.
The relevance is defined as follows: a data model
is relevant to a table if and only if the human
would agree that such a data model would natu-
rally be hypothesized as an interpretation for that
table (analogously to the way that word senses are
manually annotated for WSD evaluations). Each
type is a tuple of the form (R, O, S), where R is
the relevant data model, O is the reading orienta-
tion of R, and S is a boolean saying if a schema
(i.e. attributes) exist in the table. Thus, Table 2
would be tagged as {(flat, vertical, true)} while
the table in Table 4 would be tagged as {(flat, hor-
izontal, true), (flat, vertical, true), (dimensional, ,

and less demanding) criteria of number of correct
attribute-value pairings. Such an evaluation ap-
proach gives unduly high weight to large repetitive
tables, and neglects structural errors in the analysis
of the table. In contrast, our approach gives equal
weight to all tables regardless of how many entries
they contain, requires semantically valid structural
analyses, and yet still accepts any parse that yields
the correct attribute-value pairings (since the tag-
ging of the test set includes all legitimate types
when there are multiple valid alternatives).
The fact that precision was lower than recall is
due to the fact that many tables were wrongly in-
terpreted as tables without schema or in wrong ori-
entations. The current grammar has difficulty dis-
tinguishing attributes from values. Significant im-
provement can be obtained by using constraints to
limit the number of incorrect parses, a strategy we
are currently implementing.
6 Conclusion
We have introduced a framework to support a
more linguistically-oriented approach to finer in-
terpretation of tables, using two-dimensional sto-
chastic CFGs with Viterbi parsing to find appro-
priate semantic interpretations of textual tables in
terms of different data models. This approach
yields a concise model that at the same time fa-
cilitates broader coverage than existing models,
and is more easily scalable and maintainable. We
also introduce a cleaner and richer data model to

World Wide Web Conference, Hawaii, USA, 2002.
K. Lari and S. J. Young. The estimation of stochastic context-free grammars
using the inside-outside algorithm. Computer Speech and Language, 4:35–
36, 1990.
Richard Power, Donia Scott, and Nadjet Bouayad-Agha. Document structure.
Computational Linguistics, 29(4):211–260, Dec 2003.
Donia Scott. Layout in NLP: The case for document structure (invited talk).
In 41st Annual Meeting of the Association for Computational Linguistics
(ACL-2003), Aug 2003.
Abraham Silberschatz, Henry F. Korth, and S. Sudarshan. Database System
Concepts. McGraw-Hill, 4th edition, 2002.
H. L. Wang, S. H. Wu, I. C. Wang, C. L. Sung, W. L. Hsu, and W. K. Shih.
Semantic search on internet tabular information extraction for answering
queries. In CIKM ’00: Proceedings of the ninth international conference
on Information and knowledge management, pages 243–249, New York,
NY, USA, 2000. ACM Press.
Xinxin Wang. Tabular Abstraction, Editing, and Formatting. PhD thesis, The
University of Waterloo, Waterloo, Ontario, Canada, 1996.
Yingchen Yang and Wo-Shun Luk. A framework for web table mining. In
WIDM ’02: Proceedings of the 4th international workshop on Web infor-
mation and data management, pages 36–42, New York, NY, USA, 2002.
ACM Press.
Yingchen Yang. Web table mining and database discovery. Master’s thesis,
Simon Fraser University, August 2002.
M. Yoshida, K. Torisawa, and J. Tsujii. A method to integrate tables of the
world wide web, 2001.
912


Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Music ♫

Copyright: Tài liệu đại học © DMCA.com Protection Status