282
O. Shmueli, G. Mihaila, and S. Padmanabhan
(Step-Predicate Consistency) If maps a Predicate to node in and
maps the first step of the RLP of the PredExp of the Predicate to node in
then is an edge in
(Comparison Mapping) maps the ‘Op Value’ part of a predicate
‘self::node() Op Value’ to the node it maps the self:: step and the condi-
tion ‘Op Value’ is satisfied by this node.
5.
6.
Similarly to an expression mapping, a matching map is also a function from
expression components into tree nodes, this time nodes of T rather than A
matching map satisfies the same six conditions listed above
3
. As we shall see, the
existence of an expression mapping function from to implies the existence
of a matching map from to T.
Let T be a tagged tree, an image data tree of T and a normalized
expression. Suppose there exists an element occurrence elem, in computed
by on as evidenced by an expression mapping from to We associate
with a matching map such that if maps a component of to a node
in then maps to the node in T such that is an image of Because
of the way images of tagged trees are defined, if maps to and to
and is an edge in then is an edge in T. Furthermore,
the node tests that hold true in must be “true” in T, that is Node Names
are the same and if text() holds on the node then the corresponding T node
“manufactures” text. We also call a matching of and T.
Observe that there may be a number of distinct matching maps from the
same normalized expression to T. Perhaps the simplest example is a tagged
tree T having a root and two children tagged with ‘A’. Let be Root/child::A,
we can match the step child::A with either of the two A children of the root
returned formulas, the ones corresponding to Predicate* sequences on the major
steps of are attached to tree nodes (in lines 20 and 25), the others are sim-
ply returned to the caller with no side effect. The information about whether
the algorithm is currently processing a major step is encoded in the Boolean
parameter isMajor. Let be the original formula labeling
in T; in
case of a data annotation, the formula is (and True if
no formula originally labels in T). Initially, the algorithm is invoked as follows:
At this point let us review the effect of ASSOCIATEF. Consider a matching
map for normalized expression Let be the major
steps in the major path of i.e., those steps not embedded within any [...]. Let
be the respective nodes of T to which maps Algo-
rithm ASSOCIATEF attaches formulas that will specify the generation
of image data for node only if all predicates applied at are guaranteed
to hold. In other words, image data generation for nodes along the major path
of is filtered. Note, that this filtering holds only for this matching. We shall
therefore need to carefully superimpose rewritings so as not to undo the filtering
effect.
Example 2. Let T be the tagged tree in Figure 4 and expression
In this case, there
is only one possible matching of to T. The call
will recursively call ASSOCIATEF on the node polist, then po, and then on the
predicate at node po. The call on the predicate expression will return the formula
4
which will be conjoined with
the original binding annotation formula Orders at node po, thus effectively
limiting the data tree generation to purchase orders of company “ABC”.
4.2
Superimposing Rewritings
Algorithm ASSOCIATEF handles one matching of an expression Let
crosstalk. Essentially, it ensures that each disjunct in will
be true only for bindings that are relevant to For example, in subsection 1.1
this solution was used in obtaining the final superimposed tree.
We now define the concept of a qual-tagged trees. Consider a variable, say
bound by a formula Define to be from which all atoms of the form
are eliminated. For example, suppose
Then Consider a node in tagged tree T
with ancestor variables If we replace with
the meaning of the tagged tree remains unchanged. This is
because, intuitively, the restrictions added are guaranteed to hold at that point.
A tree in which this replacement is applied to all nodes is called a qual-tagged tree.
5
Tagged Tree Optimizations
5.1
Node Marking and Mapping Tree Pruning
Consider a tagged tree T and a normalized expression Each node in T which
is in the range of some matching map
is marked as visited. Each such node
that is the last one to be visited along the major path of the corresponding path
expression and all its descendants are marked as end nodes. Nodes that are
not marked at all are useless for No image data generated for such nodes will
ever be explored by in an image data tree If for all client queries ec, for
all normalized expressions for ec, node is useless for then node may be
deleted from T.
One may wonder whether we can further prune the tagged tree; consider the
following example.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
286
O. Shmueli, G. Mihaila, and S. Padmanabhan
Example 3. Let T be a chain-like tagged tree, with a root, a root child A tagged
Push up steps may be performed in various parts of the tree In fact, if
we proceed bottom-up, handling a dependent node only once all its dependent
children nodes (if any) were handled, the result will be a tagged tree in which
each generated image data, for a dependent node, can potentially be used by the
expression mapping implied by (on the resulting data tree). In other words,
image data that certainly cannot lead to generation of image data that may be
used by the expression mapping are not produced.
The benefit of pushing lies in the further filtering of image data node genera-
tion. The cost of extensive filtering is that (a) the resulting formulas are complex
(they need eventually be expressed in SQL), and (b) some computations are re-
peated (that is, a condition is ensured, perhaps as part of a disjunction, and
then rechecked for a filtering effect further down the tree). Therefore, there is
a tradeoff; less “useless” image data is generated, but the computation is more
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Query-Customized Rewriting and Deployment of DB-to-XML Mappings
287
extensive. Further work is needed to quantify this tradeoff so as to decide when
is it profitable to apply push up steps.
We have explained formula push up as applied to a tagged tree that reflects
a single matching. Recall that such trees, for various queries and matching map-
pings, may be superimposed. The superimposition result reflects the push up
steps of the various matching mappings as well. An alternative approach is to
first superimpose and then do push up steps. For example, view a node as es-
sential if it is essential for any of the relevant mappings. Clearly, the resulting
filtering will be less strict (analogous to the tuple “cross-talk” phenomenon”).
5.3
Minimum Data Generation
For a tree T let denote the number of nodes in T. For tree denotes
that may be obtained from T by deleting some sub-trees of T.
Given a data tree and a query Q, we would like to compute a minimum
In the first experiment, we examine the effect of tagged tree pruning for
XPath queries without predicates. Thus, we simulated a client workloads
(we use the
abbreviated XPath syntax for readability). For each XPath expression in this
workload, the ASSOCIATEF algorithm found a single matching in the original
tagged tree and, after eliminating all the unmarked nodes, it resulted in a tagged
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
288
O. Shmueli, G. Mihaila, and S. Padmanabhan
Fig
.
5. Data Tree Reductions (no predicates)
Fig
.
6
.
Data Tree Reduction (with predicates)
Fig
.
7. Data Tree Reduction (no orderlines)
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Query-Customized Rewriting and Deployment of DB-to-XML Mappings
289
Fig
.
8. Data Tree Reduction (predicates and structure)
In the third experiment (see Figure 7), we simulated a client that wants
to mine order data, but does not need the actual details of the ordered
items
In this case, the
imal. Our algorithms can identify such cases and advise the data administrator
that full mapping materialization is preferable.
7
Conclusions and Future Work
We considered the problem of rewriting an XML mapping, defined by the tagged
tree mechanism, into one or more modified mappings. We have laid a firm foun-
dation for rewriting based on a set of client queries, and suggested various tech-
niques for optimization - at the tagged tree level as well as at the data tree level.
We confirmed the usefulness of our approach with realistic experimentation.
The main application we consider is XML data deployment to clients requir-
ing various portions of the mapping-defined data. Based on the queries in which
a client is interested, we rewrite the mapping to generate image data that is
relevant for the client. This image data can then be shipped to the client that
may apply to the data an ordinary Xpath processor. The main benefit is in re-
ducing the amount of shipped data and potentially of query processing time at
the client. In all cases, we ship sufficient amount of data to correctly support
the queries at the client. We also considered various techniques to further limit
the amount of shipped data. We have conducted experiments to validate the
usefulness of our shipped data reduction ideas. The experiments confirm that in
reasonable applications, data reduction is indeed significant (60-90%)
The following topics are worthy of further investigation: developing a formula-
label aware Xpath processor; quantifying a cost model for tagged trees, and rules
for choosing transformations, for general query processing; and extending rewrit-
ing techniques to a full-fledged query language (for example, a useful fragment
of Xquery).
References
M. J. Carey, J. Kiernan, J. Shanmugasundaram, E. J. Shekita, and S. N.
Subramanian. XPERANTO: Middleware for publishing object-relational
data as XML documents. In VLDB 2000, pages 646–648. Morgan Kauf-
mann, 2000.
www.tpc.org/tpcw.
[XP] XPath. www.w3.org/TR/xpath.
[XQ] XQuery. www.w3c.org/XML/Query.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
LexEQUAL: Supporting Multiscript Matching
i
n
Database Systems*
A
.
Kumaran and Jayant R. Haritsa
Departmen
t
of Computer Science and Automation
India
n
Institute of Science, Bangalore 560012, INDIA
{kumaran
,
haritsa}@csa.iisc.ernet.i
n
Abstract. To effectively support today’s global economy, database systems need
to store and manipulate text data in multiple languages simultaneously. Current
database systems do support the storage and management of multilingual data,
but are not capable of querying or matching text data across different scripts. As
a first step towards addressing this lacuna, we propose here a new query oper-
ator called LexEQUAL, which supports multiscript matching of proper names.
The operator is implemented by first transforming matches in multiscript text
space into matches in the equivalent phoneme space, and then using standard
approximate matching techniques to compare these phoneme strings. The algo-
Springer-Verlag Berlin Heidelberg 2004
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
LexEQUAL: Supporting Multiscript Matching in Database Systems
293
Fig. 1. Multilingual Books.com
Fig. 2. SQL: 1999 Multiscript Query
corpora is generic or proper names [16]. To illustrate the need for the LexEQUAL
operator, consider Books.com, a hypothetical e-Business that sells books in different
languages, with a sample product catalog as shown in Figure 1
1
.
In this environment, an SQL: 1999 compliant query to retrieve all works of an author
(say,
Nehru
), across multiple languages (say, in
English
,
Hindi
,
Tamil
and
Greek
)
would have to be written as shown in Figure 2. This query suffers from a variety of
limitations: Firstly, the user has to specify the search string
Nehru
in all the languages in
which she is interested. This not only entails the user to have access to lexical resources,
such as fonts and multilingual editors, in each of these languages to input the query, but
also requires the user to be proficient enough in all these languages, to provide all close
proach has its roots in the classical Soundex algorithm [11], and has been previously used
successfully in monolingual environments by the information retrieval community [28].
The transformation of a text string to its equivalent phonemic string representation can
be obtained using common linguistic resources and can be represented in the canoni-
cal IPA format [9]. Since the phoneme sets of two languages are seldom identical, the
compariso
n
of phonemic strings is inherently fuzzy, unlike the standard uniscript lexico-
graphic comparisons, making it only possible to produce a likely, but not perfect, set of
answers with respect to the user’s intentions. For example, the record with English name
Nero
in Figure 1, could appear in the output of the query shown in Figure 3, based on
threshold value setting. Also, in theory, the answer set may not include all the answers
that would be output by the equivalent (if feasible) SQL: 1999 query. However, we expect
that this limitation would be virtually eliminated in practice by employing high-quality
Text-to-Phoneme converters.
Our phoneme space matches are implemented using standard approximate string
matching techniques. We have evaluated the matching performance of the
LexEQUAL
operator on a real multiscript dataset and our experiments demonstrate that it is possible to
simultaneously achieve good recall and precision by appropriate algorithmic parameter
settings. Specifically, a recall of over 95 percent and precision of over 85 percent were
obtained for this dataset.
Apart from output quality, an equally important issue is the run-time of the
LexE-
QUAL operator. To assess this quantitatively, we evaluated our first implementation of
the LexEQUAL operator as a User-Defined Function (UDF) on a commercial database
system. This straightforward implementation turned out to be extremely slow – however,
we were able to largely address this inefficiency by utilizing one of Q-Gram filters [6]
or Phoneme Indexing [27] techniques that inexpensively weed out a large number of
etc.
)
which are assumed not to have any semantic value to the user, other than their
vocalization. That is, we assume that when a name is queried for, the primary intention
of the user is in retrieving all names that match aurally, in the specified target languages.
Though restricted to proper names, multiscript matching gains importance in light of the
fac
t
that a fifth of normal text corpora is generic or proper names [16].
A sample multiscript selection query was shown earlier in Figure 3. The LexEQUAL
operator may also be used for equi-join on multiscript attributes, as shown in the query
in Figure 5, where all authors who have published in multiple languages are retrieved.
The multiscript matching we have outlined here is applicable to many user domains,
especially with regard to e-Commerce and e-Governance applications, web search en-
gines, digital libraries and multilingual data warehouses. A real-life e-Governance ap-
plication that requires a join based on the phonetic equivalence of multiscript data is
outlined in [12].
2.1
Linguistic Issues
We hasten to add that multiscript matching of proper names is, not surprisingly given the
diversity of natural languages, fraught with a variety of linguistic pitfalls, accentuated
by the attribute level processing in the database context. While simple lexicographic and
accent variations may be handled easily as described in [14], issues such as language-
dependent vocalizations and context-dependent vocalizations, discussed below, appear
harder to resolve – we hope to address these issues in our future work.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
296
A. Kumaran and J.R. Haritsa
Language-dependent Vocalizations.
A
of collation sequences (to correctly sort and index the text data) for individual languages,
but comparison across collations is binary.
To the best of our knowledge, none of the commercial and open-source database
systems currently support multiscript matching. Further, with regard to the specialized
techniques proposed for the
LexEQUAL operator, their support is as follows:
Approximate Matching. Approximate matching is not supported by any of the com-
mercial or open-source databases. However, all commercial database systems allow
User-defined Functions (UDF) that may be used to add new functionality to the
server. A major drawback with such addition is that UDF-based queries are not
easily amenable to optimization by the query optimizer.
Phonetic Matching. Most database systems allow matching text strings using pseudo-
phonetic Soundex algorithm originally defined in [11], primarily for Latin-based
scripts.
In summary, while current databases are effective and efficient for monolingual
data (that is, within a collation sequence), they do not currently support processing
multilingual strings across languages in any unified manner.
2.3
Related Research
To our knowledge, the problem of matching multiscript strings has not been addressed
previously in the database research literature. Our use of a phonetic matching scheme for
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
LexEQUAL: Supporting Multiscript Matching in Database Systems
297
multiscript strings is inspired by the successful use of this technique in the mono-script
context by the information retrieval and pharmaceutical communities. Specifically, in
[23] and [28], the authors present their experience in phonetic matching of uniscript
text strings, and provide measures on correctness of matches with a suite of techniques.
Phonemic searches have also been employed in pharmaceutical systems such as [15],
where the goal is to find look-alike sound-alike (LASA) drug names.
In this section, we first present the strategy that we propose for matching multilingual
strings, and then detail our multiscript matching algorithm.
3.1
Multiscript Matching Strategy
Our view of ontology of text data storage in database systems is shown in Figure 6.
The semantics of what gets stored is outlined in the top part of the figure, and how the
information gets stored in the database systems is provided by the bottom part of the
figure. The important point to note is that a proper name, which is being stored currently
as a character string (shown by the dashed line) may be complemented with a phoneme
string (shown by the dotted line), that can be derived on demand, using standard linguistic
resources, such as Text-To-Phoneme (TTP) converters.
As mentioned earlier, we assume that when a name is queried for, the primary inten-
tion of the user is in retrieving all names that match aurally, irrespective of the language.
Our strategy is to capture this intention by matching the equivalent phonemic strings of
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
298
A. Kumaran and J.R. Haritsa
Fig
.
6. Ontology for Text Data
Fig
.
7. Architecture
the multilingual strings. Such phoneme strings represent a normalized form of proper
names across languages, thus providing a means of comparison. Further, when the text
data is stored in multiple scripts, this may be the only means for comparing them. We
propose complementing and enhancing the standard lexicographic equality operator of
database systems with a matching operator that may be used for approximate matching
of the equivalent phonemic strings. Approximate matching is needed due to the inherent
fuzzy nature of the representation and due to the fact that phoneme sets of different
function takes a multilingual string in a given language and returns
its phonetic representation in IPA alphabet. Such transformation may be easily imple-
Fig
.
8. The LexEQUAL Algorithm
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
300
A. Kumaran and J.R. Haritsa
mented by integrating standard TTP systems that are capable of producing phonetically
equivalent strings.
The
editdistance
function
[7]
takes
two
strings
and
returns
the
edit distance between them. A dynamic programming algorithm is implemented for this
computation, due to, as mentioned earlier, the flexibility that it offers in experimenting
with different cost functions.
Match Threshold Parameter. A user-settable parameter, Match Threshold, as a fraction
between
0
and
1
, is an additional input for the phonetic matching. This parameter spec-
ifies the user tolerance for approximate matching:
4.1
Data Set
With regard to the datasets to be used in our experiments, we had two choices: experiment
with multilingual lexicons and verify the match quality by manual relevance judgement,
or alternatively, experiment with tagged multilingual lexicons (that is, those in which
the expected matches are marked beforehand) and verify the quality mechanically. We
chose to take the second approach, but because no tagged lexicons of multiscript names
were readily available
2
, we created our own lexicon from existing monolingual ones, as
described below.
2
Bi-lingual dictionaries mark semantically, and
not
phonetically, similar words.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
LexEQUAL: Supporting Multiscript Matching in Database Systems
301
Fig. 9. Phonemic Representation of Test Data
We selected proper names from three different sources so as to cover common names
in English and Indic domains. The first set consists of randomly picked names from the
Bangalore Telephone Directory, covering most frequently used Indian names. The sec-
ond set consists of randomly picked names from the San Francisco Physicians Directory,
covering most common American first and last names. The third set consisting of generic
names representing Places, Objects and Chemicals, was picked from the Oxford English
Dictionary. Together the set yielded about 800 names in English, covering three distinct
name domains. Each of the names was hand converted to two Indic scripts – Tamil and
Hindi. As the Indic languages are phonetic in nature, conversion is fairly straight forward,
barring variations due to the mismatch of phoneme sets between English and the Indic
languages. All phonetically equivalent names (but in different scripts) were manually