632
N. Karayannidis, T. Sellis, and Y. Kouvaras
Fig. 7.
Impact of cube dimensionality increase to the CUBE File size
We used synthetic data sets that were produced with an OLAP data generator that
we have developed. Our aim was to create data sets with a realistic number of
dimensions and hierarchy levels. In Table 1, we present the hierarchy configuration
for each dimension used in the experimental data sets. The shortest hierarchy consists
of 2 levels, while the longest consists of 10 levels. We tried each data set to consist of
a good mixture of hierarchy lengths. Table 2 shows the data set configuration for each
series of experiments. In order to evaluate the adaptation to sparse data spaces, we
created cubes that were very sparse. Therefore the number of input tuples was kept
from a small to a moderate level. To simulate the cube data distribution, for each cube
we created ten hyper-rectangular regions as data point containers. These regions are
defined randomly at the most detailed level of the cube and not by combination of
hierarchy values (although this would be more realistic), in order not to favor the
CUBE File particularly, due to the hierarchical chunking. We then filled each region
with data points uniformly spread and tried to maintain the same number of data
points in each region.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
CUBE File: A File Structure for Hierarchically Clustered OLAP Cubes
633
Fig. 8.
Size ratio between the UB-tree and the CUBE File for increasing dimensionality
Fig. 9.
Size scalability in the number of input tuples (i.e., stored data points)
4.2
Structure Experiments
Fig. 7 shows the size of the CUBE File as the dimensionality of the cube increases.
The vertical axe is in logarithmic scale. We see the cube data space size (i.e., the
product of the dimension grain-level cardinalities) “exploding” exponentially as the
since for 9 dimensions and 100,000 data points the data space has become extremely
sparse As we noted above, our implementation does not apply
any compression to the directory chunks. Therefore, it is reasonable that for such
extremely sparse data spaces the overhead from these chunks becomes significant,
since a single data point might trigger the allocation of all the cells in the parent
nodes. An implementation that would incorporate the compression of directory
chunks as well would eliminate this effect substantially.
Fig. 9 depicts the size of the CUBE File as the number of cube data points (i.e.,
input tuples) scales up, while the cube dimensionality remains constant (five
dimensions with a good mixture of hierarchy lengths – see Table 1). In the same
graph we show the corresponding size of the UB-tree/MHC and the size of the root-
bucket. The CUBE File maintains a lower storage cost for all tuple cardinalities.
Moreover, the UB-tree size increases in a faster rate making the difference of the two
larger as the number of tuples increases. The root-bucket size is substantially lower
than the CUBE File and demonstrates an almost constant behaviour. Note that in our
implementation we store the whole root-directory in the root-bucket and thus the
whole root-directory is kept in main memory during query evaluation. Thus the graph
also shows that the root-directory size becomes very fast negligible compared to the
CUBE File size as the number of data points increase. Indeed, for cubes containing
more than 1 million tuples the root-directory size is below 5% of the CUBE File size,
although the directory chunks are stored uncompressed in our current implementation.
Hence it is feasible to keep the whole root-directory in main memory.
4.3
Query Experiments
For the query experiments we ran a total of 5,234 HPP queries both on the CUBE File
and the UB-tree/MHC. These queries were classified in three classes: (a) 1,593 prefix
queries, (b) 1,806 prefix range queries and (c) 1,835 prefix multi-range queries. A
prefix query is one in which we access the data points by a specific chunk-id prefix.
For example the following prefix query is represented by the shown chunk
expression, which denotes the restriction on each hierarchy of a 3-dimensional cube
Root bucket number of buckets: 295
Fig. 11, shows the I/O ratio between the UB-tree and the CUBE File for all three
classes of queries for the hot-cache case. This ratio is calculated from the total number
of I/Os for all queries of the same maximum depth of restrictions for each data
structure. As increases, essentially the cube selectivity decreases (i.e., less data
points are returned in the result set). We see that the UB-tree performs more I/Os for
all depths and for all query classes. For small-depth restrictions where the selection
rectangles are very large the CUBE File performs 3 times less I/Os than the UB-tree.
Moreover, for more restrictive queries the CUBE file is multiple times better
achieving up to 37 times less I/Os. An explanation for this is that the smaller the
selection hyper-rectangle the greater becomes the percentage of UB-tree leaf-pages
containing very few (or even none) of the qualifying data points in the total number of
accessed pages. Thus more I/Os are required on the whole, in order to evaluate the
restriction, and for large-depth restrictions the UB-tree performs even worse, because
essentially it fails to cluster the data with respect to the more detailed hierarchy levels.
This behaviour was also observed in [7], where for queries with small cube
selectivities the UB-tree performance was worse and the hierarchical clustering effect
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
636
N. Karayannidis, T. Sellis, and Y. Kouvaras
reduced. We believe this is due to the way data are clustered into z-regions (i.e., disk
pages) along the z-curve [1]. In contrast, the hierarchical chunking applied in the
CUBE File, creates groups of data (i.e., chunks) that belong in the same “hierarchical
family” even for the most detailed levels. This, in combination with the chunk-to-
bucket allocation that guarantees that hierarchical families will be clustered together,
results in better hierarchical clustering of the cube even for the most detailed levels of
the hierarchies.
Fig. 10.
Size ratio between the UB-tree and the CUBE File for increasing tuple cardinality
Fig. 11.
structure. It explicitly supports dimension hierarchies, enabling fast access to cube
data via a directory of chunks formed exactly from the hierarchies. It clusters data
with respect to the dimension hierarchies resulting in reduced I/O cost for query
evaluation. It imposes a low storage overhead basically for two reasons: (a) it adapts
perfectly to the extensive sparseness of the cube, not allocating space for empty
regions, and (b) it does not need to store the dimension values along with the
measures of the cube, due to its location-based access mechanism of cells. These two
result in a significant compression of the data space. Moreover this compression can
increase even further, if compression of intermediate nodes is employed. Finally, it
achieves a high space utilization filling the buckets to capacity. We have verified the
aforementioned performance aspects of the CUBE File by running an extensive set of
experiments and we have also shown that the CUBE File outperforms UB-Tree/MHC,
the most effective method proposed up to now for hierarchically clustering the cube,
in terms of storage cost and number of disk I/Os. Furthermore, the CUBE File fits
perfectly to the processing framework for ad hoc OLAP queries over hierarchically
clustered fact tables (i.e., cubes) proposed in our previous work [7]. In addition, it
supports directly the effective hierarchical pre-grouping transformation [13, 19],
since it uses hierarchically encoded surrogate keys. Finally, it can be used as a
physical base for implementing a chunk-based caching scheme [3].
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
638
N. Karayannidis, T. Sellis, and Y. Kouvaras
Acknowledgements. We wish to thank Transaction Software GmbH for providing us
Transbase Hypercube to run the UB-tree/MHC experiments. This work has been
partially funded by the European Union’s Information Society Technologies
Programme (IST) under project EDITH (IST-1999-20722).
References
1.
2.
3.
1997.
R. Pieringer et al: Combining Hierarchy Encoding and Pre-Grouping: Intelligent Grouping
in Star Join Processing. ICDE 2003.
F. Ramsak et al: Integrating the UB-Tree into a Database System Kernel. VLDB 2000.
S. Sarawagi: Indexing OLAP Data. Data Engineering Bulletin 20(1): 36-43 (1997).
Y. Sismanis, A. Deligiannakis, N. Roussopoulos, Y. Kotidis: Dwarf: shrinking the
PetaCube. SIGMOD 2002.
S. Sarawagi, M. Stonebraker: Efficient Organization of Large Multidimensional Arrays.
ICDE 1994
The Transbase Hypercube® relational database system ().
Aris Tsois, Timos Sellis: The Generalized Pre-Grouping Transformation: Aggregate-
Query Optimization in the Presence of Dependencies. VLDB 2003.
Roger Weber, Hans-Jörg Schek, Stephen Blott: A Quantitative Analysis and Performance
Study for Similarity-Search Methods in High-Dimensional Spaces. VLDB 1998: 194-205.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Efficient Schema-Based Revalidation of XML
Mukund Raghavachari
1
and Oded Shmueli
W
e
examine both the situation where an XML document is modified
befor
e
it is to be revalidated and the situation where it is unmodified.
1
Introduction
The ability to validate XML documents with respect to an XML Schema [21]
or DTD is central to XML’s emergence as a key technology for application
integration. As XML data flow between applications, the conformance of the
data to either a DTD or an XML schema provides applications with a guarantee
that a common vocabulary is used and that structural and integrity constraints
are met. In manipulating XML data, it is sometimes necessary to validate data
with respect to more than one schema. For example, as a schema evolves over
time, XML data known to conform to older versions of the schema may need to
be verified with respect to the new schema. An intra-company schema used by
a business might differ slightly from a standard, external schema and XML data
valid with respect to one may need to be checked for conformance to the other.
The validation of an XML document that conforms to one schema with re-
spect to another schema is analogous to the cast operator in programming lan-
guages. It is useful, at times, to access data of one type as if it were associated
with a different type. For example, XQuery [20] supports a validate operator
which converts a value of one type into an instance of another type. The type
safety of this conversion cannot always be guaranteed statically. At runtime,
E. Bertino et al. (Eds.): EDBT 2004, LNCS 2992, pp. 639–657, 2004.
© Springer-Verlag Berlin Heidelberg 2004
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
640
M. Raghavachari and O. Shmueli
Our solution to this problem has different characteristics, as will be described.
The scenario we consider is that a source schema A and a target schema B
are provided and may be preprocessed statically. At runtime, documents valid
according to schema A are verified with respect to schema B. In the modification
case, inserts, updates, and deletes are performed to a document before it is
verified with respect to B. Our approach takes advantage of similarities (and
differences) between the schemas A and B to avoid validating portions of a
document if possible. Consider the two XML Schema element declarations for
purchaseOrder shown in Figure 1. The only difference between the two is that
whereas the
billTo element is optional in the schema of Figure 1a, it is required
in the schema of Figure 1b. Not all XML documents valid with respect to the first
schema are valid with respect to the second — only those with a billTo element
would be valid. Given a document valid according to the schema of Figure 1a,
an ideal validator would only check the presence of a billTo element and ignore
the validation of the other components (they are guaranteed to be valid).
This paper focuses on the validation of XML documents with respect to
the structural constraints of XML schemas. We present algorithms for schema
cast validation with and without modifications that avoid traversing subtrees of
an XML document where possible. We also provide an optimal algorithm for
revalidating strings known to conform to a deterministic finite state automaton
according to another deterministic finite state automaton; this algorithm is used
to revalidate content model of elements. The fact that the content models of
XML Schema types are deterministic [6] can be used to show that our algorithm
for XML Schema cast validation is optimal as well. We describe our algorithms in
terms of an abstraction of XML Schemas, abstract XML schemas, which model
the structural constraints of XML schema. In our experiments, our algorithms
achieve 30-95% performance improvement over Xerces 2.4.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Efficient Schema-Based Revalidation of XML
ments (typed according to specialized DTDs). Their algorithm keeps data struc-
tures that encode validation computations with document tree nodes and utilizes
these structures to revalidate a document. Barbosa et al. [3] present an algorithm
that also encodes validation computations within tree nodes. They take advan-
tage of the 1-unambiguity of content models of DTDs and XML Schemas [6],
and structural properties of a restricted set of DTDs, to revalidate documents.
Our algorithm is designed for the case where schemas can be preprocessed, but
the documents to be revalidated are not available a priori to be preprocessed.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
642
M. Raghavachari and O. Shmueli
Example
s
include message brokers, programming language and query compilers,
etc
.
In these situations, techniques that preprocess the document and store state
informatio
n
at each node could incur unacceptable memory and computing over-
head
,
especially if the number of updates is small with respect to the document,
o
r
the size of the document is large. Moreover, our algorithm handles the case
wher
e
the document must be revalidated with respect to a different schema.
Kan
d
with types, then by applying the subsumption mapping to these type
annotations
,
one obtains an annotation for the subsuming schema. Our solution
i
s
more general in that we do not require either schema to be subsumed by the
other
,
but do handle the case where this occurs. Furthermore, we do not require
typ
e
annotations on nodes. Finally, we consider the notion of disjoint types in
additio
n
to subsumption in the revalidation of documents.
On
e
approach to handling XML and XML Schema has been to express them
i
n
terms of formal models such as tree automata. For example, Lee et al. describe
ho
w
XML Schema may be represented in terms of deterministic tree grammars
with one lookahead [15]. The formalism for XML Schema and the algorithms in
thes
e
paper are a more direct solution to the problem, which obviates some of
polynomia
l
rather than exponential in complexity.
3
XML Schema and DTD Conformance
I
n
this section, we present the algorithm for revalidation of documents accord-
in
g
to XML Schemas. We first define our abstractions for XML documents,
ordered
labeled trees, and for XML Schema, abstract XML Schema. Abstract
XM
L
Schema captures precisely the structural constraints of XML Schema.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Efficient Schema-Based Revalidation of XML
643
Ordered Labeled Trees.
We
abstract
XML
documents
as
ordered
labeled
trees,
where an ordered labeled tree over a finite alphabet is a pair where
is an ordered tree consisting of a finite set of nodes, N, and a set of
the XML Schema constraint that if two children of an element have the
same label, they must be assigned the same type.
is a partial function which states which element labels can occur
as the root element of a valid tree according to the schema, and the type
this root element is assigned.
Consider the XML Schema fragment of Figure 1a. The function maps global
element declarations to their appropriate types, that is,
Table 1 shows the type declaration for POType1 in our formalism.
Abstract XML Schemas do not explicitly represent atomic types, such as
xsd:integer.
For simplicity of exposition, we have assumed that all XML Schema
atomic and simple types are represented by a single simple type. Handling atomic
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
644
M. Raghavachari and O. Shmueli
and simple types, restrictions on these types and relationships between the values
denoted by these types is a straightforward extension. We do not address the
identity constraints (such as key and keyref constraints) of XML Schema in
this paper. This is an area of future work. Other features of XML Schema such
as substitution groups, subtyping, and namespaces can be integrated into our
model. A discussion of these issues is beyond the scope of the paper.
We define the validity of an ordered, labeled tree with respect to an abstract
XML Schema as follows:
Definition 1. The set of ordered labeled trees that are valid with respect to a
type is defined in terms of the least solution to a set of equations, one for each
of the form ( are nodes):
An ordered labeled tree,
T
, is valid with respect to a schema
if is defined and If is a complex type, and
body
of the foreach loop will not be executed.
A DTD can be viewed as an abstract XML Schema, where
each is assigned a unique type irrespective of the context in which it
is used. In other words, for all there exists such that for all
is either not defined or If
is defined, then as well.
3.1
Algorithm Overview
Given two abstract XML Schemas, and and
an ordered labeled tree, T, that is valid according to S, our algorithm validates
T with respect to S and in parallel. Suppose that during the validation of T
with respect to we wish to validate a subtree of T, with respect to a type
Let be the type assigned to during the validation of T with respect to
S
. If one can assert that every ordered labeled tree that is valid according to
is also valid according to then one can immediately deduce the validity of
according to Conversely, if no ordered labeled tree that is valid according
to is also valid according to then one can stop the validation immediately
since will not be valid according to
We use subsumed type and disjoint type relationships to avoid traversals of
subtrees of T where possible:
Definition
2. A
type
is
subsumed
by a
type
denoted
is defined,
As mentioned before, for exposition reasons, we have chosen to merge all simple
types into one common simple type. It is straightforward to extend the definition
above so that the various XML Schema atomic and simple types, and their
derivations are used to bootstrap the definition of the subsumption relationship.
Also, observe that is a finite relation since there are finitely many types.
The following theorem states that the relation captures precisely the
notion of subsumption defined earlier:
Theorem
1. if and
only
if
We now present an algorithm for computing the relation. The algorithm
starts with a subset of and refines it successively until is obtained.
1.
2.
3.
4.
Let such that implies that both and are
simple types, or both of them are complex types.
For if remove from
For each if there exists and and
remove from
Repeat Step 3 until no more tuples can be removed from the relation
i.
ii.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Efficient Schema-Based Revalidation of XML
647
Computing the
hand, if the document can be determined to be invalid with respect to
immediately. Pseudocode for incremental validation of the document is provided
below. Again, constructstring is a utility method (not shown) that creates a
string from the labels of the root nodes of a sequence of trees (returning if the
sequence is empty). We can efficiently verify the content model of with respect
to by using techniques for finite automata schema cast validation, as
will be described in the Section 4.
i.
ii.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
648
M. Raghavachari and O. Shmueli
3.3
Schema Cast Validation with Modifications
Given an ordered, labeled tree, T, that is valid with respect to an abstract XML
Schema S, and a sequence of insertions and deletions of nodes, and modifications
of element tags, we discuss how the tree may be validated efficiently with respect
to a new abstract XML Schema The updates permitted are the following:
1.
2.
3.
Modify the label of a specified node with a new label.
Insert a new leaf node before, or after, or as the first child of a node.
Delete a specified leaf node.
Given a sequence of updates, we perform the updates on T, and at each step,
we encode the modifications on T to obtain by extending with special
element tags of the form where A node in with label
represents the modification of the element tag in T with the element tag
in Similarly, a node in with label represents a newly inserted node
with tag and a label denotes a node deleted from T. Nodes that have not
If modified is false, we can run the algorithm described in the previous
subsection on this subtree. Since the subtree is unchanged and we know
that when checked with respect to S, we can treat the vali-
dation of as an instance of the schema cast validation problem (without
modifications) described in Section 3.2.
Otherwise, if we do not need to validate the subtree with
respect to any since that subtree has been deleted.
Otherwise, if since the label denotes that is a newly in-
serted subtree, we have no knowledge of its validity with respect to any
other schema. Therefore, we must validate the whole subtree explicitly.
Otherwise, if or
since elements may have been added or deleted from the original content
model of the node, we must ensure that the content of is valid with
respect to If is a simple type, the content model must satisfy (1) of
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Efficient Schema-Based Revalidation of XML
649
Definition 1. Otherwise, if one must check that fit
into the content model of as specified by In verifying the content
model, we check whether where
is defined as:
is defined analogously. If the content model check succeeds, and
is also a complex type, then we continue recursively validating
with respect to from S and from
(note that if is we do not have to validate that since it
has been deleted in If is not a complex type, we must validate each
explicitly.
3.4
DTDs
Since the type of an element in an XML Schema may depend on the context in
is defined. We use where to denote that
maps
to
For string and state denotes the state
reached by operating on one symbol at a time. A string
is accepted by a finite
state automaton if is rejected by the automaton if is
not accepted by it.
The language accepted (or recognized) by a finite automaton
denoted
is the set of strings accepted by We also define as
Note that for a finite state automaton if a string
is in and then is in
We shall drop the subscript from when the automaton is clear from the
context.
A state is a dead state if either:
In other words, either the state is not reachable from the start state or no
final state is reachable from it. We can identify all dead states in a finite state
automaton in time linear in the size of the automaton via a simple graph search.
Intersection Automata. Given two automata, and
one can derive an intersection automaton such that
accepts exactly the language The intersection automaton evaluates
a string on both and in parallel and accepts only if both would. Formally,
where and
where Note that if and are
deterministic, is deterministic as well.
Immediate Decision Automata. We introduce immediate decision automata
as modified finite state automata that accept or reject strings as early as pos-
sible. Immediate decision automata can accept or reject a string when certain
conditions are met, without scanning the entire string. Formally, an immediate
in Without loss of generality, we assume that
Our method for the efficient validation of a string in
with respect to relies on evaluating on and in parallel. Assume that after
parsing a prefix of we are in a state in and a state
in Then, we can:
1.
2.
Accept immediately if because is guaranteed
to be in (since accepts ), which implies that will be in
By definition of will accept
Reject immediately if Then, is guaranteed
not to be in and therefore, will not accept
We construct an immediate decision automaton, from the intersection
automaton of and with and based on the two conditions above:
Definition 7. Let be the intersection automaton
derived
from two finite state automata and The derived immediate decision automa-
ton is where:
Theorem 3. For all accepts if and only if
The determination of the members of can be done efficiently for deter-
ministic finite state automata. The following proposition is useful to this end.
Proposition 1. For any state, if and only if
there exist states and such that and if _
then
We now present an alternative, equivalent definition of
Definition 8. For all if there exist states
such that and if then
is a dead state }.
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.