6
THE FRACTAL STRUCTURE OF DATA REFERENCE
3. MODEL DEFINITION
Eventually, this book will present abundant statistical summaries of data
reference patterns. As a starting point, however, let us begin with a single,
observed pattern of access to a single item of data. Figure 1.1 presents one
such pattern, among hundreds of thousands, observed in a large, production
database environment running under
OS/390. In Figure 1.1, the horizontal axis
is a time line, upon which most of the requests are marked. When a group of
requests are too closely spaced to distinguish along the time line, the second
and subsequent requests are displaced vertically to give a “zoom” of each one’s
interarrival time with respect to the request before it.
A careful examination of Figure 1.1 makes it clear that the arrivals are driven
by processes operating at several distinct time scales. For example, episodes
occur repeatedly in which the interarrival time is a matter of a few milliseconds;
such “bursts” are separated, in turn, by interarrival times of many seconds or
tens of seconds. Finally, the entire sequence is widely separated from any other
reference to the data.
If we now examine the structure of database software, in an effort to account
for data reuse at a variety of time scales, we find that we need not look far. For
example, data reuse may occur due to repeated requests in the same subroutine,
different routines called to process the same transaction, or multiple transac
-
tions needed to carry out some overall task at the user level. The explicitly
hierarchical structure of most software provides a simple and compelling ex
-
Figure 1.1. Pattern of requests to an individual track. The vertical axis acts as a “zoom”, to
separate groups of references that are too closely spaced to distinguish along a single time line.
Hierarchical Reuse Model 7
planation for the apparent presence of multiple time scales in reference patterns
references.
Clearly, a hypothesis of this form must be constructed with some lower
limit on the time scale δ
0
; otherwise, we are in danger of dividing by zero. A
lower limit of this kind ends up applying to most self
-
similar models of real
phenomena [12]. For the applications pursued in this book, the lower limit
appears to be much less than any of the time scales of interest (some fraction
of one second). Thus, we will not bother trying to quantify the lower limit
but simply note that there is one. In the remainder of the book, we shall avoid
continually repeating the caveat that a lower limit exists to the applicable time
scale; instead, the reader should take it for granted that this caveat applies.
By good fortune, the type of statistical self
-
similarity that is based upon an
invariant distribution of the form (1.2) is well understood. Indeed, Mandelbrot
has shown that a random variable U which satisfies the conditions stated in
the hierarchical reuse hypothesis must belong to the heavy
-
tailed, also called
hyperbolic, family of distributions. This means that the asymptotic behavior
8
of U must tend toward that of a power law:
THE FRACTAL STRUCTURE OF DATA REFERENCE
(1 .3)
where a > 0 and 8 > 0 are constants that depend upon the specific random
variable being examined. Distributions having this form, first studied by Italian
economist/sociologist Vilfredo Pareto (1848- 1923) and French mathematician
constant. We shall call this quantity the single
-
reference residency time, or τ
(recalling the earlier discussion about time scales, τ ≥ τ
min
> 0, where τ
min
is some fraction of one second). It then follows that a request to a given track
can be serviced out of cache memory if and only if the time since the previous
reference to the track is no longer than τ. By applying this criterion of time
-
in
-
cache to distinguish hits and misses, any statement about the distribution of
interarrival times must also be a statement about miss ratios. In particular, (1.3)
is mirrored by the corresponding result:
(1.4)
Actually, we can conclude even more than this, by considering how subsets
of the stored data must share the use of the cache. The situation is analogous
to that of a crowded road from one town to another, with one lane for each
Hierarchical Reuse Model 9
direction of traffic. Just as it must take all types of cars about the same amount
of time to complete the journey between towns, it must take tracks containing
all types of data about the same amount of time to get from the top to the bottom
ofthe
LRU list. Thus, ifwe wishto apply the hierarchical reuse model to some
identified application specifically (say, application i), we may write
(1.5)
where m
i
VM operating system makes an I/O request,
the system intercepts the request and passes it to disk storage. This scheme
reflects
VM’
s
design philosophy, in which VM is intended to provide a layer
of services upon which other operating systems can run as guests. With the
exception of an optional
VM facility called minidisk cache, not used in the
environments presented by Figure 1.2, a
VM host system does not retain the
results of previous
I/O requests for potential use in servicing future I/O. This
makes data collected on
VM systems (other than those which use minidisk
cache) particularly useful as a test of (1.3), since there is no processor cache to
complicate the interpretation of the results. The more complex results obtained
in
OS/390 environments, where large file buffer areas have been set aside in
processor memory, are considered in Subsection 5.1.
Figure 1.2 presents the distribution of interarrival times for the user and
system data pools at each surveyed installation. Note that the plot is presented
in log
-
log format (and also that the “up” direction corresponds to improving
miss ratio values). If (1.3) were exact rather than approximate, then this
presentation of the data should result in a variety of straight lines; the slope
10
of each line (in the chart’s “up” direction) would be the value of θ for the
corresponding data pool.