ALGORITHMS phần 6 - Pdf 21

21. Parsing
Several fundamental algorithms have been developed to recognize legal
computer programs and to
decomI:ose
their structure into a form suitable
for further processing. This operation, called parsing, has application beyond
computer science, since it is directly related to the study of the structure
of language in general. For example, parsing plays an important role in sys-
tems which try to “understand” natural (human) languages and in systems
for translating from one language to another. One particular case of inter-
est is translating from a “high-level” co.nputer language like Pascal (suitable
for human use) to a “low-level” assembly or machine language (suitable for
machine execution). A program for doing such a translation is called a com-
piler.
Two general approaches are used for parsing. Top-down methods look
for a legal program by first looking for parts of a legal program, then looking
for parts of parts, etc. until the pieces are small enough to match the input
directly. Bottom-up methods put pieces of the input together in a structured
way making bigger and bigger pieces until a legal program is constructed.
In general, top-down methods are recursive, bottom-up methods are iterative;
top-down methods are thought to be
easier
to implement, bottom-up methods
are thought to be more efficient.
A full treatment of the issues involved in parser and compiler construction
would clearly be beyond the scope of
thi>,
book. However, by building a simple
“compiler” to complete the pattern-mats:hing algorithm of the previous chap-
ter, we will be able to consider some of’ the fundamental concepts involved.
First we’ll construct a top-down parser for a simple language for describing

This grammar describes regular expressions like those that we used in the last
chapter, such as (l+Ol)*(O+l) or (A*B+AC)D. Each line in the grammar is
called a production or replacement rule. The productions consist of terminal
symbols
(,
), + and
*
which are the symbols used in the language being
described
(‘91,”
a special symbol, stands for any letter or digit); nonterminal
symbols (expression), (term), and (factor) which are internal to the grammar;
and
metasymbols I:=
and ( which are used to describe the meaning of the
productions. The ::= symbol, which may be read
2s
a,” defines the left-hand
side of the production in terms of the right-hand side; and the 1 symbol, which
may be read as
“or”
indicates alternative choices. The various productions,
though expressed in this concise symbolic notation, correspond in a simple
way to an intuitive description of the grammar. For example, the second
production in the example grammar might be read “a (term) is a (factor)
or a (factor) followed by a (term).” One nonterminal symbol, in this case
(expreswon),
is distinguished in the sense that a string of terminal symbols is
in the language described by the grammar if and only if there is some way to
use the productions to derive that string from the distinguished nonterminal

pressions apply directly to the complex job of compiling and executing Pascal
272
CHAPTER 21
programs. For example, the following grammar describes a very small subset
of Pascal, arithmetic expressions involving addition and multiplication.
(expression) ::= (term) 1 (term) + (expression)
(term) ::= (factor) 1 (factor)* (term)
(factor) ::= ((expression)) )
21
Again,
w
is a special symbol which stands for any letter, but in this grammar
the letters are likely to represent variables with numeric values. Examples of
legal strings for this grammar are A+(B*C) and (A+B*C)*D*(A+(B+C)).
As we have defined things, some strings are perfectly legal both as arith-
metic expressions and as regular expressions. For example, A*(B+C) might
mean “add B to C and multiply the result by
A”
or “take any number of A’s
followed by either B or C.” This points out the obvious fact that checking
whether a string is legally formed is one thing, but understanding what it
means is quite another. We’ll return to this issue after we’ve seen how to
parse a string to check whether or not it is described by some grammar.
Each regular expression is itself an example of a context-free grammar:
any language which can be described by a regular expression can also be
described by a context-free grammar. The converse is not true: for example,
the concept of “balancing” parentheses can’t be captured with regular ex-
pressions.
Other types of grammars can describe languages which can’t be
described by context-free grammars. For example, context-sensitive grammars

which
is not used in the grammar) set j to 1, and call expression. If this results in
j being set to M+1, then the regular ex
3ression
is in the language described
by the grammar. Otherwise, we’ll see below how various error conditions are
handled.
The first thing that expression does is call term, which has a slightly more
complicated implementation:
procedure term ;
begin
fact
x-;
if
(1:
b]=‘(
‘)
or letter(ptj]) then term;
end
A direct translation from the grammar would simply have term call factor
and then term. This obviously won’t work because it leaves no way to
exit from term: this program would
go
into an infinite recursive loop if
called. (Such loops have particularly unpleasant effects in many systems.)
The implementation above gets around this by first checking the input to
decide whether term should be called. l’he first thing that term does is call
factor, which is the only one of the proc:dures that could detect a mismatch
in the input. From the grammar, we know that when factor is called, the
current input character must be either

’
then j : =j+
1
else error
end
else if letter(plj]) then
j:=j+l
else error;
if pb]=‘*‘then
j:=j+l;
end
;
Another error condition occurs when a
“)”
is missing.
These procedures are obviously recursive; in fact they are so intertwined
that they can’t be compiled in Pascal without using the forward construct
to get around the rule that a procedure can’t be used without first being
declared.
The parse tree for a given string gives the recursive cal! structure during
parsing. The reader may wish to refer to the tree above and trace through
the operation of the above three procedures when p contains (A*B+AC)D and
expression is called with
j=1.
This makes the origin of the “top-down” name
obvious. Such parsers are also often called recursive descent parsers because
they move down the parse tree recursively.
The top-down approach won’t work for all possible context-free gram-
mars. For example, if we had the production (expression) ::=
v

term, we used lookahead to avoid such a loop; in this case the proper way to
get around the problem is to switch the grammar to say (term)+(expression).
The occurrence of a nonterminal as the first thing on the right hand side of
a replacement rule for itself is called left recursion. Actually, the problem
is more subtle, because the left recursion can arise indirectly: for example
if we were to have the productions (expression) ::= (term) and (term) ::=
v
1 (expression) + (term). Recursive descent parsers won’t work for such
grammars: they have to be transformed to equivalent grammars without left
recursion, or some other parsing method has to be used. In general, there
is an intimate and very widely studied connection between parsers and the
grammars they recognize. The choice of a parsing technique is often dictated
by the characteristics of the grammar to be parsed.
Bottom- Up Parsing
Though there are several recursive calls in the programs above, it is an in-
structive exercise to remove the recursion systematically. Recall from Chapter
9 (where we removed the recursion from Quicksort) that each procedure call
can be replaced by a stack push and each procedure return by a stack pop,
mimicking what the Pascal system does to implement recursion. A reason
for doing this is that many of the calls which seem recursive are not truly
recursive. When a procedure call is the last action of a procedure, then a
simple
goto
can be used. This turns expression and term into simple loops,
which can be incorporated together and combined with factor to produce a
single procedure with one true recursive call (the call to expression within
factor).
This view leads directly to a quite simple way to check whether regular
expressions are legal. Once all the procedure calls are removed, we see that
each terminal symbol is simply scanned as it is encountered. The only real

The main difficulty in building a shift-reduce parser is deciding when to
shift and when to reduce. This can be a complicated decision, depending
on the grammar. Various types of shift-reduce parsers have been studied in
great detail, an extensive literature has been developed on them, and they are
quite often preferred over recursive descent parsers because they tend to be
slightly more efficient and significantly more flexible. Certainly we don’t have
space here to do justice to this field, and we’ll forgo even the details of an
implementation for our example.
Compilers
A compiler may be thought of as a program which translates from one lan-
guage to another. For example, a Pascal compiler translates programs from
the Pascal language into the machine language of some particular computer.
We’ll illustrate one way that this might be done by continuing with our
regular-expression pattern-matching example, where we wish to translate
from the language of regular expressions to a “language” for pattern-matching
machines, the ch,
nextl,
and next2 arrays of the match program of the pre-
vious chapter.
Essentially, the translation process is “one-to-one”: for each character in
the pattern (with the exception of parentheses) we want to produce a state
for the pattern-matching machine (an entry in each of the arrays). The trick
is to keep track of the information necessary to fill in the next1 and next2
arrays. To do so, we’ll convert each of the procedures in our recursive descent
parser into functions which create pattern-matching machines. Each function
will add new states as necessary onto the end of the ch, nextl, and next2
arrays, and return the index of the initial state of the machine created (the
final state will always be the last entry in the arrays).
For example, the function given below for the (expression) production
creates the “or” states for the pattern matching machine.

state, state);
end ;
end
;
This function uses a procedure setstate which simply sets the ch, nextl, and
next2 array entries indexed by the first argument to the values given in the
second, third, and fourth arguments, respectively. The index state keeps track
of the “current” state in the machine being built. Each time a new state is
created, state is simply incremented. Thus, the state indices for the machine
corresponding to a particular procedure call range between the value of state
on entry and the value of state on exit.
The final state index is the value
of state on exit. (We don’t actually “create” the final state by incrementing
state before exiting, since this makes it easy to “merge” the final state with
later initial states, as we’ll see below.)
With this convention, it is easy to check (beware of the recursive call!)
that the above program implements the rule for composing two machines with
the “or” operation as diagramed in the previous chapter. First the machine
for the first part of the expression is built (recursively), then two new null
states are added and the second part of the expression built. The first null
state (with index t2 1) is the final state of the machine of the first part of
the expression which is made into a “no-op” state to skip to the final state for
the machine for the second part of the expression, as required. The second
null state (with index t2) is the initial state, so its index is the return value
for expression and its next1 and next2 entries are made to point to the initial
states of the two expressions. Note carefully that these are constructed in the
opposite order than one might expect, because the value of state for the no-op
state is not known until the recursive call to expression has been made.
The function for (term) first builds the machine for a (factor) then, if
necessary, merges the final state of that machine with the initial state of the

begin
tl :=state;
if
plj]=‘(‘then
begin
j:=j+l;
t2:=expression;
if p b] =
‘)

’
then j := j+
1
else error
end
else if letter(pb]) then
begin
setstate(state,plj], state+l, 0);
t2:=state;
j:=j+l;
state:=state+I
end
else error;
if
p[j]<>‘*‘then
factor:=t2 else
begin
setstate(state,
’

i:=l
to N-l do
if
match(i)>=i
then writeln(i);
This program will print out all character positions in a text string
a[l
. N]
where a pattern
p[l

.M]
leads to a match.
Compiler-Compilers
The program for general regular expresr:ion pattern matching that we have
developed in this and the previous chapter is efficient and quite useful. A
version of this program with a few added capabilities (for handling
“don’t-
care” characters and other amenities) is likely to be among the most heavily
used utilities on many computer systems.
It is interesting (some might say confusing) to reflect on this algorithm
from a more philosophical point of view. In this chapter, we have considered
parsers for unraveling the structure of regular expressions, based on a formal
description of regular expressions using a context-free grammar. Put another
way, we used the context-free gramma]’ to specify a particular “pattern”:
sequences of characters with legally
balz.nced
parentheses. The parser then
checks to see if the pattern occurs in the input (but only considers a match
legal if it covers the entire input string). Thus parsers, which check that an

Parser generators and compiler-compilers are available for general use in
many computing environments, and are quite useful tools which can be used
to produce efficient and reliable parsers and compilers with a relatively small
amount of effort. On the other hand, top-down recursive descent parsers of the
type considered here are quite serviceable for simple grammars which arise in
many applications. Thus, as with many of the algorithms we have considered,
we have a straightforward method which can be used for applications where
a great deal of implementation effort might not be justified, and several ad-
vanced methods which can lead to significant performance improvements for
large-scale applications. Of course, in this case, this is significantly understat-
ing the point: we’ve only scratched the surface of this extensively researched
PARSING
281
Exercises
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
How does the recursive descent parser find an error in a regular expression
such as (A+B)*BC+ which is incomplete?
Give the parse tree for the regular expression ((A+B)+(Ct-D)*)*.
Extend the arithmetic expression grammar to include exponentiation, div
and mod.
Give a context-free grammar to

posite orientation: methods designed primarily to reduce space consumption
without using up too much time. Ironically, the techniques that we’ll examine
to save space are “coding” methods from information theory which were de-
veloped to minimize the amount of information necessary in communications
systems and therefore originally intended to save time (not space).
In general, most files stored on computer systems have a great deal of
redundancy. The methods we will examine save space by taking advantage
of the fact that most files have a relatively low “information content.” File
compression techniques are often used for text files (in which certain charac-
ters appear much more often than others), “raster” files for encoding pictures
(which can have large homogeneous areas), and files for the digital repre-
sentation of sound and other analog signals (which can have large repeated
patterns).
We’ll look at an elementary algorithm for the problem (which is still quite
useful) and an advanced “optimal” method. The amount of space saved by
these methods will vary depending on characteristics of the file. Savings of
20% to 50% are typical for text files, and savings of 50% to 90% might be
achieved for binary files. For some types of files, for example files consisting
of random bits, little can be gained. In fact, it is interesting to note that any
general-purpose compression method must make some files longer (otherwise
we could continually apply the method to produce an arbitrarily small file).
On one hand, one might argue that file compression techniques are less
important than they once were because the cost of computer storage devices
has dropped dramatically and far more storage is available to the typical user
than in the past. On the other hand, it can be argued that file compression
283
CHAPTER 22
techniques are more important than ever because, since so much storage is in
use, the savings they make possible are greater. Compression techniques are
also appropriate for storage devices which allow extremely high-speed access

lying on its side, which is
representative of the type of information that might have to be processed by a
text formatting system (such as the one used to print this book); at the right
is a list of numbers which might be used to store the letter in a compressed
form.
FILE COMPRESSION
000000000000000000000000000011111111111111000000000
000000000000000000000000001111111111111111110000000
000000000000000000000001111111111111111111111110000
000000000000000000000011111111111111111111111111000
000000000000000000001111111111111111111111111111110
0000000000000000000111111100000000000000~0001111111
000000000000000000011111000000000000000000000011111
000000000000000000011100000000000000000000000000111
000000000000000000011100000000000000000000000000111
000000000000000000011100000000000000000000000000111
000000000000000000011100000000000000000000000000111
000000000000000000001111000000000000000000000001110
000000000000000000000011100000000000000000000111000
011111111111111111111111111111111111111111111111111
011111111111111111111111111111111111111111111111111
011111111111111111111111111111111111111111111111111
011111111111111111111111111111111111111111111111111
0!1111111111111111111111111111111111111111111111111
011000000000000000000000000000000000000000000000011
285
28 14 9
26 18 7
23 24 4
22 26 3

How can we make some letters represent digits and others represent
parts of the string to be encoded? One solution is to use some character
which is likely to appear rarely in the text as a so-called escape character.
Each appearance of that character signals that the next two letters form a
(count,character) pair, with counts represented by having the ith letter of
the alphabet represent the number i. Thus our example string would be
represented as follows with Q as the escape character:
286
CHAPTER 22
QDABBBAAQEBQHCDABCBAAAQDBCCCD
The combination of the escape character, the count, and the one copy
of the repeated character is called an escape sequence. Note that it’s not
worthwhile to encode runs less than four characters long since at least three
characters are required to encode any run.
But what if the escape character itself happens to occur in the input?
We can’t afford to simply ignore this possibility, because it might be difficult
to ensure that any particular character can’t occur. (For example, someone
might try to encode a string that has already been encoded.) One solution to
this problem is to use an escape sequence with a count of zero to represent the
escape character. Thus, in our example, the space character could represent
zero, and the escape sequence “Q(space)” would be used to represent any
occurrence of Q in the input. It is interesting to note that files which contain
Q are the only files which are made longer by this compression method. If a
file which has already been compressed is compressed again, it grows by at
least the number of characters equal to the number of escape sequences used.
Very long runs can be encoded with multiple escape sequences. For
example, a run of 51 A’s would be encoded as QZAQYA using the conventions
above. If many very long runs are expected, it would be worthwhile to reserve
more than one character to encode the counts.
In practice, it is advisable to make both the compression and expansion

how it is created. Suppose that we wish to encode the string
“A
SIMPLE
STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS.”
Encoding it in our standard compact binary code with the five-bit binary
representation of
i
representing the ith letter of the alphabet (0 for blank)
gives the following bit sequence:
000010000010011010010110110000011000010100000
100111010010010010010111000111000001010001111
000000001000101000000010101110000110111100100
001010010000000101011001101001011100011100000
000010000001101010010111001001011010000101100
000000111010101011010001000101100100000001111
001100000000010010011010010011
To “decode” this message, simply read off five bits at a time and convert
according to the binary encoding defined above. In this standard code, the
C, which appears only once, requires the same number of bits as the I, which
appears six times. The Huffman code achieves economy in space by encoding
frequently used characters with as few bits as possible so that the total number
of bits used for the message is minimized.
The first step is to count the frequency of each character within the
message to be encoded. The following code fills an array count[0 26] with the
frequency counts for a message in a character array a[l M]. (This program
uses the index procedure described in Chapter 19 to keep the frequency count
for the ith letter of the alphabet in count[i], with count[0] used for blanks.)
for i:=O to 26 do count [i]
:=O;
for

single tree:
FILE COMPRESSION
289
c!lb
1 1
c P
Note that nodes with low frequencies end up far down in the tree and nodes
with high frequencies end up near the root of the tree. The numbers labeling
the external (square) nodes in this tree are the frequency counts, while the
number labeling each internal (round) node is the sum of the labels of its
two sons. The small number above each node in this tree is the index into
the count array where the label is stored, for reference when examining the
program which constructs the tree below. (The labels for the internal nodes
will be stored in count[27 51] in an order determined by the dynamics of the
construction.) Thus, for example, the 5 in the leftmost external node (the
frequency count for N) is stored in count [14], the 6 in the next external node
(the frequency count for I) is stored in count
[9],
and the 11 in the father of
these two is stored in count[33], etc.
It turns out that this structural description of the frequencies in the form
of a tree is exactly what is needed to create an efficient encoding. Before
looking at this encoding, let’s look at the code for constructing the tree.
The general process involves removing the smallest from a set of unordered
elements, so we’ll use the pqdownheap procedure from Chapter 11 to build and
maintain an indirect heap on the frequency values. Since we’re interested in
small values first, we’ll assume that the sense of the inequalities in pqdownheap
has been reversed. One advantage of using indirection is that it is easy to
ignore zero frequency counts. The following table shows the heap constructed
for our example:

begin
N:=N+I;
heap[N] :=i end;
for k:=N downto 1 do pqdownheap(k);
As mentioned above, this assumes that the sense of the inequalities in the
pqdownheap code has been reversed.
Now, the use of this procedure to construct the tree as above is straightfor-
ward: we take the two smallest elements off the heap, add them and put the
result back into the heap. At each step we create one new count, and decrease
the size of the heap by one. This process creates N-l new counts, one for
each of the internal nodes of the tree being created, as in the following code:
repeat
t:=heap[l];
heap[l]:=heap[N];
N:=N-1;
pqdownheap(l);
count[26+N]:=count[heap[I]]+count[t];
dad[t]:=26+N; dad[heap[l]]:=-26-N;
heap[l]:=26+N;
pqdownheap(1);
until N=
1;
dad[26+N] :=O;
The first two lines of this loop are actually pqremove; the size of the heap is
decreased by one. Then a new internal node is “created” with index 26+Nand
given a value equal to the sum of the value at the root and value just removed.
Then this node is put at the root, which raises its priority, necessitating
another call on pqdownheap to restore order in the heap. The tree itself is
represented with an array of “father” links: dad[t] is the index of the father
of the node whose weight is in count

if count[k]=O then
begin code[k] :=O; len[k] :=O end
else
begin
i:=O;
j:=l;
t:=dad[k];
x:=0;
repeat
if t<O then begin x:=x+j;
t:= t
end;
t:=dad[t];
j:=j+j;
i:=i+I
until
t=O;
code[k]
:=x;
len[k]
:=i;
end ;
Finally, we can use these computed representations of the code to encode the
message:
for
j:=l
to M do
for i:=Ien[index(ab])] downto
1
do

the message is enough to offset the cost, or in situations where the coding trie
can be precomputed and used for a large number of messages. For example, a
trie based on the frequencies of occurrence of letters in the English language
could be used for text documents. For that matter, a trie based on the
frequency of occurrence of characters in Pascal programs could be used for
encoding programs (for example,
“;”
is likely to be near the top of such a
trie). A Huffman encoding algorithm saves about 23% when run on the text
for this chapter.
As before, for truly random files, even this clever encoding scheme won’t
work because each character will occur approximately the same number of
times, which will lead to a fully balanced coding tree and an equal number of
bits per letter in the code.
I

Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

ALGORITHMS phần 6 - Pdf 21

Tài liệu, ebook tham khảo khác

Học thêm