Tài liệu DocBox the Definitive Guide-Chapter 3. Parsing DocBook Documents - Pdf 98

Chapter 3. Parsing DocBook Documents
A key feature of SGML and XML markup is that you validate it. The
DocBook DTD is a precise description of valid nesting, the order of
elements, and their content. All DocBook documents must conform to this
description or they are not DocBook documents (by definition).
A validating parser is a program that can read the DTD and a particular
document and determine whether the exact nesting and order of elements in
the document is valid according to the DTD.
If you are not using a structured editor that can enforce the markup as you
type, validation with an external parser is a particularly important step in the
document creation process. You cannot expect to get rational results from
subsequent processing (such as document publishing) if your documents are
not valid.
The most popular free SGML parser is SP by James Clark, available at
http://www.jclark.com/
.
SP includes nsgmls, a fast command-line parser. In the world of free
validating XML parsers, IBM AlphaWorks's xml4j and James Clark's xp are
popular choices.

Not all XML parsers are validating, and although a non-validating parser
may have many uses, it cannot ensure that your documents are valid
according to the DTD.
3.1. Validating Your Documents
The exact way in which the parser is executed varies according to the parser
in use, naturally. For information about your particular parser, consult the
documentation that came with it.
3.1.1. Using nsgmls
The nsgmls command from SP is a validating SGML parser. The options
used in the example below suppress the normal output (-s), except for error
messages, print the version number (-v), and specify the catalog file that

calstblx.dtd: 22, 0: Warning: Entity name, "secur",
already defined. This declaration will be ignored.
calstblx.dtd: 44, 48: Warning: Entity name,
"tbl.table.name", already defined. This declaration
will be ignored.
calstblx.dtd: 47, 78: Warning: Entity name,
"tbl.table.mdl", already defined. This declaration
will be ignored.
calstblx.dtd: 64, 80: Warning: Entity name,
"tbl.entry.mdl", already defined. This declaration
will be ignored.
(3)<!DOCTYPE chapter PUBLIC "-//Norman Walsh//DTD
DocBk XML V3.1.4//EN"
"n:/share/sgml/Norman_Walsh/db31xml/db3xml.dtd">
<chapter><title>Test Chapter</title>
<para>
This is a paragraph in the test chapter. It is
unremarkable in
every regard. This is a paragraph in the test
chapter. It is
unremarkable in every regard. This is a paragraph
in the test
chapter. It is unremarkable in every regard.
</para>

(1)

You can ignore the warning message about the duplicate character
entity inodot. Both the ISO AMS Ordinary Math character entities
and the ISO Latin 2 character entities define the inodot entity.

3.2.1. DTD Cannot Be Found
The telltale sign that SP could not find the DTD, or some module of the
DTD, is the error message: "cannot generate system identifier for public text
…". Generally, the errors that occur after this are spurious; if SP couldn't
find some part of the DTD, it's likely to think that everything is wrong.
Careful examination of the following document will show that we've
introduced a simple typographic error into the public identifier (the word
"DocBook" is misspelled with a lowercase "b"):
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD Docbook
V3.1//EN">
<chapter><title>Test Chapter</title>
<para>
This is a paragraph in the test chapter. It is
unremarkable in
every regard. This is a paragraph in the test
chapter. It is
unremarkable in every regard. This is a paragraph
in the test
chapter. It is unremarkable in every regard.
</para>
<para>
<emphasis role=bold>This</emphasis> paragraph
contains
<emphasis>some <emphasis>emphasized</emphasis>
text</emphasis>
and a <superscript>super</superscript>script
and a <subscript>sub</subscript>script.
</para>
<para>
This is a paragraph in the test chapter. It is

element "PARA" undefined
m:\jade\nsgmls.exe:examples\errs\nodtd.sgm:10:15:E:
there is no attribute "ROLE"
m:\jade\nsgmls.exe:examples\errs\nodtd.sgm:10:19:E:
element "EMPHASIS" undefined
m:\jade\nsgmls.exe:examples\errs\nodtd.sgm:11:9:E:
element "EMPHASIS" undefined
m:\jade\nsgmls.exe:examples\errs\nodtd.sgm:11:24:E:
element "EMPHASIS" undefined
m:\jade\nsgmls.exe:examples\errs\nodtd.sgm:12:18:E:
element "SUPERSCRIPT" undefined
m:\jade\nsgmls.exe:examples\errs\nodtd.sgm:13:16:E:
element "SUBSCRIPT" undefined
m:\jade\nsgmls.exe:examples\errs\nodtd.sgm:15:5:E:
element "PARA" undefined
Other things to look for, if you haven't misspelled the public identifier, are
typos in the catalog or failure to specify a catalog that resolves the public
identifier that can't be found.
3.2.2. ISO Entity Set Missing
A missing entity set is another example of either a misspelled public
identifier, or a missing catalog or catalog entry.
In this case, there's nothing wrong with the document, but the catalog that's
been specified is missing the public identifiers for the ISO entity sets:
[n:\dbtdg]nsgmls -sv -c examples\errs\cat2
examples\simple.sgm
m:\jade\nsgmls.exe:I: SP version "1.3.2"
m:\jade\nsgmls.exe:n:/share/sgml/docbook/3.1/dbcent
.mod:53:65:W: cannot generate system identifier for
public text "ISO 8879:1986//ENTITIES Added Math
Symbols:Arrow Relations//EN"


3.2.3. Character Data Not Allowed Here
Out of context character data is frequently caused by a missing start tag, but
sometimes it's just the result of typing in the wrong place!
<!DOCTYPE chapter PUBLIC "-//Davenport//DTD DocBook
V3.0//EN">
<chapter><title>Test Chapter</title>
<para>
This is a paragraph in the test chapter. It is
unremarkable in
every regard. This is a paragraph in the test
chapter. It is
unremarkable in every regard. This is a paragraph
in the test
chapter. It is unremarkable in every regard.
</para>
You can't put character data here.
<para>
<emphasis role=bold>This</emphasis> paragraph
contains
<emphasis>some <emphasis>emphasized</emphasis>
text</emphasis>
and a <superscript>super</superscript>script
and a <subscript>sub</subscript>script.
</para>
<para>
This is a paragraph in the test chapter. It is
unremarkable in
every regard. This is a paragraph in the test
chapter. It is

<emphasis role=bold>This</emphasis> paragraph
contains
<emphasis>some <emphasis>emphasized</emphasis>
text</emphasis>
and a <superscript>super</superscript>script
and a <subscript>sub</subscript>script.
</para>
<para>
This is a paragraph in the test chapter. It is
unremarkable in
every regard. This is a paragraph in the test
chapter. It is
unremarkable in every regard. This is a paragraph
in the test
chapter. It is unremarkable in every regard.
</para>
</chapter>
[n:\documents\books\dbtdg]nsgmls -sv -c
\share\sgml\catalog examples\errs\misspe
ll.sgm
m:\jade\nsgmls.exe:I: SP version "1.3.2"
m:\jade\nsgmls.exe:examples\errs\misspell.sgm:9:5:E
: element "PAAR" undefined
m:\jade\nsgmls.exe:examples\errs\misspell.sgm:14:6:
E: end tag for element "PARA" which is not open
m:\jade\nsgmls.exe:examples\errs\misspell.sgm:21:9:
E: end tag for "PAAR" omitted, but OMITTAG NO was
specified
m:\jade\nsgmls.exe:examples\errs\misspell.sgm:9:0:
start tag was here

in the test
chapter. It is unremarkable in every regard.
</para>
</chapter>
[n:\dbtdg]nsgmls -sv -c \share\sgml\catalog
examples\errs\misspell2.sgm
m:\jade\nsgmls.exe:I: SP version "1.3.2"
m:\jade\nsgmls.exe:examples\errs\misspell2.sgm:2:35
:E: end tag for element "TITEL" which is not open
m:\jade\nsgmls.exe:examples\errs\misspell2.sgm:3:5:
E: document type does not allow element "PARA"
here; missing one of "FOOTNOTE", "MSGTEXT" start-
tag
m:\jade\nsgmls.exe:examples\errs\misspell2.sgm:9:5:
E: document type does not allow element "PARA"
here; missing one of "FOOTNOTE", "MSGTEXT" start-
tag
m:\jade\nsgmls.exe:examples\errs\misspell2.sgm:15:5
:E: document type does not allow element "PARA"
here; missing one of "FOOTNOTE", "MSGTEXT" start-
tag
m:\jade\nsgmls.exe:examples\errs\misspell2.sgm:21:9
:E: end tag for "TITLE" omitted, but OMITTAG NO was
specified
m:\jade\nsgmls.exe:examples\errs\misspell2.sgm:2:9:
start tag was here
m:\jade\nsgmls.exe:examples\errs\misspell2.sgm:21:9
:E: end tag for "CHAPTER" which is not finished
These are pretty easy to spot as well, but look at how confused the parser
became. From the parser's point of view, failure to close the open Title

every regard. This is a paragraph in the test
chapter. It is
unremarkable in every regard. This is a paragraph
in the test
chapter. It is unremarkable in every regard.
</para>
</chapter>
[n:\dbtdg]nsgmls -sv -c \share\sgml\catalog
examples\errs\badstarttag.sgm
m:\jade\nsgmls.exe:I: SP version "1.3.2"
m:\jade\nsgmls.exe:examples\errs\badstarttag.sgm:9:
12:E: document type does not allow element "TITLE"
here; missing one of "CALLOUTLIST",
"SEGMENTEDLIST", "VARIABLELIST", "CAUTION",
"IMPORTANT", "NOTE", "TIP", "WARNING",
"BLOCKQUOTE", "EQUATION", "EXAMPLE", "FIGURE",
"TABLE" start-tag
In this example, we probably wanted a FormalPara
, so that we could
have a title on the paragraph. But note that the parser didn't suggest this
alternative. The parser only tries to add additional elements, rather than
rename elements that it's already seen.
3.2.7. Missing End Tag
Leaving out an end tag is a lot like an out of context start tag. In fact, they're
really the same error. The problem is never caused by the missing end tag
per se, rather it's caused by the fact that something following it is now out of
context.
<!DOCTYPE chapter PUBLIC "-//Davenport//DTD DocBook
V3.0//EN">
<chapter><title>Test Chapter</title>

E: document type does not allow element "PARA"
here; missing one of "FOOTNOTE", "MSGTEXT",
"CAUTION", "IMPORTANT", "NOTE", "TIP", "WARNING",
"BLOCKQUOTE", "INFORMALEXAMPLE" start-tag
m:\jade\nsgmls.exe:examples\errs\noendtag.sgm:20:9:
E: end tag for "PARA" omitted, but OMITTAG NO was
specified
m:\jade\nsgmls.exe:examples\errs\noendtag.sgm:9:0:
start tag was here
In this case, the parser figured out that the best thing it could do is end the
paragraph.
3.2.8. Bad Entity Reference
If you spell an entity name wrong, the parser will catch it.
<!DOCTYPE chapter PUBLIC "-//Davenport//DTD DocBook
V3.0//EN">
<chapter><title>Test Chapter</title>
<para>
This is a paragraph in the test chapter. It is
unremarkable in
every regard. This is a paragraph in the test
chapter. It is
unremarkable in every regard. This is a paragraph
in the test
chapter. It is unremarkable in every regard.
</para>
<para>
There's no entity called &xyzzy; defined in this
document.
</para>
<para>

characters.
<!DOCTYPE chapter PUBLIC "-//Davenport//DTD DocBook
V3.0//EN">
<chapter><title>Test Chapter</title>
<para>
This is a paragraph in the test chapter. It is
unremarkable in
every regard. This is a paragraph in the test
chapter. It is
unremarkable in every regard. This is a paragraph
in the test
chapter. It is unremarkable in every regard.
</para>
<para>
The DocBook declaration in use doesn't allow 8 bit
characters
like �this�.
</para>
<para>
<emphasis role=bold>This</emphasis> paragraph
contains
<emphasis>some <emphasis>emphasized</emphasis>
text</emphasis>
and a <superscript>super</superscript>script
and a <subscript>sub</subscript>script.
</para>
<para>
This is a paragraph in the test chapter. It is
unremarkable in
every regard. This is a paragraph in the test


Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Music ♫

Copyright: Tài liệu đại học © DMCA.com Protection Status