Valid XML Document and DTDArticle Links
Introducing the Valid XML Document and the DTDIn the last section, we reviewed the process of creating a "well-formed" XML document. As you saw, there are many rules you must follow in order to assure that your XML document is well-formed. But even when you write well-formed XML documents, you're not quite out of the woods! Making your document well-formed is only half the battle. You must also make sure that the document is valid A valid document by definition, is a well-formed XML document. But validity goes one step further. A valid XML document is also a well-formed SGML document, and as such, can be read and interpreted as one. To pass the SGML validity test, an XML document must conform to the specifications defined by a Document Type Definition (DTD). You can think of the DTD as defining the overall structure and syntax of the document. The DTD is in fact the meat of the "meta-markup" concept. The DTD defines the grammar and vocabulary of a markup language. In short, the DTD specifies everything a parser needs to know in order for that parser to interpret a well-formed XML document. This "specification" can be as simple as listing all the valid elements (such as elements, tags, attributes, entities) that an XML document may contain, or can be as complex as specifying relationships between those elements (such as element X must contain either Element Y or Element Z but never both). For example, do you remember our CONTACT XML document from previous sections? A CONTACT DTD might specify that every CONTACT has an To help you get a feel for the difference between well-formed XML and valid XML,
consider the following well-formed English: As you can see, all the words and punctuation represent well-formed elements of English. However, unless you are into absurdist poetry, the words and punctuation are virtually meaningless, and difficult to interpret (especially by a computer). To be valid English, the words must conform to a standard grammatical structure. For example, The quick brown fox jumped over the lazy dog. In the case of the markup languages defined by XML, the DTD provides the grammatical structure to bring order to the elements of the language. To specify grammatical rules, DTDs take advantage of a set of regular expressions that match for specified patterns within the XML document in order to determine whether or not the document is valid. Matching is done conservatively so that anything not specifically allowed by the DTD is forbidden. Okay, enough about what DTD's are....let's look at how you'll build them. The Prolog and the BodyAs we mentioned earlier, all documents are made up of a prolog and a body. The document prolog contains the XML Declaration and the document body contains the actual marked up document. Recall from previous sections that we had developed a CONTACTS XML document that looked something like the following:
<!--Beginning of prolog-->
What we did not say earlier was that the prolog also holds the DTD. The Basic DTDThe simplest usage of a DTD involves actually adding the DTD into the prolog portion of your XML document, just after the XML processing instruction. The skeleton (not quite valid) of a DTD looks something like the following:
<?xml version = "1.0" encoding="UTF-8" standalone =
"yes"?>
In this case we declare a document with a root element called
<?xml version = "1.0" encoding="UTF-8" standalone =
"yes"?>
Element Type Declarations (ETDs)As we mentioned parenthetically, the above DTD is "not quite valid". The above DTD really only says that the parser should expect a document with a root element named CONTACTS. It does not say anything about the contents or structure of that document. However, to be valid, a document's DTD must specify every detail of its structure!
To specify the structure, we must populate the " ETDs specify the name of elements and whether or not those elements may have any children. Elements may have several types of children ranging from none, to plain parsed character data, to other elements, to other elements with their own children, to any of the above. ETD's follow the generic syntax of
<!ELEMENT ELEMENT_NAME CHILDREN_NAMES>
In the case of our CONTACTS element we might see something like the following:
<?xml version = "1.0" encoding="UTF-8" standalone = "yes"?>
<!DOCTYPE CONTACTS [
<!ELEMENT CONTACTS ANY>
]>
<CONTACTS>
</CONTACTS>
In this case, the DTD defines an XML document containing a single root element named CONTACTS (don't forget XML is case sensitive) that may contain ANY (case sensitive) type of child, including parsed character data or other elements. Note however, that though CONTACTS "could" contain other elements, no element other than CONTACTS is actually allowed by the DTD since no other elements are defined. All elements in an XML document must be defined in the DTD. Thus, the following XML, though well-formed, is invalid!
<?xml version = "1.0" encoding="UTF-8" standalone = "yes"?>
<!DOCTYPE CONTACTS [
<!ELEMENT CONTACTS ANY>
]>
<CONTACTS>
<CONTACT>
<NAME>Roger Kaplan</NAME>
</CONTACT>
</CONTACTS>
NOTE: Unlike elements, parsed character data within an "ANY" declaration, does not need to be defined...thus, the following XML document would be valid:
<?xml version = "1.0" encoding="UTF-8" standalone = "yes"?>
<!DOCTYPE CONTACTS [
<!ELEMENT CONTACTS ANY>
]>
<CONTACTS>
<CONTACT>
Here is some plain parsed character data.
</CONTACT>
</CONTACTS>
For the document to be valid, you must also define the
<?xml version = "1.0" encoding="UTF-8" standalone = "yes"?>
<!DOCTYPE CONTACTS [
<!ELEMENT CONTACTS ANY>
<!ELEMENT CONTACT (NAME)>
<!ELEMENT NAME (#PCDATA)>
]>
<CONTACTS>
<CONTACT>
<NAME>Roger Kaplan</NAME>
</CONTACT>
</CONTACTS>
In this case, we see that we have defined an XML document with a single
root element named NOTE: It is bad form to use the ANY keyword for any element other than the root element. Generally, you should try to be as conservative as the DTD wants to be. Think in terms of everything being denied besides what you specifically allow. Also, note that the order in which you specify ETDs does not matter. Thus, <!ELEMENT NAME (#PCDATA)> <!ELEMENT CONTACTS ANY> <!ELEMENT CONTACT (NAME)> would work just as well as <!ELEMENT CONTACTS ANY> <!ELEMENT CONTACT (NAME)> <!ELEMENT NAME (#PCDATA)> Finally, note that you may not specify elements with the same name but with different definitions such as: <!ELEMENT CONTACTS ANY> <!ELEMENT CONTACT (NAME)> <!ELEMENT CONTACT (EMAIL)> <!ELEMENT NAME (#PCDATA)>
The double definition of
The ANY and #PCDATA keywords are pretty straightforward. And in this
case, the definition of the NOTE: Elements should begin with either a letter, an underscore (_) or a colon (:) followed by some combination of letters, numbers, periods (.), colons, underscores, or hyphens (-) but no white space, with the exception that no tags should begin with any form of "xml". It is also a good idea to not use colons as the first character in a tag name even if it is legal. Using a colon first could be confusing. Further, though the XML 1.0 standard specifies names of any length, actual XML processors may limit the length of markup names. However, as we mentioned before, the regular expression functionality offered through DTD's allows you to get very flexible with the definition/declaration of elements and their children. Let's take a look... Defining XML Elements and their Children By Selena Sol at eXtropia |
||