You are in: Home > Articles > XML: The Well-formed Document

XML: The Well-formed Document

Article Links

The Well-Formed Document

"HTML 4.0 has about three hundred different tags. Most of these have half a dozen possible attributes for several thousand variations. Because XML is more powerful than HTML, you may think XML would have even more tags, but you'd be wrong....XML predefines almost no tags." - Elliotte Rusty Harold

As we have said before, XML is a tool used to generate markup languages in general rather than a specific markup language. Thus, rather than pre-defining a set of tags, XML defines a methodology for tag creation. Once defined, tags are mixed with plain text in order to form an "XML document".

It is worth mentioning that the word "document" can be a little misleading because although XML markup can certainly be contained in a file, (as the word document would imply), it can also be sent as a data stream, a database result set, or be dynamically-generated by one application and sent to another. More correctly, an XML document can be thought of as a "data object", but for simplicity, document will work just fine.

However, though you are free to be as innovative as you want with the tag sets you create, you must follow the constraints of the XML tag set generation standards exactly. When an XML document is presented to an XML-processor, in order for the XML processor to understand how to process it, the XML must follow the XML standard. Specifically, the document must be "well-formed". If the document is not well-formed the processor will stop, complaining about a "fatal error".

Well-formedness has an exact meaning in XML. Specifically, a well-formed document adheres to the syntax rules specified by the XML 1.0 specification in that it must satisfy both physical and logical structures.

Why get so caught up in syntax? Well, the creators of XML had a tough problem to solve. They had to create a system in which documents could be created that could be read either by people or by machines. Writing a language for people is one thing...people can figure their way through ambiguity. Machines, on the other hand, can only work if the rules are clear and the rules are followed. Making your XML document well-formed means that it meets the minimum requirement of being readable by the dumbest of computers.

XML Document Structure

Physically, documents are composed of a set of "entities" that are identified by unique names (except the document entity that we will discuss later). All documents begin with a "root" or document entity. All other entities are optional.
So as not to confuse you, I want to mention that you have not seen any entities in previous examples, we have only needed to rely on the "document entity" that you don't need to explicitly define because XML gives it to you for free. We'll look at entities in greater detail later.

What is important for the moment is that you understand that entities can be thought of as aliases for more complex functions. That is, a single entity name can take the place of a whole lot of text. As in any computer aliasing scheme, entity references cut down the amount of typing you have to do because anytime you need to reference that bunch of text, you simply use the alias name and the processor will expand out the contents of the alias for you.

As opposed to physical structure, XML documents have a logical structure as well. Logically, documents are composed of declarations, elements, comments, character references, and processing instructions, all of which are indicated in the document by explicit markup.

Data Versus Markup

All XML documents may be understood in terms of the data they contain and the markup that describes that data. Data is typically "character data" (letters, numbers, punctuation...anything within the boundaries of valid Unicode) but can also be binary data. Markup includes tags, comments, processing instructions, DTDs, references, etc....

The most simple example of character data and markup would be something like the following:

<NAME>Selena Sol</NAME>

In this case, the <NAME> and </NAME> tags comprise the markup and the "Selena Sol" comprises the character data. As you can imagine there are few rules that manage your data (content) other than what type of data is allowed (binary or ascii for example). On the other hand, there are many rules that define how you must code your markup.

In the rest of this section, we will outline what requirements you must satisfy in order to write well-formed XML.

The XML Declaration

To begin an XML document, it is a good idea to include the XML declaration as the very first line of the document. I say "good idea" because, though the XML declaration is optional, it is suggested by the W3C specification.

Essentially, the XML declaration is a processing instruction that notifies the processing agent that the following document has been marked up as an XML document. It will look something like the following:

<?xml version = "1.0"?>

We'll talk more about the gory details of processing instructions later, but we can at least explain how the XML declaration works.

All processing instructions, including the XML declaration, begin with <? and end with ?>. Following the initial <?, you will find the name of the processing instruction, which in this case is "xml".

The XML processing instruction, requires that you specify a "version" attribute and allows you to specify optional "standalone" and "encoding" attributes.

In its full regalia, the XML declaration might look like the following:

<?xml version = "1.0" standalone = "yes" encoding = "UTF-8"?>

The Version Attribute

As we said before, if you do decide to use the optional XML declaration, you must define the "version" attribute. As of this writing, the current version of XML is 1.0. Note that if you include the optional attributes, "version" must be specified first.

The Standalone Attribute

The "standalone" attribute specifies whether the document has any markup declarations that are defined in a separate document. Thus, if "standalone" is set to "yes", there will be no markup declarations in external DTD's. Setting it to "no" leaves the issue open. The document may or may not access external DTD's.

The Encoding Attribute

All XML parsers must support 8-bit and 16-bit Unicode encoding corresponding to ASCII. However, XML parsers may support a larger set. You'll rarely need to work with this, so I'll simply refer you to section 4.3.3 in the XML Specification document where you can get a list of encoding types and more.

Elements

Once you have written your XML declaration, you are ready to begin coding your XML document. To do so, you should understand the concept of elements.

Elements are the basic unit of XML content. Syntactically, an element consists of a start tag, and an end tag, and everything in between. For example consider the following element:

<NAME>Frank Lee</NAME>

All XML documents must have at least one root element to be well formed. The root element, also often called the document tag, must follow the prolog (XML declaration plus DTD) and must be a nonempty tag that encompasses the entire document.
Generally, you are supposed to match the root element name to the DTD declaration. For example, this declaration

<!DOCTYPE Instrument PUBLIC "-//NASA//Instrument Markup Language 0.2//EN" "http://pioneer.gsfc.nasa.gov/public/iml/iml.dtd">

implies that "Instrument" is my root element. (This rule isn't enforced, but it is a convention.)

XML defines the text between the start and end tags to be "character data" and the text within the tags to be "markup".

Character Data

Character data may be any legal (Unicode) character with the exception of "<". The "<" character is reserved for the start of a tag.

XML also provides a couple of useful entity references that you can use so as not to create any doubt whether you are specifying character data versus markup. Specifically, XML provides the following entity references:

Character Entity Reference
> &gt;
< &lt;
& &amp;
" &quot;
' &apos;

Obviously, the < entity reference is useful for character data. The other entity references can be used within markup in cases in which there could be confusion such as:

<STATEMENT VALUE = "She said, "Don't go there!"">

Which should be written as:

<STATEMENT VALUE = "She said, &quot;Don&apos;t go there!&quot;">

Tags

By and large, tags make up the majority of XML markup. A tag is pretty much anything between a < sign and a > sign that is not inside a comment, or a CDATA section (we'll discuss these in a bit). In short, it is pretty much the same as an HTML tag.

The rules governing tags are a little more complex than those governing character data. Let's take a look at them....

Gimme Something to Work With

For one, all well-formed XML document must have at least one element!

Watch Your Case

Also, care must be taken to assure that you maintain case within a tag set. In other words, the tags <HELLO>, <hello> would not be equivalent as they would in HTML.

End Your Tags Right

Further, besides being spelled and capitalized the same way as their start tag counterparts, end tags should include an initial forward slash "/". Thus in most cases, a start tag of <HELLO>, should be closed with a </HELLO>.

I say sometimes, because in certain circumstances, you can bypass the end tag. Specifically, if you need to use a tag that has no content, you may use a single start tag with a trailing forward slash such as: <HR/>

Nest Properly

Also, note that XML elements may contain other elements but the nesting of elements must be correct. Thus the following example is wrong:

<CONTACT>
<NAME>Frank Lee
<EMAIL>flee@flee.com
</CONTACT></NAME></EMAIL>

Instead, it should be:

<CONTACT>
<NAME>Frank Lee</NAME>
<EMAIL>flee@flee.com</EMAIL>
</CONTACT>

Name Your Tags Legally

Tags should begin with either a letter, an underscore (_) or a colon (:) followed by some combination of letters, numbers, periods (.), colons, underscores, or hyphens (-) but no white space, with the exception that no tags should begin with any form of "xml". It is also a good idea to not use colons as the first character in a tag name even if it is legal. Using a colon first could be confusing.

Further, though the XML 1.0 standard specifies names of any length, actual XML processors may limit the length of markup names.

Define Valid Attributes

Finally, tags may specify any number of supporting attributes. Attributes, that must not duplicate in any one tag, specify a name/value pair delimited by equal (=) sign in which the value is delimited by quotation marks such as:

<SHOE STYLE = "SPECTATOR" COLORING = "BLACK_AND_WHITE">

Unlike HMTL, XML specifies that values MUST be delimited with quotation marks.

In this case, STYLE and COLORING are attributes of the SHOE tag and "SPECTATOR" is the value of the STYLE attribute and "BLACK_AND_WHITE" is the value of the COLORING attribute.

Attribute names follow the same conventions as tag names (valid characters, case sensitivity, etc). Values, on the other hand, may include include white spaces, punctuation and may include entity references when necessary.

Note that all values are not typed. That is, they are considered to be strings. Thus if you were to process the tag

<ROOM_SIZE RADIUS = "10" DEPTH = "13">

you would have to convert "10" and "13" to their numeric values outside of the XML environment.

CDATA and Entities


By Selena Sol at eXtropia