Introduction to XML for Web Developers (Part 1 of 8)

Related Articles

What is a Markup Language?

Surely, if you have decided to learn about XML, you are probably already quite familiar with the concepts behind HTML (HyperText Markup Language). So let's start from there.

HTML, as its name implies, is a markup language. As such, it is used to markup text. But what exactly does it mean to markup text?

Abstractly, marking up text is a methodology for encoding data with information about itself. Examples of markups (encoded data) are ubiquitous in the real world.

For example, back when you were slogging through high school, you probably used to use a bright yellow highlighter pen to highlight sentences in your schoolbooks (or at last you knew someone who did!). You did so because you thought that the highlighted sentences would be useful to review around exam time and you wanted a quick way to skim through the important points. Just like you, thousands of kids around the world did the exact same thing for the exact same reason.

By highlighting certain bits of text, you were effectively "marking-up" the data. Essentially, you specified that certain sentences (data) were important by marking them in yellow. These sentences became encoded with the fact that they were important.

And what's more, since everyone followed the same standard of marking up, you could easily pick up a used text book and get a good idea just from reading the highlighted sections what were core points of the book.

There are two crucial points to take away from this example. For markups to transmit useful information about data to a pool of users...

  1. a standard must be in place to define what a valid markup is - In the example above, markup is defined as a bit of yellow ink atop text. In HTML a markup is a tag.
  2. a standard must be in place to define what markup means - In the example above, a yellow highlight means the highlighted text represents an important point. In HTML each tag communicates its own layout of formatting meaning.

Markups are also ubiquitous in the world of computers. They are used by word processors to specify formatting and layout, by communications programs to express the meaning of data sent over the wires, by database applications that must associate meaning and relationships with the data they serve, and by multimedia processing programs which must express meta-data about images or sound.

As data is sent through dumb computers and programs, it is essential that the data carries with it information necessary to communicate what the data means and/or what the receiver should do with that data.

Data with no context is meaningless just as an unhighlighted book is bad news around exam time!

HTML is one of the more famous computer markup systems. HTML defines a set of tags that associate formatting rules with bits of text. Documents which have been marked up (which contain plain text as well as the tags that specify the rules for formatting that text) are read by an HTML processing application (a web browser for example) that knows how to display the text according to the rules.

For example, the <B> tag specifies a rule which instructs an HTML processing application to bold a specific bit of text. Similarly, the <CENTER> tag instructs the HTML processing application to center the text.

Thus <CENTER><B>BOLD</B></CENTER> would be displayed by an HTML processing application as

BOLD

You might imagine a client contact list which could look like the following bit of HTML code:

<UL>
<LI>Gunther Birznieks
<UL>
<LI>Client ID: 001
<LI>Company: Bob's Fish Store
<LI>Email: gunther@bobsfishstore.com
<LI>Phone: 662-9999
<LI>Street Address: 1234 4th St.
<LI>City: New York
<LI>State: New York
<LI>Zip: 10024
</UL>
<LI>Susan Czigany
<UL>
<LI>Client ID: 002
<LI>Company: Netscape
<LI>Email: susan@eudora.org
<LI>Phone: 555-1234
<LI>Street Address: 9876 Hazen Blvd.
<LI>City: San Jose
<LI>State: California
<LI>Zip: 90034
</UL>
</UL>

The above HTML-encoded data would be displayed by an HTML processing application as:

  • Gunther Birznieks
    • Client ID: 001
    • Company: Bob's Fish Store
    • Email: gunther@bobsfishstore.com
    • Phone: 662-9999
    • Street Address: 1234 4th St.
    • City: New York
    • State: New York
    • Zip: 10024
  • Susan Czigany
    • Client ID: 002
    • Company: Netscape
    • Email: susan@eudora.org
    • Phone: 555-1234
    • Street Address: 9876 Hazen Blvd.
    • City: San Jose
    • State: California
    • Zip: 90034

What is XML?

Like HTML, XML (also known as Extensible Markup Language) is a markup language which relies on the concept of rule-specifying tags and the use of a tag-processing application that knows how to deal with the tags.

"The correct title of this specification, and the correct full name of XML, is "Extensible Markup Language". "eXtensible Markup Language" is just a spelling error. However, the abbreviation "XML" is not only correct but, appearing as it does in the title of the specification, an official name of the Extensible Markup Language.

The name and abbreviation were invented by James Clark; other options under consideration had included MGML, (Minimal Generalized Markup Language), MAGMA (Minimal Architecture For Generalized Markup Applications), and SLIM (Structured Language for Internet Markup)" - Extensible Markup Language (XML) 1.0 Specs, The Annotated Version.

However, XML is far more powerful than HTML.

This is because of the "X". XML is "eXtensible". Specifically, rather than providing a set of pre-defined tags, as in the case of HTML, XML specifies the standards with which you can define your own markup languages with their own sets of tags. XML is a meta-markup language which allows you to define an infinite number of markup languages based upon the standards defined by XML.

"The design goals for XML are:

  1. XML shall be straightforwardly usable over the Internet.
  2. XML shall support a wide variety of applications.
  3. XML shall be compatible with SGML.
  4. It shall be easy to write programs which process XML documents.
  5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
  6. XML documents should be human-legible and reasonably clear.
  7. The XML design should be prepared quickly.
  8. The design of XML shall be formal and concise.
  9. XML documents shall be easy to create.
  10. Terseness in XML markup is of minimal importance."
- Extensible Markup Language (XML) 1.0 Specs, The Annotated Version.

Let's consider a very simple example. Let's create a new markup language called SCLML (Selena's Client List Markup Language). This language will define tags to represent contact people and information about contact people.

The set of tags will be simple. However, they will be expressive. Unlike <UL> and <LI> XML tags can be immediately understood just by reading the document.

<CONTACT>
<NAME>Gunther Birznieks</NAME>
<ID>001</ID>
<COMPANY>Bob's Fish Store</COMPANY>
<EMAIL>gunther@bobsfishstore.com</EMAIL>
<PHONE>662-9999</PHONE>
<STREET>1234 4th St.</STREET>
<CITY>New York</CITY>
<STATE>New York</STATE>
<ZIP>Zip: 10024</ZIP>
</CONTACT>

<CONTACT>
<NAME>Susan Czigany</NAME>
<ID>002</ID>
<COMPANY>Netscape</COMPANY>
<EMAIL>susan@eudora.org</EMAIL>
<PHONE>555-1234</PHONE>
<STREET>9876 Hazen Blvd.</STREET>
<CITY>San Jose</CITY>
<STATE>California</STATE>
<ZIP>90034</ZIP>
</CONTACT>

Note that the use of XML is not limited to text markup. The very extensibility of XML means that it could just as easily be applied to sound markup or image markup. A tag such as <EMPHASIZE> might be displayed textualy as being bold but audibly as a louder voice!

What you see above is a very simple "XML document". As you can see, it looks pretty similar to an HTML document.

But don't forget, as we said before, it is not enough to simply encode (markup) the data. For the data to be decoded by someone or something else, the encoding markup languages must follow standard rules including:

  1. The syntax for marking up
  2. The meaning behind the markup

In other words, a processing application must know what a valid markup is (perhaps a tag) and what to do with it if it is valid? After all, how would Netscape know what to do with the above document? What in the world is a <PHONE> tag? Is it a legal tag? How should it be displayed? Our markup language must somehow communicate the syntax of the markup so that the processing application will know what to do with it.

In XML, the definition of a valid markup is handled by a Document Type Definition (DTD) which communicates the structure of the markup language. The DTD specifies what it means to be a valid tag (the syntax for marking up).

We'll discuss the details of DTDs later. For now, just get comfortable with the idea of a DTD as a separate component to the equation.

Yet we must also communicate the meaning of the markup as well as the syntax.

To specify what valid tags mean, XML documents are also associated with "style sheets" which provide GUI instructions for a processing application like a web browser. A style sheet, the details of which we will discuss later, might specify display instructions such as:

  1. Anytime you see a <CONTACT>, display it using a <UL> tag. Similarly, </CONTACT> tags should be converted to </UL>
  2. All <NAME> tags can be substituted for <LI> tags and </NAME> tags should be ignored.
  3. All <EMAIL> tags can be substituted for <LI> tags and </EMAIL> tags should be ignored.
etc.....

In this example, the style sheet utilizes the functionality of HTML to define the formatting of SCLML. But if the XML document was being processed by a program other than a web browser, the HTML translation step might be bypassed.

Processing applications combine the logic of the style sheet, the DTD, and the data of the SCLML document and display it according to the rules and the data.

But wait, isn't this quite complex? Now instead of a single HTML document which defines the data and the rules to display the data, we have an SCLML document, a DTD, AND a style sheet. That's three pieces as opposed to just one.

Further, we need a processing agent that can do the work of putting the DTD, style sheet, and SCLML document together. Remember, web browsers are made to read a specific markup language (like HTML), not any markup language. That means we have three documents to pull together plus one processing program to write or buy. What a mess.

"A software module called an XML processor is used to read XML documents and provide access to their content and structure. It is assumed that an XML processor is doing its work on behalf of another module, called the application." - Extensible Markup Language (XML) 1.0 Specs, The Annotated Version.

Actually however, though there are a few more hurdles to jump in order to use XML, there are several reasons why all this is worth it. Let's take a look at them. . . .

Part 2 >>


Publication Date: Tuesday 19th August, 2003
Author: Selena Sol View profile

Related Articles