Introduction to XML for Web Developers (Part 2 of 8)
Advantages of XML: Breaking the Tag Monopoly
The first benefit of XML is that because you are writing your own markup language, you are not restricted to a limited set of tags defined by proprietary vendors.
Rather than waiting for standards bodies to adopt tag set enhancements (a process which can take quite some time), or for browser companies to adopt each other's standards (yeah right!), with XML, you can create your own set of tags at your own pace.
Of course, not only are you free to develop at your own pace, but you are free to develop tools that meet your needs exactly.
By defining your own tags, you create the markup language in terms of your specific problem set! Rather than relying on a generic set of tags which suits everyone's needs adequately, XML allows every person/organization to build their own tag library which suits their needs perfectly.
"From the earliest days of the Web, we've been using essentially the same set of tags in our documents....There's a significant benefit to a fixed tag set with fixed semantics: portability. However, HTML is very confining. Web designers want more control over presentation. Enter XML" - Norman Walsh
That is, though the majority of web designers do not need tags to format musical notation, medical formula, or architectural specifications, musicians, doctors and architects might.
XML allows each specific industry to develop its own tag sets to meet its unique needs without forcing everyone's browser to incorporate the functionality of zillions of tag sets, and without forcing the developers to settle for a generic tag set that is too generic to be useful.
Check out these customized XML-based languages:
- Chemical Markup Language: A Simple introduction to Structured Documents by Peter Murray-Rust
- HTML-Math Mathematical Markup Language Working Draft by Robert R. Miner, Patrick D.F. Ion
Advantages of XML: Moving Beyond Format
However cool the idea of escaping the limitations of a basic tag set (like HTML) sounds, it isn't even close to the best thing about XML?
The real power of XML comes from the fact that with XML, not only can you define your own set of tags, but the rules specified by those tags need not be limited to formatting rules. XML allows you to define all sorts of tags with all sorts of rules, such as tags representing business rules or tags representing data description or data relationships.
Consider again the case of the contact list in SCLML. Using standard HTML, a developer might use something like the following:
<UL> <LI>Gunther Birznieks <UL> <LI>Client ID: 001 <LI>Company: Bob's Fish Store <LI>Email: email@example.com <LI>Phone: 662-9999 <LI>Street Address: 1234 4th St. <LI>City: New York <LI>State: New York <LI>Zip: 10024 </UL> <LI>Susan Czigany <UL> <LI>Client ID: 002 <LI>Company: Netscape <LI>Email: firstname.lastname@example.org <LI>Phone: 555-1234 <LI>Street Address: 9876 Hazen Blvd. <LI>City: San Jose <LI>State: California <LI>Zip: 90034 </UL> </UL>
While this may be an acceptable way to store and display your data, it is hardly the most efficient or powerful. As you are probably aware, there are many potential problems associated with marking up your data using HTML. Three particularly serious problems come to mind:
- The GUI is embedded in the data. What happens if you decide that you like a table-based presentation better than a list-based presentation? In order to change to a table-based presentation, you must recode all your HTML! This could mean editing many of pages.
- Searching for information in the data is tough. How would you get a quick list of only the clients in California? Certainly, some type of script would be necessary. But how would that script work? It would probably have to search through the file word for word looking for the string "California". And even if it found matches, it would have no way of knowing that California might have a relationship to "New York" - that they are both states. Forget about the relationships between pieces of data which are crucial to power searching.
- The data is tied to the logic and language of HTML. What happens if you want to present your data in a Java applet? Well, unfortunately, your Java applet would have to parse through the HTML document stripping out tags and reformat the data. Non-HTML processing applications should not be burdened with extraneous work.
With XML, these problems and similar problems are solved. In XML, the same page would look like the following:
<CLIENT> <NAME>Gunther Birznieks</NAME> <ID>001</ID> <COMPANY>Bob's Fish Store</COMPANY> <EMAIL>email@example.com</EMAIL> <PHONE>662-9999</PHONE> <STREET>1234 4th St.</STREET> <CITY>New York</CITY> <STATE>New York</STATE> <ZIP>Zip: 10024</ZIP> </CLIENT> <CLIENT> <NAME>Susan Czigany</NAME> <ID>002</ID> <COMPANY>Netscape</COMPANY> <EMAIL>firstname.lastname@example.org</EMAIL> <PHONE>555-1234</PHONE> <STREET>9876 Hazen Blvd.</STREET> <CITY>San Jose</CITY> <STATE>California</STATE> <ZIP>90034</ZIP> </CLIENT>
As you can see, custom tags are used to bring meaning to the data being displayed. When stored this way, data becomes extremely portable because it carries with it its description rather than its display. Display is "extracted" from the data and as we will see later, incorporated into a "style sheet".
Let's consider some of the benefits.
- With XML, the GUI is extracted. Thus, changes to display do not require futzing with the data. Instead, a separate style sheet will specify a table display or a list display.
- Searching the data is easy and efficient. Search engines can simply parse the description-bearing tags rather than muddling in the data. Tags provide the search engines with the intelligence they lack.
- Complex relationships like trees and inheritance can be communicated.
- The code is much more legible to a person coming into the environment with no prior knowledge. In the above example, it is obvious that <ID>002</ID> represents an ID whereas <LI>002 might not. XML is self-describing.
Disadvantages of XML
However, awesome XML is, there are some drawbacks which have hindered it from gaining widespread use since its inception. Let's look at the biggest drawback: The lack of adequate processing applications.
For one, XML requires a processing application. That is, the nice thing about HTML was that you knew that if you wrote an HTML document, anyone, anywhere in the world, could read your document using Netscape. Well, with XML documents, that is not yet the case. There are no XML browsers on the market yet (although the latest version of IE does a pretty good job of incorporating XSL and XML documents provided HTML is the output).
Thus, XML documents must either be converted into HTML before distribution or converting it to HTML on-the-fly by middleware. Barring translation, developers must code their own processing applications.
The most common tactic used now is to write parsing routines in DHTML or Java, or Server-Side perl to parse through an XML document, apply the formatting rules specified by the style sheet, and "convert" it all to HTML.
"While it's true that browser support is limited, IE 5 and Netscape 5 are expected to fully support XML. Also, W3C's Amaya browser supports it today, as does the JUMBO browser that was created for the Chemical Markup Language.
XML isn't about display -- it's about structure. This has implications that make the browser question secondary. So the whole issue of what is to be displayed and by what means is intentionally left to other applications. You can target the same XML (with different XSL) for different devices (standard web browser, palm pilot, printer, etc.). You should not get the impression that XML is useless until browsers support it. This is definitely not true -- we are using it at NASA in ways where no browser plays any role." - Ken Sall
However, this takes some magic and the amount of work necessary even to print "hello world" are sometimes enough to dissuade developers from adopting the technology.
Nevertheless, parsing algorithms and tools continue to improve over time as more and more people see the long-term benefits of migrating their data to XML. The backend part of XML will continue to become simpler and simpler. Already Internet Explorer and Netscape provide a decent amount of built in XML parsing tools.
History of XML
XML emerged as a way to overcome the shortcomings of its two predecessors, SGML and HTML which were both very successful markup languages, but which were both flawed in certain ways.
SGML, the international standard for marking up data, has been used since the 80s. SGML is an extremely powerful and extensible tool for semantic markup which is particularly useful for cataloging and indexing data. Like XML, SGML can be used to create an infinite number of markup languages and has a host of other resources as well.
However, SGML is pretty darn complex, especially for the everyday uses of the web. Not only that, but SGML is pretty expensive. Adding SGML capability to a word processor could double or triple the price. Finally, the commercial browsers made it pretty clear that they did not intend to ever support SGML.
HTML on the other hand was free, simple and widely supported. HTML was originally designed at CERN around 1990 to provide a very simple version of SGML which could be used by "regular" people. As everyone knows, HTML spread like wildfire.
Unfortunately, HTML had serious defects that we discussed earlier.
So in 1996, discussions began which focused on how to define a markup language with the power and extensibility of SGML but with the simplicity of HTML. The World Wide Web Consortium (W3C) decided to sponsor a group of SGML gurus including Jon Bosak from Sun.
Essentially, Bosak and his team did to SGML what the Java team had done to C++. All of the non-essential, unused, cryptic parts of SGML were sliced away. What remained was a lean, mean marking up machine: XML. The specification of XML (written mostly by Tim Bray and C.M. Sperberg-McQueen) was only 26 pages as opposed to the 500+ pages of the SGML specification! Nevertheless, all the useful things which could be done by SGML, could also be done with XML.
Over the next few years, XML evolved, drawing from the work of its sponsors and the work of developers solving similar problems such as Peter Murray-Rust who had been working on CML (Chemical Markup Language) and the consortium of folks working on MathML. By mid 1997 The eXtensible Linking Language XLL project was underway and by the summer of 1997, Microsoft had launched the Channel Definition Format (CDF) as one of the first real-world applications of XML.
Finally, in 1998, the W3C approved Version 1.0 of the XML specification and a new language was born.
Okay, you are probably beginning to get a little bit dizzy with all of this theoretical stuff. If you are like me, by now you are quite ready to sink your teeth into the meat of XML.
So to conclude this section, we will run through a very simple XML example so that you can see how it all fits together. We'll keep it simple of course, in fact we will be sloppy and not include a DTD (for simplicity). But in the next few sections we will give you all the tools to start doing more advanced work.
Let's return to our contact document:
<?xml version = "1.0" encoding="UTF-8" standalone = "yes"?> <DOCUMENT> <CONTACT> <NAME>Gunther Birznieks</NAME> <EMAIL>email@example.com</EMAIL> <PHONE>662-9999</PHONE> </CONTACT> <CONTACT> <NAME>Susan Czigany</NAME> <EMAIL>firstname.lastname@example.org</EMAIL> <PHONE>555-1234</PHONE> </CONTACT> </DOCUMENT>
You could copy this text using your favorite word processor and save it as plain text, naming it something like contacts.xml or test.xml.
Notice that the first line is something we have not seen before. This line is called a processing instruction. We will talk much more about processing instructions and their attributes later. For now, just know that all XML documents need this first line much like HTML documents begin with <HTML>.
Other than that, you see a set of opening and closing tags with data (together, the tags and data are called XML Elements).
Okay, the next thing you will need to do is associate stylistic meanings to the tags so that a browser can display the document. As we said before, since XML allows you to create your own sets of tags, you must also create your own style guidelines which a browser can use to interpret your tags (which it has never seen before).
Because they are extracted from the data, style sheets can be shared by any number of XML documents. Also, they can be written in a number of style languages such as Cascading Style Sheet Language (CSS) or eXtensible Style Language (XSL). In this example, we will use XSL.
Let's take a look at a Style Sheet in XSL for our contacts.xml document
"Unlike HTML, XML is VERY precise. If the syntax isn't exactly right, the parser will stop processing it and nothing (except an error message) will be displayed. For example, the processing instruction is absolutely required fro all XML documents. In contrast, most browsers will accept a missing <HTML> tag at the beginning of an HTML document. This is because browsers have built-in "recovery" code to guess what's missing and to recover from invalid HTML. XML parsers, whether embedded in browsers or as standalone processors, are explicitly not allowed to recover. Much like compiling a program, an XML file is either correct, or it's toast. If this seems arbitrary, consider that XML is about transmitting structured data using tags that are usually non-standard. Parsers can't guess what's missing the way they can with HTML." - Ken Sall
<xsl:stylesheet xmlns:xsl="http://www.w3.org/TR/WD-xsl"> <xsl:template pattern = "DOCUMENT"> <HTML> <HEAD> <TITLE>Contacts</TITLE> </HEAD> <BODY> <xsl:process-children> </BODY> </HTML> </xsl:template> <xsl:template pattern = "CONTACT"> <UL> <xsl:process-children> </UL> </xsl:template> <xsl:template pattern = "NAME"> <LI> <xsl:process-children> </LI> </xsl:template> <xsl:template pattern = "PHONE"> <LI> <xsl:process-children> </LI> </xsl:template> <xsl:template pattern = "EMAIL"> <LI> <xsl:process-children> </LI> </xsl:template> </xsl:stylesheet>
Putting it all TogetherOnce you have defined your XML and XSL documents, you can run them through a procesor and display them. We talk a lot more about how you do this in later sections. For now, we only show you what the final converted document will look like
<HTML> <HEAD> <TITLE>Contacts</TITLE> </HEAD> <BODY> <UL> <LI>Gunther Birznieks</LI> <LI>email@example.com</LI> <LI>662-9999</LI> </UL> <UL> <LI>Susan Czigany</LI> <LI>firstname.lastname@example.org</LI> <LI>555-1234</LI> </UL> </BODY> </HTML>
In the next couple of sections we will delve more deeply into the syntax of the XML document, the XSL document and the DTD.