XML - What is it?

XML (eXtensible Markup Language) is a markup language developed by the W3C (World Wide Web Consortium) because of certain limitations of HTML (Hypertext Markup Language). XML uses a tag system similar to HTML. It is important to understand that XML itself focuses on the structure and content of the document rather than the presentation of the document. This enables the same document to be used for many different types of presentations such as web, printing, audio, etc. Whereas the HTML has a fixed set of tags which are not content specific, XML allows the developer to create his/her own set of tags that is specific to the discipline or enterprise. This makes it possible for a group to settle on a set of tags that has meaning within the discipline rather than using the general purpose tags provided in HTML. While it is not possible to process HTML documents based on their content, this is an important feature of XML documents.


In his book XML By Example (QUE, 2000), Marchal says "If I had to summarize XML in one sentence, it would be something like 'a set of standards to exchange and publish information in a structured manner.' the emphasis on structure cannot be underestimated."

XML Books Used for Project

The following is a list of books on XML that were used heavily with this project. Many ideas, products referenced, etc. came from these sources.

Primary Web Link for XML

The primary web link for XML is http://www.xml.org . Almost all of the web sites and software products used in this project were found by beginning at this site.

Differences Between HTML and XML

While HTML is sure to be around for some time, it has several drawbacks that XML has been designed to overcome. Among these are the following:

Tag Set Size

The fact that the HTML's tag set is large means that the browser software must be complicated in order to handle all of the different types of tags. This is also true for any other software used to process HTML documents.

Tag Set Extensibility

In spite of the fact that HTML has a large set of built-in tags, the tag set is limited in the sense that authors in various disciplines can not markup their documents with tags specific to the subjects at hand. Thus the HTML document can not express the structure nor the semantics of the document except in the most general terms such as paragraphs, lists, tables, etc.

On the other hand, with XML the document can be marked up using tags that the author or groups of authors define for their specific needs. In effect, a group can develop its own markup language within XML. For example, there are such languages for marking up mathematics documents, chemistry documents, multi-media presentations, etc. These subject specific languages enable the development of software for processing such documents based on content and structure expected for the information within the given discipline.

Rigor

For the most part, browsers do not report errors in HTML syntax. Rather the browser attempts to display the document with the errors present or else ignores the offending parts of the document. This often results in browser inconsistencies in presentation and a large percentage of poorly constructed web pages.

XML documents must adhere to a strict syntax or else error messages are given by the software parsing the document. Actually there are two levels of verification possible. The first is a verification that the document obeys the general syntax rules of XML. A further possible level is verification that the document satisfies the rules of a groups own markup language within XML.

The fact that XML is a standard markup language that is rigorously enforced enables the development of a wide range of applications for processing and translating XML documents.

Content Vs. Presentation

Much of the HTML markup is related to the presentation of the document on a web page rather than the content and structure of the document. One major drawback to this is that a minor change in the content of the document may require tedious change in the presentation markup. Also, this means that the document is limited to the one specific presentation of the document as expressed in this markup.

XML focuses on the structure of the document and the content. The presentation of the document is left to other devices. This may be style sheets for web presentations (possibly various web looks for different purposes), translations to other forms for print technology, audio, or database purposes.

Range of Applications

HTML is primarily for web page presentations. Since HTML does not express much about the document's content, other software can not process the document based on specific content. For example, HTML would not make clear whether a number was a price, a conversion rate, a temperature or whatever. Similarly, HTML does not express much about the structure of the document; so it is not possible to process the document based on structure.

Since XML focuses on content and structure, it enables processing based on either document structure or the content data of the document. Thus it is relatively easy to convert between database content and XML data, to process the XML document as data (content-based searches), to process the document based on structure (create indexes, table of contents, etc.), and presenting the data in various formats based on content and application.

Browser Support

At the time of this writing, HTML is supported by all major browsers. However, browser support for XML is quite limited. Netscape does not support XML at all, but the next version of Netscape should provide extensive support for XML. Currently, Internet Explorer provides moderate support for XML.

One popular procedure for dealing with XML is to use a translator to convert the XML document into HTML for web presentations.

Editor Support

Currently, there are many excellent web page development packages to facilitate the development of HTML documents for web presentation. In fact, many people develop attractive web pages without knowing much at all about HTML.

At this point, there the same kind of development packages are not widely available for XML. In fact, it is not as clear what one should expect along these lines. Since XML markup is not intended for display purposes, there is not an obvious way for the editor software to show the content. One approach might be to base the display on the structure. Another possibility would be to make use of something like a style sheet suited for development purposes.