Monday, August 08, 2005

The problem with DTDs and schemas

Many would agree with me that XML is too bulky for its intended purposes. But would you say the same about HTML? probably not.

And why should that be? Isnt XML a glorified, genericized HTML? Despite what the historians say about both being children of SGML, the fact of the matter is that XML only came into being after the spectacular success of HTML and if you remember the original marketing hype, it was all about how web documents would be XML in the future and how they would be rendered or used differently by other consumers.

One of the important factors for the failure of XML is the reliance on DTDs and schemas. HTML succeeded simply because there was no predefined format for any given web page. Browsers ignored what they did not understand, and did the best with what they could. HTML was a language that browsers and servers spoke. There was a syntax, but no schema. The schema came later, long after HTML was refined through multi-year, multi-user testing.

Your typical enterprise application does not have that luxury.

Consider this snippet of XML:


<orders>
<date>Dec-24</date>
 <order>
  <customer>Joe Schlubbs</customer>
  <customer-address>10 drowning st</customer-address>
  <customer-city>London</customer-city>
  <order-line>
     <quantity>100</quantity>
     <item-number>J-350</item-number>
     <item-name>extra large onions</item-name>
     <price-per-item>12.35</price-per-item>
     <!-- other fields -->
  </order-line>
 </order>
 <order>
   <customer>Joe Schlubbs</customer>
   <customer-address>10 drowning st</customer-address>
   <customer-city>London</customer-city>
   <order-line>
      <quantity>100</quantity>
      <item-number>J-007</item-number>
      <item-name>extra large bullets</item-name>
      <price-per-item>1.99</price-per-item>
   </order-line>
  </order>
</orders>

Lets assume we are writing an application that needs a list of all customers who spent more than a thousand bucks in the last month so we could mail them some coupons.

An SQL programmer would simply do a SELECT customer, sum(price*qty) group by customer having sum(price*qty) > 1000. The typical XML programmer on the other hand, has to get all orders for the last month and skim through large amounts of text to produce the necessary results. Depending on the number of tags, close to 90 percent of the input will be useless to the application.

Could the server simply omit the unnecessary data? Probably not, because the tags are not optional. Thats because when they designed the XML, they could not forsee that business would be so good that they would send out coupons.

It is simply impossible to forsee all possible future needs, so to really future-proof your schema, you would need to make nearly every tag optional. This pretty much void the purpose of a schema.

A schema where every tag is optional could still useful, to describe what tags go where and what they mean -- sort of like HTML.

Or you could just deal with the bulk.

No comments: