Thursday, August 04, 2005

Is XQuery Right for Scalable Databases?

's new xml data type enables an alternative approach to accessing and modifying hierarchical—rather than relational—data. The primary data access objective for the xml data type is to return structured information from semi-structured XML content. XQuery 1.0, still a W3C Working Draft, is the XML query language implementation for SQL Server's xml data type and basis of the future second edition of SQL/XML-2003 for ANSI SQL-n. The SQL Server team has increased XQuery's utility by adding proprietary methods to update content of xml columns. You can learn more about Microsoft's XQuery implementation in SQL Server 2005 and download an interactive XQuery VB .NET 2005 Windows form project from my "Exploit Yukon's XML Data Type" Visual Studio Magazine article. The article also describes Microsoft's approach to typing xml columns with XML Schema collections and indexing nodes to improve XQuery performance. My tests against SQL Server 2005 beta versions indicate that Microsoft's XQuery implementation meets the requirements of most XQuery use cases and delivers acceptable peformance with a few simuated clients executing simultaneous queries. Hopefully, I'll be able to set up larger size scalability tests with the November 2005's RTM version . XQuery isn't yet an industry "open standard," but the language can stake a claim to maturity. The first XQuery 1.0 Working Draft was posted by the W3C on 15 February 2001, thus XQuery has had a gestation period almost as long as SQL Server 2005. InfoWorld's Jon Udell has posted an interview with Jonathan Robie, one of the editors of the W3C's Working Draft, that recounts the roots of XQuery in earlier XML query languages, such as Object Query Language (OQL), XQL (XML Query Language), XML-QL, and Quilt. Jon's post includes a link to a brief summary of the Robie interview in InfoWorld's "Innovation is alive and well in 2005" special report. Adding Processing Logic to the Database Engine vs. Scalability In his "Database Requirements in the Age of Scalable Services" keynote address to O'Reilly's MySQL Users Conference, held April 18-25, 2005 in Santa Clara, California, Google's Adam Bosworth stated, "If it takes four years, you've overdone the thing," referring to XQuery 1.0. Adam's XQuery condemnation could be deemed applicable to SQL Server 2005 and even to Microsoft Access 1.0, if you include the development time spent on it's ill-fated predecessor (code-named "Omega.") Adam was in charge of the design and development of Access 1.0 during his career at Microsoft. Adam is a champion of simple, highly-scalable queries for Web searches. He says Google has the "dumbest query language in the world ... just type in words. ... It would be silly for MySQL to support XQuery. ... Centralizing [processing] logic in the database ... is not going to give you scale." The same issue applies to executing .NET 2.0 CLR stored procedures within the SQL Server 2005 engine's process, publishing native Web services that use an engine-provided SOAP stack, or defining triggers and user-defined types with C# or VB .NET code. Scaling up databases by adding processors increases hardware and licensing costs faster than scaling out middle tiers by adding Web or application servers. What's missing at this point are independent (third-party) quantitative cost/performance comparisons for engine-based versus middle-tier processing of business rules and other logical—rather than set-based—processing operations. The results aren't likely to become widely known until large organizations start piloting XQuery information retrieval and in-process CLR database objects for real or simulated production environments. Federated Content vs. Centralized Data Domains Google's requirement is high scalability to achieve fast response times for simple full-text queries that return low-granularity (page-level) result sets from caches. Blog searches require increasing granularity to the item level and, presumably, enabling creation or modification date/time criteria. According to Adam Bosworth, "RSS and ATOM will become the lingua franca of data access. ... ATOM was formed by a consortium of bloggers." He observes that RSS/ATOM queries must operate at the item level and won't support joins because joins don't scale; this results in much lower-tech queries than XQuery. The queries aren't intended for the Semantic Web with the Resource Description Framework (RDF) and its top-down taxonomies and schemas. Adam claims that RDF has failed, as demonstrated by the fact that RSS 1.0 was an RDF grammer, RSS 2.0 was not, and RSS 2.0 "won in a walk." Despite the enormous amount of content in the form of Web pages and the rapid increase in the number and size of Weblogs (federated content domains), it's a safe bet that the majority of content that qualifies as useful information is stored in restricted-access databases (centralized data domains). These databases are primarily relational and very-fine grained; relatively few include searchable full-text XML content or XML documents shredded to relations and indexed attributes (tables and columns). As XML documents or fragments increasingly find their way into columns of the xml data type, the ability to write efficient XQuery expressions that extract and transform fine-grained semi-structured content (i.e., XML Infosets) will become a skill required of professional database developers. However, XQuery execution cost—and thus scalability—will remain an open question until all major RDBMS vendors offer implementations that have at least roughly equivalent feature sets. --rj