XML TypeLift – Generating XML types and parsers from XML Schema

Growth of XML

XML is a widely accepted and adopted open standard for data storage and transmission. It has simplified and standardized data processing by virtue of its self descriptive structure and tag based data representation model. XML data is stored in plain text format and this has facilitated simple data sharing and transport models without any underlying hardware, software or platform compatibility issues. XML provides a common syntax for message transfer across heterogeneous communication systems thus allowing operational compatibility.

XML is growing in popularity and application, as more and more business and technology entities are adopting it for standardized data representation, and for data storage and exchange. Non XML data formats are being converted to XML, due to the availability of more XML processing tools and components. XML data is highly portable, exchangeable and is currently being used to represent relatively high number of information technology data entities belonging to various domains including multimedia. Eventually, XML has occupied a significant proportion of the total data on internet and this trend is growing.

Chart of growth of XML usage

XML Schema Definition

The XML domain is vast with huge number of XML file formats for various needs and applications. While the self-descriptive structure and format allows information to be easily ingested in the case of XML, most application areas require strict validation rules for logical correctness of XML documents. It is crucial to understand the structure of the data, nesting of the elements and so on, in order to extract the most out of XML. Without such a formal definition, an XML document is not validated or not said to be well formed. It is here, that XML Schema allows for validating an XML document by defining its building components such as the document structure, the elements and attributes.

XML Schema or XML Schema Definition (XSD) is a recommendation of the W3C to describe and validate the structure and content of an XML document by defining the elements, attributes and data types in the document. XSDs are extensive type descriptions that are helpful for standardizing the large number of XML formats being used in today’s XML world. They do this by providing definitions for the document structure and elements and also rules that to be adhered for being a valid document. Most of the W3C XML formats have XML Schema Definitions developed and standardized by ISO/IEC. Same applies for all ECMA standards including Microsoft OpenXML and OpenDocument Format used by LibreOffice.

An XML Schema acts as a language for specifying rules about XML documents such as specify the elements and attributes in a document, associates types of elements with values (eg: name::string, cost::integer), specify where elements and attributes can appear in a nested and non nested structure, provide formal description of documents, etc. Checking a document’s structure against a Schema is called validating the document and a document conforming to the Schema is called a valid XML document.

XML and Haskell Programming Language

There are 2 approaches for programming XML document processing systems in a functional language like Haskell. (Wallace and Runciman)

An XML document has a tree structure and this is used as the basis for designing a library of combinators that do generic XML document processing tasks namely selection, generation and transformation of document trees.
Here an entire type-translation framework is used that considers the XML type definitions such as Schemas and DTDs as programming language data types and functions are derived for manipulating documents as typed values in Haskell.

Haskell is a statically typed and purely functional programming language and has many packages for manipulating XML.

xml-conduit: This package provides parsing and rendering functions for XML.
HaXml: It is a collection of utilities for parsing, filtering, transforming and generating XML documents using Haskell. It contains elements including: a parser and a validator for XML, a stream parser for XML events, combinator libraries and 2 translators – one from DTD types to Haskell data types and the other from Schema definitions to Haskell data types (both translators come with associated parsers and pretty printers).
hxt (Haskell XML Toolbox): It is a collection of Haskell tools for XML processing. Its core component is a domain specific language with a set of combinators that do processing of XML trees. HXT is based on HaXml, but uses a generic data model for representing XML documents, including the DTD subset and the document subset, in Haskell. It contains utilities for parsing, evaluating and validating XML and associated elements.
hexml – It is a fast, DOM style parser for XML that will parse only a subset of the XML. It will skip entities and does not support <!DOCTYPE related features.
XsdToHaskell: It is a tool for translating any valid XML Schema definition into equivalent Haskell types, together with SchemaType instances.
xeno: A fast, low-memory use, event-based XML parser in pure Haskell.

xeno – a fast, event-based XML Parser

xeno is fast, low-memory use, event-based XML parser written in pure Haskell. The features of xeno are:

SAX style parser (triggers events of tags, attributes, etc)
Fast and low memory usage
Written in pure Haskell
CDATA support as of version 0.2

A benchmark parsing test was conducted between the hexml Haskell library that uses an XML parser written in C and xeno, with the baseline parameters taken from that of hexml’s performance. The test was conducted on three XML documents of size 4KB, 31KB and 211KB.

It was found that the Xeno.SAX interface was faster and took lower memory compared to the DOM versions of other XML parsers implemented in Haskell.

Xeno benchmark

Image reference: http://hackage.haskell.org/package/xeno

XML TypeLift

While XML documents validated by XML Schema Definitions are Valid, to analyze these formats in a typed programming language, one needs to convert these XML data types from XML Schema into meaningful language data type declarations, and generate parser for these XML documents. XML TypeLift is a Haskell tool that converts XML standard data types into Haskell data types and thus making this data available in Haskell programs for analysis. Among other advantages, this allows easily handling of large XML Schemas like Office OpenXML made by Microsoft.

XML TypeLift is in the development stage and once it is released, we are anticipating an efficient tool that performs algorithm validation and identifier resolution well in advance in the operational chain. Haskell has a static type system that requires that the type of every expression to be known at compile time itself. Any type mismatch will throw errors at compile time itself and this is an advantage that can be well utilized with XML document parsing also. Programmers can use the full power of this strong type system during conversion of XML Schema data types to Haskell types since type validation is performed before compilation. No user induced or identifier resolving errors are produced since type information from XML documents is automatically extracted well before compile time, thus avoiding costly run time errors and delays.

Undoubtedly, XML is a key technology that allows for long-term data storage, both static and dynamic, through evolving data formats and self descriptive document structures. It also facilitates data format standardization worldwide by virtue of DTDs and Schemas associated with XML. A programming language based tool like XML TypeLift is highly useful in such scenarios, since it allows dynamic and automated XML parsing and generation.

Even though similar parsing solutions exist within other Haskell packages like HXT and other parsers, XML TypeLift has much more advantages and higher goals.

Performs online processing of multi-gigabyte inputs without overloading system memory.
Fast, low-memory use and event-based parsing using pure-Haskell xeno XML parsing engine.
Exceeds speed and processing capacity of commonly used data processing languages and libraries: Python lxml, Scala, F#, etc.

During the initial stages, XML TypeLift focused more on generating appropriate data type declarations for corresponding XML Schema elements. This allowed for concise understanding of existing XML documents and we are yet to develop the parser generator. Also for the project, the code of xeno is extended with the FromXML class, which can be used without automatic generation of a parser. This will allow a choice between automatic generation, and handwritten instances. FromXML is also an interface scheme that allows to define fast XML parsing on a high level. The FromXML class uses Data.ByteString, a fast and memory efficient implementation of byte vectors using packed Word8 Arrays instead of String IO for XML data handling. Data.ByteString is suitable for high performance use, both in terms of large data quantity or high speed requirements.

The parser generated by XML TypeLift uses xeno as the lower level parser that actually does the parsing, while FromXML acts as the high level interface for XML parsing. The use of xeno and FastXML has made XML TypeLift extremely fast with high performance. The Xeno.DOM that is a fast DOM parser and API for XML and FromXML that implemented both Xeno.DOM and the high performance ByteString for XML data handling are some of the important factors behind the extended performance of XML TypeLift.