Dec 11, 2010

Java + XML

Started trying to parse some XML in Java. I figured this would be simple considering that both technologies are so well developed - but the case is, it's so well developed that there's so many choices out there that it's confusing. I had a hard time trying to find a good succinct introduction to help choose which technology to use. Surprisingly, even though almost all the work is pre-2005, it's hard to find a single simple introduction. There's a lot of documentation out there, and it took me a long time to read through various different API specs, other docs, and tutorials, before I felt like I had a firm grasp on the world of XML parsing/manipulation in Java. Now I assume if you're interested in XML, you already know what it is, if not there's plenty of primers on that topic. Here's some info on how to work with XML with Java.

First of all,

DTD, Document Type Definition: text file in a standardized format that defines the rules of a specific type of XML file, and the allowed elements, attributes, and structure.

xsd, XML Schema: equivalent of a DTD, but written in XML. Newer.

Note: you would only use one of either DTD or xsd, more DTD in the past, xsd for newer stuff.

DOM, Document Object Model: a parser API (ie a bunch of defined classes) that represents an XML or HTML document as a tree of Node objects. After parsing, the user can navigate the tree to find and manipulate the data they are looking for. While conceptually simple, the implementation is tedious because it's generic. You can navigate to an Element, but you have to use String objects to find the particular element or attribute you are looking for every time. It ends up having high memory usage and is slow. DOM can be used for XML or HTML, and the standard Java implementation for XML is org.w3c.dom.

SAX, Simple API for XML: a competing parser API designed for performance, but does not define how a document is represented - that's left to the user. Instead it uses a callback model that calls user-implemented functions when the parser hits elements or attributes. The user can represent the document any way he likes, from a tree of specific classes for pertinent datatypes, to a linked list or array or hash table if that's more convenient. This may be straightforward for a very simple document, or time consuming if there are many types of elements/attributes. The standard Java implementation is org.xml.sax.

Note: you would only use one of either DOM or SAX. DOM is heavyweight but allows you to navigate the tree, and is generally less work. SAX is faster, but you have to create your own object model, which may be simple or hard, depending on the complexity of your XML schema. It's most likely more work than using the DOM if your XML is complicated.

JAXP, Java API for XML Parsing: A standard API that lets you interact with XML documents, primarily by parsing, and then transforming from source to destination formats. For example, writing to file is implemented as a transform. The API is under java.xml.parsers and java.xml.transform. There are SAX and DOM implementations, using JAXP hides much of the details of the underlying parser. This is one of the simpler ways to go initially, but look up some samples to help you along, since the API docs are not so intuitive. JAXP also includes ways to transform XML documents using XSLT styles, and validate your XML using a Schema. I found using JAXP to be much less code for my purposes, which was just to write some structures to XML.

All the APIs described so far will give you Element or Attribute objects, but then you'd check the string inside to figure out what type of Element it is, and then look for the appropriate Attribute inside. Now it would be much simpler if you actually had Java classes representing the element and attribute types, and simply navigated that. One way to do this is to code up your own Java classes to represent the various element and attribute types, and instantiate them as you go with an SAX parser.

An alternative is to code up your own Java classes that wrap the DOM classes, so that your own classes perform the appropriate DOM manipulation operations, and expose a much simpler interface to the user. Too bad there's so much manual work involved. You would have thought that with a DTD or Schema, you would have all that information about your XML document... That's where JAXB comes into play.

JAXB, Java Architecture for XML Binding: JAXB is an intermediate "compiler" that generates Java classes based on an XML Schema. You can then build these auto-generated classes into your app without the tedium of dealing with DOM or SAX. Juicy!