In this article were going to discuss XML parsers and the difference between the two main Parser types; DOM and SAX. As you are already may have noticed from my other articles, I will first give you a Helicopter overview before we will go into the details;
Why is this important?
Good question! Let’s start with XML and why it’s important.
XML is everywhere. In our documents, phones, photos, computers, software, hardware, social media, networks, household, refrigerators, coffee, Tea, toys, music, television, radio, the weather, cars, bycicles, plains, boats, trains, the gym, governments, food, drinks, sports, leisure and probably much more that I couldn’t think of right now.
Yes. It’s a lot of areas where we can find XML. But what is XML?
XML stands for EXtensible Markup Language. Just like HTML it is a Markup Language. There’s however one big difference. HTML does describe the Data and Layout, where XML only describes Data, not how it should look. That’s the task of other systems like Extensible Stylesheet Language (XSL) is used to refer to a family of languages used to transform and render XML documents. If you open an XML document in a webbrowser you only see something like in the picture above.
The XML shown above shows up customers and details about the orders they placed at some company. This kind of XML is extensively used by sales systems.
One of the great advantages about this kind of XML is that it can be easily transferred (over HTTP). The same way webpages are transferred from a server anywhere over the world to a webbrowser anywhere in the world, with amazing speed!
We are used to it but think of it this way. You place an online order at some webshop. The shop doesn’t have your ordered item in stock, so it immediately sends a new order to it’s supplier, who delivers your product immediately to the webshop who delivers it to you on time!
All these processes happen without you noticing. You just got what you expected on time and that’s the only thing that matters in your opinion.
These ‘messages’ need to exchange the data contained inside the XML code, so there needs to be a system which can grab the data from the XML. This system is the Parser.
Parsing, syntax analysis or syntactic analysis is the process of analysing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech).
Java and XML
Already since the first version of Java (1.1) Java had all the instruments on board to parse XML (either from files or networks).
You could say that Java and XML are happily married since the beginning of their time.
The DOM Parser
A popular way to look at XML Is as a Tree with with branches and leaves.
The DOM parser makes use of this. It reads the whole XML document at once and after this, the Nodes and Elements can be accessed by simple methods.
After the DOM Parser has read the complete XML Document it generated a DOM object. If you’re interested you can parse an XML File yourself and have a look at the DOM object in the Debugger. It’s a very interesting Object to see!
Getting the data from the DOM Object
Once you have the DOM Object, it’s methods are simple. You can Navigate trough the XML by calling methods as getElementByName(), getNextElement(), getNextSibbling(), getPreviousElement() etc., etc, The data is usually stored in the next Element of the Named Node.
Many developers use Static Utility libraries to extract data from the DOM Object. You can find many of these in the Apache Solr Source Code since Solr does almost everything with XML.
When you want to get Arrays from the Data or want to grab the vontent from a single element, XPATH expressions are very useful.
The SAX Parser
SAX stands for Simple API for XML. The big difference between SAX and DOM is that DOM reads the complete Document before processing and SAX processes the Document while reading.
SAX works with events. To explain this, consider the following XML:
<XML> <a>... Some data...</a> <b>... Some data...</b> <c>... Some data...</c> <d>... Some data...</d> </XML>
- At the first element <XML> the SAX parser recognizes that it is reading the start of the Document and the startDocument event is triggered.
- The parser keeps on reading and encounters the <a> element. It fires the startElement event.
- Now the parser sees the characters in <a> and starts reading until it encounters the </a>. This continues until all elements are read.
- Finally </XML> is encountered and the parser is done.
Which parser when?
- Since SAX doesn’t read the whole Document in memory it’s best for very large Documents.
- DOM is generally faster than SAX because it keeps the whole Document in memory.
Getting the data from the SAX parser
Like DOM this is pretty easy. You can create a JavaBean while parsing and use the getters to get all you want. Here is a nice example.
Based upon this information you should make a decision on the parser you will use.