18
Aug
2008
 

Cocoa Tutorial: libxml and xmlreader

by Marcus Zarra

Let us pretend for a moment that NSXMLDocument was not available to your Cocoa application for some reason. Perhaps you have low memory requirements, perhaps you are running on a slimmed down version of OS X. Whatever the reason, for the purposes of this exercise, NSXMLDocument does not exist.

Let us now assume that we have a requirement to parse an xml document quickly and without loading the entire tree into memory in a object structure. In a situation like this libxml comes in handy. Unfortunately it is quite a bit more complicated than calling alloc init on NSXMLDocument.

libxml is a C library that is included with all current releases of OS X. With this library we can quickly read in a document, scrape the information we need out of that document and avoid loading the entire tree into memory at once. In addition, libxml (and more specifically xmlReader) does this very quickly, far faster than NSXMLDocument which is very useful when you have a lower end CPU. In this project we are going to create a simple application that reads in an xml file containing a list of people, their names and their ages. For the purposes of demonstration we are going to load that data into an array of NSDictionary objects and display it in a standard Cocoa window.

The raw XML file looks like this:

Besides being a C library, another difference in the way that we are going to read in this xml file is that the reading is iterative. For each element in the xml file we will be looping over the reader. Therefore, unlike our old friend NSXMLDocument, we will have to actually walk the file one element at a time and evaluate those elements as we receive them. In the xml file listed above, we would first read in the opening root tag, then the opening person tag, the opening name tag, the first name, etc. until we were done with the file.

To accomplish our goal, I created a new Xcode project which can be downloaded below. In this Xcode project, I added the libxml “framework” to the project by right clicking on the target, clicking on the add button, narrowing the focus to libraries and selecting the libxml2.dylib file.

Addlibxml2.tif

This will link libxml to our project at compile time. However, unlike an Objective-C framework, we do not get access to its headers for free. Therefore the next step is to add the header to our search path. To do so, open the target’s properties and select the build tab. Narrow its focus to “header” and select the header search path entry.

AddHeaderSearchPath.tif

Double click on this row and add /usr/include/libxml to the list of paths. NOTE: Some people have had issues with this path and if you do then you will need to find another copy of these headers in your Xcode installation directory.

ThePathToAdd.tif

Once this is done, we will be able to compile against this library with no errors or warnings. Next, after we add an application delegate and wire it up in our xib file, we need to import libxml in our AppDelegate.h. While we are in here we are going to add a reference for our records array that we will be loading the data into.

Since xmlReader is so quick, our data set is so small and this is a demonstration only, we are going to read the xml file directly as the application starts. BE WARNED: for a production application THIS IS A BAD IDEAtm. Fortunately I can get away with a lot in demos. :-)

The first thing we need to do in the -applicationDidFinishLaunching: method is to load the xml data and initialize the xml reader.

In this section, again for demonstration purposes, we are reading in the included xml file into an NSData object. Normally we would be retrieving this data from an NSURLConnection or some other external source. Once the data is loaded, we pass the raw bytes off to the C function xmlReaderForMemory(). This function will return a pointer reference to our xmlReader. This reader is now ready to walk, one way and one time, through our xml file. Now we need to declare some local variables to store our state information while we walk the xml file.

Since this xml file is very simple we do not need to hold onto a lot of state information. Basically when we hit a person tag we will create a new person dictionary and then every time we hit a text node we will store that text node’s value into the dictionary along with the tag name. Finally we will store each of those person dictionaries into an array.

Once our variables are declared and initialized it is time to start walking the xml file. Since we do not know how long it is going to be we will simply loop until we hit the end. Therefore we have a while(true) to start the loop. The first thing we do inside of the loop is to tell the reader to read the next element. If that returns back a 0 we are done with the file and break. If it returns back a positive value then we know the reader has advanced to the next node. Since the node types are stored as integer values, we can use a switch statement to quickly sort them as we are looping. The xmlTextReaderNodeType() function will return that type integer.

For our purposes, we really only care about two of the node types, opening tags and text values. For a more complex xml file we would watch for more information but fortunately we are working in a controlled environment. The tags we are going to watch for are XML_READER_TYPE_ELEMENT and XML_READER_TYPE_TEXT.

When we hit an element we will grab its name, store it temporarily and check to see if it is a person, if it isn’t then we loop. If it is then we create a person and add it to our array.

When we hit a text node then we want to stick it in the current person object. Since we previously stored the name of this element (remember, an element and its text value are two separate nodes), we can grab the value of the text and store it in the dictionary.

Any other type of node is ignored and we continue on the default: tag of the switch.

FinalWindow.tifSince we check for the completion of the file a the beginning of the loop that is all that we need to do for this example. Once the while loop exits we pass the array to -setRecords: and we are done. Cocoa bindings will take care of the rest for us.

xcode.png
NSError Tutorial

Comments

schwa says:

While forgetting about NSXMLDocument & co you should not be forgetting TouchXML: http://code.google.com/p/touchcode/wiki/TouchXML – an Objective-C XML library that for some unknown reason manages to clone enough NSXMLDocument to be useful.

wadetregaskis says:

Be aware that the version of libxml2 included with Mac OS X to date is 2.6.16, which has a lot of bugs in xmlTextReader – in fact, several parts of it aren’t even implemented at all! None of this is documented, either, sadly, though all the issues I’m aware of are fixed in 2.6.32 (the latest).

If you’re stuck using it – though there are alternatives, if you don’t mind pulling in a 3rd party library – it’s worth pulling down the source for 2.6.16 (ftp://xmlsoft.org/libxml2/libxml2-2.6.16.tar.gz) and carefully checking out each function before you use them, to see if there’s any bits obviously missing. That won’t save you from the crashing bugs, alas.

If you’re only missing the NSXMLDocument class (in your *ahem* hypothetical case), it should be possible to use the NSXMLParser class instead of raw libxml2 for Event-Driven XML handling. It is similarly fast and lightweight.

Of course, NSXMLParser is strict, so will fail on malformed XML.

I personally use neither, instead falling back to the libxml parser instead of reader (xmlDocPtr) since I like to use XPath queries (xmlXPathEvalExpression) tp extract my nodes. I’m so heavyweight.

“slimmed down version of OS X.”

OS X is perfect why would you need to slim it down? ;)

klacoste says:

In the 4th paragraph, when talking about the XML data file, there’s a typo in there. I’m pretty sure the XML is not “irritative”. Then again, if that’s what you really meant, I think I’d understand.

PS: Cheers for the blog. I enjoy reading it. Keep the posts coming!

Marcus Zarra says:

Where do you see irritative? Perhaps you meant to type iterative?

While XML itself is not iterative, the method that this article uses for reading it is.

maurj says:

Just recently, I’ve been working “without NSXMLDocument”, although I didn’t realise this till about 2 days of development had passed. Grr. Like yourself, I started looking for an alternative, and ended up working with libxml directly (with great success, I should add). I’m going for a more direct “xmlReadMemory” approach than described here, but the principle is the same. Wish I’d found your article a few days ago.

I thought I should let you know that the same approach described here can be used to work with libxslt (for XSLT transformations) and libtidy (for HTML tidying – e.g to convert HTML into XHTML). As with libxml, you might need to find your own copies of the headers in each case, as they aren’t included by default. And if the local copy of your slimmed-down Mac OS X installation doesn’t have the libxslt and libtidy dylibs to hand, then you might need to borrow the copies from a non-slimmed-down version too.

karlmeier says:

Thanks a lot for this post. it turns out to be way faster than my implementation with NSXMLParser. However I haven’t managed to extend it in a way that the attributes of tags are parsed into the dictionary.

My testing for “case XML_READER_TYPE_ATTRIBUTE:” is never being called and I have issues understanding the libxml2 documentation. As far as I can tell my test should work.

Turns out that nobody has implemented this in objective-c apparently, as far as Google knows anyway. Could you give me a hint on what I’m missing??

Oh and btw, I found that you probably should replace
the return in “if (!currentPerson) return;” with a break.

Thanks in advance

Marcus Zarra says:

Have you looked at TouchXML at all? It is also based on libxml2 and may shed some light on the issue for you.

karlmeier says:

Yeah, I have. Hasn’t helped me too much though. It doesn’t use native libxml to extract the attributes and uses DOM. I’m trying to build a simple SAX parser that returns a neat dictionary that I can use. Furthermore, the TouchXML project seems to have vanished from Google code. I managed to find the code on the web, but didn’t have much luck with the documentation for it ;-)

Seems like I have to do a little digging myself then… oh well. Thanks for the tip though.