Processing XML documents with PEAR::XML_Parser
Although the acronym SAX stands for Simple Access (or API) to XML, it is not that easy, that everybody instantly is able to use it. PHP provided a SAX-based parser using the xml_* functions since PHP 3.0.6, but only since PHP5 provides a wider range of XML APIs PHP and XML are getting the attention they deserve. This tutorial will show you how to use XML_Parser, an object oriented wrapper around the native PHP extension, which makes processing XML documents with PHP4 easy as cake.
SAX is an event based parser. That means that the XML document is traversed character by character and whenever the parser finds an entity (like an opening or closing tag, a processing instruction or only plain character data) it will trigger events. You PHP script may then catch these events by supplying PHP functions or methods as callbacks.
With a SAX parser, the following events can be caught:
When processing XML, it is mostly sufficient to catch the first three events and ignore the rest. But still this leads to a lot of code, which is needed in all of your scripts that need to process XML documents, as you need to create a parser resource, register all handlers, open the XML document and parse it gradually as well as catch all errors that might occur during the parsing stage. And until now you haven't even implemented one line of code that actually works with the information in the document.
Parsing XML with XML_Parser
The PEAR package XML_Parser is able to undertake these common tasks for you, so the only thing that is left to the developer is implementing the logic of the callbacks that process the different entities.
Before we can actually start processing XML documents, we need a document, that we can use for the following examples:
Our first task now is to create a PHP script that extracts all information about the heroes in the team "JLA". When working with XML_Parser, you are supposed to create a new class, which extends the XML_Parser class and that implements the handlers for all events you want to process. The handlers "startHandler()", "endHandler()" and "cdataHandler()" already have been implemented as empty methods in the XML_Parser class and will be registered when you create you new parser instance.
In this class we are defining some properties to contain the actual information about the superhero team ($name, $abbrev and $heroes), some helper properties, used during parsing ($currentHero and $currentTag) as well as the special $folding property. This property, initially defined in XML_Parser, tells the parser, whether to enable case-folding (true) or not (not). If case-folding is enabled, all tag and attribute names will be transformed to uppercase. This is useful if you are not interested in case-sensitivity in your documents, so they are less error prone.
Next we implemented a method to handle all opening XML tags, which has to be called startHandler() in order to be registered by XML_Parser. This method needs to accept three parameters: the resource id of the current parser (which you will probably never need), the name of the opening tag and an associative array containing all attributes of the current tag. In this method, we implement a switch statement to decide based on the name of the tag what needs to be done. It the tag team is found, we store the attributes in the object properties, if hero is found, we store the name of the new hero and create a new entry in the $heroes array. When any other tag is found, we only store the name of the XML tag and continue parsing the document.
Following the startHandler() method, we implemented a matching endHandler() method, which only accepts the parser resource and the name of the closing tag, as closing tags do not contain attributes.
The last method needed for the XML processing is the cdataHandler() which accepts the parser resource and the data found between the tags. You should be aware, that this method is not called once for all data between two tags, but also when a line break is found. In this example, we are using very simple character data handling, which relies on the fact that no line breaks are allowed inside the character data. If the data only contains white space, it will be discarded; otherwise it is stored in the array of the current hero.
And last but not least, we implemented a method getHeroes() to access the data that was extracted from the example document. In real-life there would be surely more methods to access a single hero or the name of the team, but for our example, one method is sufficient to prove that it is working.
Now that all methods to handle the data have been implemented, let us take a look at how to actually parse the document:
After creating a new instance of our parser, we are setting the name path of the file that contains the XML document using setInputFile() and then only need to call the parse() method to trigger the parsing. This method will either return true, if everything went well or an instance of XML_Parser_Error, if the document could not be parsed, e.g. if the XML document was not well formed. If you need to parse XML documents that have been generated on-the-fly, you may as well use setInputString() to parse a string or setInput() to parse data that will be read directly from a resource, like a networking socket. If you run this script, you will get the following output:
Array ( [Superman] => Array ( [realname] => Clark Kent [city] => Metropolis ) [Batman] => Array ( [realname] => Bruce Wayne [city] => Gotham City ) [The Flash] => Array ( [realname] => Wally West [city] => Keystone City ) [Aquaman] => Array ( [realname] => Arthur Curry [city] => Sub Diego ) )
This data can be easily be processed by your PHP applications and you've just successfully parsed your first XML document.
No multiple inheritance
A huge drawback of XML_Parser always has been that you had to extend the XML_Parser class in order to use it. As PHP does not support multiple inheritance you were doomed if you wanted to use XML_Parser in conjunction with a class that already extended another class. However, this has changed with XML Parser 1.2.0, which introduced a new method setHandlerObj(), which allows you to specify any object whose methods should be used as callbacks. So to get the Team object rid of the XML_Parser base class, we only need to modify the class definition a little bit:
Besides removing the extends statement we also removed the $folding property, as this is not needed anymore inside the class. Of course the class usage needs to be adjusted as well:
While this leads to a bit more code, it still is a lot more flexible in most situations, as you can easily re-use one parser object with different handler objects or the other way round.
Getting rid of the switch
What you still may find annoying is the need for the switch()-statement inside the start element handler. If this is the case, XML_Parser is able to make your life happier, as it provides a second mode of operandum which creates switch-free parsers.
To enable this mode, you'll have to tell XML_Parser which mode to use when creating a new object:
When instantiating XML_Parser without any parameters, it will be started in the "event" mode, where you define event handlers. The difference between these two modes is, that while you implement one function to handle all opening or closing tags in "event" mode, you can define a separate method for each opening or closing tag, that will be called depending on the tag name. All you have to do is to call your handlers for the opening tags xmltag_TAGNAME() and the handlers for the closing tags xmltag_TAGNAME_() (notice the trailing underscore), where TAGNAME needs to be replaced with the name of the actual tag. The method signatures are identical to the handlers in the previous examples.
Important: Make sure that you have at least XML_Parser version 1.2.4 installed in order for this example to work.
Working with encodings
Until now, we've only processed XML documents that contained information about American superheroes. But imagine your task was to parse the following XML document:
As you can see in the xml declaration, this document is UTF-8 encoded, as it contains German umlauts. In your script you would prefer to get these strings as ISO-8859-1 as this is easier to work with when displaying it the user.
All you need to change in your code is the part where you are creating the XML_Parser instance to:
The first parameter is the source encoding of the XML document (ISO-8859-1 is assumed if left out), the second parameter is the parsing mode and the last parameter is used to specify the encoding you would like to use in your element and character data handlers. If you skip the last parameter, the encoding of the document is used as target encoding. PHP�s XML functions are able to work with UTF-8, ISO-8859-1 and US-ASCII). It you set the source encoding to UTF-8 and the target encoding to ISO-8859-1 you will see this output:
Array ( [Captain Überpower] => Array ( [realname] => Unknown [city] => Los Angeles ) [Bizzaro Män] => Array ( [realname] => Unknown [city] => The Twin Cities ) )
Making it even simpler
Although XML_Parser already did a lot of work for you, parsing XML documents can even be simpler: since version 1.2.0 XML_Parser provides a second class called XML_Parser_Simple. When using this class, XML_Parser will store attributes and character data on a stack internally and only call a handler when the closing element is found. This way you can handle all information of a tag at once. This method has to be called handleElement() and needs to accept three parameters: the name of the tag, an associative array containing all attributes and last but not least the character data that has been found inside the tag.
To parse our example document, the following code is needed.
In the handleElement() method we implemented a simple switch, that checks, whether a
Embedding PHP in XML
In the last example, we will use processing instructions to embed PHP code in the XML document, which will be executed and replaced with the result of the PHP code. For this, we need to modify our XML document a little bit:
If the parser now encounters
And to return it to the parser as character data, we just need to call the cdataHandler() method manually and pass the captured result. In order to be able to do this, the Team object needs a reference to the actual parser object, and thus we add a new property $parser in the Team object as well as a setter method. After all these changes, the code will be:
Catching events for entities and notation declarations is working exactly the same way and has been left out in this tutorial.
As you've seen, XML_Parser proves to be an extremely valuable tool, when it comes to working with XML documents. If you want to learn more information about XML_Parser and the PHP functions it is based on, you will be able to find additional information at http://pear.php.net/manual/en/package.xml.xml-parser.php and http://de3.php.net/manual/en/ref.xml.php.