URL of the article:

Issue: 02.2004
PHP 5 Meets XML and the DOM
An Intro to PHP 5's rewritten DOM, XSLT, and XPath extensions
by Adam Trachtenberg
Every Web developer needs a few key skills: processing forms, querying databases, and parsing XML. While PHP makes it easy to solve the first two tasks, its XML support has been uneven. With PHP 5, this deficiency with first-rate XML utilities has been remedied.Of the many new XML features, this piece covers the DOM extension because it's the largest and most versatile of the XML specifications. It also shows how to use DOM with XSLT and XPath. Read this article to see the future of XML in PHP 5.

Introduction
The mother-of-all XML parsing APIs is the DOM (Document Object Model) - a W3 standard for interfacing with XML documents in a language-independent and platform-neutral manner. Using the DOM you can read, create, edit, save, and search XML documents using PHP.

PHP 4 has a DOM extension bundled with it, but it's lacking in certain ways: non-standard functions, memory leaks, and unimplemented features. This isn't entirely the fault of the extension's authors. In some cases, PHP 4 just isn't equipped to handle the object-oriented features required by the DOM specification. The DOM extension in PHP 5, in contrast, was written from scratch to comply fully with the DOM specifications. While the extension, like PHP 5, is not yet complete, it's a significant improvement over its PHP 4 cousin. However, it does mean that some of the code in this article doesn't work even under the most recent beta version of PHP 5 -- Beta 3. Therefore, if you're looking to explore the DOM under PHP 5, go to http://snaps.php.net and install the latest PHP 5 Snapshot.

The PHP 5 DOM extension is written on top of libxml2, which is an open source XML parser that's part of the GNOME project. This means you need libxml2 installed in order to use DOM in PHP 5. Luckily, most modern versions of Unix bundle libxml2, but you may need to install or update libxml2 in order to use DOM. Note: As PHP 5 gets closer to going final, there may be a version of PHP 5 that includes libxml2. This article begins with an introduction to the DOM. After this, it covers reading existing XML documents using DOM and shows how you can create new XML documents. Finally, the article wraps up by showing how the DOM extension integrates with the new PHP 5 XSLT and XPath extensions.

As a demonstration on how to use the DOM in PHP 5, the examples in this article use a catalogue of music albums stored in XML. The piece shows how to use PHP to print out all the artists and albums in the collection. It also shows how new entries can be added to the collection using DOM.

Introducing the DOM
The DOM's largest strength is its comprehensiveness. No matter what you want to do XMLwise, you can do it with DOM. Other APIs, like SAX and XSLT, are great for modifying existing XML documents. However, only DOM lets you create new documents from scratch and add elements to existing files.

However, the DOM's strengths are also its weaknesses. The DOM is quite large and complex because it supports every XML feature. In fact, the DOM specification is broken down into three different levels. Since implementing the entire DOM specification is a Herculean task, this allows developers to more easily deliver a useful subset of features. Additionally, DOM never assumes anything about your document or what you want to do with it, so even basic tasks can take a few steps. Depending upon your viewpoint, this is either a plus or a minus. Sometimes, as you'll see, this means it's easier to do things in XSLT or XPath.

Conceptually, DOM treats XML documents as trees. When you turn an XML document into a DOM object, DOM turns every part of the document - elements, attributes, pieces of text, etc. - into a node. Then, by using different methods, you can navigate through the tree by visiting a node's children, siblings, and parent. Alternatively, you can retrieve multiple elements at once if they have the same tag name. Here's an example that shows both an XML document and its DOM representation:

<artist id="1">
<name>The Rolling Stones</name>
</artist>


Figure 1: A DOM representation of an XML document

The name element is a child of the artist element. This is clear from viewing the XML. However, as you can see, the text The Rolling Stones is not part of name. Instead, it's a separate text node in and of itself. This text node is placed as a child of the name node.

Furthermore, to get access to the physical piece of text The Rolling Stones, in contrast to the text object holding the data, you need to access the nodeValue attribute of an object. Therefore, if $name is a variable holding the name node, you can't do print $name. Instead you need to do print $name->firstChild->nodeValue. This tells DOM to grab the text node (which is the first and only child of $name) and pull the actual text from the object. While this may not be how you'd design the DOM, that's what DOM needs to do in order to ensure that every possible variation of XML is available using its methods.

As a demonstration on how to use the DOM in PHP 5, the examples in this article use a catalogue of music albums stored in XML. The piece shows how to use PHP to print out all the artists and albums in the collection. It also shows how to add new entries to the collection using DOM. Listing 1 shows the XML document, which is saved as music.xml, upon which the examples are based:

Listing 1

<music>
<artist id="1">
<name>The Rolling Stones</name>
<albums>
<title>Exile On Main Street</title>
</albums>
</artist>
<artist id="2">
<name>Aimee Mann</name>
<albums>
<title>I'm With Stupid</title>
<title>Bachelor No. 2</title>
</albums>
</artist>
</music>

Reading XML With the DOM
When you represent an XML document as a DOM object in PHP, it's an instance of the domDocument class. Create a domDocument by instantiating it just like any other class:

$music = new domDocument;

Then you can choose to set any parsing options. By default, the DOM extension follows the XML specification and treats whitespace as meaningful content. That means, for instance, the whitespace between the closing tag and opening tag is turned into a text element.

For this file, however, whitespace is not important and you don't want it to be considered part of the document. Thankfully, you can automatically eliminate whitespace by setting the preserveWhiteSpace attribute to false, like this:

$music->preserveWhiteSpace = false;

Once you've finished telling DOM how to parse the file, load in an XML document. If your XML is in a file, use the load() method. If it's in a variable, use loadXML(). For instance, since the XML holding the music is stored in music.xml:

$music->load('music.xml');

Now you can retrieve the data from the object. The easiest way to gather a group of elements from a DOM object is by using the getElementsByTagName() method. This method returns an array containing all of the elements below the current node that match your query. For example, the code bit in Listing 2 demonstrates how to locate and print out the names of all the artists in the collection.

Listing 2

$names =<br></br> $music->getElementsByTagName('name');
foreach ($names as $name) {
print $name->firstChild->nodeValue . "\n";
}

The Rolling Stones
Aimee Mann

Since each name element only has one child, the best way to retrieve the element's text is through the firstChild's nodeValue. Remember, in the DOM, nothing is assumed, so the name elements don't hold the text; instead, the text is a child node. Also, once you have a text element, you still need to explicitly request its nodeValue to access the text itself. There's a difference in DOM between the element and its contents. Things are slightly more complex when an element contains more than one child. For instance, to retrieve all the albums, you could do a getElementsByTagName() request for title. However, this doesn't allow you to distinguish between albums from one artist to another. Solve this problem by substituting albums for title, as demonstrated in Listing 3.

Listing 3

$albums =
$music->getElementsByTagName('albums');
foreach ($albums as $album) {
foreach ($album->childNodes as $title) {
print $title->firstChild->nodeValue . <br></br> "\n";
}
}

Exile On Main Street
I'm With Stupid
Bachelor No. 2

In this case, each albums element may have more than one child; therefore, you can't just print firstChild->nodeValue. Instead, loop through the childNodes attribute of the element to ensure you catch all the albums. The childNodes attribute is what's known in DOM as a node list, so, even if there's only one child, you're guaranteed that your foreach will work. (In fact, even if there are no children, childNodes returns an empty node list for you, so the loop won't generate an error.) As shown in Listing 4, structuring your code this way makes it easy to convert this information into an HTML list with albums divided by artist.

Listing 4

$albums =
$music->getElementsByTagName('albums');
foreach ($albums as $album) {
print "<ul>";
foreach ($album->childNodes as $title) {
print "<li>" .
$title->firstChild->nodeValue .
"</li>";
}
print "</ul>";
}

Alternatively, you can create an HTML table that contains each artist and their albums, as detailed in Listing 5.

Listing 5

$artists = $music->documentElement;
print "<table>\n";
foreach ($artists->childNodes as $artist) {
$names =
$artist->getElementsByTagName('name');
$name = $names->item(0)->
firstChild->nodeValue;

$titles =
$artist->getElementsByTagName('title');
foreach ($titles as $title) {
print "<tr><td>$name</td>";
print "<td>" .
$title->firstChild->nodeValue .
"</td></tr>\n";
}
}
print "</table>\n";


Figure 2: HTML Table Displaying Album Data

The HTML table code in Listing 5 not only builds on the previous example, but also introduces a few new DOM features.

First, the top list references $music->documentElement. This attribute is a reference to the root of the XML tree. Since every XML document is required to have one and only one root element, DOM can just use documentElement to refer to that spot in the tree and not worry about sibling elements. In Listing 5, inside the first foreach loop, there are two calls to getElementsByTagName(). This time, instead of calling the method on the original DOM object, they're invoked on the children: $artist. However, that's okay. It's perfectly legal to call getElementsByTagName() on domElements as well as domDocuments. A domElement is a DOM object that represents an XML element instead of an entire document. (Heads up! There's a bug in this method in PHP 5 Beta 3 and earlier.)

Again, since there's only one name per artist element, the text can be accessed directly as $names->item(0)->firstChild->nodeValue. Since getElementsByTagName() returns node list, the first (and only) element is in position 0. Under the PHP 5 DOM extension, you can access the values of a node list using the item() method.

Next, in Listing 5, there's another foreach() that iterates through every album title. This requires another call to getElementsByTagName() to retrieve the albums, but otherwise this code is similar to what's done earlier. The one change is that there's additional code to print the artist name and the new table HTML.

Writing XML
The DOM lets you do more than just iterate through pre-existing XML elements and print out their values. You can also use DOM to append new information to the document. For example, you just downloaded a new album - Sticky Fingers by The Rolling Stones - and want to add it to list. It's easy to do this using DOM. The task breaks down into two high-level steps: creating the new information and then appending the data in the correct place within the tree.

To create a new DOM element, instantiate a new instance of the domElement class. Pass the element's name as the first value. If the element is to contain just a text node, like in this case, pass the text as the second argument:

$newAlbum = new domElement('title',
'Sticky Fingers');

Once that's done, the element needs to be added to the master document. Unfortunately, it can't be placed anywhere in the tree - it needs to be added as a sibling to the other Rolling Stones album that is already there. This is easier said than done using DOM because DOM doesn't provide any fine-grained querying abilities. Therefore, you again need to loop through the artists and check for the one who has a name child element that's The Rolling Stones and append the new node to that artist's set of albums. This process is demonstrated in Listing 6.

Listing 6

$artists = $music->documentElement;
foreach ($artists->childNodes as $artist) {
$names = $artist->
getElementsByTagName('name');
if ('The Rolling Stones' ==
$names->item(0)<br></br> ->firstChild->nodeValue) {
$albums = $music->
getElementsByTagName('albums');
$albums->item(0)-><br></br> appendChild($newAlbum);
break;
}
}

Most of this code is PHP to search through the DOM object. However, in the middle, the magic happens. The new album belongs as a child of the albums element. When the loop reaches The Rolling Stones, it retrieves that element using getElementsByTagName('albums'). Again, by design, there's only one result, so it's located in $albums->item(0). To add a new child to this element, call the appendChild() method and pass the just created $newAlbum as it's argument. Now the album has been added to the tree. One way to double-check the results is to convert the DOM object back into XML and inspect it yourself, as shown in Listing 7.

Listing 7

// indent elements
$music->formatOutput = true;
print $music->saveXML();

<?xml version="1.0"?>
<music>
<artist id="1">
<name>The Rolling Stones</name>
<albums>
<title>Exile On Main Street</title>
<title>Sticky Fingers</title>
</albums>
</artist>
<artist id="2">
<name>Aimee Mann</name>
<albums>
<title>I'm With Stupid</title>
<title>Bachelor No. 2</title>
</albums>
</artist>
</music>

That example was relatively straightforward because there were only two new nodes to add: the title element and the text element that contains Sticky Fingers. However, it's slightly more complicated to add a new artist to the end of the tree. For instance, to know what to do when your copy of Elvis Presley #1 arrives, see the code bit in Listing 8 - this code is a little bit more complex because it creates multiples, but the concept is the same as before.

Listing 8

$artists = $music->documentElement;
$lastId = $artists->lastChild->
getAttribute('id');

$newArtist = $artists->appendChild(
new domElement('artist'));
$newArtist->setAttribute('id', $lastId + 1);
$newArtist->appendChild(
new domElement('name', <br></br> 'Elvis Presley'));

$newAlbums = $newArtist->appendChild(
new domElement('albums'));
$newAlbums->appendChild(
new domElement('title', '#1'));

Let's look at the code in Listing 8 in some detail. For starters, artists have a unique ID associated with them because bands sometimes change names, so, to add a new artist to the collection, you need to set its id attribute. In this case, the new ID number is one more than the previous high. Since new artists are added to the end of the list, take the value of last artist's ID attribute and increment it by one. You could loop through $music->childNodes or to find the last element in that array, but the fastest way to identify the end element is to use $music->documentElement->lastChild. (Like I said, DOM is comprehensive, so if there's a firstChild, there's bound to be a lastChild.) The getAttribute() and setAttribute() methods allow you to read and write element attribute values. Therefore, getAttribute('id') grabs the ID value and stores it for later use. Next, there's a new artist element. Unlike before, there's only one argument, since this element is a container that holds the name and the albums. Earlier, the code to create the new element and append a domElement divided into two distinct steps. However, you can directly pass the new element to the append method without storing it in a temporary variable. That's what happens here.

After you've appended the new element, set the new one using $newArtist->setAttribute('id', $lastId + 1). Now you append name and albums element to $newArtist. Since the title element lives below albums, you append that node to $newAlbums instead of $newArtist. By the way, using this code - shown in Listing 8 - with multiple users requires some form of file locking because of concurrency issues, but that's another topic. Of course, having two separate pieces of code - one to update existing entries and one to add new entries - is cumbersome. It's cleaner to have one function that tries to append the new album to an artist's record, but will create a new entry if the artist isn't already in the system. This only requires a slight code reorganization, as is demonstrated in Listing 9. In the code, the addAlbum() function first loops through the existing entries looking for a match. If it finds the artist, it appends the new album and returns. However, if the loop completes without success, you know you need to create a new entry and the function contains the code to do that, too.

Listing 9

function addAlbum($music, $theArtist, <br></br> $theAlbum) {
$artists = $music->documentElement;

// try to insert new album
// under existing artist
foreach ($artists->childNodes as $artist) <br></br> {
$names = $artist->
getElementsByTagName('name');
if ($theArtist ==
$names->item(0)->firstChild<br></br> ->nodeValue) {
$albums = $artist->
getElementsByTagName('albums');
$albums->item(0)->appendChild(
new domElement('title', $theAlbum));
return;
}
}

// new artist, so create a whole new entry
$lastId = $artists->lastChild->
getAttribute('id');

$newArtist = $artists->appendChild(
new domElement('artist'));
$newArtist->setAttribute('id', <br></br> $lastId + 1);
$newArtist->appendChild(
new domElement('name', $theArtist));

$newAlbums = $newArtist->appendChild(
new domElement('albums'));
$newAlbums->appendChild(
new domElement('title', $theAlbum));
}

addAlbum($music, 'Elvis Presley', '#1');

XSLT
XSLT is a way to transform XML into other formats: HTML, plain text, or even another XML document. PHP 4 had an XSLT extension, but it didn't work together with the DOM extension. This made it difficult to use XSLT with a DOM object because you had to serialize it to a string or file before you could feed it to the XSLT processor. In PHP 5, the XSLT extension takes DOM objects as its input, so it's simple to use the two extensions together. You pass the XSLT processor a DOM object that holds your stylesheet and then you have it transform other DOM objects created from your source XML documents. We'll now look at how to create the same HTML table that lists all your albums and their artist using XSLT instead of DOM. The first step, as shown in Listing 10, is to create the stylesheet and save is as music.xsl.

Listing 10

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet

xmlns:xsl="http://www.w3.org/1999/
XSL/Transform"
version="1.0">
<xsl:output method="html"/>

<xsl:template match="/">
<table>
<xsl:apply-templates
select="music/artist/albums/title"/>
</table>
</xsl:template>

<xsl:template
match="music/artist/albums/title">
<tr>
<td><xsl:value-of select=<br></br> "../../name"/></td>
<td><xsl:value-of select="."/></td>
</tr>
</xsl:template>

</xsl:stylesheet>

This stylesheet wraps a table tag around the rows and then tells the program to process all the music/artist/albums/title elements. That template prints out the table rows and data tags and retrieves the necessary data to fill them in. In XSLT, "." (dot or period) is shorthand for the current element and ".." (two dots or periods) means backup one level. (This is standard Unix shell notation.) Therefore, select="../../name" means take the name element that's two levels above the current location. Likewise, select="." means take the current element: title. With the stylesheet complete, you can now create the XSLT processor, load it in, and do the transformation, as shown in Listing 11.

Listing 11

$xslt = new xsltProcessor;

$xsl = domDocument::load('music.xsl');
$xslt->importStylesheet($xsl);

$xml = domDocument::load('music.xml');
print $xslt->transformToXML($xml);

<table>
<tr>
<td>The Rolling Stones</td>
<td>Exile On Main Street</td>
</tr>
<tr>
<td>Aimee Mann</td>
<td>I'm With Stupid</td>
</tr>
<tr>
<td>Aimee Mann</td>
<td>Bachelor No. 2</td>
</tr>
</table>

Just like domDocument is the class for DOM objects, xsltProcessor is the class for XSLT processor objects. After instantiating a new instance of the processor, import the stylesheet by passing the importStylesheet() method to a DOM object. This example uses the static load() constructor to create a DOM object and load the XML in one step. Now the $xslt object is ready to process data. The transformToXML() method takes a DOM object and returns a string, which you can print or otherwise use. Alternatively, the transformToDoc() method returns a domDocument and transformToURI() saves the output to disk. Pass the filename as the second argument to transformToURI().

XPath
XPath is the language used by XSLT and XPointer to specify portions of an XML document. XSLT template select expressions, like ../../name, use XPath.

In PHP 5, you can also use XPath to search DOM objects. This is a great benefit because XPath allow you to craft highly detailed searches, unlike DOM which only has getElementsByTagName(). This allows you to remove loops because you no longer need to iterate over a nodeList. For example, in addAlbum() (Listing 9), you were forced to loop through every artist checking if their name matched the name of the new artist:

// try to insert new album <br></br> // under existing artist
foreach ($artists->childNodes as $artist) {
$names = $artist->
getElementsByTagName('name');
if ($theArtist ==
$names->item(0)->firstChild->nodeValue) {
$albums = $artist->
getElementsByTagName('albums');
$albums->item(0)->appendChild(
new domElement('title', $theAlbum));
return;
}
}

However, using the XPath extension, the code simplifies considerably:

$xpath = new domXPath($music);
$albums = $xpath->query("/music/
artist[name = '$theArtist']/albums");
if ($albums->length) {
$albums->item(0)->appendChild(
new domElement('title', $theAlbum));
return;
}

The new code not only transfers the searching to the XPath extension, but also makes it return the exact node that you need to append the new element. It's a double win. Start by making a new domXPath object, passing the DOM object to the constructor. Now you can search the object using the domXPath query method. The request of /music/artist[name = '$theArtist']/albums means, starting at the root, return all the albums that live under artists that live under a music element where the artist element has a child named music whose value is (the value of the PHP variable) $theArtist. While that sentence is a mouthful, it's easy to deconstruct if you consider each / as descending one level into the document and each [ ] as a way filter those results using a set of boolean tests. As of the time of writing this article, the query() method returns a node list of matching nodes, like findElementsByTagName(). This places the first node in $albums->item(0).

With a nodeList object, you check the value of the length attribute to find the number of items in the object. This is similar to calling count() on an array in PHP. You can also use XPath to find the values of attributes. For instance, to find the value of the id attribute of the final artist, you did:

$artists = $music->documentElement;
$lastId = $artists->lastChild->
getAttribute('id');

In XPath, this is:

$lastIds = $xpath->query('/music/
artist[position() = last()]/@id');
$lastId = $lastIds->item(0)->nodeValue;

The query searches for the artist element that is in the last position in the list and then retrieves its id attribute. Placing an @ in front of id tells XPath to take the element's attribute instead of its child. The position() and last() functions are just two of a number of built-in XPath functions that allow you to winnow your search. Unlike getAttribute() which returns the value of the attribute, XPath returns a domAttr object. This object maps to an XML attribute, like domElement maps to an element. To get the text out of the object, call nodeValue. For more on XPath, check out the specification at http://www.w3.org/TR/xpath.

Conclusion
The new DOM extension in PHP 5 is powerful and implements a larger portion of the specification than the previous version. Additionally, it's compatible with PHP 5's other XML extensions, including XSLT and XPath. These tools combine to create a comprehensive set of XML processing utilities that let you solve all your XML needs. Regardless of your task, the DOM is up to the challenge and when the DOM makes things difficult, you can quickly switch to another extension to solve your problems without a headache. While there are some features that haven't been added to the PHP 5 DOM extension, progress continues every day towards a complete DOM implementation. If you're using DOM and PHP 5, the future is bright.

Links and Literature

© 2004 Software & Support Verlag GmbH. Reproduction has to be permitted by the publisher. Questions?