Introduction
The mother-of-all XML parsing APIs is the DOM (Document Object Model) - a W3 standard for interfacing with XML documents in a language-independent and platform-neutral manner. Using the DOM you can read, create, edit, save, and search XML documents using PHP.
PHP 4 has a DOM extension bundled with it, but it's lacking in certain ways:
non-standard functions, memory leaks, and unimplemented features. This isn't entirely
the fault of the extension's authors. In some cases, PHP 4 just isn't equipped
to handle the object-oriented features required by the DOM specification. The
DOM extension in PHP 5, in contrast, was written from scratch to comply fully
with the DOM specifications. While the extension, like PHP 5, is not yet complete,
it's a significant improvement over its PHP 4 cousin. However, it does mean that
some of the code in this article doesn't work even under the most recent beta
version of PHP 5 -- Beta 3. Therefore, if you're looking to explore the DOM under
PHP 5, go to
http://snaps.php.net and install the latest PHP 5 Snapshot.
The PHP 5 DOM extension is written on top of libxml2, which is an open source
XML parser that's part of the GNOME project. This means you need libxml2 installed
in order to use DOM in PHP 5. Luckily, most modern versions of Unix bundle libxml2,
but you may need to install or update libxml2 in order to use DOM.
Note: As
PHP 5 gets closer to going final, there may be a version of PHP 5 that includes
libxml2. This article begins with an introduction to the DOM. After this,
it covers reading existing XML documents using DOM and shows how you can create
new XML documents. Finally, the article wraps up by showing how the DOM extension
integrates with the new PHP 5 XSLT and XPath extensions.
As a demonstration on how to use the DOM in PHP 5, the examples in this article use a catalogue of music albums stored in XML. The piece shows how to use PHP to print out all the artists and albums in the collection. It also shows how new entries can be added to the collection using DOM.
Introducing the DOM
The DOM's largest strength is its comprehensiveness. No matter what you want to do XMLwise, you can do it with DOM. Other APIs, like SAX and XSLT, are great for modifying existing XML documents. However, only DOM lets you create new documents from scratch and add elements to existing files.
However, the DOM's strengths are also its weaknesses. The DOM is quite large and complex because it supports every XML feature. In fact, the DOM specification is broken down into three different levels. Since implementing the entire DOM specification is a Herculean task, this allows developers to more easily deliver a useful subset of features.
Additionally, DOM never assumes anything about your document or what you want to do with it, so even basic tasks can take a few steps. Depending upon your viewpoint, this is either a plus or a minus. Sometimes, as you'll see, this means it's easier to do things in XSLT or XPath.
Conceptually, DOM treats XML documents as trees. When you turn an XML document
into a DOM object, DOM turns every part of the document - elements, attributes,
pieces of text, etc. - into a node. Then, by using different methods, you can
navigate through the tree by visiting a node's children, siblings, and parent.
Alternatively, you can retrieve multiple elements at once if they have the same
tag name. Here's an example that shows both an XML document and its DOM representation:
<artist id="1">
<name>The Rolling Stones</name>
</artist>

Figure 1: A DOM representation of an XML document
The
name element is a child of the
artist element. This
is clear from viewing the XML. However, as you can see, the text
The Rolling
Stones is not part of
name. Instead, it's a separate text node in and
of itself. This text node is placed as a child of the
name node.
Furthermore, to get access to the physical piece of text
The Rolling Stones,
in contrast to the text object holding the data, you need to access the
nodeValue
attribute of an object. Therefore, if
$name is a variable holding the
name node, you can't do
print $name. Instead you need to do
print $name->firstChild->nodeValue.
This tells DOM to grab the text node (which is the first and only child of
$name)
and pull the actual text from the object. While this may not be how you'd design
the DOM, that's what DOM needs to do in order to ensure that every possible variation
of XML is available using its methods.
As a demonstration on how to use the DOM in PHP 5, the examples in this article use a catalogue of music albums stored in XML. The piece shows how to use PHP to print out all the artists and albums in the collection. It also shows how to add new entries to the collection using DOM.
Listing 1 shows the XML document, which is saved as
music.xml, upon which the examples are based:
Listing 1 <music>
<artist id="1">
<name>The Rolling Stones</name>
<albums>
<title>Exile On Main Street</title>
</albums>
</artist>
<artist id="2">
<name>Aimee Mann</name>
<albums>
<title>I'm With Stupid</title>
<title>Bachelor No. 2</title>
</albums>
</artist>
</music>
Reading XML With the DOM
When you represent an XML document as a DOM object in PHP, it's an instance of the
domDocument class. Create a
domDocument by instantiating it just like any other class:
$music = new domDocument;
Then you can choose to set any parsing options. By default, the DOM extension follows the XML specification and treats whitespace as meaningful content. That means, for instance, the whitespace between the closing
tag and opening
tag is turned into a text element.
For this file, however, whitespace is not important and you don't want it to be considered part of the document. Thankfully, you can automatically eliminate whitespace by setting the
preserveWhiteSpace attribute to
false, like this:
$music->preserveWhiteSpace = false;
Once you've finished telling DOM how to parse the file, load in an XML document. If your XML is in a file, use the
load() method. If it's in a variable, use
loadXML(). For instance, since the XML holding the music is stored in
music.xml:
$music->load('music.xml');
Now you can retrieve the data from the object. The easiest way to gather a group of elements from a DOM object is by using the
getElementsByTagName() method. This method returns an array containing all of the elements below the current node that match your query. For example, the code bit in Listing 2 demonstrates how to locate and print out the names of all the artists in the collection.
Listing 2 $names =<br></br> $music->getElementsByTagName('name');
foreach ($names as $name) {
print $name->firstChild->nodeValue . "\n";
}
The Rolling Stones
Aimee Mann
Since each name element only has one child, the best
way to retrieve the element's text is through the
firstChild's
nodeValue.
Remember, in the DOM, nothing is assumed, so the name elements don't hold the
text; instead, the text is a child node. Also, once you have a text element, you
still need to explicitly request its
nodeValue to access the text itself.
There's a difference in DOM between the element and its contents. Things are slightly
more complex when an element contains more than one child. For instance, to retrieve
all the albums, you could do a
getElementsByTagName() request for
title.
However, this doesn't allow you to distinguish between albums from one artist
to another. Solve this problem by substituting
albums for
title,
as demonstrated in Listing 3.
Listing 3 $albums =
$music->getElementsByTagName('albums');
foreach ($albums as $album) {
foreach ($album->childNodes as $title) {
print $title->firstChild->nodeValue . <br></br> "\n";
}
}
Exile On Main Street
I'm With Stupid
Bachelor No. 2
In this case, each albums element may have more than
one child; therefore, you can't just print
firstChild->nodeValue. Instead,
loop through the
childNodes attribute of the element to ensure you catch
all the albums. The
childNodes attribute is what's known in DOM as a node
list, so, even if there's only one child, you're guaranteed that your
foreach
will work. (In fact, even if there are no children,
childNodes returns
an empty node list for you, so the loop won't generate an error.) As shown in
Listing 4, structuring your code this way makes it easy to convert this information
into an HTML list with albums divided by artist.
Listing 4 $albums =
$music->getElementsByTagName('albums');
foreach ($albums as $album) {
print "<ul>";
foreach ($album->childNodes as $title) {
print "<li>" .
$title->firstChild->nodeValue .
"</li>";
}
print "</ul>";
}
Alternatively, you can create an HTML table that contains
each artist and their albums, as detailed in Listing 5.
Listing 5 $artists = $music->documentElement;
print "<table>\n";
foreach ($artists->childNodes as $artist) {
$names =
$artist->getElementsByTagName('name');
$name = $names->item(0)->
firstChild->nodeValue;
$titles =
$artist->getElementsByTagName('title');
foreach ($titles as $title) {
print "<tr><td>$name</td>";
print "<td>" .
$title->firstChild->nodeValue .
"</td></tr>\n";
}
}
print "</table>\n";

Figure 2: HTML Table Displaying Album Data
The HTML table code in Listing 5 not only builds
on the previous example, but also introduces a few new DOM features.
First, the top list references
$music->documentElement. This attribute is a reference to the root of the XML tree. Since every XML document is required to have one and only one root element, DOM can just use
documentElement to refer to that spot in the tree and not worry about sibling elements.
In Listing 5, inside the first
foreach loop, there are two calls to
getElementsByTagName(). This time, instead of calling the method on the original DOM object, they're invoked on the children: $artist. However, that's okay. It's perfectly legal to call
getElementsByTagName() on
domElements as well as
domDocuments. A
domElement is a DOM object that represents an XML element instead of an entire document. (Heads up! There's a bug in this method in PHP 5 Beta 3 and earlier.)
Again, since there's only one name per artist element, the text can be accessed directly as
$names->item(0)->firstChild->nodeValue. Since
getElementsByTagName() returns node list, the first (and only) element is in position
0. Under the PHP 5 DOM extension, you can access the values of a node list using the
item() method.
Next, in Listing 5, there's another
foreach() that iterates through every album title. This requires another call to
getElementsByTagName() to retrieve the albums, but otherwise this code is similar to what's done earlier. The one change is that there's additional code to print the artist name and the new table HTML.
Writing XML
The DOM lets you do more than just iterate through pre-existing XML elements and print out their values. You can also use DOM to append new information to the document. For example, you just downloaded a new album -
Sticky Fingers by The Rolling Stones - and want to add it to list. It's easy to do this using DOM. The task breaks down into two high-level steps: creating the new information and then appending the data in the correct place within the tree.
To create a new DOM element, instantiate a new instance of the
domElement class. Pass the element's name as the first value. If the element is to contain just a text node, like in this case, pass the text as the second argument:
$newAlbum = new domElement('title',
'Sticky Fingers');
Once that's done, the element needs to be added to the master document. Unfortunately, it can't be placed anywhere in the tree - it needs to be added as a sibling to the other Rolling Stones album that is already there. This is easier said than done using DOM because DOM doesn't provide any fine-grained querying abilities.
Therefore, you again need to loop through the artists and check for the one who has a name child element that's
The Rolling Stones and append the new node to that artist's set of albums. This process is demonstrated in Listing 6.
Listing 6 $artists = $music->documentElement;
foreach ($artists->childNodes as $artist) {
$names = $artist->
getElementsByTagName('name');
if ('The Rolling Stones' ==
$names->item(0)<br></br> ->firstChild->nodeValue) {
$albums = $music->
getElementsByTagName('albums');
$albums->item(0)-><br></br> appendChild($newAlbum);
break;
}
}
Most of this code is PHP to search through the DOM
object. However, in the middle, the magic happens. The new album belongs as a
child of the
albums element. When the loop reaches
The Rolling Stones,
it retrieves that element using
getElementsByTagName('albums'). Again,
by design, there's only one result, so it's located in
$albums->item(0).
To add a new child to this element, call the
appendChild() method and pass
the just created
$newAlbum as it's argument. Now the album has been added
to the tree. One way to double-check the results is to convert the DOM object
back into XML and inspect it yourself, as shown in Listing 7.
Listing 7 // indent elements
$music->formatOutput = true;
print $music->saveXML();
<?xml version="1.0"?>
<music>
<artist id="1">
<name>The Rolling Stones</name>
<albums>
<title>Exile On Main Street</title>
<title>Sticky Fingers</title>
</albums>
</artist>
<artist id="2">
<name>Aimee Mann</name>
<albums>
<title>I'm With Stupid</title>
<title>Bachelor No. 2</title>
</albums>
</artist>
</music>
That example was relatively straightforward because
there were only two new nodes to add: the title element and the text element that
contains
Sticky Fingers. However, it's slightly more complicated to add
a new artist to the end of the tree. For instance, to know what to do when your
copy of Elvis Presley
#1 arrives, see the code bit in Listing 8 - this
code is a little bit more complex because it creates multiples, but the concept
is the same as before.
Listing 8 $artists = $music->documentElement;
$lastId = $artists->lastChild->
getAttribute('id');
$newArtist = $artists->appendChild(
new domElement('artist'));
$newArtist->setAttribute('id', $lastId + 1);
$newArtist->appendChild(
new domElement('name', <br></br> 'Elvis Presley'));
$newAlbums = $newArtist->appendChild(
new domElement('albums'));
$newAlbums->appendChild(
new domElement('title', '#1'));
Let's look at the code in Listing 8 in some detail.
For starters, artists have a unique ID associated with them because bands sometimes
change names, so, to add a new artist to the collection, you need to set its
id
attribute. In this case, the new ID number is one more than the previous high.
Since new artists are added to the end of the list, take the value of last artist's
ID attribute and increment it by one. You could loop through
$music->childNodes
or to find the last element in that array, but the fastest way to identify the
end element is to use
$music->documentElement->lastChild. (Like I said,
DOM is comprehensive, so if there's a
firstChild, there's bound to be a
lastChild.) The
getAttribute() and
setAttribute() methods
allow you to read and write element attribute values. Therefore,
getAttribute('id')
grabs the ID value and stores it for later use. Next, there's a new artist element.
Unlike before, there's only one argument, since this element is a container that
holds the name and the albums. Earlier, the code to create the new element and
append a
domElement divided into two distinct steps. However, you can directly
pass the new element to the append method without storing it in a temporary variable.
That's what happens here.
After you've appended the new element, set the new one
using
$newArtist->setAttribute('id', $lastId + 1). Now you append name
and albums element to
$newArtist. Since the title element lives below
albums, you append that node to
$newAlbums instead of
$newArtist.
By the way, using this code - shown in Listing 8 - with multiple users requires
some form of file locking because of concurrency issues, but that's another topic.
Of course, having two separate pieces of code - one to update existing entries
and one to add new entries - is cumbersome. It's cleaner to have one function
that tries to append the new album to an artist's record, but will create a new
entry if the artist isn't already in the system. This only requires a slight code
reorganization, as is demonstrated in Listing 9. In the code, the
addAlbum()
function first loops through the existing entries looking for a match. If it finds
the artist, it appends the new album and returns. However, if the loop completes
without success, you know you need to create a new entry and the function contains
the code to do that, too.
Listing 9 function addAlbum($music, $theArtist, <br></br> $theAlbum) {
$artists = $music->documentElement;
// try to insert new album
// under existing artist
foreach ($artists->childNodes as $artist) <br></br> {
$names = $artist->
getElementsByTagName('name');
if ($theArtist ==
$names->item(0)->firstChild<br></br> ->nodeValue) {
$albums = $artist->
getElementsByTagName('albums');
$albums->item(0)->appendChild(
new domElement('title', $theAlbum));
return;
}
}
// new artist, so create a whole new entry
$lastId = $artists->lastChild->
getAttribute('id');
$newArtist = $artists->appendChild(
new domElement('artist'));
$newArtist->setAttribute('id', <br></br> $lastId + 1);
$newArtist->appendChild(
new domElement('name', $theArtist));
$newAlbums = $newArtist->appendChild(
new domElement('albums'));
$newAlbums->appendChild(
new domElement('title', $theAlbum));
}
addAlbum($music, 'Elvis Presley', '#1');
XSLT
XSLT is a way to transform XML into
other formats: HTML, plain text, or even another XML document. PHP 4 had an XSLT
extension, but it didn't work together with the DOM extension. This made it difficult
to use XSLT with a DOM object because you had to serialize it to a string or file
before you could feed it to the XSLT processor. In PHP 5, the XSLT extension takes
DOM objects as its input, so it's simple to use the two extensions together. You
pass the XSLT processor a DOM object that holds your stylesheet and then you have
it transform other DOM objects created from your source XML documents. We'll now
look at how to create the same HTML table that lists all your albums and their
artist using XSLT instead of DOM. The first step, as shown in Listing 10, is to
create the stylesheet and save is as
music.xsl.
Listing 10 <?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/
XSL/Transform"
version="1.0">
<xsl:output method="html"/>
<xsl:template match="/">
<table>
<xsl:apply-templates
select="music/artist/albums/title"/>
</table>
</xsl:template>
<xsl:template
match="music/artist/albums/title">
<tr>
<td><xsl:value-of select=<br></br> "../../name"/></td>
<td><xsl:value-of select="."/></td>
</tr>
</xsl:template>
</xsl:stylesheet>
This stylesheet wraps a table tag around the rows
and then tells the program to process all the
music/artist/albums/title
elements. That template prints out the table rows and data tags and retrieves
the necessary data to fill them in. In XSLT, "
." (dot or period) is shorthand
for the current element and "
.." (two dots or periods) means backup one
level. (This is standard Unix shell notation.) Therefore,
select="../../name"
means take the name element that's two levels above the current location. Likewise,
select="." means take the current element:
title. With the stylesheet
complete, you can now create the XSLT processor, load it in, and do the transformation,
as shown in Listing 11.
Listing 11 $xslt = new xsltProcessor;
$xsl = domDocument::load('music.xsl');
$xslt->importStylesheet($xsl);
$xml = domDocument::load('music.xml');
print $xslt->transformToXML($xml);
<table>
<tr>
<td>The Rolling Stones</td>
<td>Exile On Main Street</td>
</tr>
<tr>
<td>Aimee Mann</td>
<td>I'm With Stupid</td>
</tr>
<tr>
<td>Aimee Mann</td>
<td>Bachelor No. 2</td>
</tr>
</table>
Just like
domDocument is the class for DOM
objects,
xsltProcessor is the class for XSLT processor objects. After instantiating
a new instance of the processor, import the stylesheet by passing the
importStylesheet()
method to a DOM object. This example uses the static
load() constructor
to create a DOM object and load the XML in one step. Now the
$xslt object
is ready to process data. The
transformToXML() method takes a DOM object
and returns a string, which you can print or otherwise use. Alternatively, the
transformToDoc() method returns a
domDocument and
transformToURI()
saves the output to disk. Pass the filename as the second argument to transformToURI().
XPath
XPath is the language used by XSLT and XPointer to specify portions of an XML
document. XSLT template select expressions, like
../../name, use XPath.
In PHP 5, you can also use XPath to search DOM objects. This is a great benefit
because XPath allow you to craft highly detailed searches, unlike DOM which only
has
getElementsByTagName(). This allows you to remove loops because you
no longer need to iterate over a
nodeList. For example, in
addAlbum()
(Listing 9), you were forced to loop through every artist checking if their name
matched the name of the new artist:
// try to insert new album <br></br> // under existing artist
foreach ($artists->childNodes as $artist) {
$names = $artist->
getElementsByTagName('name');
if ($theArtist ==
$names->item(0)->firstChild->nodeValue) {
$albums = $artist->
getElementsByTagName('albums');
$albums->item(0)->appendChild(
new domElement('title', $theAlbum));
return;
}
}
However, using the XPath extension,
the code simplifies considerably:
$xpath = new domXPath($music);
$albums = $xpath->query("/music/
artist[name = '$theArtist']/albums");
if ($albums->length) {
$albums->item(0)->appendChild(
new domElement('title', $theAlbum));
return;
}
The new code not only transfers
the searching to the XPath extension, but also makes it return the exact node
that you need to append the new element. It's a double win. Start by making a
new
domXPath object, passing the DOM object to the constructor. Now you
can search the object using the
domXPath query method. The request
of
/music/artist[name = '$theArtist']/albums means, starting at the root,
return all the albums that live under artists that live under a music element
where the artist element has a child named music whose value is (the value of
the PHP variable)
$theArtist. While that sentence is a mouthful, it's
easy to deconstruct if you consider each
/ as descending one level into
the document and each
[ ] as a way filter those results using a set of
boolean tests. As of the time of writing this article, the
query() method
returns a node list of matching nodes, like
findElementsByTagName(). This
places the first node in
$albums->item(0).
With a
nodeList object,
you check the value of the
length attribute to find the number of items
in the object. This is similar to calling
count() on an array in PHP. You
can also use XPath to find the values of attributes. For instance, to find the
value of the
id attribute of the final artist, you did:
$artists = $music->documentElement;
$lastId = $artists->lastChild->
getAttribute('id');
In
XPath, this is:
$lastIds = $xpath->query('/music/
artist[position() = last()]/@id');
$lastId = $lastIds->item(0)->nodeValue;
The query searches for the artist element that is
in the last position in the list and then retrieves its
id attribute. Placing
an
@ in front of
id tells XPath to take the element's attribute
instead of its child. The
position() and
last() functions are just
two of a number of built-in XPath functions that allow you to winnow your search.
Unlike
getAttribute() which returns the value of the attribute, XPath returns
a
domAttr object. This object maps to an XML attribute, like
domElement
maps to an element. To get the text out of the object, call
nodeValue.
For more on XPath, check out the specification at
http://www.w3.org/TR/xpath.
Conclusion
The new DOM extension in PHP 5 is powerful and implements a larger portion of
the specification than the previous version. Additionally, it's compatible with
PHP 5's other XML extensions, including XSLT and XPath. These tools combine to
create a comprehensive set of XML processing utilities that let you solve all
your XML needs. Regardless of your task, the DOM is up to the challenge and when
the DOM makes things difficult, you can quickly switch to another extension to
solve your problems without a headache. While there are some features that haven't
been added to the PHP 5 DOM extension, progress continues every day towards a
complete DOM implementation. If you're using DOM and PHP 5, the future is bright.
Links and Literature