PHP: XML data extraction using regular expressions — page one

PHP 4 comes with a set of XML parser functions based on the expat library, but these functions seem best suited to parsing entire XML files all in one go. If you only want to nibble at a specific element in an XML file, it may be simpler just to use regular expressions to grab the bits you want.

This article is about using regex to make short work of extracting content from an XML data file. This first page is about a function for grabbing the content of a single element from a piece of XML, page two is about a function for capturing multiple elements with the same name, and page three is about a function for extracting attributes from an XML element.

Important note: These functions have serious limitations that prevent them from extracting the content of elements under certain conditions. See the section on the limitations of using generic regular expressions to parse XML.

Update: it's been a few years since I wrote this article, and anyone using PHP 5.2.3 or later ought to take a look at reading XML feeds using PHP 5 and SimpleXML which demonstrates an alternative to using regular expressions to extract data from XML.

Download the source code

To use these functions in your own PHP scripts, download the source file in the compressed format of your choice:

Feel free to modify and use these functions for any purpose that is neither illegal nor immoral. If you want to let other people know about the functions, please link to this page, and not directly to the source file.

Matching a unique XML element

Here is the code for a function I've written called value_in which extracts the content of a single element in a piece of XML:

function value_in($element_name, $xml, $content_only = true) {
    if ($xml == false) {
        return false;
    }
    $found = preg_match('#<'.$element_name.'(?:\s+[^>]+)?>(.*?)'.
            '</'.$element_name.'>#s', $xml, $matches);
    if ($found != false) {
        if ($content_only) {
            return $matches[1];  //ignore the enclosing tags
        } else {
            return $matches[0];  //return the full pattern match
        }
    }
    // No match found: return false.
    return false;
}

You tell it the name of the element you're interested in, give it the XML you want to extract the data from, and tell it whether it should return only the content of the named element or preserve the enclosing tags of the named element. Then value_in returns the content of the first element it finds within the supplied XML that is an exact match for the given name.

Examples of use

Consider an XML sample:

<movies>
    <movie>
        <title>Der Untergang</title>
        <actor>
            <name>Bruno Ganz</name>
        </actor>
        <actor>
            <name>Alexandra Maria Lara</name>
        </actor>
        <director>
            <name>Oliver Hirschbiegel</name>
        </director>
    </movie>
</movies>

If you had the above XML stored in a variable called $xml then you could extract the value of the title element by calling value_in like this:

$title = value_in('title', $xml);

And then $title should contain the value "Der Untergang". If you wanted to preserve the enclosing title tags, you could call value_in with the optional third parameter set to false:

$title = value_in('title', $xml, false);

so that $title should instead contain the value "<title>Der Untergang</title>".

Remember that value_in extracts the content of the first element whose name matches the provided parameter. So what if you want to extract the director's name from the above XML sample? You can't just call value_in with the parameter 'name' because it will return the content of the first name element in the sample, which would be "Bruno Ganz". Instead, it's easiest to make one call to value_in to select the director element, and then use the value returned as the input for a second call to value_in, this time asking for the content of the name element. Like this:

$director = value_in('director', $xml);
$name = value_in('name', $director);

Or even more compact:

$name = value_in('name', value_in('director', $xml);

Now $name should contain the value 'Oliver Hirshbiegel'.

But how would you select the name of the second actor in the XML example above? The function value_in does not allow this to be done, because there's no way of specifying which actor element you're interested in. In the case of an XML sample that contains more than one element of a given name, you need to proceed to the next page: a PHP function that matches multiple elements with the same name.