PHP: XML data extraction using regular expressions — page three

This is page three of an article about using PHP and regular expressions to extract data from XML files. This page is about extracting the attributes from an element in an XML sample. Page one is about extracting the content of a single element, and page two is about extracting multiple elements into an array.

The source code for the functions described in this article can be downloaded on page one.

Extracting attributes from XML elements

Here is the code for a simple function I've written to extract attributes from a named XML element:

function element_attributes($element_name, $xml) {
    if ($xml == false) {
        return false;
    }
    // Grab the string of attributes inside an element tag.
    $found = preg_match('#<'.$element_name.
            '\s+([^>]+(?:"|\'))\s?/?>#',
            $xml, $matches);
    if ($found == 1) {
        $attribute_array = array();
        $attribute_string = $matches[1];
        // Match attribute-name attribute-value pairs.
        $found = preg_match_all(
                '#([^\s=]+)\s*=\s*(\'[^<\']*\'|"[^<"]*")#',
                $attribute_string, $matches, PREG_SET_ORDER);
        if ($found != 0) {
            // Create an associative array that matches attribute
            // names to attribute values.
            foreach ($matches as $attribute) {
                $attribute_array[$attribute[1]] =
                        substr($attribute[2], 1, -1);
            }
            return $attribute_array;
        }
    }
    // Attributes either weren't found, or couldn't be extracted
    // by the regular expression.
    return false;
}

This function searches the provided XML sample for the first element of the given name that has attributes, and returns those attributes in an associative array. The array keys are the attribute names, and the array values are the attribute values.

Example of use

Given the following XML sample in a PHP variable called $xml:

<item>
    <target href="http://somedomain.com/" type="text/html"
	        category="Home &amp; Leisure" />
    <title>This is the title of this item</title>
</item>

you could call element_attributes like this:

$attribute_array = element_attributes('target', $xml);

Now, if the function hasn't failed and returned false, you should have an associative array that contains the attribute values of the target element. In the returned array, the attribute names are the keys to the array, so you can grab the value of the individual attributes like this:

$href = $attribute_array['href'];
$type = $attribute_array['type'];
$category = $attribute_array['category'];

A very handy way of quickly grabbing the attribute values of a lone element. But if there are several empty elements with the same name, you will only be able to get the attributes of the first matching element that has attributes. Which leads us into a note about the limitations of these functions.

Limitations of using generic regex to extract XML data

The three pages of this article have offered simple functions for grabbing at the content and attributes of XML elements by using regular expression patterns to match the juicy bits in an XML sample.

Limitations of value_in and element_set

Because of the way the regular expressions in these functions work, there are some XML structures that will break value_in and element_set and cause them to return unexpected results. For instance, neither function can extract the content of an element that contains another element with the same name, as in this example:

<div class="outer">
    <div class="inner">some content</div>
</div>

Calling value_in to extract the content of "div" from the above XML would break, because the regular expression would stop when it found the closing tag for the inner div element, thinking it was the closing tag for the outer div element.

Another shortcoming with the regular expression approach is where elements at different levels within a piece of XML share the same name. For instance:

<root>
    <sub-element>
        <name>Alan</name>
    </sub-element>
    <name>Brad</name>
</root>

Because of the way the regular expression in value_in works, there's no way of isolating the name element that contains "Brad" in the above XML, so there's no way of selecting it using value_in. This happens because there's no way to match the element called 'name' that contains "Brad" without first matching the element called 'name' that contains "Alan". Using element_set to extract the name elements would store both of them in an array, but you wouldn't be able to isolate only name elements that were immediate child elements of root.

Limitations of element_attributes

The element_attributes function also has a serious limitation. It can only grab the attributes from the first named element that has attributes. If there is more than one element with the same name, such as in this XML sample,

<item>
	<hyperlink />
	<hyperlink/>
	<hyperlink type="text/html" href="http://www.bobulous.org.uk/coding/" />
	<hyperlink href="text/html" href="http://www.bbc.co.uk/"/>
</item>

then the element_attributes function can only return an associative array that contains the attributes for the third hyperlink element, because the first two have no attributes and the fourth hyperlink element cannot be isolated.

If the elements weren't empty elements, then the element_set function could gather up the elements into an array, and then you could call element_attributes on each one. But element_set doesn't match empty elements (because they contain no content), so it wouldn't work in this example. You could create a modified version of element_set that looks for empty elements, but hopefully you won't encounter many XML data formats that make use of empty element siblings with the same name.

Summary

These limitations won't be a problem if you're trying to extract content from well-designed XML data files, but if you're trying to extract content from XHTML files or files that feature problem structures like the above, you'll have to come up with your own customised regular expressions, or switch to an event-based parser like the one that expat offers.

Updates

13th January 2012: Thanks to Robert Bradley for pointing out that the source code on this page had an error. The line in the code box above wrongly said $attribute_array[$attribute[1]] = $attribute[3] and it should have been $attribute_array[$attribute[1]] = substr($attribute[2], 1, -1). The downloadable code was fine, so I've no idea how I managed to get the wrong code onto this page.