Read XML feeds using PHP 5 and SimpleXML

Some of my previous articles were about extracting XML using regular expressions. This method was useful in PHP 4 as a quick way to grab pieces of data in well-structured XML documents, but it had its limitations.

For web developers using PHP 5, the SimpleXML extension is a quicker, easier way to access content in data-oriented XML documents [see note below], and it's built into PHP 5 by default. This article shows you how to use SimpleXML to extract the latest headlines from an Atom XML feed and display them on your site. It also explains how you can use an extension called Cache_Lite to save the results, to greatly speed up loading times.

The result

Here are the first five headlines from the Atom XML feed from The Register, my preferred source for technology (and assorted) news stories.

The Register

Current headlines

EFF dinks HP Inc finks in rinky-dink ink stink

Give us back our steenkin' cartridges

The Electronic Freedom Foundation has written to HP Inc demanding it reverse its attempt to prevent any third-party ink cartridges or refilled cartridges from working in its Officejet Pro printers.…

Self-destructing Samsung Galaxy Note 7 recall hits six out of ten

Only one in ten demanding cash, we're told

Just over three weeks after announcing a global Galaxy Note 7 recall, Samsung says six out of ten US and South Korean punters have returned their exploding phablets.…

Microsoft's Azure-in-a-box preview runs on your own hardware

Shame the actual product still doesn't, though

Microsoft is dangling a new Technical Preview of its Azure Stack in front of enterprise customers who want to run an applications and services platform across their on-premise private cloud and Redmond's globe-spanning Azure public cloud.…

Startup iguazio launches NVMe-propelled missile at enterprise…

Wall St-beating PaaS for Big Data firm touts crazy performance claims

iguazio’s Data-as-a-Service Enterprise Data Cloud converges different storage access protocols and use cases behind an access abstraction layer and claims to out-perform Amazon and all-flash filers at lower costs.…

New LITE working group takes up ARMs against the IoT

Another initiative targets developers of smart doorbells and other gizmos

Linaro, the collaborative engineering effort focused around Linux for ARM-based devices, has spawned a new working group to develop open reference platforms for connected products, with an inevitable eye on the Internet of Things (IoT).…

The headlines, story dates and summaries have been extracted using SimpleXML and then wrapped in HTML markup. CSS is used to give the whole thing layout and colour.

Requirements and suitability

The code on this page requires that you are using PHP 5.2.3 or later. Version 5.2.3 was released in May 2007, so you or your web host really should be running at least that version on your server by now. (If not, you should perhaps consider switching to another web host.)

To fetch the XML feed, this code uses cURL. This extension is not compiled/enabled by default, so see the cURL PHP documentation to find out how to add the extension to your PHP build. Note that cURL is not required to use SimpleXML, though, and it's just one way of fetching an XML feed from a third-party site.

While this page uses an Atom XML feed as its data source, SimpleXML can be used to work with other data-oriented XML document types. Note: SimpleXML is not suited to accessing XML documents which contain mixed-content elements, such as XHTML documents where text and elements mingle together like this:

<p>This text mingles with <a href="here.html">this hyperlink
element</a> so SimpleXML won't be able to see the whole p
element as one object.</p>

If you need to access mixed-content XML documents like this, take a look at the more complex XML libraries on offer in PHP 5, such as DOM. If your XML document is very large, you may need to use a parser such as XMLReader instead (because object tree-based models such as SimpleXML have to load the entire XML structure into memory in one go).

Loading the XML feed

If you're working with a file on your own server, then it's probably simplest to use the simplexml_load_file function like this:

$feed = simplexml_load_file($filename)

which will open that file and produce a SimpleXMLElement object tree in one easy command. If this works for you, you can skip down to the section about accessing the SimpleXMLElement object tree, but you should probably start from the section about SimpleXML to see how to capture errors created by libxml.

On the other hand, if you're fetching an XML feed from another site, as I'm doing on this page, the simplexml_load_file function may not work, because most servers are configured to prohibit file-handling commands from accessing URLs (the server setting allow_url_fopen is set to "Off" for security reasons). This is where cURL comes in. If cURL is available on your PHP build, then you can use it to fetch files from URLs on other sites in a few lines of code.

Using cURL to fetch an XML document

At this point it's important to say that you must check that you have permission to use another site's XML feed before you proceed. Many sites provide a page that tells you what you're allowed to do with their feeds, and a person to contact if you have any questions. I contacted the Digital Operations Manager at The Register to check that my intended use of their XML feed was acceptable, and he gave me the green light. But make sure you do have permission from the site which owns a feed (or any other type of content) before you use it for any purpose.

Back to cURL. For instance, to fetch the main Atom XML feed from The Register, we can use cURL like this:

$xml_feed_url = 'http://www.theregister.co.uk/headlines.atom';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $xml_feed_url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$xml = curl_exec($ch);
curl_close($ch);

After this code has executed, the variable $xml should be a string containing the entire document referenced by the URL (an Atom XML document in this case). However, if cURL encountered a problem, $xml will instead be false, and it's important to check for this and handle the error gracefully if it occurs.

SimpleXML

Once you've got a string variable, $xml, which contains an entire XML document, creating a SimpleXMLElement object can be as simple as this:

$xmlTree = new SimpleXMLElement($xml);

However, it's better to create a function using the following code instead, so that any errors from libxml (upon which SimpleXML relies) are caught rather than dumped out to the page:

function produce_XML_object_tree($raw_XML) {
    libxml_use_internal_errors(true);
    try {
        $xmlTree = new SimpleXMLElement($raw_XML);
    } catch (Exception $e) {
        // Something went wrong.
        $error_message = 'SimpleXMLElement threw an exception.';
        foreach(libxml_get_errors() as $error_line) {
            $error_message .= "\t" . $error_line->message;
        }
        trigger_error($error_message);
        return false;
    }
    return $xmlTree;
}

(Note that the trigger_error function will likely also dump the error messages out to the page unless you define an error handler to do something more suitable with them.)

Now you can call this function, supplying it with the $xml variable that contains the XML document, and it will return a SimpleXMLElement object tree, or false if something goes wrong:

$feed = produce_XML_object_tree($xml);

(remember to check for false and handle the error gracefully).

We've called the object variable $feed because it represents the root element of the XML document, and the root element of an Atom XML feed is called feed.

Accessing the SimpleXMLElement object tree

Assuming nothing went wrong, $feed will now be an object tree which has a structure exactly like the XML document. With this SimpleXMLElement object, it's now trivially easy to access content from the XML document.

According to the Atom Syndication format, the root element must be a feed element. And this contains zero or more entry elements which represent news stories. Each entry element must contain a title, id, and updated element, and can optionally contain a summary element where a brief text of the story is contained.

So for the Atom feed example on this page, the $feed object is the root node, and it contains objects named after the elements which are the immediate children of the feed element in an Atom XML feed, such as the title, id and updated elements.

For exmaple, to get the title of the second entry element, you just access the object like this, where the -> operator is used to point from an object to one of its child objects:

$second_entry_title = $feed->entry[1]->title;

(Note: the keys for these objects start at zero, like in arrays, so entry[1] refers to the second entry, not the first.)

If you want to iterate through all of the entry objects in the feed and print out each entry title, it's as simple as this:

foreach ($feed->entry as $entry) {
    echo '<p>'.$entry->title.'</p>';
}

You can also access attribute values and namespace values using the SimpleXMLElement object. For more examples, see the "Basic usage" page of the SimpleXML documentation on the PHP site.

Now you've got your SimpleXMLElement object, you're ready to extract content from the XML feed and write it to your page.

Extract and markup

Check data carefully

First, a warning. As always you, as the web developer, must suspect all third-party data of being potentially dangerous. Whether it's data submitted to your site via a form, or the content of an XML feed fetched from a remote site, you need to process the data as though it could be harmful. I trust the good people at The Register, but if their site was hacked, their XML feed might fall under the control of malicious agents, and it could then contain harmful hyperlinks, for instance.

So I use the following function to process any text extracted from the XML:

function clean_text($text, $length = 0) {
    $html = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
    $text = strip_tags($html);
    if ($length > 0 && strlen($text) > $length) {
        $cut_point = strrpos(substr($text, 0, $length), ' ');
        $text = substr($text, 0, $cut_point) . '…';
    }
    $text = htmlentities($text, ENT_QUOTES, 'UTF-8');
    return $text;
}

This function removes any HTML tags encoded as entities within the text, trims the text (if necessary) if a maximum length is provided, and then re-encodes the whole text using PHP's htmlentities function. Note that SimpleXML internally uses UTF-8 character encoding, so you must specify 'UTF-8' in the htmlentities and html_entity_decode function calls. If your web page is using a character encoding other than UTF-8, you should use PHP's iconv function to convert the feed content from UTF-8 into the encoding your page uses.

You also want to make sure that URLs extracted from the feed cannot cause trouble, so I use the following function to process those URLs:

function safe_url($raw_url) {
    $url_scheme = parse_url($raw_url, PHP_URL_SCHEME);
    if ($url_scheme == 'http' || $url_scheme == 'https') {
        return htmlspecialchars($raw_url, ENT_QUOTES, 'UTF-8',
                false);
    }
    // parse_url failed, or the scheme was not hypertext-based.
    return false;
}

This function simply checks that the scheme of the URL is "http:" or "https:", and then uses htmlentities with the ENT_QUOTES option. The scheme is checked to avoid URLs that begin "javascript:", because we don't want to allow third-party URLs to execute JavaScript on our site. And htmlentities is called with ENT_QUOTES to make sure that the URL does not contain unencoded angle-brackets or quote symbols (which would allow the URL to break out of a href attribute in an a element, and could lead to a lot of trouble), and the fourth parameter is false to avoid double-encoding existing entities (such as ampersand entities which may be needed in query strings). Again, you need to specify 'UTF-8' as the encoding for htmlentities, and use iconv if your web page uses a character encoding other than UTF-8.

If the scheme of the URL is not acceptable, then safe_url will return false, and you should check for this in your code.

With these data-scrubbing functions at the ready, we can run through the feed object, extract the feed title and its logo graphic and then iterate through the entry objects to output story headlines and summaries.

The main loop

Which content you extract and how you choose to wrap it in HTML markup will depend on the feed and on how you want to structure and style the end result. But here's the main loop of the code I use to produce the example above, showing how to access data in the SimpleXMLElement object $feed, and how to make use of the clean_text and safe_url functions (and check to see whether the output of safe_url is false).

echo '<dl>';
$entry_count = 0;
// Have to call date_default_timezone_set otherwise the
// date functions generate warning messages every time
// you use them.
date_default_timezone_set('UTC');
foreach ($feed->entry as $entry) {
    $url = safe_url((string) $entry->link['href']);
    $summary = extract_header_and_first_paragraph(
            $entry->summary);
    // If either the item URL is bad, or the summary text did
    // not match the h4 + p pattern expected, then skip this
    // item and hope the next item is in better shape.
    if (!$url || !$summary) {
        continue;  // skip this entry
    }

    // Limit the title to a suitable maximum length, and make
    // sure that no rogue markup can get into the output.
    $safe_title = clean_text($entry->title, 70);
    echo '<div class="feed_story">';  // box dt + dd pairs
    echo '<dt><a href="'.$url.'">'.$safe_title.
            '</a></dt>';
    echo '<dd>';
    echo '<p class="quip"><span class="text">'.
            clean_text($summary['header'], 80).'</span></p>';
    // Attempt to process the "updated" value as a date, and
    // if successful, add a date and time to this entry.
    $date = date_create($entry->updated);
    if ($date != false) {
        echo '<p class="dateLine">'.
        date_format($date, 'jS F \a\t H:i \U\T\CO').
        '</p>';
    }
    echo '<p>'.
            clean_text($summary['text'], 260).'</p>';
    echo '</dd></div>';  // end div.feed_story
    ++$entry_count;
    if ($entry_count >= 5) {
        break;  // stop after first five entries
    }
}
// If all of the entries were skipped or there were none . . .
if ($entry_count < 1) {
    echo '<dt>No headlines</dt>';
    echo '<dd>There may be a problem with the feed, or '.
            'perhaps the feed processing script has a '.
            'fault.</dd>';
    trigger_error('$entry_count was zero, probably due to '.
            'the XML feed content changing or being '.
            'corrupt.');
}
echo '</dl>';

I'm using a definition list (dl element) to structure the headline and summary pairs, and I'm wrapping each pair in a div element with class "feed_story" to make it easier to target each story using CSS selectors, so that a box can be placed around each story.

Note that my code calls a function named extract_header_and_first_paragraph. This function is specific to the feed produced by The Register, and it simply uses a regular expression to grab the h4 element and first p element from the summary content, and returns them in an associative array so that $summary['header'] contains the content of the h4 element, and $summary['text'] contains the content of the first p element.

Remember, this function is specific to the summary nodes from The Register feed, but here it is in case you find it informative:

function extract_header_and_first_paragraph($summary_text) {
    // NOTE: SimpleXML seems to automatically convert entities
    // into their Unicode characters. This feature is not
    // documented in the PHP docs.
    $match_found = preg_match('#<h4>(.+?)</h4>'.
            '.*?<p>(.+?)</p>#is',
            $summary_text, $matches);
    if (!$match_found) {
        return false;
    }
    return array('header' => $matches[1], 'text' => $matches[2]);
}

If the expected h4-followed-by-p structure is not found, then this function returns false, and this has to be checked in the main loop. If this happens, then the current entry is skipped, and the next entry is checked instead. If all of the entry nodes are examined and none of them fit, then $entry_count will be zero, and a brief apology will be shown to the user.

Also note that some XML documents (especially RSS feeds) use CDATA sections to contain HTML markup which is not entity-encoded, or to contain all text content. The libxml library has a LIBXML_NOCDATA option which can be passed to the SimpleXMLElement constructor to cause it to "Merge CDATA as text nodes". This might be useful if your feed contains content within CDATA sections, though I haven't had a need to try it myself.

Cache_Lite

Because all of this fetching and processing is quite expensive (in terms of time, CPU cycles, data usage, etc), it makes sense to cache the result so that if another person visits the page within the next few minutes, the result can be retrieved from cache quickly. On my test machine, building the news feed summary the hard way (fetching and processing the XML) was taking between 0.24 and 0.44 seconds; while retrieving a recent copy of the result from a cached file was taking between 0.002 and 0.003 seconds. So building the result took between 80 and 220 times longer than simply retrieving it from cache.

Using a cache also means that your script isn't asking for the XML document from the third-party site more often than it really needs to, which will avoid your page causing the other site serious bandwidth or traffic usage problems.

Installing Cache_Lite

Luckily there is a PEAR extension called Cache_Lite which makes content caching very simple. Once you've got it installed.

Because Cache_Lite is a PEAR extension, you must have PEAR installed before you can install and use Cache_Lite. If you're on a shared hosting package using cPanel, you may find a "PHP PEAR Packages" tool in the cPanel main page. This lets you easily search for and install new extensions, such as Cache_Lite, and update it to the latest version with a single click of an "Update" button. If you don't have a cPanel service, you may have to visit the PEAR website and install PEAR manually. This may not be possible if you're on a shared hosting package, so check with your web host's technical support.

Even if you use cPanel to install the Cache_Lite extension, you might find that it installs to a directory within your personal directory tree on the web server, which means that PHP's include path almost certainly won't know where to find it. So you'll need to add something like the following to your scripts before using Cache_Lite functions:

// Get actual path to the directory above DOCUMENT_ROOT
// Returned value does not end with a slash.
$ops_dir = realpath($_SERVER['DOCUMENT_ROOT'].'/../');

// Ask the include path to search the php folder too
$pear_path = $ops_dir.'/php';
set_include_path(get_include_path() . PATH_SEPARATOR . $pear_path);

This assumes that your PEAR extensions directory is a directory called 'php', found on the level above your public webspace (document root) directory. Obviously you'll need to change this to match the path to PEAR extensions as they're installed on your own server.

Using Cache_Lite in your PHP pages

That's the difficult bit done. Once Cache_Lite is installed, it's incredibly simple to use, and there's no point in my demonstrating, because the Cache_Lite Introduction page on the official site covers it so well. You just wrap the Cache_Lite code around (all of) the PHP code needed to fetch the XML feed, process the feed, and output the HTML result, and Cache_Lite will serve up a cached copy if its cache is fresh enough, or run the feed crunching code if the cache is too old.

A final note about Cache_Lite. You need to decide where cache files will be stored. By default Cache_Lite saves them to the top-level '/tmp/' directory. Which is fine if you own your own server, but it might be forbidden (or a security risk) if you're on a shared server, so check with technical support if you're not sure. It's easy enough to create a 'cache-files-for-Cache_Lite' directory on the level above document root, and then specify this location in the $options array you use to create a Cache_Lite object (where $ops_dir is the same variable defined in the previous block of code, above):

// Specify Cache_Lite options (including cache object lifetime
// in seconds)
$options = array(
    'cacheDir' => $ops_dir.'/cache-for-Cache_Lite/',
    'lifeTime' => 3600
);

Note that the lifeTime value is in seconds, so 3600 is one hour, which means that the cached content will be used to serve any visitors that view the page within one hour of the feed box being built. This means that your script will fetch the XML feed from the third-party site a maximum of twenty-four times per day, no matter how many visitors your page receives. Which is far better than fetching it every time a visitor views the page.

In summary

I did intend this article to be short, but once again the finished piece is pretty lengthy. Which is misleading, because while SimpleXML and Cache_Lite take a bit of work to get installed and ready, they're both quick and simple to use in your scripts while being very powerful tools. Hopefully this article has enlightened you rather than scared you off altogether, but if you see anything that's incorrect, confusing or unclear, let me know.