Extracting XML data in PHP with SimpleXML

Some of my previous articles were about extracting XML using regular expressions. This method was useful in PHP 4 as a quick way to grab pieces of data in well-structured XML documents, but it had its limitations.

For web developers using PHP 5, the SimpleXML extension is a quicker, easier way to access content in data-oriented XML documents [see note below], and it's built into PHP 5 by default. This article shows you how to use SimpleXML to extract the latest headlines from an Atom XML feed and display them on your site. It also explains how you can use an extension called Cache_Lite to save the results, to greatly speed up loading times.

The result

Here are the first five headlines from the Atom XML feed from The Register, my preferred source for technology (and assorted) news stories.

The Register

Current headlines

Major publishers sue Perplexity AI for scraping without paying

We sell that to OpenAI – how dare you steal it and make stuff up

22nd October at 07:30 UTC+0000

Major US news publishers Dow Jones & Co and NYP Holdings have sued AI search engine startup Perplexity for scraping their content without paying for it.…

Lab-grown human brain cells drive virtual butterfly in simulation

Could organoid-driven computing be the future of AI power?

22nd October at 06:30 UTC+0000

Researchers affiliated with the neuroscience platform FinalSpark have devised a 3D simulation depicting a butterfly that's directed by human brain cells.…

Pixel perfect Ghostpulse malware loader hides inside PNG image files

Miscreants combine it with an equally tricky piece of social engineering

22nd October at 05:30 UTC+0000

The Ghostpulse malware strain now retrieves its main payload via a PNG image file's pixels. This development, security experts say, is "one of the most significant changes" made by the crooks behind it since launching in 2023.…

India, Nvidia, discuss jointly developed AI chip

Current capabilities mean local manufacturing is not likely – but a chip…

22nd October at 04:26 UTC+0000

India's government is reportedly in talks with Nvidia to co-develop AI silicon.…

China Telecom's next 150,000 servers will mostly use local processors

Intel and AMD left scrapping over about a third of the deal, and license fees

22nd October at 03:32 UTC+0000

Most years, China Telecom posts a tender for new servers to help it run the apps it needs to serve its hundreds of millions of customers. This year, its 150,000-plus orders will mostly go to domestic manufacturers who use local tech.…

Feed fetched 2024-10-22 08:09:11 UTC

The headlines, story dates and summaries have been extracted using SimpleXML and then wrapped in HTML markup. CSS is used to give the whole thing layout and colour.

Requirements and suitability

The code on this page requires that you are using PHP 5.2.3 or later. Version 5.2.3 was released in May 2007, so you or your web host really should be running at least that version on your server by now. (If not, you should perhaps consider switching to another web host.)

To fetch the XML feed, this code uses cURL. This extension is not compiled/enabled by default, so see the cURL PHP documentation to find out how to add the extension to your PHP build. Note that cURL is not required to use SimpleXML, though, and it's just one way of fetching an XML feed from a third-party site.

While this page uses an Atom XML feed as its data source, SimpleXML can be used to work with other data-oriented XML document types. Note: SimpleXML is not suited to accessing XML documents which contain mixed-content elements, such as XHTML documents where text and elements mingle together like this:

If you need to access mixed-content XML documents like this, take a look at the more complex XML libraries on offer in PHP 5, such as DOM. If your XML document is very large, you may need to use a parser such as XMLReader instead (because object tree-based models such as SimpleXML have to load the entire XML structure into memory in one go).

Loading the XML feed

If you're working with a file on your own server, then it's probably simplest to use the simplexml_load_file function like this:

which will open that file and produce a SimpleXMLElement object tree in one easy command. If this works for you, you can skip down to the section about accessing the SimpleXMLElement object tree, but you should probably start from the section about SimpleXML to see how to capture errors created by libxml.

On the other hand, if you're fetching an XML feed from another site, as I'm doing on this page, the simplexml_load_file function may not work, because most servers are configured to prohibit file-handling commands from accessing URLs (the server setting allow_url_fopen is set to "Off" for security reasons). This is where cURL comes in. If cURL is available on your PHP build, then you can use it to fetch files from URLs on other sites in a few lines of code.

Using cURL to fetch an XML document

At this point it's important to say that you must check that you have permission to use another site's XML feed before you proceed. Many sites provide a page that tells you what you're allowed to do with their feeds, and a person to contact if you have any questions. I contacted the Digital Operations Manager at The Register to check that my intended use of their XML feed was acceptable, and he gave me the green light. But make sure you do have permission from the site which owns a feed (or any other type of content) before you use it for any purpose.

Back to cURL. For instance, to fetch the main Atom XML feed from The Register, we can use cURL like this:

After this code has executed, the variable $xml should be a string containing the entire document referenced by the URL (an Atom XML document in this case). However, if cURL encountered a problem, $xml will instead be false, and it's important to check for this and handle the error gracefully if it occurs.

SimpleXML

Once you've got a string variable, $xml, which contains an entire XML document, creating a SimpleXMLElement object can be as simple as this:

However, it's better to create a function using the following code instead, so that any errors from libxml (upon which SimpleXML relies) are caught rather than dumped out to the page:

(Note that the trigger_error function will likely also dump the error messages out to the page unless you define an error handler to do something more suitable with them.)

Now you can call this function, supplying it with the $xml variable that contains the XML document, and it will return a SimpleXMLElement object tree, or false if something goes wrong:

We've called the object variable $feed because it represents the root element of the XML document, and the root element of an Atom XML feed is called feed.

Accessing the SimpleXMLElement object tree

Assuming nothing went wrong, $feed will now be an object tree which has a structure exactly like the XML document. With this SimpleXMLElement object, it's now trivially easy to access content from the XML document.

According to the Atom Syndication format, the root element must be a feed element. And this contains zero or more entry elements which represent news stories. Each entry element must contain a title, id, and updated element, and can optionally contain a summary element where a brief text of the story is contained.

So for the Atom feed example on this page, the $feed object is the root node, and it contains objects named after the elements which are the immediate children of the feed element in an Atom XML feed, such as the title, id and updated elements.

For exmaple, to get the title of the second entry element, you just access the object like this, where the -> operator is used to point from an object to one of its child objects:

(Note: the keys for these objects start at zero, like in arrays, so entry[1] refers to the second entry, not the first.)

If you want to iterate through all of the entry objects in the feed and print out each entry title, it's as simple as this:

You can also access attribute values and namespace values using the SimpleXMLElement object. For more examples, see the "Basic usage" page of the SimpleXML documentation on the PHP site.

Now you've got your SimpleXMLElement object, you're ready to extract content from the XML feed and write it to your page.

Extract and markup

Check data carefully

First, a warning. As always you, as the web developer, must suspect all third-party data of being potentially dangerous. Whether it's data submitted to your site via a form, or the content of an XML feed fetched from a remote site, you need to process the data as though it could be harmful. I trust the good people at The Register, but if their site was hacked, their XML feed might fall under the control of malicious agents, and it could then contain harmful hyperlinks, for instance.

This function removes any HTML tags encoded as entities within the text, trims the text (if necessary) if a maximum length is provided, and then re-encodes the whole text using PHP's htmlentities function. Note that SimpleXML internally uses UTF-8 character encoding, so you must specify 'UTF-8' in the htmlentities and html_entity_decode function calls. If your web page is using a character encoding other than UTF-8, you should use PHP's iconv function to convert the feed content from UTF-8 into the encoding your page uses.

You also want to make sure that URLs extracted from the feed cannot cause trouble, so I use the following function to process those URLs:

This function simply checks that the scheme of the URL is "http:" or "https:", and then uses htmlentities with the ENT_QUOTES option. The scheme is checked to avoid URLs that begin "javascript:", because we don't want to allow third-party URLs to execute JavaScript on our site. And htmlentities is called with ENT_QUOTES to make sure that the URL does not contain unencoded angle-brackets or quote symbols (which would allow the URL to break out of a href attribute in an a element, and could lead to a lot of trouble), and the fourth parameter is false to avoid double-encoding existing entities (such as ampersand entities which may be needed in query strings). Again, you need to specify 'UTF-8' as the encoding for htmlentities, and use iconv if your web page uses a character encoding other than UTF-8.

If the scheme of the URL is not acceptable, then safe_url will return false, and you should check for this in your code.

With these data-scrubbing functions at the ready, we can run through the feed object, extract the feed title and its logo graphic and then iterate through the entry objects to output story headlines and summaries.

The main loop

Which content you extract and how you choose to wrap it in HTML markup will depend on the feed and on how you want to structure and style the end result. But here's the main loop of the code I use to produce the example above, showing how to access data in the SimpleXMLElement object $feed, and how to make use of the clean_text and safe_url functions (and check to see whether the output of safe_url is false).

I'm using a definition list (dl element) to structure the headline and summary pairs, and I'm wrapping each pair in a div element with class "feed_story" to make it easier to target each story using CSS selectors, so that a box can be placed around each story.

Note that my code calls a function named extract_header_and_first_paragraph. This function is specific to the feed produced by The Register, and it simply uses a regular expression to grab the h4 element and first p element from the summary content, and returns them in an associative array so that $summary['header'] contains the content of the h4 element, and $summary['text'] contains the content of the first p element.

Remember, this function is specific to the summary nodes from The Register feed, but here it is in case you find it informative:

If the expected h4-followed-by-p structure is not found, then this function returns false, and this has to be checked in the main loop. If this happens, then the current entry is skipped, and the next entry is checked instead. If all of the entry nodes are examined and none of them fit, then $entry_count will be zero, and a brief apology will be shown to the user.

Also note that some XML documents (especially RSS feeds) use CDATA sections to contain HTML markup which is not entity-encoded, or to contain all text content. The libxml library has a LIBXML_NOCDATA option which can be passed to the SimpleXMLElement constructor to cause it to "Merge CDATA as text nodes". This might be useful if your feed contains content within CDATA sections, though I haven't had a need to try it myself.

Cache_Lite

Because all of this fetching and processing is quite expensive (in terms of time, CPU cycles, data usage, etc), it makes sense to cache the result so that if another person visits the page within the next few minutes, the result can be retrieved from cache quickly. On my test machine, building the news feed summary the hard way (fetching and processing the XML) was taking between 0.24 and 0.44 seconds; while retrieving a recent copy of the result from a cached file was taking between 0.002 and 0.003 seconds. So building the result took between 80 and 220 times longer than simply retrieving it from cache.

Using a cache also means that your script isn't asking for the XML document from the third-party site more often than it really needs to, which will avoid your page causing the other site serious bandwidth or traffic usage problems.

Installing Cache_Lite

Luckily there is a PEAR extension called Cache_Lite which makes content caching very simple. Once you've got it installed.

Because Cache_Lite is a PEAR extension, you must have PEAR installed before you can install and use Cache_Lite. If you're on a shared hosting package using cPanel, you may find a "PHP PEAR Packages" tool in the cPanel main page. This lets you easily search for and install new extensions, such as Cache_Lite, and update it to the latest version with a single click of an "Update" button. If you don't have a cPanel service, you may have to visit the PEAR website and install PEAR manually. This may not be possible if you're on a shared hosting package, so check with your web host's technical support.

Even if you use cPanel to install the Cache_Lite extension, you might find that it installs to a directory within your personal directory tree on the web server, which means that PHP's include path almost certainly won't know where to find it. So you'll need to add something like the following to your scripts before using Cache_Lite functions:

This assumes that your PEAR extensions directory is a directory called 'php', found on the level above your public webspace (document root) directory. Obviously you'll need to change this to match the path to PEAR extensions as they're installed on your own server.

Using Cache_Lite in your PHP pages

That's the difficult bit done. Once Cache_Lite is installed, it's incredibly simple to use, and there's no point in my demonstrating, because the Cache_Lite Introduction page on the official site covers it so well. You just wrap the Cache_Lite code around (all of) the PHP code needed to fetch the XML feed, process the feed, and output the HTML result, and Cache_Lite will serve up a cached copy if its cache is fresh enough, or run the feed crunching code if the cache is too old.

A final note about Cache_Lite. You need to decide where cache files will be stored. By default Cache_Lite saves them to the top-level '/tmp/' directory. Which is fine if you own your own server, but it might be forbidden (or a security risk) if you're on a shared server, so check with technical support if you're not sure. It's easy enough to create a 'cache-files-for-Cache_Lite' directory on the level above document root, and then specify this location in the $options array you use to create a Cache_Lite object (where $ops_dir is the same variable defined in the previous block of code, above):

Note that the lifeTime value is in seconds, so 3600 is one hour, which means that the cached content will be used to serve any visitors that view the page within one hour of the feed box being built. This means that your script will fetch the XML feed from the third-party site a maximum of twenty-four times per day, no matter how many visitors your page receives. Which is far better than fetching it every time a visitor views the page.

In summary

I did intend this article to be short, but once again the finished piece is pretty lengthy. Which is misleading, because while SimpleXML and Cache_Lite take a bit of work to get installed and ready, they're both quick and simple to use in your scripts while being very powerful tools. Hopefully this article has enlightened you rather than scared you off altogether, but if you see anything that's incorrect, confusing or unclear, let me know.

Read XML feeds using PHP 5 and SimpleXML