Getting data from XML feeds using PHP

A how-to guide by Bobulous.

Introduction

[Note: This article has been superseded by the newer article reading XML feeds using SimpleXML.]

In a previous article, I introduced a set of simple functions that grab data from XML files. The functions are a little rough and ready, but with well-structured XML data formats they work very well. This article offers an example of how to use those functions to grab the latest headlines from a real-world RSS feed and then format them into a "latest news" box for your website.

The desired result

Here is the result of fetching an XML feed, extracting the desired data from it, and formatting that data into a block of HTML. Using CSS, you can alter the appearance of the news box radically, choosing colours, sizes, positions and fonts to suit the situation.

World news | The Guardian

Nigerian woman rescued 10 years after kidnap by Boko Haram in Chibok
Lydia Simon, recovered along with three children born in captivity, was one of 276 schoolgirls taken in 2014Nigerian troops have rescued a pregnant woman and her three children 10 years after she was abducted by Boko Haram militants when she was...
[17:53, 18th April GMT]
War, grief and hope: the stories behind the World Press Photo award-winners
Images from Gaza, Ukraine, Madagascar and the US border chosen by global jury from more than 60,000 entries• World Press Photo winners 2024 – in picturesPhotographs documenting the wars in Gaza and Ukraine, migration, family and dementia have...
[12:30, 18th April GMT]
New types of mosquito bed nets could cut malaria risk by up to half, trial finds
Adding another insecticide to the protective netting has proved effective in fight against the disease that killed 600,000 in 2022Two new types of mosquito bed nets have been found to reduce cases of malaria by up to a half, raising hopes of...
[08:00, 18th April GMT]
Lethal heatwave in Sahel worsened by fossil fuel burning, study finds
Deaths from record temperatures in Mali reportedly led to full morgues turning away bodies this monthThe deadly protracted heatwave that filled hospitals and mortuaries in the Sahel region of Africa earlier this month would have been impossible...
[04:01, 18th April GMT]
Europeans care more about elephants than people, says Botswana president
Westerners see elephants as pets, said Mokgweetsi Masisi, whose government threatened to send 30,000 elephants to Germany and the UK to demonstrate their dangersMany Europeans value the lives of elephants more than those of the people who live...
[14:37, 17th April GMT]

Feed fetched 2024-04-19 21:33:32 UTC

Step one: Get the RSS feed

Before you create a script that fetches an XML feed from someone else's site, make sure to check that you have permission to do so. Look for a usage policy in the feeds section of the site, or email the site owner to make sure they're happy with what you intend to do.

For this article, I'm using the latest world news feed from The Guardian, one of the few newspapers I actually trust. The Guardian's XML feeds are in the RSS 2.0 format, but the techniques used in this article could be applied to most XML data formats.

Once you have the URL of the XML feed that you are going to use, you need to have PHP load the contents of the feed into a string variable. Using file_get_contents, you could fetch the XML file like so:

$xml = file_get_contents('https://www.theguardian.com/world/rss');

However, this requires that PHP is setup with allow_url_fopen set to true. Not all web hosts enable this setting, for security reasons. So another way to fetch the XML file into a string is by using the cURL functions (if they are installed on your PHP setup) like this:

// Use cURL to get the RSS feed into a PHP string variable.
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,
        'https://www.theguardian.com/world/rss');
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$xml = curl_exec($ch);
curl_close($ch);

Either method should give you a string called $xml that contains the contents of the entire XML feed.

Step two: Extract data from the feed

Most XML data files contain a large amount of content. For instance, The Guardian news feed contains channel (feed) title and image information, a couple of dozen news items, each with a title, description, image for the item, link to visitor comments, and the publication date. To keep things simple, I'm only going to extract the channel title and URL, and the title, description and date of each news item.

Starting off with the channel details, we can grab what we want using the value_in function. (You need to include the functions in xml_regex.php, which you can download from my page about extracting XML data with regular expressions.)

// Include the handy XML data extraction functions.
include 'xml_regex.php';
// An RSS 2.0 feed must have a channel title, and it will
// come before the news items. So it's safe to grab the
// first title element and assume that it's the channel
// title.
$channel_title = value_in('title', $xml);
// An RSS 2.0 feed must also have a link element that
// points to the site that the feed came from.
$channel_link = value_in('link', $xml);

Next we want to build an array that contains individual news item elements. This can be done using the element_set function.

// Create an array of item elements from the XML feed.
$news_items = element_set('item', $xml);

Once we've got the array of item elements, we can iterate through the array, one item at a time, and extract just the data that we're interested in from each news item: the URL, title, description and date. Then we can store that data in a new array called $item_array.

foreach($news_items as $item) {
    $title = value_in('title', $item);
    $url = value_in('link', $item);
    $description = value_in('description', $item);
    $timestamp = strtotime(value_in('pubDate', $item));
    $item_array[] = array(
            'title' => $title,
            'url' => $url,
            'description' => $description,
            'timestamp' => $timestamp
    );
}

Now we have $item_array, which is an array that contains news items in the form of associative arrays, so that the title of the first news item should be found in $item_array[0]['title'] and the title of the second news item should be found in $item_array[1]['title'] and so on.

Step three: Markup the data with HTML

With all the data we want stored in one handy array, we can iterate through the array and markup the data with HTML.

if (sizeof($item_array) > 0) {
    // First create a div element as a container for the whole
    // thing. This makes CSS styling easier.
    $html = '<div class="rss_feed_headlines">';
    // Markup the title of the channel as a hyperlink.
    $html .= '<h2 class="channel_title">'.
            '<a href="'.make_safe($channel_link).'">'.
            make_safe($channel_title).'</a></h2><dl>';
    // Now iterate through the data array, building HTML for
    // each news item.
    $count = 0;
    foreach ($item_array as $item) {
        $html .= '<dt><a href="'.make_safe($item['url']).'">'.
                make_safe($item['title']).'</a></dt>';
        $html .= '<dd>'.make_safe($item['description']);
        if ($item['timestamp'] != false) {
		    $html .= '<br />' .
                    '<span class="news_date">['.
                    gmdate('H:i, jS F T', $item['timestamp']).
                    ']</span>';
        }
        echo '</dd>';
        // Limit the output to five news items.
        if (++$count == 5) {
            break;
        }
    }
    $html .= '</dl></div>';
    echo $html;
}

The above code creates the HTML for the first five news items, plus the channel title, and wraps it all in a div element with the class "rss_feed_headlines" so that CSS stylesheets can target the news box specifically.

Step four: Consider security implications of external data

I trust the people at The Guardian not to send anything malicious to me via their RSS feeds. But suppose their site was hacked or infected by malware. Then they wouldn't be in control of what their RSS feeds contained. And my script would be fetching a corrupted RSS feed and displaying it on my site, which could allow for cross-site scripting exploits to run riot.

So you need to write code as though the RSS feed is suspect, even if you trust the site that you are getting it from. This means doing what is necessary to avoid any third-party HTML making it into your page. The number of cross-site scripting exploits is high, so it can be difficult to know how secure your code is.

To reduce the number of exploits that can survive, my code calls a function named make_safe on everything that is going to be output in HTML.

function make_safe($string) {
    $string = preg_replace('#<!\[CDATA\[.*?\]\]>#s', '', $string);
    $string = strip_tags($string);
    // The next line requires PHP 5.2.3, unfortunately.
    //$string = htmlentities($string, ENT_QUOTES, 'UTF-8', false);
    // Instead, use this set of replacements in older versions of PHP.
    $string = str_replace('<', '&lt;', $string);
    $string = str_replace('>', '&gt;', $string);
    $string = str_replace('(', '&#40;', $string);
    $string = str_replace(')', '&#41;', $string);
    $string = str_replace('"', '&quot;', $string);
    $string = str_replace('\'', '&#039;', $string);
    return $string;
}

This function removes CDATA sections (see the note below), then calls PHP's strip_tags function to remove HTML markup, then tries to convert any remaining dangerous characters into HTML character entities (PHP 5 is better equipped to do this than PHP 4). However, if you see any holes still open to exploitation, tell me about it.

Embedded markup and CDATA sections

Some RSS feeds embed HTML markup in their title and description elements using HTML entities. These are harmless, but they will cause the news item to appear on your page surrounded by HTML tags. There are various ways to remove such unwanted markup, but each feed will require its own adjustments, so I don't offer suggestions here. [See February 2012 update below.]

Another issue is that some RSS feeds contain CDATA sections to hide non-text elements inside RSS elements, such as HTML markup. To get rid of these CDATA sections, my code deletes everything in them with the following code:

$string =  preg_replace('#<!\[CDATA\[.*?\]\]>#s', '', $string);

The RSS feed for The Guardian doesn't seem to use CDATA sections, but some RSS feeds use CDATA sections to store img elements inside description elements, and some feeds seem to wrap everything in CDATA sections, even plain text. So, as with removing unwanted markup tags, you may need to craft a solution customised to the content of the feed you are using.

Other XML feed formats

The code in this guide has been based on parsing an RSS 2.0 feed. Some websites only offer Atom feeds for syndication. The Atom standard uses empty elements for some information, such as its link element. In such a case, you can use the element_attributes function to extract data from XML elements, and then the rest of the code suggested above can remain very similar.

The above methods aren't limited to syndication feeds. You can use the functions to parse data out of XML produced by website APIs. For instance, the Amazon Associates system returns results in XML, and you can use these functions to parse the data you want out of the XML.

Just bear in mind the limitations of using regex for XML parsing, and always consider the security of using data from external sources.

Updates

: For anyone using PHP version 5.2.3 or later, I recommend you take a look at my article about reading XML feeds using SimpleXML, which shows you an easy way of constructing an object tree full of data from an XML document.

: The dpreview.com feed has (for some time now) been placing HTML markup inside its description elements, which is why you might have seen img and p elements in the text of the news feed display at the top of this page recently. To remove these I've modified the make_safe function so that after the first call to strip_tags the resulting string is fed to html_entity_decode to convert entities back to raw characters, and then strip_tags is called again to remove any HTML tags which result. This seems to do the trick, but I've not got a full PHP development platform setup on my machine at the moment, so I've not been able to do any testing. Also, you need at least PHP 4.3.0 to use the html_entity_decode function so this won't suit some users (though as PHP 4 is no longer in support you really should have moved to PHP 5 by now).

: The dpreview.com feed which was previously being used is now blocking the curl requests sent by my web server, and it's not clear why. (Requests using the Firefox web browser or the Akregator feed reader both receive the actual feed XML fine.) Consequently I've stopped requesting the dpreview.com feed and now fetch the world news feed for The Guardian. Because both use the RSS 2.0 format, and because my code is not extracting much but the title and description of each news item, it was a simple matter of swapping one RSS feed URL for the other.