A custom PHP handler for 404 errors

The 404 problem

Recently I noticed that some discussion forum and wiki sites are using software that automatically converts user-typed URLs into hyperlinks, and accidentally includes punctuation (such as periods, commas, parentheses, brackets, and so on) in the URL. For instance, one popular forum has a posting which consists of something like the following:

blah blah (which I found at this site: http://www.bobulous.org.uk/misc/Replay-Gain.html) blah blah

The problem is, because the forum software doesn't require the user to markup where the URL starts and ends, it has to guess for itself. And it's guessing wrongly that the end-parenthesis symbol ')' is part of the URL.

Users who clicked on incorrect links generated by such clumsy software were being taken to a drab "Error 404, File Not Found" page which told them that the address was invalid. And there's a good chance most users simply clicked "Back" or closed the tab and gave up, possibly cursing my site for the problem.

The options

The number of such sites that were generating incorrect links seemed to be growing, so I wanted to help users to get to the page they were hoping for even if the link they'd clicked was slightly incorrect.

One solution, which I quickly dismissed, was to configure Apache to redirect these unlucky users to the correct page. The problem with this method is that a redirect would generate a 301 "Moved permanently" HTTP status code, which suggests that the page used to be at the incorrect URL. Which is wrong, because the URL is simply incorrect, so a 404 "Not found" code ought to be generated.

You could also use Apache's rewrite module to simply show the correct page to the user, but that's even worse, because then it looks like you've got multiple pages with the same content, which often causes search engines to penalise all of those pages.

A far better solution is to generate a 404 page that tries to deduce, from the user's invalid request, which actual page it was they were trying to reach, and offer the user a valid hyperlink to that page. That way, a 404 status code is generated as expected, and you help the user to quickly find their way to the page they wanted.

The PHP script

I'm using Apache 2 and PHP 5 on my site, so the 404-handler script on this page will be written for PHP 5, but the general idea can very likely be adapted to other server-side languages. Note that you do need to be able to configure Apache to tell it to use this script as a 404 error handler (see the section Apache configuration below).

This script contains a function that will not only try to correct invalid URLs by removing punctuation (and other characters) after the '.html' extension, it will also change '.htm' to '.html'. If that doesn't result in a valid URL then it will also do a case-insensitive check to see whether the request matches an actual file when upper and lower case letters are considered equivalent. Finally, it will also treat hyphens and underscores as equivalent, as these are two characters which get mixed up quite often.

Here's the script, called PHP-404-handler.php and topped with markup for XHTML 1.0 Strict and UTF-8 (which you should change to suit whatever your pages normally use) and a simple style sheet to give the page some colour and layout:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head><meta http-equiv="Content-type" content="application/xhtml+xml; charset=UTF-8" />
<title>Page not found</title><style type="text/css">
html {background-color: #ffdd00}
body {background-color: white; width: 90%; max-width: 55em; min-width: 740px;
margin: 1em auto; padding: 1em 1em 10em 1em}
</style></head><body>
<?php
// Part one: generates the HTML and text that informs the user.
echo "<h1>Error 404: Page not found</h1>";
if (!isset($_SERVER['HTTP_REFERER'])) {
	echo '<p>The address you just typed into your browser is incorrect.</p>';
} else {
	echo '<p>The link you just clicked points to the wrong place.</p>';
}
$wrong_path = $_SERVER['REQUEST_URI'];
$correct_path = find_valid_path_in_request($wrong_path);
if ($correct_path != false) {
	$full_path = $correct_path;
	echo '<p>You may have been looking for this:</p>';
	echo '<p><strong><a href="'.$full_path.'">';
	echo 'http://'.$_SERVER['HTTP_HOST'].$full_path.'</a></strong></p>';
	echo '<p>If not, take ';
} else {
	echo '<p>Take ';
}
echo 'a look at the <a href="/sitemap.html">Sitemap</a> and ';
echo 'you may find the page you were looking for.</p>';

// Part two: the function which attempts to correct an invalid path.
function find_valid_path_in_request($wrong_path) {
	// Stage one: use preg_match to make sure that there is the essence
	// of a valid request and capture the components of the request,
	// only up to and including the '.htm' or '.html' extension.
	if(preg_match("#^((/(?:[a-z]+/)*)([a-z0-9][a-z0-9_-]*\.html?))#i", $wrong_path, $match)) {
		// If the request ends with '.htm' then change it to '.html'.
		if (substr($match[1], -4) == '.htm') {
			$match[1] .= 'l';
			$match[3] .= 'l';
		}
		// $match[1] => full match, starts: '/', ends: '.html'.
		// $match[2] => directory component, starts & ends: '/'.
		// $match[3] => filename component, ends: '.html'.
	} else {
		// The request did not even resemble a valid request.
		return false;
	}
	// If the simple corrections above have resulted in a
	// valid request (as confirmed by the is_file function) then return
	// the corrected path.
	if(is_file($_SERVER['DOCUMENT_ROOT'].$match[1])) {
		return $match[1];
	}
	// Stage two: the simple corrections have not produced a valid
	// path so now check to see whether the directory component of
	// the request is at least valid. If so, try using simplification
	// to see whether the filename part of the request matches any of
	// the filenames from that directory (after they've been equally
	// simplified).
	$supplied_directory = $_SERVER['DOCUMENT_ROOT'].$match[2];
	if (is_dir($supplied_directory)) {
		// Directory component of request is a valid directory,
		// so now try simplified matching.
		$ls = scandir($supplied_directory);
		// Our simplification involves treating hyphens and
		// underscores alike, so define an array that will
		// map hyphens into underscores. Then use the strtr
		// function to execute this translation.
		$replacement_array = array('-' => '_');
		$simplified_filename = strtr($match[3], $replacement_array);
		foreach ($ls as $item) {
			// For every filename in this directory (which
			// ends with '.html', skip any that don't),
			// simplify the filename and then compare it to
			// our simplified request filename. If it matches
			// then we return the pre-simplified filename
			// as the match we've been looking for.
			if (substr($item, -5) != '.html') continue;
			if (strnatcasecmp(strtr($item, $replacement_array), $simplified_filename) == 0) {
				return $match[2].$item;
			}
		}
	}
	// No match has been found, one way or another.
	return false;
}
?>
</body></html>

Hopefully the numerous comments make it clear what the script is doing, but below follow a few notes.

Part one of the script just outputs a message to the user. It starts off by checking to see whether a HTTP REFERER value is set. If so, the script assumes the user has arrived here by clicking a link; if not, the script assumes the user typed the incorrect address into their browser themselves. (Note that HTTP REFERER is a highly unreliable value, so you should never use it for anything important.) Then the incorrect request (taken from the server variable REQUEST_URI) is passed to the function in part two, and if a result other than false is returned, the result is used to produce a hyperlink to a valid page on the site.

Part two, stage one: The regular expression

Part two, the function find_valid_path_in_request, does the real work. Firstly it uses PHP's preg_match function to check whether or not the request fits this regular expression:

#^((/(?:[a-z]+/)*)([a-z0-9][a-z0-9_-]*\.html?))#i

This (case insensitive) pattern will only match a string that begins with a forward slash, and then any number of directory names that consist only of (Latin) alphabetic characters followed by forward-slash, and finishes with a filename that must begin with a (Latin) alphanumeric character and can then also contain underscores and slashes. The filename must end with '.html' or even '.htm', but the request string is not required to end there. If the string matches this pattern then preg_match will capture the parts of the pattern that are surrounded by parentheses and store them in an array called $match. If the request string does not fit this pattern, then the function returns false, and the script gives up trying to find the valid path the user wants.

Also, if the matching part ended with '.htm' instead of '.html', an 'l' is appended to the end (because all of the pages on my site end with '.html'). If the pages on your site actually end with '.htm', it's pretty simple to modify the code to convert '.html' to '.htm' instead.

So now $match[1] will contain as much of the string that matches, beginning with a forward-slash and ending with '.html'. And $match[2] will contain just the directory path component of the string, beginning and ending with a forward-slash. And $match[3] will contain just the filename component of the string, ending with '.html'. Note that we've now corrected the extension to '.html' and stripped away any rubbish that was found after '.html' in the request. In the hope that this has made the request string valid, we use the PHP function is_file to test whether this request (appended to the server variable DOCUMENT_ROOT) is a valid path. If so, return $match[1] as the valid request.

Important: if your site uses a totally different directory and filename pattern, you'll have to modify the regular expression so that it only matches valid requests for pages from your own site. You will need to understand how to create regular expression patterns to do this, but knowledge of regular expressions is well worth learning if you are often involved in this sort of challenge.

Stage two: Simplified comparison

If the corrected request did not result in a valid path in stage one, we'll now test the directory component of the request using the PHP function is_dir to see whether that is at least a valid directory. If it's not, then we give up and return false. But if the request does at least contain a valid directory, we can now use the PHP function scandir to get the list of files contained within that directory, and see whether our request filename matches an actual file from the directory after both request and actual filenames have been simplified.

The simplification will convert hyphens to underscores, as sometimes users (and even web crawlers) get these characters mixed up. Also, the comparison will be case insensitive, as many requests come in lower-case when some of the filename should actually be upper-case.

The PHP function strtr will be used with an array which simply maps the hyphen '-' to the underscore '_' to perform the translation needed for the simplification. If you see a lot of requests to your site getting other characters mixed up, you could add other entries to the array. For instance, you might have filenames that contain a tilde '~' character that gets confused for a hyphen, so you could add an entry to the array that maps the tilde to the underscore too.

The foreach loop then just examines each file returned by the scandir function. If the filename does not end with '.html' then it's of no interest to us, so we jump to the next filename using the continue statement. But for each file in the directory which does end with '.html', we do a simplified comparison by translating it using strtr and our replacement array, and then use PHP's strnatcasecmp function to do a case-insensitive comparison with our simplified request filename. If a match is found this way, then the actual filename from the directory (not the simplified version) is our best guess at the file the user was hoping to find, and the function uses this actual filename to return its best guess at the path the user actually needs to find the page they are looking for.

If none of this has found a match, then the function returns false.

Note: because the scandir function returns all of the contents of the specified directory, listing both files and also directories, it is possible that this process will match with a directory name instead of a filename. This won't happen on my site, because none of my directories end with '.html', but if you change the regular expression, or if you do have directories whose name ends with '.html', you may need to add extra checks to see whether the match you've found is a file or a directory. It ought not hurt if a match with a directory is found, however, and the hyperlink provided to the user should just point to a directory of the site instead. But think about your own site and make sure that it's not possible to do harm this way.

Apache ErrorDocument configuration

To tell Apache to use the PHP script to deal with 404 "file not found" errors, you need to edit Apache's configuration files. If you are the server administrator, and you want this script to handle 404 errors by default, you can edit the httpd.conf file, but be warned that I am not a server admin so I've not been able to test this script in that way.

For most people on shared hosting, you will only have the ability to specify per-directory directives in a .htaccess file, which is how I've tested this script. Note that not all web hosting services permit their customers to create and modify .htaccess files, and even if you can do so, you may be limited in which directives you are permitted to specify. So check your hosting service documentation or contact technical support to ask what you're allowed to do.

Assuming you are permitted to make changes, look in the root directory of your own webspace. You want to edit the .htaccess file (or create one) and add the following to it:

ErrorDocument 404 /PHP-404-handler.php

(This assumes you've named the script PHP-404-handler.php and stored it in the root directory of your webspace.)

Now Apache will hand all 404 errors to the PHP script for processing.

If for some reason you do not want 404 errors caused by requests for certain files to be processed by this script, Apache has directives that let you specify which filenames or file types to apply this ErrorDocument directive too. For instance, as this script only searches for matching '.html' files, you may not want this script to handle 404 errors generated by requests for Javascript files, or for CSS files, or the insidious favicon.ico. In which case you might add something like the following to your Apache config:

# CSS and Javascript files should not be requested directly by users,
# so we don't need a full, descriptive error page generated.
<FilesMatch "\.(css|js)$">
    ErrorDocument 404 "File not found."
</FilesMatch>

# Barely generate anything for damned favicon.ico requests.
<Files favicon.ico>
    ErrorDocument 404 "-"
</Files>

# We want other 404 errors dealt with by our custom PHP script.
ErrorDocument 404 /PHP-404-handler.php

Just make sure you don't use the FilesMatch directive to specify a file extension ending '.html' because then any requests ending with punctuation will be skipped, and these are exactly the sort of requests that this script is designed to handle.

Avoiding 404 errors

It's unfortunate when external sites direct users to invalid URLs on your site, but there's no excuse for your own site pointing users to invalid URLs. Make sure you check for incorrect, broken and obsolete links on your site using an automated checker tool.

If you're using Linux, you can use the very handy KLinkStatus which lets you crawl your own site hunting for bad links, so you can correct them. This works particularly quickly if you're using it to check a local development copy of your site, plus you can use it to check sites that aren't visible to the public.

There's also the W3C Link Checker, which is web-based so every webmaster ought to be able to use it to check public-facing sites.

By all accounts, search engines punish pages and sites that contain bad links, so it's worth checking regularly that your site is not playing host to an increasingly decrepit list of obsolete URLs.