URL Design and Management

Contents

  1. Introduction
  2. Readability and Length
  3. Hierarchical Structure
  4. Uniqueness and Mirrors
  5. Permanence and Managing Changes
  6. Dated Material and Archives
  7. Query Strings and Search Results
  8. Further Reading

Introduction

URLs are the addresses of your pages. They define the structure of your site, and express relationships and associations between resources. They're also the means by which other sites will link to your content, and by which you will encourage others to visit specific pages within your site. You should therefore choose your URLs wisely. They should be: short, readable, descriptive, memorable, and permanent. Search engines, other websites linking to your site, and your visitors will all appreciate the care you take. This essay explores some of the best-practice techniques for assigning and managing URLs for a website.

Readability and Length

Make sure your URLs are simple and readable. Use normal language words where-ever possible, and avoid uncommon acronyms and unnatural abbreviations. At the same time keep things as short as practicable: avoid unnecessary duplication and the inclusion of information which is implied elsewhere in the URL. This not only applies to the file part of the URL, but also to the domain name.

Even in today's digital world, there are plenty of occasions where you will need to verbally tell someone a URL, or it will appear in print and need to be entered into a computer manually by a user. Some popular email clients break long URLs across several lines, making them less usable at the receiving end. For this reason, try to keep URLs below about 70 characters. Any longer gets hard to remember / written down anyway.

For consistency, ease of transcription, and general aesthetics, it's usually a good idea to ensure that all your URLs are entirely lower case. Some specialised cases may required exceptions to this — the main ones that spring to mind are encyclopedia entries and PDF file names, which might both benefit from the use of Camel Case. Also avoid joining words with underscores, hyphens, or dots. Generally the inclusion of such characters is unnecessary; they add to the length of the URL and make spelling it out verbally more difficult. In the case of underscores, they can be lost in underlines with displayed on the web.

Real-life examples of duplication and unnecessary length include:

In particular, don't accept the crazy, often query-string-based, URLs offered by some badly thought-out Content Management Systems, such as:

If you're naming files yourself, then it should be easy to fix things. However, the URLs may be tied to a particular software solution, and it may only be possible to fix them by changing to some saner software. However, you may be able to improve things somewhat by using URL rewriting with Apache's mod_rewrite.

Hierarchical Structure

If your site has a definite hierarchical structure (and most should, unless they're very small or have encyclopedia-like arrangement of content) then the URLs you use should reflect this. In fact, the directory levels implied by the use of forward slashes should correspond exactly to the arrangement of sections and sub-sections in your navigational structure. This reinforces the structure and relationships in the mind of the user, and adds to the apparent trustworthiness of your site.

Each directory level should should correspond to a section or sub-section of the site. Each levels should have a main index page, but this should be displayed without the final index.html or similar. (Worse still would be to repeat the section name, but constructions such as /about/about.htm are sadly all too common.) An important corollary of the above is that the main page should be found at the root URL. Not only must the root URL work, but there should be no auto redirection to /index.php or /html/home.aspx etc. Not only does such redirection look unprofessional, but people will inevitably bookmark the resultant URL, and when you change the back end of your site there's a risk that those bookmarks will stop working.

Uniqueness and Mirrors

You should avoid running mirror sites with the same content existing at two or more different addresses. There are two main reasons for this. First, search engines will typically penalise duplicate content, and/or split your ranking across the multiple addresses. Secondly, there is the 'trust factor' for users — even if different URLs return the same material from the same file, how are users to know this? They may then be left wondering which is authoritative; and will one version stop working or not be updated.

However, this doesn't mean that you can't allow people to access the same content via two or more different addresses. If you have more than one domain name serving the same content, you should decide which is to be the main one (i.e. used in advertising and as your email domain), and set up 'HTTP 301' server-side redirects from all other domains. This way, any existing rankings should be transfered to the main domain, and the links to the other domain(s) will continue to work. It's also worth setting up redirects from the for the non-www version of your site (assuming you use the www in the main URL — if not then it's even more important that the www version redirects to the non-www version, as some people will invariably prepend www whatever you tell them).

Permanence and Managing Changes

People bookmark pages, and other sites will link to your content. If these links are to be any use, they need to continue to function months or possibly even years into the future. Therefore, once allocated, a URL should be a permanent address for a particular page or resource. Link rot is a major problem on the web today, please do your best not to add to the problem.

Think carefully when assigning URLs to resources, so that you can avoid having to make changes in the future. Avoid anything in the URL which reflects the back-end processes used to run the site, as such processes are likely to change in the future. This includes things like .html or .aspx file extensions (you should be able to set your server up so that such extensions are implied when the bare filename is requested), having everything in a /html/ directory, and accessing scripts via an explicit /cgi-bin/ directory.

Where changes are unavoidable, then automatic server-side redirects should be set up to take those still using old links to the new address. Such redirects should always be used in preference to other techniques such as JavaScript or 'meta-refresh' redirects, or even a simple 'this page has moved, click here to continue' message. These other methods are not as robust, will not (in general) be picked up by automatic link-checking software, and are not as efficient at transferring any built-up search engine ranks to the new pages.

If material needs to be removed from a site, it's not enough to just unlink the page from the rest of your site; the file needs to be removed. If you're really keen you'll ensure that the correct 'HTTP 410 (gone)' error is returned to user agents requesting such pages, rather than the usual 'HTTP 404 (not found)'.

Dated Material and Archives

With the popularity of weblogs in the last few years, the occurrence of dated material left as posted to forming an archive is now very common. For ease of sorting and language independence, it's good practice to base any 'dated' URLs on the international standard data/time format ISO 8601. For dates, this uses the sequence YYYY-MM-DD with a four-digit year, two-digit month, and two-digit day, separated by optional hyphens for clarity. The ordering with the most significant part first ensures correct ordering when sorted alphabetically, and the four-digit year indicates the use of this ordering.

To put a date stamp on an individual file, naming it YYYY-MM-DD-foo (where "foo" is the subject/topic) is probably the best approach to maximise readability. When managing a large archive of material, it's usually best to set up a hierarchy of directories corresponding to years, and possibly months and even days, depending on the frequency density of the material. For transparency, you should also usually include a readable text snippet to identify the subject or topic of the file. The leads to URLs of the form http://www.example.org/archive/2006/08/subject-text. The directory levels /archive/2006/08/ and /archive/2006/ should have suitable index pages allowing users to access the months and pages under that particular level.

If there are only a few items per month, and the exact day and month aren't terribly important, then it would be best just to sort by year and and omit the month and day levels. Conversely, if there are several items per day, then adding an extra day level would make sense.

However, you should avoid using dated forms for things where people will expect the latest version, and older versions are of little interest. For example, if you have an annually revised Safety Handbook, the main links should go to an undated URL. Provide an archive of older versions if you wish, but the main link, which people are likely to bookmark, should be set up to point to the latest version.

Query Strings and Search Results

Only use query strings for real queries; they have no role in selecting pages as part of a poor Content Management System (see above). As with URLs in general, query strings, when used, should be concise and readable where possible.

When using query strings as part of a search process, help users bookmark results pages they may wish to refer back to or send to others by exposing the URL used to generate the results whenever possible. This means using GETs (rather than POSTs), and also avoiding the use of session id's or cookies to store internal states and previous preferences which influence the results.

In general the POST method should only be used when the server state will be modified and repeating the request doesn't make sense. Cookies should only be used to store preferences relating to data input mechanism or the general formatting / display of results. Ensure that the content of the result is wholly determined by the URL.

Further Reading