URLs are the addresses of your pages. They define the structure of your site, and express relationships and associations between resources. They're also the means by which other sites will link to your content, and by which you will encourage others to visit specific pages within your site. You should therefore choose your URLs wisely. They should be: short, readable, descriptive, memorable, and permanent. Search engines, other websites linking to your site, and your visitors will all appreciate the care you take. This essay explores some of the best-practice techniques for assigning and managing URLs for a website.
Make sure your URLs are simple and readable. Use normal language words where-ever possible, and avoid uncommon acronyms and unnatural abbreviations. At the same time keep things as short as practicable: avoid unnecessary duplication and the inclusion of information which is implied elsewhere in the URL. This not only applies to the file part of the URL, but also to the domain name.
Even in today's digital world, there are plenty of occasions where you will need to verbally tell someone a URL, or it will appear in print and need to be entered into a computer manually by a user. Some popular email clients break long URLs across several lines, making them less usable at the receiving end. For this reason, try to keep URLs below about 70 characters. Any longer gets hard to remember / written down anyway.
For consistency, ease of transcription, and general aesthetics, it's usually a good idea to ensure that all your URLs are entirely lower case. Some specialised cases may required exceptions to this — the main ones that spring to mind are encyclopedia entries and PDF file names, which might both benefit from the use of Camel Case. Also avoid joining words with underscores, hyphens, or dots. Generally the inclusion of such characters is unnecessary; they add to the length of the URL and make spelling it out verbally more difficult. In the case of underscores, they can be lost in underlines with displayed on the web.
Real-life examples of duplication and unnecessary length include:
http://www.maths.university.ac.uk/research/appliedmathematics/appliednonlinear/phdprojects/
http://www.example.org/about_us/what_we_do.aspx
In particular, don't accept the crazy, often query-string-based, URLs offered by some badly thought-out Content Management Systems, such as:
http://www.example.com/index.php?option=com_content&task=view&id=25&Itemid=1
http://www.county.gov.uk/countycc/usp.nsf/pws/Council+Government+and+Democracy+-+Council+Publications
If you're naming files yourself, then it should be easy to fix things. However, the URLs may be tied to a particular software solution, and it may only be possible to fix them by changing to some saner software. However, you may be able to improve things somewhat by using URL rewriting with Apache's mod_rewrite.
If your site has a definite hierarchical structure (and most should, unless they're very small or have encyclopedia-like arrangement of content) then the URLs you use should reflect this. In fact, the directory levels implied by the use of forward slashes should correspond exactly to the arrangement of sections and sub-sections in your navigational structure. This reinforces the structure and relationships in the mind of the user, and adds to the apparent trustworthiness of your site.
Each directory level should should correspond to a section or
sub-section of the site. Each levels should have a main index page,
but this should be displayed without the final index.html
or similar. (Worse still would be to repeat the section name, but
constructions such as /about/about.htm
are sadly all too
common.) An important corollary of the above is that the main page
should be found at the root URL. Not only must the root URL work, but
there should be no auto redirection to /index.php
or
/html/home.aspx
etc. Not only does such redirection look
unprofessional, but people will inevitably bookmark the resultant URL,
and when you change the back end of your site there's a risk that
those bookmarks will stop working.
You should avoid running mirror sites with the same content existing at two or more different addresses. There are two main reasons for this. First, search engines will typically penalise duplicate content, and/or split your ranking across the multiple addresses. Secondly, there is the 'trust factor' for users — even if different URLs return the same material from the same file, how are users to know this? They may then be left wondering which is authoritative; and will one version stop working or not be updated.
However, this doesn't mean that you can't allow people to access
the same content via two or more different addresses. If you have more
than one domain name serving the same content, you should decide which
is to be the main one (i.e. used in advertising and as your email
domain), and set up 'HTTP
301' server-side
redirects from all other domains. This way, any existing rankings
should be transfered to the main domain, and the links to the other
domain(s) will continue to work. It's also worth setting up redirects
from the for the non-www
version of your site (assuming
you use the www
in the main URL — if not then it's
even more important that the www
version redirects to the
non-www
version, as some people will invariably prepend
www
whatever you tell them).
People bookmark pages, and other sites will link to your content. If these links are to be any use, they need to continue to function months or possibly even years into the future. Therefore, once allocated, a URL should be a permanent address for a particular page or resource. Link rot is a major problem on the web today, please do your best not to add to the problem.
Think carefully when assigning URLs to resources, so that you can
avoid having to make changes in the future. Avoid anything in the URL
which reflects the back-end processes used to run the site, as such
processes are likely to change in the future. This includes things
like .html
or .aspx
file extensions (you
should be able to set your server up so that such extensions are
implied when the bare filename is requested), having everything in a
/html/
directory, and accessing scripts via an explicit
/cgi-bin/
directory.
Where changes are unavoidable, then automatic server-side redirects should be set up to take those still using old links to the new address. Such redirects should always be used in preference to other techniques such as JavaScript or 'meta-refresh' redirects, or even a simple 'this page has moved, click here to continue' message. These other methods are not as robust, will not (in general) be picked up by automatic link-checking software, and are not as efficient at transferring any built-up search engine ranks to the new pages.
If material needs to be removed from a site, it's not enough to just unlink the page from the rest of your site; the file needs to be removed. If you're really keen you'll ensure that the correct 'HTTP 410 (gone)' error is returned to user agents requesting such pages, rather than the usual 'HTTP 404 (not found)'.
With the popularity of weblogs in the last few years, the
occurrence of dated material left as posted to forming an archive is
now very common. For ease of sorting and language independence, it's
good practice to base any 'dated' URLs on the international standard
data/time format ISO
8601. For dates, this uses the sequence YYYY-MM-DD
with a four-digit year, two-digit month, and two-digit day, separated
by optional hyphens for clarity. The ordering with the most
significant part first ensures correct ordering when sorted
alphabetically, and the four-digit year indicates the use of this
ordering.
To put a date stamp on an individual file, naming it
YYYY-MM-DD-foo
(where "foo" is the subject/topic) is
probably the best approach to maximise readability. When managing a
large archive of material, it's usually best to set up a hierarchy of
directories corresponding to years, and possibly months and even days,
depending on the frequency density of the material. For transparency,
you should also usually include a readable text snippet to identify
the subject or topic of the file. The leads to URLs of the form
http://www.example.org/archive/2006/08/subject-text
. The
directory levels /archive/2006/08/
and
/archive/2006/
should have suitable index pages allowing
users to access the months and pages under that particular level.
If there are only a few items per month, and the exact day and month aren't terribly important, then it would be best just to sort by year and and omit the month and day levels. Conversely, if there are several items per day, then adding an extra day level would make sense.
However, you should avoid using dated forms for things where people will expect the latest version, and older versions are of little interest. For example, if you have an annually revised Safety Handbook, the main links should go to an undated URL. Provide an archive of older versions if you wish, but the main link, which people are likely to bookmark, should be set up to point to the latest version.
Only use query strings for real queries; they have no role in selecting pages as part of a poor Content Management System (see above). As with URLs in general, query strings, when used, should be concise and readable where possible.
When using query strings as part of a search process, help users
bookmark results pages they may wish to refer back to or send to
others by exposing the URL used to generate the results whenever
possible. This means using GET
s (rather than
POST
s), and also avoiding the use of session id's or
cookies to store internal states and previous preferences which
influence the results.
In general the POST
method should only be used when
the server state will be modified and repeating the request doesn't
make sense. Cookies should only be used to store preferences relating
to data input mechanism or the general formatting / display of
results. Ensure that the content of the result is wholly determined by
the URL.