Caching

Introduction

Caching means keeping a copy of data that you've already received, to avoid having to ask for it again.

Most web browsers perform caching to hold copies of web pages and downloaded files. Information on the web can change over time, so the browser needs help from the web server and the site designer to decide how long a cached page should be stored. This help is provided using HTTP headers, which are exchanged between the server and the browser.

If done properly, caching can significantly reduce the bandwidth used by visitors to your site, especially regular visitors.

Recommendations

Use Static Content Where Possible

Static content consists of resources served directly from files on disk, rather than by scripts. It is very easy for the server to tell the browser when static content was last updated, and so for the browser to only download a new copy if the content has changed.

Set and Check Last Modified Date in Scripts

For dynamic content, the server has no such information (the output can change even if the script file does not), so it always calls the script to generate the content again.

However, your script may be able to tell if the browser already has up-to-date information by comparing the request headers with the local files and databases used to generate the response.

Set the Expires Header for All Objects

An Expires header gives a date after which the content it refers to is regarded as out of date. If you know, or can make a good guess, when your content will change, you can use the Expires header to greatly increase the efficiency of browser caches and the speed of page loading.

Unfortunately, it's not always possible to know in advance when the content will change. Nonetheless, it is regarded as good practice to always set an Expires header.

If the content expires before it is changed, it may be downloaded again, wasting bandwidth. If it expires too late, however, there is the risk that users with cached content will not see any changes to your pages until the cached content expires. You should consider the ramifications of these possibilities when deciding how conservatively to estimate the date your content expires.

How caching works

Browsers locate a particular piece of information through the Uniform Resource Locater (URL). Each resource, which could be an HTML page, or an image, has a unique URL. If the same information (HTML page or image) can be accessed through several different URLs, each is considered to be a separate resource. This is because browsers have no way to know that different URLs access the same information.

A browser fetches a resource from a server by sending it a request. The request contains a URL to retrieve, and some request HTTP headers which provide additional information about the request. The server responds to a request by sending back the resource data together with some response HTTP headers. Response HTTP headers provide additional information about the response or the information that it contains.

Web browsers regularly need to fetch resources to respond to a user action, for example to load and display a web page. This happens when the user clicks on a link to the resource (e.g. a hyperlink in a web page) or enters the resource's address in the address bar. It also happens when the browser discovers a reference to another resource while displaying a web page and wishes to display that resource in the page. Examples of this include images and style sheets referenced by HTML pages, when the browser supports those features.

Resources come in different kinds or formats, including text, HTML, CSS, JavaScript, JPEG, GIF and PNG, but they are all cached and retrieved in the same way, as a block of binary data (a file).

When the browser needs a resource, it may check its own cache (on a local disk or in memory) to see if it already has a copy and that copy has not expired. There are three possible results:

I have a local copy that has not expired
Do not fetch anything from the server.
I have a local copy with no expiry date
Ask the server to send a new copy if it has changed since I downloaded it.
I have no local copy or it has expired
Download a new copy from the server.

The first case uses the least bandwidth (none at all). The second case uses a little bandwidth, but much less than the third unless the object has changed.

The first case is also the fastest (no round trips to the server). The second case requires at least two round trips (usually four) and if the server sends the data again, additional round trips and download time are required.

What actually happens will depend on whether the browser has the content already, and some additional pieces of information from the response headers:

  • The Last-Modified header, which indicates when the content was last modified
  • The Expires header, which indicates when the content will expire and must be retrieved again

If the browser (or a proxy) caches the response, it will save the modification and expiry dates along with the resource. On subsequent requests for the same resource, it will use the expiry date (if any) to determine whether the cached content has expired. If no expiry date is given, it will send the generated date back to the server in a request header, and allow the server to decide whether to send a new version of the content, or a 304 Not Modified response.

Setting the Last-Modified Header

The server's file system keeps track of the last modified date of all files. For static files, the web server sends this date in the Last-Modified header and checks the date when it receives a request with an If-Modified-Since header to see whether it matches the current date on the file. If so, then the web server knows the client already has the latest version, and sends a 304 Not Modified response.

As mentioned above, this header can also be set by a script. In this case, the script itself, rather than the server, decides whether to send a new version or a 304 Not Modified response.

For this to work, your script must send Last-Modified headers with each response, which the browser will store in its cache.
(Example)

Setting the Expires Header

Advice from Microsoft states:

It is highly recommended that all Web servers use a scheme for the expiration of all Web pages. It is bad practice for a Web server not to supply expiration information via the HTTP Expires response header for every resource returned to requesting clients. Most browsers and intermediate proxies today respect this expiration information and use it to increase the efficiency of communications over the network...

Pages that are not expected to change should be marked with an expiration date of approximately one year.[1]

In order to send Expires headers, your server will need to be told what to set them to. The default for all web servers is not to send Expires headers, because they cannot guess how long your content will be valid for without your input. Make sure you give this:

Scripts can set the Expires header themselves.
(Example)

Applying expiration dates to static content requires configuration of the web server:

  • On Apache servers you need to enable and configure the mod_expires module[2] to provide the server with information about the expiry date that applies to your files.
    (Example)
  • On Microsoft IIS servers, use the IIS Manager to configure expiration dates of static content.
    (Example)

Example of use of Expires header

If you update your website every morning at 9:00am, you can set the Expires header at all times to 9:00am the next morning:

Expires: Wed, 01 Aug 2007 09:00:00 GMT

Beware that if you are late updating the site one morning, for example if you update it at 9:15, visitors between 9:00 and 9:15 will get a copy of the old page which is set to expire the next morning. Even if they revisit your site later in the day, they will see yesterday's version, unless they hit their browser's reload or refresh button.

Because of this, you may want to define an "update window", for example 1 hour between 9:00 and 10:00, and during this time you set the expiry time to the end of the current update window. Outside this time, set the expiry time to the start of the next update window.

If you don't update a site regularly, take a guess as to how often each content item is updated, and set its expiry time to be that long in the future.

If you know that a particular item will be updated on or after a particular date, set the expiry time to that date, until the date passes.

Using the Cache-Control Header

Preventing all caching

By using the Cache-Control header, it is very easy to disable caching entirely. For example, any of the following headers would break caching entirely:

Pragma: no-cache
Cache-Control: no-cache
Cache-Control: max-age=0 (or negative)

Pragma: no-cache is an older version of Cache-Control: no-cache, supported by HTTP/1.0 browsers. If you must send one then send both, but otherwise do not send either.

It is advisable to avoid this. It assumes that your users always want the latest content, even if it takes longer. If your site doesn't change constantly, this is unlikely to be the case!

Some Content Management Systems (CMS) and Wikis automatically generate cache control headers. If a Cache-Control line can be seen in viewing a page source then it should be noted that the page will not be cached.

Preventing caching of private content

If you don't want a page to be cached by an intermediate proxy cache, for example if it contains personal information, use:

Cache-Control: private

This allows it to be cached by the browser at least, instead of disabling entire caching.

On top of that, you should aim to minimize the number of pages that contain such personal information to maximize the caching of your site.

Avoid the Auto-Refresh Header

It is possible to make pages reload themselves automatically using a response header like this:

Refresh: 0

Or a META element like this:

<meta http-equiv="refresh" content="0">

It is advisable to avoid this wherever possible. It's very easy for your users to accidentally waste huge amounts of their bandwidth and yours. If you want to do this, make it an optional feature that users must explicitly turn on.

Notes

A web cache can only contain whole resources, not parts of web pages or images. If you modify even one byte in a page or image, the whole thing must be downloaded again.

You have virtually no control over the caching process unless you can directly control or configure your web server, or you use specific CGI scripts. If you use a web hosting company, rather than a dedicated server, you may find it difficult to implement the advice given in this chapter for static content, but we hope it will still help you to analyse and understand caching problems

Summary

  • Use static content whenever possible
  • Set Last-Modified and Expires headers for all content
  • Avoid using Cache-Control and Auto-Refresh headers

Further Resources and Tutorials

Footnotes

[#1] http://support.microsoft.com/kb/234067

[#2] http://httpd.apache.org/docs/1.3/mod/mod_expires.html