It’s important to make resources (images, scripts, stylesheets, etc.) cacheable so your pages load faster when users come back. A recent study from Website Optimization shows that the top 1000 home pages average over 50 resources per page! Being able to skip 50 HTTP requests and just read those resources from the browser’s cache dramatically speeds up pages.
This is covered in my book (High Performance Web Sites) and YSlow by Rule 3: Add an Expires Header. It’s easy to make your resources cacheable – just add an Expires HTTP response header with a date in the future. You can do this in your Apache configuration like this:
<FilesMatch "\.(gif|jpg|js|css)$"> ExpiresActive On ExpiresDefault "access plus 10 years" </FilesMatch>
That part is easy. The hard part is revving your resource filenames when you make a change. If you make mylogo.gif cacheable for 10 years and then publish a modified version of this file to your servers, users with the old version in their cache won’t get the update. The solution is to rev the name, perhaps by including the file’s timestamp or version number in the URL. But which is better: mylogo.1.2.gif or mylogo.gif?v=1.2? To gain the benefit of caching by popular proxies, avoid revving with a querystring and instead rev the filename itself.
There’s a section in my book called Revving Filenames. It contains an example of adding a version number to the filename. That’s prompted several emails where people have asked me about tradeoffs around using a querystring versus embedding something in the filename. I wasn’t aware of any performance difference, but in a meeting this week a co-worker, Jacob Hoffman-Andrews, mentioned that Squid, a popular proxy, doesn’t cache resources with a querystring. This hurts performance when multiple users behind a proxy cache request the same file - rather than using the cached version everybody would have to send a request to the origin server.
I tested this by creating two resources, mylogo.1.2.gif and mylogo.gif?v=1.2. Both have a far future Expires date. I configured my browser to go through a Squid proxy. I made one request to mylogo.1.2.gif, cleared my cache (to simulate another user making the request), and fetched mylogo.1.2.gif again. This produces the following HTTP headers:
>> GET http://stevesouders.com/mylogo.1.2.gif HTTP/1.1 << HTTP/1.0 200 OK << Date: Sat, 23 Aug 2008 00:17:22 GMT << Expires: Tue, 21 Aug 2018 00:17:22 GMT << X-Cache: MISS from someserver.com << X-Cache-Lookup: MISS from someserver.com >> GET http://stevesouders.com/mylogo.1.2.gif HTTP/1.1 << HTTP/1.0 200 OK << Date: Sat, 23 Aug 2008 00:17:22 GMT << Expires: Tue, 21 Aug 2018 00:17:22 GMT << X-Cache: HIT from someserver.com << X-Cache-Lookup: HIT from someserver.com
Notice that the second response shows a HIT in the X-Cache and X-Cache-Lookup headers. This shows it was served by the Squid proxy. More evidence of this is the fact that the Date and Expires response headers have the same values, even though I made these requests 10 seconds apart. For conclusive evidence, only one hit shows up in the stevesouders.com access log.
Loading mylogo.gif?v=1.2 twice (clearing the cache in between) results in these headers:
>> GET http://stevesouders.com/mylogo.gif?v=1.2 HTTP/1.1 << HTTP/1.0 200 OK << Date: Sat, 23 Aug 2008 00:19:34 GMT << Expires: Tue, 21 Aug 2018 00:19:34 GMT << X-Cache: MISS from someserver.com << X-Cache-Lookup: MISS from someserver.com >> GET http://stevesouders.com/mylogo.gif?v=1.2 HTTP/1.1 << HTTP/1.0 200 OK << Date: Sat, 23 Aug 2008 00:19:47 GMT << Expires: Tue, 21 Aug 2018 00:19:47 GMT << X-Cache: MISS from someserver.com << X-Cache-Lookup: MISS from someserver.com
Here it’s clear the second response was not served by the proxy: the caching response headers say MISS, the Date and Expires values change, and tailing the stevesouders.com access log shows two hits.
Proxy administrators can change the configuration to support caching resources with a querystring, when the caching headers indicate that is appropriate. But the default configuration is what web developers should expect to encounter most frequently. Another interesting note about these tests: notice how the proxy downgrades the responses to HTTP/1.0. This is going to alter browser behavior in terms of the number of connections that are opened. When I’m doing performance analysis I make sure to avoid being connected through a proxy.
It’s important to leverage the caching provided by Squid and other proxies. Users get faster load times if resources are served from their proxy, avoiding the trip all the way back to the origin server. How many users does this affect? You can estimate this by tracking how many requests to your servers contain the Via header. The percentage is 5-25% – it varies depending on your audience and geographic region. For those users who are behind proxies, help foster a faster experience by avoiding a querystring for cacheable resources.
Charlie Park | 23-Aug-08 at 3:11 pm | Permalink
Kevin Hale, from Wufoo, has a piece on automatically versioning files. Although I haven’t implemented his code on any of my sites, it looks promising, and works with the non-querystring filename version promoted here.
Yusuf Goolamabbas | 24-Aug-08 at 7:31 am | Permalink
Steve, HTTP 1.0 will also cause a tcp 3way handshake to occur for each and every request. It would be interesting to benchmark a page which has many responses (think JSON or XML via XHR) fronted by reverse proxy which can switch between HTTP 1.0 and HTTP 1.1 (lighty 1.5) or comparing squid versus nginx.
it gets more interesting if you put a traffic modellor to emulate a high latency connection between client and proxy so the time taken by the 3way handshake starts to dominate
Steve Souders | 24-Aug-08 at 8:04 am | Permalink
Great comments. The example from Wufoo is awesome.
Yusuf - I’m assuming HTTP/1.0 is still using persistent connections (connection: keep-alive). How does that affect the 3way handshake?
Moushumi Kabir | 24-Aug-08 at 8:52 pm | Permalink
This is very useful information. Thanks!
Yusuf Goolamabbas | 25-Aug-08 at 5:00 am | Permalink
Steve, From my memory HTTP/1.0 keepalives and HTTP/1.1 persistent connections are slightly different.
HTTP/1.0 kept-alive connections must be broken after every dynamic object
HTTP/1.0 doesn’t support pipelining
HTTP/1.0 keepalives is not allowed when one speaks through a proxy.
my experience with squid has been primarily as a reverse proxy and even when I turned
client_persistent_connections on
server_persistent_connections on
Squid did not reuse the connections in all cases
Basically, Squid can’t persist the client-side connection unless it gets a Content-Length: header from the backend.
murray | 27-Aug-08 at 11:20 am | Permalink
fyi: squid is capable of caching URLs with query strings, it just wasn’t default behavior due to:
hierarchy_stoplist cgi-bin ?
acl QUERY urlpath_regex cgi-bin \?
cache deny QUERY
squid actually changed their default policy with caching dynamic URLs with their 2.7 release:
http://wiki.squid-cache.org/ConfigExamples/DynamicContent
1 big caveat tho, their dev team told me this will break if you’re using multiple squid nodes with sibling relationships. requests coming from cache peers have an additional header which will hang the request.
Łukasz Korzybski | 08-Dec-08 at 5:31 am | Permalink
Great test! It ultimately shows that more conservative approach which is file name versioning is a more safe method (at least now).