Revving Filenames: don’t use querystring

August 23, 2008 10:51 am | 13 Comments

It’s important to make resources (images, scripts, stylesheets, etc.) cacheable so your pages load faster when users come back. A recent study from Website Optimization shows that the top 1000 home pages average over 50 resources per page! Being able to skip 50 HTTP requests and just read those resources from the browser’s cache dramatically speeds up pages.

This is covered in my book (High Performance Web Sites) and YSlow by Rule 3: Add an Expires Header. It’s easy to make your resources cacheable – just add an Expires HTTP response header with a date in the future. You can do this in your Apache configuration like this:

<FilesMatch "\.(gif|jpg|js|css)$">
  ExpiresActive On
  ExpiresDefault "access plus 10 years"
</FilesMatch>

That part is easy. The hard part is revving your resource filenames when you make a change. If you make mylogo.gif cacheable for 10 years and then publish a modified version of this file to your servers, users with the old version in their cache won’t get the update. The solution is to rev the name, perhaps by including the file’s timestamp or version number in the URL. But which is better: mylogo.1.2.gif or mylogo.gif?v=1.2? To gain the benefit of caching by popular proxies, avoid revving with a querystring and instead rev the filename itself.

There’s a section in my book called Revving Filenames. It contains an example of adding a version number to the filename. That’s prompted several emails where people have asked me about tradeoffs around using a querystring versus embedding something in the filename. I wasn’t aware of any performance difference, but in a meeting this week a co-worker, Jacob Hoffman-Andrews, mentioned that Squid, a popular proxy, doesn’t cache resources with a querystring. This hurts performance when multiple users behind a proxy cache request the same file – rather than using the cached version everybody would have to send a request to the origin server.

I tested this by creating two resources, mylogo.1.2.gif and mylogo.gif?v=1.2. Both have a far future Expires date. I configured my browser to go through a Squid proxy. I made one request to mylogo.1.2.gif, cleared my cache (to simulate another user making the request), and fetched mylogo.1.2.gif again. This produces the following HTTP headers:

>> GET http://stevesouders.com/mylogo.1.2.gif HTTP/1.1
<< HTTP/1.0 200 OK
<< Date: Sat, 23 Aug 2008 00:17:22 GMT
<< Expires: Tue, 21 Aug 2018 00:17:22 GMT
<< X-Cache: MISS from someserver.com
<< X-Cache-Lookup: MISS from someserver.com

>> GET http://stevesouders.com/mylogo.1.2.gif HTTP/1.1
<< HTTP/1.0 200 OK
<< Date: Sat, 23 Aug 2008 00:17:22 GMT
<< Expires: Tue, 21 Aug 2018 00:17:22 GMT
<< X-Cache: HIT from someserver.com
<< X-Cache-Lookup: HIT from someserver.com

Notice that the second response shows a HIT in the X-Cache and X-Cache-Lookup headers. This shows it was served by the Squid proxy. More evidence of this is the fact that the Date and Expires response headers have the same values, even though I made these requests 10 seconds apart. For conclusive evidence, only one hit shows up in the stevesouders.com access log.

Loading mylogo.gif?v=1.2 twice (clearing the cache in between) results in these headers:

>> GET http://stevesouders.com/mylogo.gif?v=1.2 HTTP/1.1
<< HTTP/1.0 200 OK
<< Date: Sat, 23 Aug 2008 00:19:34 GMT
<< Expires: Tue, 21 Aug 2018 00:19:34 GMT
<< X-Cache: MISS from someserver.com
<< X-Cache-Lookup: MISS from someserver.com

>> GET http://stevesouders.com/mylogo.gif?v=1.2 HTTP/1.1
<< HTTP/1.0 200 OK
<< Date: Sat, 23 Aug 2008 00:19:47 GMT
<< Expires: Tue, 21 Aug 2018 00:19:47 GMT
<< X-Cache: MISS from someserver.com
<< X-Cache-Lookup: MISS from someserver.com

Here it’s clear the second response was not served by the proxy: the caching response headers say MISS, the Date and Expires values change, and tailing the stevesouders.com access log shows two hits.

Proxy administrators can change the configuration to support caching resources with a querystring, when the caching headers indicate that is appropriate. But the default configuration is what web developers should expect to encounter most frequently. Another interesting note about these tests: notice how the proxy downgrades the responses to HTTP/1.0. This is going to alter browser behavior in terms of the number of connections that are opened. When I’m doing performance analysis I make sure to avoid being connected through a proxy.

It’s important to leverage the caching provided by Squid and other proxies. Users get faster load times if resources are served from their proxy, avoiding the trip all the way back to the origin server. How many users does this affect? You can estimate this by tracking how many requests to your servers contain the Via header. The percentage is 5-25% – it varies depending on your audience and geographic region. For those users who are behind proxies, help foster a faster experience by avoiding a querystring for cacheable resources.

13 Responses to Revving Filenames: don’t use querystring

  1. Kevin Hale, from Wufoo, has a piece on automatically versioning files. Although I haven’t implemented his code on any of my sites, it looks promising, and works with the non-querystring filename version promoted here.

  2. Steve, HTTP 1.0 will also cause a tcp 3way handshake to occur for each and every request. It would be interesting to benchmark a page which has many responses (think JSON or XML via XHR) fronted by reverse proxy which can switch between HTTP 1.0 and HTTP 1.1 (lighty 1.5) or comparing squid versus nginx.

    it gets more interesting if you put a traffic modellor to emulate a high latency connection between client and proxy so the time taken by the 3way handshake starts to dominate

  3. Great comments. The example from Wufoo is awesome.

    Yusuf – I’m assuming HTTP/1.0 is still using persistent connections (connection: keep-alive). How does that affect the 3way handshake?

  4. This is very useful information. Thanks!

  5. Steve, From my memory HTTP/1.0 keepalives and HTTP/1.1 persistent connections are slightly different.

    HTTP/1.0 kept-alive connections must be broken after every dynamic object
    HTTP/1.0 doesn’t support pipelining
    HTTP/1.0 keepalives is not allowed when one speaks through a proxy.

    my experience with squid has been primarily as a reverse proxy and even when I turned
    client_persistent_connections on
    server_persistent_connections on

    Squid did not reuse the connections in all cases
    Basically, Squid can’t persist the client-side connection unless it gets a Content-Length: header from the backend.

  6. fyi: squid is capable of caching URLs with query strings, it just wasn’t default behavior due to:

    hierarchy_stoplist cgi-bin ?
    acl QUERY urlpath_regex cgi-bin \?
    cache deny QUERY

    squid actually changed their default policy with caching dynamic URLs with their 2.7 release:

    http://wiki.squid-cache.org/ConfigExamples/DynamicContent

    1 big caveat tho, their dev team told me this will break if you’re using multiple squid nodes with sibling relationships. requests coming from cache peers have an additional header which will hang the request.

  7. Great test! It ultimately shows that more conservative approach which is file name versioning is a more safe method (at least now).

  8. One safe assumption to make in planning for cachebility is to assume that the client side and all proxies in between are totally borked, and won’t get fixed any time soon.

  9. I ran into a problem with using file last modified timestamp versioning.

    Turns out using file timestamps as the version can be a bad idea depending on your deployment process and/or version control system.

    Git doesn’t store timestamps at all, and Subversion needs to be configured to preserve them.

    Or your deployment process may change the file timestamps when copying assets, even if the file has changed from the last deployment.

    As a result of this, I’ve ended up using md5 sums of the files themselves for the version.

    This is consistent and only changes if the file contents change.

    There’s more information on this solution (and an accompanying Asset Fingerprint Rails plugin) at http://blog.eliotsykes.com/2010/05/06/why-rails-asset-caching-is-broken/ .

  10. Correction to my previous comment, forgot a “not”:

    “Or your deployment process may change the file timestamps when copying assets, even if the file has *not* changed from the last deployment.”

  11. The advantage of putting the version in the query string is that you have to change the HTML alone. The web server does not have to know that you use versioning for 1.jpg and requests for 1.jpg?version=1 and 1.jpg?version=2 will automatically reach the same file (1.jpg). Managing the version in the filename complicates matters.

    Leonid Fainberg
    Co-Founder & CTO
    AcceloWeb

  12. I was going to post something, but noticed others had already posted the same comment. Most proxy servers, including my favorite, Apache Traffic Server, can easily be configured to cache URLs with query parameters.

    However, there might be intermediaries between the UA and the origin, which do not allow this. E.g. transparent ISP proxies, corporate firewall proxies, etc. So, the advice is still good, you are better off versioning on the filename.

    A good example of this done “right” is the Flickr web site.

  13. HTTP/1.0 + “Connection: Keep-Alive” basically behaves the same way as HTTP/1.1 in terms of persistent connections. Further, pipelining is currently of little real-world use.

    However, a huge problem with dropping down to HTTP/1.0 is the loss of support for HTTP compression.