HTTP Archive: max-age

April 18, 2011 9:40 pm | 10 Comments

There’s a long list of interesting stats to be added to the HTTP Archive. I’m planning on knocking those off at about one a week. (If someone wants to help that’d be great – contact me. Familiarity with MySQL and Google Charts API is a plus.)

Last week I added an interesting stat looking at the cache lifetime being specified for resources – specifically the value set in the Cache-Control: max-age response header. As a reminder, the HTTP Archive is currently analyzing the top ~17K websites worldwide. Across those websites a total of ~1.4M resources are requested. The chart below shows the distribution of max-age values across all those resources.

56% of the resources don’t have a max-age value and 3% have a zero or negative value. That means only 41% of resources are cacheable. In more concrete terms, the average number of resources downloaded per page is 81. 33 of those are cacheable, but the other 48 will likely generate an HTTP request on every page view. Ouch! That’s going to slow things down. Only 24% of resources are cacheable for more than a day. Adding caching headers is an obvious performance win that needs wider adoption.

10 Responses to HTTP Archive: max-age

  1. Hey Steve,

    Remember caches are allowed to apply a heuristic to freshness for many responses [1], and many (browser and intermediary) do, especially if there’s a Last-Modified header present.

    What would be interesting is a measure of how many responses had any of Cache-Control: max-age, Expires *or* Last-Modified (perhaps greater than, say, three days); that’d give a better idea of how much of the Web is really cacheable.

    It’d also be cool to contrast this with responses that have CC: no-cache, CC: no-store or Pragma: no-cache, to filter out those that the server explicitly doesn’t want cached.

  2. Oops, forgot the reference:
    http://tools.ietf.org/html/draft-ietf-httpbis-p6-cache-14#section-2.3.1.1

    BTW, I agree that explicitly cacheability is always better. However, people shouldn’t think that those are the only responses that are cached.

    Cheers,

  3. I think instead of the statement “Only 24% of resources are cacheable for more than a day”, it should be “Only 24% of resources are currently being cached for more than a day”. It could just be a by product of things not being “fast by default”.

  4. I don’t think IE added the heuristic capability until IE 9 (8 and below don’t cache if there’s not an explicit header – well technically they cache but do a freshness check). Can’t wait until the older versions of IE are gone and we have good market share on the new browsers.

  5. IE has used a heuristic for its caching for a long time. Not sure how well it works and I certainly wouldn’t depend on it. I always set explicit cache headers for exactly what I want.

    (Under the section “Conditional Requests and the WinInet Cache”)
    http://msdn.microsoft.com/en-us/library/bb250442%28VS.85%29.aspx

    Also here (Under “Heuristic Cache Improvements”)
    http://blogs.msdn.com/b/ie/archive/2010/07/14/caching-improvements-in-internet-explorer-9.aspx

  6. @Patrick: It’s true that IE9 introduced heuristic expiration based on the algorithm suggested by RFC2616, but earlier versions had once-per-session heuristics and did something (sorta bizarre) for heuristic caching of images across sessions.

    But yes, we’re all in agreement that explicit freshness directives are the right way to go.

  7. @Mark: In addition to heuristic caching is memory cache – an image may be cached in memory and reused regardless of expiration headers and not even generated a conditional GET request. I agree further analysis of no-cache etc would be nice. It would be great if you’d enter a bug with specifics on what you’d like to see.

    Great comments!

  8. mmm, it would be nice to know how many of those resources are ads-related. Most ad serving platforms have a great number of resources not cached (for obvious reasons) and there’s nothing the administrators or programmers of the site can do to improve them…
    I, for example always make to analysis one with the ads blocked and one with everything so I can see what is in my hands to fix and with what I have to cope…

  9. Hi Steve

    I imagine you crawl sites without a log-in. In that case, I think you would get a higher set of cacheable things than when a user is logged-in, right?

  10. @Ionatan: Sounds interesting.

    @dev: I agree – if we were logged in the number of resources would probably be higher.