Cache them if you can

March 22, 2012 10:41 pm | 24 Comments

“The fastest HTTP request is the one not made.”

I always smile when I hear a web performance speaker say this. I forget who said it first, but I’ve heard it numerous times at conferences and meetups over the past few years. It’s true! Caching is critical for making web pages faster. I’ve written extensively about caching:

Things are getting better – but not quickly enough. The chart below from the HTTP Archive shows that the percentage of resources that are cacheable has increased 10% during the past year (from 42% to 46%). Over that same time the number of requests per page has increased 12% and total transfer size has increased 24% (chart).

Perhaps it’s hard to make progress on caching because the problem doesn’t belong to a single group – responsibility spans website owners, third party content providers, and browser developers. One thing is certain – we have to do a better job when it comes to caching. 

I’ve gathered some compelling statistics over the past few weeks that illuminate problems with caching and point to some next steps. Here are the highlights:

  • 55% of resources don’t specify a max-age value
  • 46% of the resources without any max-age remained unchanged over a 2 week period
  • some of the most popular resources on the Web are only cacheable for an hour or two
  • 40-60% of daily users to your site don’t have your resources in their cache
  • 30% of users have a full cache
  • for users with a full cache, the median time to fill their cache is 4 hours of active browsing

Read on to understand the full story.

My kingdom for a max-age header

Many of the caching articles I’ve written address issues such as size & space limitations, bugs with less common HTTP headers, and outdated purging logic. These are critical areas to focus on. But the basic function of caching hinges on websites specifying caching headers for their resources. This is typically done using max-age in the Cache-Control response header. This example specifies that a response can be read from cache for 1 year:

Cache-Control: max-age=31536000

Since you’re reading this blog post you probably already use max-age, but the following chart from the HTTP Archive shows that 55% of resources don’t specify a max-age value. This translates to 45 of the average website’s 81 resources needing a HTTP request even for repeat visits.

Missing max-age != dynamic

Why do 55% of resources have no caching information? Having looked at caching headers across thousands of websites my first guess is lack of awareness – many website owners simply don’t know about the benefits of caching. An alternative explanation might be that many resources are dynamic (JSON, ads, beacons, etc.) and shouldn’t be cached. Which is the bigger cause – lack of awareness or dynamic resources? Luckily we can quantify the dynamicness of these uncacheable resources using data from the HTTP Archive.

The HTTP Archive analyzes the world’s top ~50K web pages on the 1st and 15th of the month and records the HTTP headers for every resource. Using this history it’s possible to go back in time and quantify how many of today’s resources without any max-age value were identical in previous crawls. The data for the chart above (showing 55% of resources with no max-age) was gathered on Feb 15 2012. The chart below shows the percentage of those uncacheable resources that were identical in the previous crawl on Feb 1 2012. We can go back even further and see how many were identical in both the Feb 1 2012 and the Jan 15 2012 crawls. (The HTTP Archive doesn’t save response bodies so the determination of “identical” is based on the resource having the exact same URL, Last-Modified, ETag, and Content-Length.)

46% of the resources without any max-age remained unchanged over a 2 week period. This works out to 21 resources per page that could have been read from cache without any HTTP request but weren’t. Over a 1 month period 38% are unchanged – 17 resources per page.

This is a significant missed opportunity. Here are some popular websites and the number of resources that were unchanged for 1 month but did not specify max-age:

Recalling that “the fastest HTTP request is the one not made”, this is a lot of unnecessary HTTP traffic. I can’t prove it, but I strongly believe this is not intentional – it’s just a lack of awareness. The chart below reinforces this belief – it shows the percentage of resources (both cacheable and uncacheable) that remain unchanged starting from Feb 15 2012 and going back for one year.

The percentage of resources that are unchanged is nearly the same when looking at all resources as it is for only uncacheable resources: 44% vs. 46% going back 2 weeks and 35% vs. 38% going back 1 month. Given this similarity in “dynamicness” it’s likely that the absence of max-age has nothing to do with the resources themselves and is instead caused by website owners overlooking this best practice.

3rd party content

If a website owner doesn’t make their resources cacheable, they’re just hurting themselves (and their users). But if a 3rd party content provider doesn’t have good caching behavior it impacts all the websites that embed that content. This is both bad a good. It’s bad in that one uncacheable 3rd party resource can impact multiple sites. The good part is that shifting 3rd party content to adopt good caching practices also has a magnified effect.

So how are we doing when it comes to caching 3rd party content? Below is a list of the top 30 most-used resources according to the HTTP Archive. These are the resources that were used the most across the world’s top 50K web pages. The max-age value (in hours) is also shown.

  1. http://www.google-analytics.com/ga.js (2 hours)
  2. http://ssl.gstatic.com/s2/oz/images/stars/po/Publisher/sprite2.png (8760 hours)
  3. http://pagead2.googlesyndication.com/pagead/js/r20120208/r20110914/show_ads_impl.js (336 hours)
  4. http://pagead2.googlesyndication.com/pagead/render_ads.js (336 hours)
  5. http://pagead2.googlesyndication.com/pagead/show_ads.js (1 hour)
  6. https://apis.google.com/_/apps-static/_/js/gapi/gcm_ppb,googleapis_client,plusone/[...] (720 hours)
  7. http://pagead2.googlesyndication.com/pagead/osd.js (24 hours)
  8. http://pagead2.googlesyndication.com/pagead/expansion_embed.js (24 hours)
  9. https://apis.google.com/js/plusone.js (1 hour)
  10. http://googleads.g.doubleclick.net/pagead/drt/s?safe=on (1 hour)
  11. http://static.ak.fbcdn.net/rsrc.php/v1/y7/r/ql9vukDCc4R.png (3825 hours)
  12. http://connect.facebook.net/rsrc.php/v1/yQ/r/f3KaqM7xIBg.swf (164 hours)
  13. https://ssl.gstatic.com/s2/oz/images/stars/po/Publisher/sprite2.png (8760 hours)
  14. https://apis.google.com/_/apps-static/_/js/gapi/googleapis_client,iframes_styles[...] (720 hours)
  15. http://static.ak.fbcdn.net/rsrc.php/v1/yv/r/ZSM9MGjuEiO.js (8742 hours)
  16. http://static.ak.fbcdn.net/rsrc.php/v1/yx/r/qP7Pvs6bhpP.js (8699 hours)
  17. https://plusone.google.com/_/apps-static/_/ss/plusone/[...] (720 hours)
  18. http://b.scorecardresearch.com/beacon.js (336 hours)
  19. http://static.ak.fbcdn.net/rsrc.php/v1/yx/r/lP_Rtwh3P-S.css (8710 hours)
  20. http://static.ak.fbcdn.net/rsrc.php/v1/yA/r/TSn6F7aukNQ.js (8760 hours)
  21. http://static.ak.fbcdn.net/rsrc.php/v1/yk/r/Wm4bpxemaRU.js (8702 hours)
  22. http://static.ak.fbcdn.net/rsrc.php/v1/yZ/r/TtnIy6IhDUq.js (8699 hours)
  23. http://static.ak.fbcdn.net/rsrc.php/v1/yy/r/0wf7ewMoKC2.css (8699 hours)
  24. http://static.ak.fbcdn.net/rsrc.php/v1/yO/r/H0ip1JFN_jB.js (8760 hours)
  25. http://platform.twitter.com/widgets/hub.1329256447.html (87659 hours)
  26. http://static.ak.fbcdn.net/rsrc.php/v1/yv/r/T9SYP2crSuG.png (8699 hours)
  27. http://platform.twitter.com/widgets.js (1 hour)
  28. https://plusone.google.com/_/apps-static/_/js/plusone/[...] (720 hours)
  29. http://pagead2.googlesyndication.com/pagead/js/graphics.js (24 hours)
  30. http://s0.2mdn.net/879366/flashwrite_1_2.js (720 hours)

There are some interesting patterns.

  • simple URLs have short cache times – Some resources have very short cache times, e.g., ga.js (1), show_ads.js (5), and twitter.com/widgets.js (27). Most of the URLs for these resources are very simple (no querystring or URL “fingerprints”) because these resource URLs are part of the snippet that website owners paste into their page. These “bootstrap” resources are given short cache times because there’s no way for the resource URL to be changed if there’s an emergency fix – instead the cached resource has to expire in order for the emergency update to be retrieved.
  • long URLs have long cache times – Many 3rd party “bootstrap” scripts dynamically load other resources. These code-generated URLs are typically long and complicated because they contain some unique fingerprinting, e.g., http://pagead2.googlesyndication.com/pagead/js/r20120208/r20110914/show_ads_impl.js (3) and http://platform.twitter.com/widgets/hub.1329256447.html (25). If there’s an emergency change to one of these resources, the fingerprint in the bootstrap script can be modified so that a new URL is requested. Therefore, these fingerprinted resources can have long cache times because there’s no need to rev them in the case of an emergency fix.
  • where’s Facebook’s like button? – Facebook’s like.php and likebox.php are also hugely popular but aren’t in this list because the URL contains a querystring that differs across every website. Those resources have an even more aggressive expiration policy compared to other bootstrap resources – they use no-cache, no-store, must-revalidate. Once the like[box] bootstrap resource is loaded, it loads the other required resources: lP_Rtwh3P-S.css (19), TSn6F7aukNQ.js (20), etc. Those resources have long URLs and long cache times because they’re generated by code, as explained in the previous bullet.
  • short caching resources are often async – The fact that bootstrap scripts have short cache times is good for getting emergency updates, but is bad for performance because they generate many Conditional GET requests on subsequent requests. We all know that scripts block pages from loading, so these Conditional GET requests can have a significant impact on the user experience. Luckily, some 3rd party content providers are aware of this and offer async snippets for loading these bootstrap scripts mitigating the impact of their short cache times. This is true for ga.js (1), plusone.js (9), twitter.com/widgets.js (27), and Facebook’s like[box].php.

These extremely popular 3rd party snippets are in pretty good shape, but as we get out of the top widgets we quickly find that these good caching patterns degrade. In addition, more 3rd party providers need to support async snippets.

Cache sizes are too small

In January 2007 Tenni Theurer and I ran an experiment at Yahoo! to estimate how many users had a primed cache. The methodology was to embed a transparent 1×1 image in the page with an expiration date in the past. If users had the expired image in their cache the browser would issue a Conditional GET request and receive a 304 response (primed cache). Otherwise they’d get a 200 response (empty cache). I was surprised to see that 40-60% of daily users to the site didn’t have the site’s resources in their cache and 20% of page views were done without the site’s resources in the cache.

Numerous factors contribute to this high rate of unique users missing the site’s resources in their cache, but I believe the primary reason is small cache sizes. Browsers have increased the size of their caches since this experiment was run, but not enough. It’s hard to test browser cache size. Blaze.io’s article Understanding Mobile Cache Sizes shows results from their testing. Here are the max cache sizes I found for browsers on my MacBook Air. (Some browsers set the cache size based on available disk space, so let me mention that my drive is 250 GB and has 54 GB available.) I did some testing and searching to find max cache sizes for my mobile devices and IE.

  • Chrome: 320 MB
  • Internet Explorer 9: 250 MB
  • Firefox 11: 830 MB (shown in about:cache)
  • Opera 11: 20 MB (shown in Preferences | Advanced | History)
  • iPhone 4, iOS 5.1: 30-35 MB (based on testing)
  • Galaxy Nexus: 18 MB (based on testing)

I’m surprised that Firefox 11 has such a large cache size – that’s almost close to what I want. All the others are (way) too small. 18-35 MB on my mobile devices?! I have seven movies on my iPhone – I’d gladly trade Iron Man 2  (1.82 GB) for more cache space.

Caching in the real world

In order to justify increasing browser cache sizes we need some statistics on how many real users overflow their cache. This topic came up at last month’s Velocity Summit where we had representatives from Chrome, Internet Explorer, Firefox, Opera, and Silk. (Safari was invited but didn’t show up.) Will Chan from the Chrome team (working on SPDY) followed-up with this post on Chromium cache metrics from Windows Chrome. These are the most informative real user cache statistics I’ve ever seen. I strongly encourage you to read his article.

Some of the takeaways include:

  • ~30% of users have a full cache (capped at 320 MB)
  • for users with a full cache, the median time to fill their cache is 4 hours of active browsing (20 hours of clock time)
  • 7% of users clear their cache at least once per week
  • 19% of users experience “fatal cache corruption” at least once per week thus clearing their cache

The last stat about cache corruption is interesting – I appreciate the honesty. The IE 9 team experienced something similar. In IE 7&8 the cache was capped at 50 MB based on tests showing increasing the cache size didn’t improve the cache hit rate. They revisited this surprising result in IE9 and found that larger cache sizes actually did improve the cache hit rate:

In IE9, we took a much closer look at our cache behaviors to better understand our surprising finding that larger caches were rarely improving our hit rate. We found a number of functional problems related to what IE treats as cacheable and how the cache cleanup algorithm works. After fixing these issues, we found larger cache sizes were again resulting in better hit rates, and as a result, we’ve changed our default cache size algorithm to provide a larger default cache.

Will mentions that Chrome’s 320 MB cap should be revisited. 30% seems like a low percentage for full caches, but could be accounted for by users that aren’t very active and active users that only visit a small number of websites (for example, just Gmail and Facebook). If possible I’d like to see these full cache statistics correlated with activity. It’s likely that user who account for the biggest percentage of web visits are more likely to have a full cache, and thus experience slower page load times.

Next steps

First, much of the data for this post came from the HTTP Archive, so I’d like to thank our sponsors: Google, Mozilla, New Relic, O’Reilly Media, Etsy, Strangeloop, dynaTrace Software, and Torbit.

The data presented here suggest a few areas to focus on:

Website owners need to increase their use of a Cache-Control max-age, and the max-age times need to be longer. 38% of resources were unchanged over a 1 month period, and yet only 11% of resources have a max-age value that high. Most resources, even if they change, can be refreshed by including a fingerprint in the URL specified in the HTML document. Only bootstrap scripts from 3rd parties should have short cache times (hours). Truly dynamic responses (JSON, etc.) should specify must-revalidate. A year from now rather than seeing 55% of resources without any max-age value we should see 55% cacheable for a month or more.

3rd party content providers need wider adoption of the caching and async behavior shown by the top Google, Twitter, and Facebook snippets.

Browser developers stand to bring the biggest improvements to caching. Increasing cache sizes is a likely win, especially for mobile devices. Data correlating cache sizes and user activity is needed. More intelligence around purging algorithms, such as IE 9′s prioritization based on mime type, will help when the cache fills up. More focus on personalization (what are the sites I visit most often?) would also create a faster user experience when users go to their favorite websites.

It’s great that the number of resources with caching headers grew 10% over the last year, but that just isn’t enough progress. We should really expect to double the number of resources that can be read from cache over the coming year. Just think about all those HTTP requests that can be avoided!

 

24 Responses to Cache them if you can

  1. Great post, I still don’t really buy that ga.js needs a 2 hour expire, at Stack Overflow we see about 22% of our hits arrive with an unprimed cache (extrapolated from our web logs / cdn logs)

    Cache corruption is an interesting case I never thought about, the link from chrome seems broken. I wonder if cache seek time plays a part in the small cache sizes (if there are too many files perhaps the cache becomes inefficient due to its design), and if browser crashes are causing some of these corrupt caches.

    I feel like window.performance is really lacking when it comes to analysis of these issues, I really wish browsers gave us more information to log and report to central servers … figuring out if an end user has an item cached or not is in many cases a best effort (especially when cdns are involved)

    Many of those sites you mentioned are committing much bigger crimes that the expire one (nih does not even gzip its stuff) I guess part of the reason is that people are dropping the ball here, is because they are just generally dropping the ball.

  2. Just for reference, Will Chan’s linked post is available at https://plus.google.com/103382935642834907366/posts/hsfVHq6wKxG .

  3. Wanna repeat
    “The fastest HTTP request is the one not made.”
    Thanks for such a detailed information.

  4. Hi Steve – great post as usual.

    One thing though – and I don’t mean to be nitpicky, but when you compare 4 percentage *points* (increase in requests with caching headers) to 12 *percent* (increase in number of requests per page) that’s rather misleading.

    The increase from 42% to 46% cached resources is an increase of 9.5 percent. Still a far cry from the 24% increase in total transfer size and certainly disappointing that it’s continuing to trail behind the reqs per page, but still.

  5. I think the cache size of Firefox is determined by the same of the filesystem and/or the amount of freespace. Therre are some setting with the prefix: browser.cache.disk.smart_size

    The size is determined on first use.

    If you copied your Firefox profile from an other computer, it might thus be wrong.

    Just now I reset the settings and mine went from 680MB+ to 1GB.

    I believe Firefox also chooses NOT to cache large HTML5-video files.

  6. It wasn’t our “revisiting the finding” so much as “rewroting the cache” that unblocked IE9 from getting better hit ratios with a larger-sized cache. :-)

  7. 315360000 / 3600*24 = 3650 days

    you meant 31536000 for 1 year not 315360000 right ?

  8. Galaxy Nexus has 20MB, as you can see here:
    https://github.com/android/platform_external_webkit/blob/master/Source/WebKit/android/WebCoreSupport/WebCache.cpp

  9. Don’t forget to keep the developer in mind when weaving cache magic – try to make sure that caching is visible, discoverable, and that the cache can be cleared. How many of us have banged our heads on the desk because of a caching behavior that wasn’t discoverable?

    I’m reminded of the way that IE6 used to cache the content type for a page no matter how the content of the page changed (a nightmare for someone trying to generate dynamic PDFs). The only way out was to close and reopen the browser.

    I also recently flailed with browsers caching 301 redirects. Sure a 301 is “permanent,” but some permanent things are less permanent than others.

  10. For the 3rd party scripts, the http request is often the entire point. If you don’t touch their servers, how are they going to track your visit?

  11. @Sam: Thanks for corroborating the ~20% page views with empty cache stat. Other website owners have echoed this stat. I fixed the URL to Will’s Chrome cache post. I agree – there is still a need to evangelize some of the basic WPO techniques like gzip.

    @Jens: Good catch – thank you. I changed “4%” to “10%” in talking about that 42% to 46% increase in use of caching headers.

    @Lennie: Firefox and Chrome both set max cache size based on free disk space – that’s why I included my total and available disk space. What’s yours?

    @Loic: Yes – thank you. I corrected it to be 31536000 (1 year).

  12. Stupid question: What is the difference between specifying a large max-age and setting Expires to a date far in the future? Is using max-age better?

  13. I’ve found this section to be true:

    Perhaps it’s hard to make progress on caching because the problem doesn’t belong to a single group – responsibility spans website owners, third party content providers, and browser developers.

    In the past I’ve tried to convince a few local EDUs, government entities and non-profits to adopt even the most basic of optimizations: compression and caching. These are large entities and it takes months to get just the site owners to think about these things, then more months to get all of the right people to agree and make it happen. In some cases years go by without meaningful changes.

    Now combine that type of site owner situation with all of the other players involved. It is like trying to move mountains with a spoon.

  14. @Eric: Using max-age is better. Expires specifies an exact time value which relies on all the world’s computers’ clocks to be synchronized – not a safe assumption. It’s good to specify both for clients that only support HTTP/1.0 (max-age was added in HTTP/1.1).

  15. There’s a paper coming out at MobiSys’12 on cache behavior of mobile devices, including mobile browsers and embedded HTTP libraries for native apps. Should be out in mid-April or so. Bottom line: Most mobile devices have badly broken cache implementations.

  16. Nice text, specifically on what is actionable on server-side.

    I have more reluctance on the client-side considerations: the cache size on visitors’ browsers is dependent on these persons’ own strategy. For my part, I always bring down to 10-30 MB my caches:
    - Using ADSL, having a larger cache has no perceptible impact on my visits (let’s say a total < 2-3 minutes /day)… Even though the impact on the servers might be lots more
    - Having a larger cache creates larger seeding opportunities for viruses and other malwares, since these will have longer lifespan and more email adresses to colect from webmails

    As a webmaster, my preferred tool is…B-) your own Mobile Perf tool (http://stevesouders.com/mobileperf/mobileperfbkm.php ) which helps a lot pinpointing the easiest performance improvements on the server, eg max age and gzip. Using sprites would also probably be a large source of improvement, but this requires lots more work…

  17. Hey Steve,

    Good stuff. A couple of things:

    * Caching is not just determined by the max-age header. Specifically, you seem to be ignoring heuristic freshness, where a cache can (and almost all do) assign a freshness lifetime if none is explicit in the message.

    Is explicit freshness better? Often, except in cases where the server makes bad assumption about lifetime (which you note). However, the fact remains that a whole lot more of the Web is cacheable and is cached than the traffic you’re presenting.

    Look for responses that have a status code that allows heuristic freshness AND a Last-Modified header; that’s what most implementations use. See also:
    https://svn.tools.ietf.org/svn/wg/httpbis/draft-ietf-httpbis/latest/p6-cache.html#heuristic.freshness

    * There isn’t a HTTP/1.0 to 1.1 dichotomy on max-age support; 1.0 devices can and almost always do support max-age. HTTP is designed to allow back-porting of features like max-age to 1.0 (since versioning is hop-by-hop anyway).

    In other words, people can drop Expires and just use max-age.

  18. @Mark: You are THE GURU when it comes to HTTP. (Mark heads the IETF HTTPbis Working Group.) You’re right that heuristic caching exists (see this description for heuristic caching in IE9). It’s optional for browsers to implement heuristic caching, and the implementation specifics are also subjective. For example, before IE9 heuristic caching in IE only applied to images. So while it’s true that real world caching is more than the static analysis of max-age headers indicate, we can’t quantify how much of an impact that has. (Although a static analysis of 10% of the interval between now and the Last-Modified date would be interesting.)

    Wrt HTTP/1.0 clients, I was referring to clients that weren’t patched to support HTTP/1.1 changes – i.e., old browsers. This is likely a small fraction of current traffic – but they do exist. The incremental cost of adding Expires might be small enough to justify this benefit. That might be a nice Browserscope user test to determine which browsers in the world today do NOT support max-age. It might be small, but might reveal some significant edge cases just like my authohead user test did.

  19. Blaze claims that persistent cache is still zero, maybe that was changed from iOS 5.01 to iOS 5.1 ?
    http://www.blaze.io/mobile/ios5-top10-performance-changes/
    I understand the 30-35MB is not memoery cache but persistent cache of mobile iPhone 4 iOS 5.1

    Galaxy Nexus in code should be 20MB not 18MB (“static const int kMaximumCacheSizeBytes = 20 * 1024 * 1024;”)
    https://github.com/android/platform_external_webkit/blob/master/Source/WebKit/android/WebCoreSupport/WebCache.cpp

    Have you seen this research: http://www.winktoolkit.org/blog/204/

    Maybe we it is possible to extend BrowserScope tests to research that

  20. Will using application cache/manifest solve this problem?

  21. @Yaniv: I believe persistent cache was improved. My tests didn’t cover persistent cache.

    @Parvez: I presume that the compressed/uncompressed cache behavior for disk cache is the same as for offline (app) cache. It would be cool if you tested this. ;-)

  22. Don’t forget browsers like Safari and Firefox with a bfcache (Back/Forward cache). Often pages can’t be cached because they contain non-cacheable frames.

    For example, adding a Facebook “like” iframe to the page (with no-cache, no-store, must-revalidate) means the whole page will be re-requested when pressing back/forward.

    It’s pretty annoying if you like DOM modifications to be preserved when pressing back/forward.

  23. “The methodology was to embed a transparent 1×1 image in the page with an expiration date in the past.”

    This is not an appropriate test, since many caches will not add such a resource to cache, and therefore I’d expect a higher number of unconditional requests than full caches.

    If you want to prime a cache with something that will hit your server, you need to use must-revalidate, or max-age = 0 etc.

  24. Adrien: Thanks for the suggestion. I haven’t heard of this. Which browsers behave as you describe?