Call to improve browser caching

April 26, 2010 9:14 pm | 38 Comments

Over Christmas break I wrote Santa my browser wishlist. There was one item I neglected to ask for: improvements to the browser disk cache.

In 2007 Tenni Theurer and I ran an experiment to measure browser cache stats from the server side. Tenni’s write up, Browser Cache Usage – Exposed, is the stuff of legend. There she reveals that while 80% of page views were done with a primed cache, 40-60% of unique users hit the site with an empty cache at least once per day. 40-60% seems high, but I’ve heard similar numbers from respected web devs at other major sites.

Why do so many users have an empty cache at least once per day?

I’ve been racking my brain for years trying to answer this question. Here are some answers I’ve come up with:

  • first time users – Yea, but not 40-60%.
  • cleared cache – It’s true: more and more people are likely using anti-virus software that clears the cache between browser sessions. And since we ran that experiment back in 2007 many browsers have added options for clearing the cache frequently (for example, Firefox’s privacy.clearOnShutdown.cache option). But again, this doesn’t account for the 40-60% number.
  • flawed experiment – It turns out there was a flaw in the experiment (browsers ignore caching headers when an image is in memory), but this would only affect the 80% number, not the 40-60% number. And I expect the impact on the 80% number is small, given the fact that other folks have gotten similar numbers. (In a future blog post I’ll share a new experiment design I’ve been working on.)
  • resources got evicted – hmmmmm

OK, let’s talk about eviction for a minute. The two biggest influencers for a resource getting evicted are the size of the cache and the eviction algorithm. It turns out, the amount of disk space used for caching hasn’t kept pace with the size of people’s drives and their use of the Web. Here are the default disk cache sizes for the major browsers:

  • Internet Explorer: 8-50 MB
  • Firefox: 50 MB
  • Safari: everything I found said there isn’t a max size setting (???)
  • Chrome: < 80 MB (varies depending on available disk space)
  • Opera: 20 MB

Those defaults are too small. My disk drive is 150 GB of which 120 GB is free. I’d gladly give up 5 GB or more to raise the odds of web pages loading faster.

Even with more disk space, the cache is eventually going to fill up. When that happens, cached resources need to be evicted to make room for the new ones. Here’s where eviction algorithms come into play. Most eviction algorithms are LRU-based – the resource that was least recently used is evicted. However, our knowledge of performance pain points has grown dramatically in the last few years. Translating this knowledge into eviction algorithm improvements makes sense. For example, we’re all aware how much costlier it is to download a script than an image. (Scripts block other downloads and rendering.) Scripts, therefore, should be given a higher priority when it comes to caching.

It’s hard to get access to gather browser disk cache stats, so I’m asking people to discover their own settings and share them via the Browser Disk Cache Survey form. I included this in my talks at JSConf and jQueryConf. ~150 folks at those conferences filled out the form. The data shows that 55% of people surveyed have a cache that’s over 90% full. (Caveats: this is a small sample size and the data is self-reported.) It would be great if you would take time to fill out the form. I’ve also started writing instructions for finding your cache settings.

I’m optimistic about the potential speedup that could result from improving browser caching, and fortunately browser vendors seem receptive (for example, the recent Mozilla Caching Summit). I expect we’ll see better default cache sizes and eviction logic in the next major release of each browser. Until then, jack up your defaults as described in the instructions. And please add comments for any browsers I left out or got wrong. Thanks.

38 Responses to Call to improve browser caching

  1. As far as I can tell eviction algorithm matters far more, than only bigger cache size. As you pointed out, no matter how big your cache is, the browser will make it full sooner or later.

    As I tried to raise it to just like 200MB, firefox started caching 10-100MB flash movie files.

    There’s no need to say, if you raise it to 2GB you will just end up storing a whole theater in your cache… :)

  2. This is with reference to your Browser Disk Cache page – http://stevesouders.com/cache.php

    The path of the folder where Opera caches files (& subsequently it’s size) can be found by going to Help -> About Opera. It’s listed in the third section (in Opera 9.6)

  3. Is there documentation on which eviction browsers are used? Firefox for example keeps track of some sort of fetch count (about:cache?device=disk) which would seem superfluous if the cache was LRU and not a LFU variant. Expires and last-modified are also fertile ground for heuristics.

  4. @Peter: We need both. During the day an active web user can easily generate more than 50MB of files that should be cached other than movies.

    @RK: Thanks! I updated cache.php.

    @Chris: I haven’t seen any documentation. For Chrome and Firefox, the source code is available.

  5. just use polipo.

    http://www.pps.jussieu.fr/~jch/software/polipo/

  6. This is useful information.

    Can you describe more about flawed experiment and conditions under which the caching headers are ignored? What does image in memory mean?

    One additional point is that it common to see that 10-20% get new persistent cookie drop (after accounting for bots and such). This high number of new users is a surprise as 10-20% of population does not comprise of new users. The possible explanations are: browsers configured to not accept persistent cookie, anti-virus software deleting cookies (and cache) or users deleting cookies. One plausible explanation is antivirus software playing a role just like mangling of gzip compression header.

  7. LRU is easy to implement but sub-optimal. Browser makers should consider switching to IBM’s Adaptive Replacement Cache (ARC) algorithm instead. although there are potential patent issues with it.

    http://en.wikipedia.org/wiki/Adaptive_replacement_cache

  8. Caching should depend not just on computer disk capacity but also on the network bandwidth available to the computer. solwer speed connections could benefit better by a bigger cache.

    so, granted most users in US urban areas have a huge disk on their computer as well as connected with a big pipe, it may be worthwhile to collect the general geography and the network speed the participant is in.

  9. Interesting, do you know if when something is gzipped and cached, does it stay gzipped? Also are there any algorithms that compare files, for example if two sites served the same image/CSS file or JS library, it would be good to comparison of bytes and names to what was already in the cache and instead of caching it again just provide a symlink to keep memory consumption down. This topic is really important for mobile browsers.

  10. > [...]the amount of disk space used
    > for caching hasn’t kept pace with the
    > size of people’s drives [...]

    I like to look at it from a different angle: cache sizes didn’t grow with the size of websites.

    The size of landing pages triples every 5 years. Well, that’s the current trend at least.

    With that in mind one can come up with the following formula:

    (3^((year-2004)/5))*50

    Today it would result in 187 MB. Next year it would be at 233 MB. Then 290 MB, 361 MB, 450 MB, and so on. The 2004/50 MB starting point was chosen because the line for Firefox’s 50 MB default cache size was written back in 2004.

    I can’t really recommend to increase Firefox’s cache size though (on Windows at least). Whenever it goes over 4000 (4096?) entries (easily reachable with a 100+ MB cache) the cache stops working. about:cache will also start to display some nonsense then.

    I haven’t tried it for a while, but it’s probably still broken, since stuff like this isn’t deemed to be a critical issue.

  11. @Mahesh: I’ll describe the experiment issues in a future blog post. Cookies also get evicted based on size etc. I’ve studied this in the past and the eviction algorithms are not well understood.

    @SR: Yes, the eviction algorithm should give a higher cache priority to resources that took a long time to download.

  12. @Simon: Firefox, Safari, Chrome, and Opera store the files gzipped. IE stores them ungzipped. (Tested on Windows XP using http://stevesouders.com/tests/big.php.) There have been discussions in the past about a checksum for comparing files that are identical, but there are significant security issues. I’m not sure it would be much of a win, either.

  13. Ok. I’m at 4282 entries now. Seems like using bigger cache sizes finally does work with Firefox. :)

  14. Forgot to point out that filling up another 100 MB of the cache (50 -> 150) merely took an hour of regular browsing.

    So, yes, those 50 MB don’t really last very long.

  15. One issue could be large corporate installs on particular operating systems. On Linux the cache is stored in my home directory. At University home directories have a quota of about 200Mbytes (primarily because they are backed up every few hours). As the University runs a proxy and has a fast Internet connection it makes more sense to simply turn off the browser disk cache otherwise backups become huge. I’ve seen places running XP with roaming profiles that also disabled IE’s cache because at the time it was being copied across the network into the roaming profile (I’d be surprised if this still happens – it should remain local to the machine). Too many different user’s disk caches can suck up disk space though.

    Something similar can sometimes happen on machines with slow writes. If disk writes are unbearably slow and you have a comparatively large amount of RAM you can just disable disk cache entirely.

  16. IME the browser caches are pretty primitive (perhaps because they haven’t seen much attention in so long?), and they all implement the spec a little differently, which leads web devs to spit out a crazy soup of headers to try to control them.

    Cache replacement algorithms are a good start (LFU is generally better for small caches, but there’s been a TON of research in this area that browser vendors should really take advantage of).

    However, they need to also focus on other things like protocol compliance for response headers, cache invalidation, etc. The revised caching spec in HTTPbis [http://tools.ietf.org/html/draft-ietf-httpbis-p6-cache] is trying to clean this up and remove ambiguity, which would make it a great starting place for the browser implementers — both to help them understand HTTP caching but also to get their feedback.

    I’d like to take it one step further and make the cache controllable from scripting; e.g., if you send “Cache-Control: max-age=30, max-stale” from XHR, the browser cache should respect that and do the right thing. This would give Web apps (especially “RESTful” ones) a lot more power.

    There are also a lot of ‘behind-the-scenes’ things in a cache that need attention; e.g., when to swap out memory cache to disk, how to handle concurrent requests, hash algorithms for cache lookup, and handling failure.

    It may be useful to talk with folks in the proxy cache community (e.g., Squid), because they’ve been hyper-focused on these problems for years. Some knowledge transfer (both ways) can only be a good thing.

    So, yes, lots to do :)

  17. Oh, and this is old, but still relevant:
    http://www.mnot.net/blog/2006/05/11/browser_caching

    I need to update the tests there.

  18. I think the size of the cache more than anything else is what needs to be improved upon. Once the HD starts swapping its all over for performance.

  19. Steve, just a minor comment: Step 9 of your instructions to find the cache information in IE states “This number is already in KB, so you can paste it directly into the survey form”. It’s in MB for me.

  20. @Steve, thanks for the response, I hadn’t considered the security implications of doing that, it’s a good point.
    I was wondering if you knew the answer to another follow up question.
    When javascript engines like v8 or Nitro compile javascript to Bytecode, do they cache the complied bytecode version or does it have to be re-compiled everytime it’s loaded from the cache?

  21. @Jos Hirth:

    I don’t know about 4096, but Mozilla’s cache is limited to 8192 items. I’m on Linux, and I’ve never seen any odd behaviour — it just evicts something to avoid going over that limit.

  22. Well, at large cache sizes (in terms of files), you will have file system speed issues become more apparent. On Linux the file search in a directory was usually O(n). So perhaps some smarter storage of the cache as well.

  23. Someone should see about getting Squid’s GDSF (Greedy-Dual Sized Frequency) cache policy into the browser, that combined with a max cacheable size would probably do wonders.

    I’d also like to see first byte and total response time added to the caching heuristic. If something can be refetched quickly and is not often accessed, evict it.

    In any case running a local proxy server can do wonders for browsing performance. If only Cisco WCCPv2 protocol was supported on lower end hardware — then we might actually see home and SMB web caching appliances.

  24. The memory cache in Firefox (not the disk cache) uses the LRU-SP algorithm , which seems to be a good candidate for an algorithm in a proxyserver, possible in a browser-cache too (maybe overkill for a memory cache). It has an interesting evicting strategy – it tries to destroy big items first, but it also uses fetch counts to lower the priority for heavily used items. But I found 1 flow in the algorithm ,as explained in bug 397275 : you also need a timer based eviction strategy, because otherwise very small items or items that were used a lot in the past (but not anymore) will never be deleted.

  25. link to LRU-SP algorithm :
    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.6081&rep=rep1&type=pdf

  26. Steve,

    I see a small issue with your survey audience and methodology. I just submitted my test results for IE, FF, and Chrome. I use Chrome as my default browser and so my cache is nearly filled, but my cache in FF and IE was nearly empty since I use them for web dev only and thus often find myself clearing the caches to view / test things.

    I’m not sure that I have a better suggestion other than to just ask your readers for cache stats on the browser they use to browse the web. Your results still may be skewed even then because I think your readership is more prone to have different numbers in there in the first place and / or clear their cache more often for web dev.

    It’s something to take note of either way when presenting the results.

  27. Interesting experiment.

    In reference to the browser cache page
    http://stevesouders.com/cache.php

    Step 7. The Temporary Internet Files folder may be hidden when moving up to the parent directory on Windows. From folder options select view, remove the check mark from “Hide protected operating system files (Recommended).”

    Step 9. My cache size was in MB not KB. If other people are just pasting in the value it may be skewing the results if their value is in MB.

  28. Andrew, maybe you shouldn’t have submitted results that are known to be unrepresentative. So why did you send datas from browser you don’t use? That’s nonsense at best. :)

    You’re second paragraph brings up some useful insight, but as you can see the results show that counting webedevs more agressive clearing of their cache doesn’t change the outcome of this survey (“55% of people surveyed have a cache that’s over 90% full”).

  29. @Andrew: Yes, it would be preferable if people only entered their primary browser. But I’m also interested in max cache size (primary and otherwise). I’ll probably take medians, so the impact is likely small. This is a self-selected population and self-reported data – we’ll have to take it with a grain of salt until we can do a better survey.

  30. A few days ago I said 150mb works fine with Firefox. Turns out, it doesn’t.

    Number of entries: 0

    It stopped caching completely again. So, back to 50mb it is. :/

  31. Someone gave me a link to the bug report:

    Only 8192 objects (entries) can be stored in disk cache.

    https://bugzilla.mozilla.org/show_bug.cgi?id=175600

  32. Firefox clears the cache after a browser crash: https://bugzilla.mozilla.org/show_bug.cgi?id=105843#c35

    This might explain why so many users have empty caches. Vote to have it fixed!

  33. Good news, everyone. Firefox’s bug #175600 has been fixed. It will be in 4.0 and you can already try it today with the 4.0 betas.

    I just tried it myself and it seems to work great indeed. Finally one can use a reasonably sized disk cache.

    If you want to try it yourself, a great way to fill up your cache quickly is to use Google Maps. Go full screen and look around casually. Within minutes your cache will be filled with thousands of objects.

  34. Chrome has since increased it’s cache. Mark Larson indicates this data hers is out of date.
    http://code.google.com/p/chromium/issues/detail?id=53541

  35. On Chrome’s cache…

    The maximum size of the cache is calculated as a percentage of available disk space. The contents can be viewed at chrome://net-internals/#httpCache. It can be cleared manually at chrome://settings/advanced or programmatically by calling chrome.benchmarking.clearCache() when Chrome is run with the –enable-benchmarking flag set. Note that for incognito windows this cache actually resides in memory.

    ~via http://gent.ilcore.com/2011/02/chromes-10-caches.html

  36. Question: if you visit a site where no expires headers are sent for images, css, and javascript – considering browser caching, when will the browser send a conditional get request to gague content freshness? (apare from when the user initiates a “refresh” command in the browser)

  37. @Expectationgap: I haven’t tested this. Given all the other stuff to look at, since I would never do this it’d be a pretty low priority for research. You should always set caching headers.

  38. A good alternative is to use the HTML5 cache manifest where available and applicable, http://developer.teradata.com/blog/js186040/2011/11/html5-cache-manifest-an-off-label-usage.