HTTP Archive: new stats

February 16, 2013 11:22 am | 2 Comments

Over the last two months I’ve been coding on the HTTP Archive. I blogged previously about DB enhancements and adding document flush. Much of this work was done in order to add several new metrics. I just finished adding charts for those stats and wanted to explain each one.

Note: In this discussion I want to comment on how these metrics have trended over the last two years. During that time the sample size of URLs has grown from 15K to 300K. In order have a more consistent comparison I look at trends for the Top 1000 websites. In the HTTP Archive GUI you can choose between “All”, “Top 1000″, and “Top 100″. The links to charts below take you straight to the “Top 1000″ set of results.

Speed Index

The Speed Index chart measures rendering speed. Speed Index was invented by Pat Meenan as part of WebPagetest. (WebPagetest is the framework that runs all of the HTTP Archive tests.) It is the average time (in milliseconds) at which visible parts of the page are displayed. (See the Speed Index documentation for more information.) As we move to Web 2.0, with pages that are richer and more dynamic, window.onload is a less accurate representation of the user’s perception of website speed. Speed Index better reflects how quickly the user can see the page’s content. (Note that we’re currently investigating if the September 2012 increase in Speed Index is the result of bandwidth contention caused by the increase to 300K URLs that occurred at the same time.)

Doc Size

The Doc Size chart shows the size of the main HTML document. To my surprize this has only grown ~10% over the last two years. I would have thought that the use of inlining (i.e., data:) and richer pages would have shown a bigger increase, especially across the Top 1000 sites.

DOM Elements

I’ve hypothesized that the number of DOM elements in a page has a big impact on performance, so I’m excited to be tracking this in the DOM Elements chart. The number of DOM elements has increased ~16% since May 2011 (when this was added to WebPagetest). Note: Number of DOM elements is not currently available on HTTP Archive Mobile.

Max Reqs on 1 Domain

The question of whether domain sharding is still a valid optimization comes up frequently. The arguments against it include browsers now do more connections per hostname (from 2 to 6) and adding more domains increases the time spent doing DNS lookups. While I agree with these points, I still see many websites that download a large number of resources from a single domain and would cut their page load time in half if they sharded across two domains. This is a great example of the need for Situational Performance Optimization evangelized by Guy Podjarny. If a site has a small number of resources on one domain, they probably shouldn’t do domain sharding. Whereas if many resources use the same domain, domain sharding is likely a good choice.

To gauge the opportunity for this best practice we need to know how often a single domain is used for a large number of resources. That metric is provided by the Max Reqs on 1 Domain chart. For a given website, the number of requests for each domain are counted. The  number of requests on the most-used domain is saved as the value of “max reqs on 1 domain” for that page. The average of these max request counts is shown in the chart. For the Top 1000 websites the value has hovered around 42 for the past two years, even while the total number of requests per page as increased from 82 to 99. This tells me that third party content is a major contributor to the increase in total requests, and there are still many opportunities where domain sharding could be beneficial.

The average number of domains per page is also shown in this chart. That has risen 50%, further suggesting that third party content is a major contributor to page weight.

Cacheable Resources

This chart was previously called “Requests with Caching Headers”. While the presence of caching headers is interesting, a more important performance metric is the number of resources that have a non-zero cache lifetime (AKA, “freshness lifetime” as defined in the HTTP spec RFC 2616). To that end I now calculate a new stat for requests, “expAge”, that is the cache lifetime (in seconds). The Cacheable Resources chart shows the percentage of resources with a non-zero expAge.

This revamp included a few other improvements over the previous calculations:

  • It takes the Expires header into consideration. I previously assumed that if someone sent Expires they were likely to also send max-age, but it turns out that 9% of requests have an Expires but do not specify max-age. (Max-age takes precendence if both exist.)
  • When the expAge value is based on the Expires date (because max-age is absent), the freshness lifetime is the delta of the Expires date and the Date response header value. For the ~1% of requests that don’t have a Date header, the client’s date value at the time of the request is used.
  • The new calculation takes into consideration Cache-Control no-store, no-cache, and must-revalidate, setting expAge to zero if any of those are present.
The percentage of resources that are cacheable hasn’t increased much in the last two years, hovering around 60%. And remember – the chart shown here is for the Top 1000 websites which are more highly tuned for performance than the long tail. This metric drops down to ~42% across all 300K top sites. I think this is a big opportunity for better performance, especially since I believe many sites don’t specify caching headers due to lack of awareness. A deeper study for a performance researcher out there would be to determine how many of the uncacheable resources truly shouldn’t be cached (e.g., logging beacons) versus static resources that could have a positive cache time (e.g, resources with a Last-Modified date in the past).

Cache Lifetime

The Cache Lifetime chart gives a histogram of expAge values for an individual crawl. (See the definition of expAge above.) This chart used to be called “Cache-Control: max-age”, but that was only focused on the max-age value. As described previously, the new expAge calculation takes the Expires header into consideration, as well as other Cache-Control options that override cache lifetime. For the Top 1000 sites on Feb 1 2013, 39% of resources had a cache lifetime of 0. Remembering that top sites are typically better tuned for performance, we’re not surprized that this jumps to 59% across all sites.

Sites hosting HTML on CDN

The last new chart is Sites hosting HTML on CDN. This shows the percentage of sites that have their main HTML document hosted on a CDN. WebPagetest started tracking this on Oct 1, 2012. The CDNs recorded in the most recent crawl were Google, Cloudflare, Akamai,, Limelight, Level 3, Edgecast, Cotendo CDN, ChinaCache, CDNetworks, Incapsula, Amazon CloudFront, AT&T, Yottaa, NetDNA, Mirror Image, Fastly, Internap, Highwinds, Windows Azure, cubeCDN, Azion, BitGravity, Cachefly, CDN77, Panther, OnApp, Simple CDN, and BO.LT. This is a new feature and I”m sure there are questions about determining and adding CDNs. We’ll follow-up on those as they come in. Keep in mind that this is just for the main HTML document. 

It’s great to see the HTTP Archive growing both in terms of coverage (number of URLs) and depth of metrics. Make sure to checkout the About page to find links to the code, data downloads, FAQ, and discussion group.


2 Responses to HTTP Archive: new stats

  1. “The new calculation takes into consideration Cache-Control no-store, no-cache, and must-revalidate, setting expAge to zero if any of those are present.”

    Perhaps you need to track must-revalidate separately. 304s are still better than 200s.

    “A deeper study for a performance researcher out there would be to determine how many of the uncacheable resources truly shouldn’t be cached (e.g., logging beacons) versus static resources that could have a positive cache time (e.g, resources with a Last-Modified date in the past).”

    Browsers use a heuristic to sometimes serve files from a cache when no cache control header is provided and last-modified is in the near/intermediate/far past. If that heuristic is standardised, can you report what requests are likely to be served from cache anyway.

  2. Nicholas: I don’t think it’s worth separating out must-revalidate. No-cache, maxage=0, etc can also result in a nice 304 response – so that’s not limited to just must-revalidate.

    There is no standard for heuristic caching. The HTTP spec says browsers MAY use heuristics to calculate a freshness lifetime based on Last-Modified and suggests 10% of now – Last-Modfied-date, but goes on to recommend that heuristic calculations “ought to used cautiously” [sic].

    But to answer your question, yes, I could research popular browsers to reverse engineer their heuristic caching formula, and analyze the HTTP Archive data for the presence of the relevant headers and calculate the heuristic expAge. But I was suggesting someone ELSE could do that as a nice research project. ;-)