HTTP Archive: URL list, Flash trends
Last week I announced the launch of the HTTP Archive. The feedback has been very positive. I’ve already heard from a handful of performance gurus who have downloaded the data and done additional analyses. This was a major goal of the project and I’m excited to see it happening.
I made a few changes to HTTP Archive that I wanted to share in this blog post.
First, there’s a potential for apples-to-oranges comparisons because the list of URLs “crawled” by HTTP Archive changes from run to run due to errors and changes in the “top N” sorting of sources like Alexa and Fortune 500. When comparing two runs it’s unclear if differences are caused from a change in the sample set or actual changes in Internet behavior. This was exemplified by this tweet from @orionlogic:
The link contains two pie charts from HTTP Archive:
The issue is that the number of URLs grew from ~1000 in October 2010 to ~17,000 in April 2011. Those additional 16,000 websites have different behavior when it comes to using Flash. If we compare Nov 15 2010 to Mar 29 2011, both of which use ~17,000 URLs, the change is only 2%.
I made some changes to mitigate this issue.
- The first three runs that were done with only 1000 URLs are now hidden in the UI. The data is still available.
- Similar confusion can happen when viewing trending charts. The fix there is to use the “intersection” set of URLs across all runs. I added a note next to the “choose URLs” pick list to point out the benefit of choosing “intersection”. I moved the plot of “URLs Analyzed” to the top so it’s more apparent when the number of URLs changes from run to run.
I made several other fixes that are less visible. Several people have submitted requests for new stats. I’ll keep knocking those off and blogging about them.