Announcing the HTTP Archive

March 30, 2011 5:44 pm | 11 Comments

I’m proud to announce the release of the HTTP Archive. From the mission statement:

Successful societies and institutions recognize the need to record their history – this provides a way to review the past, find explanations for current behavior, and spot emerging trends. In 1996 Brewster Kahle realized the cultural significance of the Internet and the need to record its history. As a result he founded the Internet Archive which collects and permanently stores the Web’s digitized content.

In addition to the content of web pages, it’s important to record how this digitized content is constructed and served. The HTTP Archive provides this record. It is a permanent repository of web performance information such as size of pages, failed requests, and technologies utilized. This performance information allows us to see trends in how the Web is built and provides a common data set from which to conduct web performance research.

The HTTP Archive code is open source and the data is downloadable. Approximately 17,000 top websites are examined every two weeks. I started gathering the data in October 2010. The list of URLs is derived from various sources including Alexa, Fortune 500, and Quantcast. The system is built on the shoulders of WebPagetest which downloads each URL and gathers a HAR file, screenshots, video of the site loading, and other information. From this information the HTTP Archive extracts data, stores it in a database, aggregates the data, and provides various statistical analyses.

This chart is one of the many interesting stats from the HTTP Archive. It shows the number of bytes downloaded for the average web page broken out by content type.

In addition to interesting stats for an individual run, the HTTP Archive also has trending charts. For example, this chart shows the total transfer size of web pages has increased 88 kB (15%) over the last six month. (In order to compare apples-to-apples, since the list of top sites can change month to month, these comparisons are done using the intersection of ~1100 URLs analyzed in every run.)

The likely cause for this increase is size of images. The next chart shows that the total transfer size of images increased 61 kB (22%) over the same time period, even though the total number of image requests only increased by two (6%). Why did the size of images grow so much? This highlights one of the benefits of the HTTP Archive. Anyone who wants to answer that question can download the data and perform an analysis. Possible things to check include if the ratio of JPG images increased (since the average size of JPG images is 14 kB compared to 3 kB for GIF and 8 kB for PNG) and whether the increase occurs across most images in the page or a few larger images.

There are hours of slicing and dicing in store for performance engineers and anyone who loves data. About 20 charts are available today. More will come (add suggestions to the list of isues). The most important thing to remember is the data is being collected and archived, and will be accessible to new views that are added in the future.

Check out the FAQ for more information. I encourage you to join the HTTP Archive group in order to follow our progress and add to the discussion. Once there you’ll see the first email from Brewster Kahle who says “Fantastic! I have not seen such a wealth of information on the performance of the web recorded and so enriched.” I hope you too will find it worthwhile.

11 Responses to Announcing the HTTP Archive

  1. Fantastic idea and brilliant execution! I really like that this is open source, and the graphs are elegant and useful. This is already a valuable resource!

  2. Absolutely awesome! totally satisfies my geek craving for perf charts.

  3. Nice, tracking trends can be a big help in getting a feel for where things are going. At a high level it would be nice to see if the top sites tend to improve their page load performance over time.

  4. Great! glad to see evolution there.
    Do you foresee beacon integration in some future?

  5. We already have a web archive site on http://www.archive.org

    They have more data than anyone else.

  6. @Zach: Have you even read the post?

    It specifically mentions the difference between that website and this project.

  7. Very cool, my only concern is how big of a time suck this could end up being for us data addicts ;)

    I noticed there are multiple runs in the stats screen, but couldn’t figure out the frequency. Do the pages just repeatedly run? Or do they run every 2 weeks?

  8. Great idea! Thanks for doing this.

    Best,
    Brad

  9. Nice work! but Why didn’t you include Yahoo as one of the websites?

  10. @Joseph: I created bug #111 for this.

    @Marin: Not sure what you mean by beacon integration. The HTTP Archive is based on HAR files and screenshots. That can’t be done via a beacon.

    @Zach: The Internet Archive is awesome. That’s why I approached Brewster about the idea first and worked with him to make sure he thought it was a good idea. The Internet Archive doesn’t give visibility to HTTP headers and other info, and doesn’t provide aggregate stats, so there’s a need for the HTTP Archive.

    @Simeon: Thanks for coming to my rescue! I’m sure Zach didn’t mean anything bad.

    @Guypo: WebPagetest loads each URL 5 times and uses the median. Those 5 runs for all ~16K URLs are done in a single ~5 hour period. That ~5 hour period occurs once every 2 weeks.

    @Brad: Hi! Long time.

    @Spartan: ??? See http://httparchive.org/viewsite.php?pageid=177531

  11. This project is great. I see you have my website listed. This will be great to have an archive to go back to and see how changes to design and functionality which pile up over the years have effected speed and performance.

    As a web publisher it’s a constant battle as the web changes and different third party widgets become must have items, to balance performance vs. functionality. Also, to be able to measure performance of 3rd party widgets over time. I mean how long does the Facebook like button take to load? Has it improved over the years?

    Another thing that will be really interesting to monitor is how different Ad Networks perform. Adding another Ad networks may seem like a no-brainer if it’s generates more revenue but if they slow down my website by 2 seconds a page I’d have to rethink it. I’ve never had a way to go back in time and measure something like that. This is exciting stuff!