HTTP Archive 3 Year Anniversary (Thank You Pat Meenan)
The earliest results available in the HTTP Archive are from Nov 15 2010, so in a sense this week is our three year anniversary. Three years! This was on my mind during breakfast as I thought back on how the HTTP Archive came about.
The idea came to me in 2007. I spent years evangelizing the idea which I called “the Internet Performance Archive”. The amount of work seemed large so instead of building it myself I met with various companies and encouraged them to build it, but to no avail. I knew it was worthwhile to record performance metrics aggregated across top websites. Each year that passed without this archive meant data that we’d never be able to reclaim. I felt a sense of urgency around getting the archive going.
Then, in September 2010 a confluence of events made me realize I could build it myself. The HTTP Archive file format, an effort I coordinated with Jan Odvarko (Firebug) and Simon Perkins (HttpWatch), had been announced the year before and was gaining wider support. There were more tools available that supported the HAR file format.
But the key factor was the work Pat Meenan was doing on WebPagetest. At this time Pat was working still at AOL. He was expanding the features of WebPagetest significantly and it was becoming one of the most important performance tools in the industry. On September 29 2010 I sent him this email:
Do you have time to talk today about an idea? I’m open 10:30am-12:30pm and after 3:30pm PT.
The project is the Internet Performance Archive (I mention it here) – a data warehouse of web performance stats. I’ve been talking about this for years, and I’d like to put up a first version now that would have stats for Fortune 500, Global 500, Alexa 1000 and perhaps other major lists. I’d like to get your thoughts and figure out a way to generate the HAR files as easily as possible (ie, it doesn’t take any of your time ;-).
In the ensuing discussion I suggested that Pat create an API for WebPagetest, so that I could build the HTTP Archive as a layer on top of it. In usual fashion, Pat informed me that the feature I wanted was already implemented. We proceeded to iterate on the initial LAMP prototype and started recording data less than two months later. After gathering six months of data the HTTP Archive was announced in March 2011.
There was one part of that initial concept that I was UNable to achieve – doing it without taking any of Pat’s time. Just the opposite, Pat has put in a ton of time to make the HTTP Archive possible. All of the tests are done on a private instance of WebPagetest (which Pat setup). When our load became too costly to run on AWS, Pat helped buy our own hardware and get it setup in our data center, Internet Systems Consortium. When we merged with the Internet Archive, Pat integrated our systems to use their S3-like storage system. He has built special widgets, added stats, and customized the data in the HAR file to make the HTTP Archive work better.
At this three year mark I’m thankful that the HTTP Archive has grown to be a popular source for performance and other stats about how the Web works. It’s a successful project. There’s a lot more to do (most importantly moving to Github to promote more contributions so I’m less of a bottleneck) but we’ve accomplished a lot and the future is bright.
I’m thankful for our sponsors: Google, Mozilla, New Relic, O’Reilly Media, Etsy, Radware, dynaTrace Software, and Torbit (now Walmart). Their support allowed us to move to our own hardware and purchase mobile devices for the HTTP Archive Mobile.
And I’m especially thankful for Pat’s help in creating the HTTP Archive from day one. WebPagetest is awesome, as is its creator.