HTTP Archive: new schema & dumps
I spent most of the last month, including my holiday break (cue violin) implementing major changes to the HTTP Archive. Most of the externally visible UI changes aren’t there yet – I’ll be blogging about them when they’re available. The changes I worked on were primarily on the backend to make the database faster to query, consume less disk space, easier to download, and contain more stats. This required going through every crawl since Nov 15, 2010 to import the data, massage the schema, calculate new stats, and export the updated mysql dump.
You might ask why I had to start with importing the data. Well, that’s a funny story.
The HTTP Archive has 6 MyISAM tables. Most of the data is in the “pages” and “requests” tables. There are ~5.5M rows in pages, and 472M in requests – reflecting the ratio of ~85 requests per page. Keeping in mind that we’re saving all the HTTP request & response headers, it’s not surprising that the requests table is over 400G. After removing some rows and applying some schema changes to reduce the table size, I ran “optimize table requests;” to reclaim the disk space. Unbeknownst to me, MySQL tried to create a temporary copy of the (400G) table on our (660G) disk. After running out of disk space, both the original requests table and the incomplete copy were corrupted. I tried the documented recovery procedures without success. I have no other clues why this happened and assume it’s atypical.
Nevertheless, in one evening 400G of data gathered over the last 2.5 years disappeared.
I’m being a little meladramatic – it’s not as bad as it sounds. I knew I had MySQL dumps for every crawl. The data wasn’t lost, it just meant it was going to be harder to apply my schema updates and calculate new stats, especially since I never wanted the requests table to climb over 300G again. And it meant that the script for applying these changes had to start by importing the data from the dump file – something that took 1-5 hours depending on the size of the crawl. There are 56 crawls.
After spending about a minute bemoaning my bad luck (and lack of MySQL skills), I accepted the fact that I was entering a task where it was necessary to put my head down and just push on. I learned this lesson in high school when I showed up for my first construction job. I arrived at the job site where the crew was adding a second story to a house. I knew the foreman and asked him what I should do, bubbling over with excitement. He pointed to the roof that had formerly been on the house and now lay in a giant pile next to the house. They were of equal size. “Load the roof in that dump truck and take it to the dump.” I looked at him bewildered, “But it won’t fit in the truck.” “Fill the truck as full as you can, drive it to the dump, come back and do it again. Keep doing that until it’s gone.” And you know what? Sometimes that’s what happens in life. I taught myself to never look at the size of the pile – it was too discouraging. I just kept my head down and loaded the truck – for three days – until the job was done. (Sorry for the lecture but I think that’s an important lesson.)
Here we are 4 weeks later and the job is done. While the update script was running I made a significant change to the web GUI – nothing on the production website accesses the requests table. This means we can remove old crawls without impacting users, enabling us to keep the requests table to a manageable size, usually just containing the last few crawls. Keeping the last few crawls online is extremely useful for answering one-off questions, like “how many sites use script X?” or “how many requests have header Y?”.
Since the web GUI doesn’t require the requests table, and since restoring the dump files used to take so long, I split the dump files into two files: one for pages and another for requests. With this change it’s possible for people running private instances of the HTTP Archive code to import the smaller pages table much faster if that’s all they want. Another change is that whereas the dump files were formerly only available in MySQL format, they’re now also available as CSV. You can see the list of dump files on the HTTP Archive downloads page. All of these changes were also applied to the HTTP Archive Mobile.
I’ve documented the schema changes in dbapi.inc along with the commands needed to apply the schema to existing tables. Unless you have your own private crawl data, I recommend you start from scratch and create new tables from the new dump files.
My next step is to add charts for the new stats that were added:
- Speed Index
- time to first byte
- visual complete time
- fully loaded time
- cdn (if any)
- font requests
- max # of requests on a single domain
- size of the main HTML document
- # of DOM elements
- gzip savings & missed savings
If you’re a MySQL guru you probably cringed through much of this blog post. If you’d like to help with the MySQL code, or even just give some advice, please contact me. For example, I just saw that one of the tables was accidentally recreated as InnoDB (since we updated to MySQL 5.5 and the default storage engine changed). Strangely, the same query is 4x slower on InnoDB than it is on MyISAM. It’s a straightforward query, not full text, etc. This is the kind of thing where I wish there was a MySQL guru who was somewhat familiar with the schema so I could explain the issue or question without taking too much time. There’s still a lot of fun MySQL ahead.
This is just the first of many blog posts around these recent HTTP Archive changes. I’ve got a good one coming up on document flush. I’ll also talk about each of the new stats when they’re available.