HTTP Archive: new schema & dumps

January 29, 2013 2:59 pm | 9 Comments

I spent most of the last month, including my holiday break (cue violin) implementing major changes to the HTTP Archive. Most of the externally visible UI changes aren’t there yet – I’ll be blogging about them when they’re available. The changes I worked on were primarily on the backend to make the database faster to query, consume less disk space, easier to download, and contain more stats. This required going through every crawl since Nov 15, 2010 to import the data, massage the schema, calculate new stats, and export the updated mysql dump.

You might ask why I had to start with importing the data. Well, that’s a funny story.

Funny Story

The HTTP Archive has 6 MyISAM tables. Most of the data is in the “pages” and “requests” tables. There are ~5.5M rows in pages, and 472M in requests – reflecting the ratio of ~85 requests per page. Keeping in mind that we’re saving all the HTTP request & response headers, it’s not surprising that the requests table is over 400G. After removing some rows and applying some schema changes to reduce the table size, I ran “optimize table requests;” to reclaim the disk space. Unbeknownst to me, MySQL tried to create a temporary copy of the (400G) table on our (660G) disk. After running out of disk space, both the original requests table and the incomplete copy were corrupted. I tried the documented recovery procedures without success. I have no other clues why this happened and assume it’s atypical.

Nevertheless, in one evening 400G of data gathered over the last 2.5 years disappeared.

I’m being a little meladramatic – it’s not as bad as it sounds. I knew I had MySQL dumps for every crawl. The data wasn’t lost, it just meant it was going to be harder to apply my schema updates and calculate new stats, especially since I never wanted the requests table to climb over 300G again. And it meant that the script for applying these changes had to start by importing the data from the dump file – something that took 1-5 hours depending on the size of the crawl. There are 56 crawls.

After spending about a minute bemoaning my bad luck (and lack of MySQL skills), I accepted the fact that I was entering a task where it was necessary to put my head down and just push on. I learned this lesson in high school when I showed up for my first construction job. I arrived at the job site where the crew was adding a second story to a house. I knew the foreman and asked him what I should do, bubbling over with excitement. He pointed to the roof that had formerly been on the house and now lay in a giant pile next to the house. They were of equal size. “Load the roof in that dump truck and take it to the dump.” I looked at him bewildered, “But it won’t fit in the truck.” “Fill the truck as full as you can, drive it to the dump, come back and do it again. Keep doing that until it’s gone.” And you know what? Sometimes that’s what happens in life. I taught myself to never look at the size of the pile – it was too discouraging. I just kept my head down and loaded the truck – for three days – until the job was done. (Sorry for the lecture but I think that’s an important lesson.)

Pushing on

Here we are 4 weeks later and the job is done. While the update script was running I made a significant change to the web GUI – nothing on the production website accesses the requests table. This means we can remove old crawls without impacting users, enabling us to keep the requests table to a manageable size, usually just containing the last few crawls. Keeping the last few crawls online is extremely useful for answering one-off questions, like “how many sites use script X?” or “how many requests have header Y?”.

Since the web GUI doesn’t require the requests table, and since restoring the dump files used to take so long, I split the dump files into two files: one for pages and another for requests. With this change it’s possible for people running private instances of the HTTP Archive code to import the smaller pages table much faster if that’s all they want. Another change is that whereas the dump files were formerly only available in MySQL format, they’re now also available as CSV. You can see the list of dump files on the HTTP Archive downloads page. All of these changes were also applied to the HTTP Archive Mobile.

I’ve documented the schema changes in dbapi.inc along with the commands needed to apply the schema to existing tables. Unless you have your own private crawl data, I recommend you start from scratch and create new tables from the new dump files.

My next step is to add charts for the new stats that were added:

  • Speed Index
  • time to first byte
  • visual complete time
  • fully loaded time
  • cdn (if any)
  • font requests
  • max # of requests on a single domain
  • size of the main HTML document
  • # of DOM elements
  • gzip savings & missed savings

MySQL Goodness

If you’re a MySQL guru you probably cringed through much of this blog post. If you’d like to help with the MySQL code, or even just give some advice, please contact me. For example, I just saw that one of the tables was accidentally recreated as InnoDB (since we updated to MySQL 5.5 and the default storage engine changed). Strangely, the same query is 4x slower on InnoDB than it is on MyISAM. It’s a straightforward query, not full text, etc. This is the kind of thing where I wish there was a MySQL guru who was somewhat familiar with the schema so I could explain the issue or question without taking too much time. There’s still a lot of fun MySQL ahead.

This is just the first of many blog posts around these recent HTTP Archive changes. I’ve got a good one coming up on document flush. I’ll also talk about each of the new stats when they’re available.

9 Responses to HTTP Archive: new schema & dumps

  1. Dear Steve,

    I really like what you do and I would like to give you some advice as I consider myself an expert in mysql, having implemented 2-3 big systems in Greece, such as http://yperdiavgeia.gr/ and http://openarchives.gr/

    First of all, stop using the default mysql and use some other version. I’m using http://www.percona.com/ but there is also Mariadb and others. I’m not an employee of Percona but I have to stress out that their system is really great.

    From my experience default myisam is better than default innodb. Your experience is normal with the 4x slower query.
    Nevertheless, a fine tuned XtraDB table (Innodb table type created by percona) can really outperform myisam.

    What is more, have a look at http://www.mysqlperformanceblog.com/ for tips, configurations of other systems, etc.

    I hope I provided you with some helpful tips.
    Regards,
    Vangelis

  2. Hi Steve,

    datasets of this size need a database to be properly managed. Unfortunately, this means understanding the myriad configuration options that are available. MySQL’s tendency to copy tables sucks big time. As I have previously written, having the write data types and indices are key to better performance. This requires some familiarity with EXPLAIN and also logging statistics. In my fork I ran performance stats for all queries which are universally faster than the default on the same machine.
    (https://bitbucket.org/charlie_x/python-httparchive/src/27b292af3928/httparchive/stats.txt?at=default)

    My local instance uses InnoDB as a backend because I like using foreign key constraints to enforce data integrity, which would have caught some of the orphaned requests. However. InnoDB must be configured to have one file per table because otherwise it grows and grows and grows. The official documentation for recovering lost space is to dump, truncate and reimport. However, once I had made some recommended tweaks to system parameters I found InnoDB importing significantly faster than MyASM and queries run at the same speed. I reckon data integrity is worth sacrificing some performance.

    @Vangelis – personally, I much prefer Postgres because it gets so much of the default right. But that isn’t really the issue here. The dataset is now so large that the database, of whichever flavour, needs carefully setting up to perform acceptably. What are the relevant statistics (table, index size) and corresponding settings?

  3. Just wondering – why did you decide to go with mysql? I mean I really don’t want to start a dbms debate here, I’m just curious, what were the arguments that led to this choice.

    Regards,
    Chris

  4. I have to give an argument in favor of using innodb. Backups, with innodb you are able do hot backups that doesn’t require you to lock any tables and also give you much faster restore times.

    Also, if you are looking for high availibility I would agree with Vangelis above, percona which hooks in to the innodb-engine has some nice features, like for example their galera implementation which enables you to load balance several mysql-server for both reads and writes. I am also told that their backup tool xtrabackup is also nice. And it’s all free software which the innodb hot backup tool apperently isn’t.

  5. Thanks for all the feedback.

    @Vangelis: I like the Percona recommendation. Peter Zaitsev spoke at the very first Velocity conference 5 years ago!

    @Charlie: I’ll email you separately with more info on the tables & indices.

    @Chris: I’ve used MySQL for a decade. My focus is more on frontend web performance so it’s hard to find time to stay current with and learn different DB technologies. MySQL has gotten the job done on all my past projects. This project is pushing the limits.

    @Micke: Backups are not an issue for me. The table that takes a long time to backup isn’t accessed by users.

  6. MySQLtuner is perfect companion to tune mysql yourself. Of course it’s not as good as mysql guru but it’s easy to use thing that really helps.

    http://mysqltuner.pl

  7. Steve,

    Schema changes are so much fun, you can imagine I have similar issues with ShowSlow.

    I had to create DBUpgrade library to make it easy for people upgrade their database schema along with code, here’s the code, if you are interested: https://github.com/sergeychernyshev/dbupgrade and here’s how it’s used in ShowSlow: https://github.com/sergeychernyshev/showslow/blob/master/dbupgrade.php

    I’m sorry I don’t have much MySQL advice myself, having same problem as you do – never had time to perfect it.

    Best,

    Sergey

  8. Hi Steve,
    Any interest in exploring context (device type, viewport sizing, user-agent) as cross section of the data you’re collecting? With the increased adoption of responsively coded pages, it would be fascinating to take context into account w/r/t each of the metrics you’re already tracking.

    Of course, it could blow out your data set by an order of magnitude.

    Thanks
    Scott

  9. Scott: Keep in mind that HTTP Archive is synthetic testing, *not* real users. This means the device type, viewport size, and user-agent is the same for all results.