HTTP Archive growing

November 3, 2011 2:11 pm | Comments Off on HTTP Archive growing

Today the number of URLs analyzed was doubled in both the HTTP Archive (from 17K to 34K URLs) and in the HTTP Archive Mobile (from 1K to 2K URLs).

This is a small step toward our goal of 1 million URLs, but it validates numerous code changes that landed recently:

  • 22: update URL lists – Previously the list of URLs to crawl was manually created (by me) from multiple other lists (Alexa, Quantcast, Fortune 500, etc.). Because it was manually created it wasn’t updated frequently. Now the list is based on the Alexa Top 1,000,000 Sites and is updated every crawl.
  • 243: handle non-ranked URLs – Some of the URLs crawled up until now are NOT in the Alexa Top 1M. In order to support looking at long term trends (by selecting “intersection“) I wanted to continue crawling these outliers. So the list of URLs that is crawled supports crawling non-ranked websites. This will allow many other nice features that you’ll hear about next week.
  • 242: rewrite batch_process.php – There’s a bunch of code for doing the crawl that needed to be made more efficient as we increase two orders of magnitude.
  • 68: cache aggregate stats for trends.php – Again, in order to deal with a larger number of URLs and still generate charts quickly, I introduced a caching layer for the aggregate stats.
  • #196: Publish a mysql schema dump – Exploring the data is now easier. Instead of having to setup an entire instance of the code, you simply create the tables based on the schema dump and download data that is of interest.

With these and other changes behind us, we’ll continue to increase the number of URLs to reach our goal. There are still some big tasks to tackle including changing the DB schema, increasing the capacity on mobile with more devices or switching to an emulator, and combining these two sites into a single site for easier comparison of desktop & mobile data.

No blog post about HTTP Archive would be complete without some observations. As mentioned earlier, whenever looking at long term trends I choose the intersection – which means the exact same URLs are included in every data point.

The main trend I’ve been noticing is how the size of resources is growing much faster than the number of resources. This growth is most evident in scripts and images. It’s no surprise – the Web is getting bigger. But now we can see where that’s happening and explore solutions.

I also wanted to shout out to Pat Meenan and Guy (“Guypo”) Podjarny. Pat works at Google and is the creator of WebPagetest, which is the foundation for the HTTP Archive (Mobile). Guypo works at Blaze and provides additional infrastructure and devices for all the mobile testing. In addition, there are a growing number of contributors to the open source project. And none of this would be happening without support from our sponsors: Google, Mozilla, New Relic, O’Reilly Media, Etsy, Strangeloop, and dynaTrace Software.

Watch for a fun announcement next week.

Comments are closed.