Add your site & custom fonts

November 17, 2011 7:45 pm | 2 Comments

The Nov 15 2011 crawls for the HTTP Archive and HTTP Archive Mobile are done. Two new things were added.

Add your site

Our goal is to crawl the world’s top 1,000,000 URLs. This month we doubled the number of URLs from 17K to 35K. We’re still a ways away but making progress. But what if you’d like your website to be in the HTTP Archive but it isn’t in the top 1M?

Now you can add your site to the HTTP Archive. If it’s already in the list we’ll tell you and point you to any data that’s been gathered so far. If it’s not in the list we’ll queue it up for the next crawl. We moderate all additions to make sure the URL is valid. We also have a limit of 1 URL per website. We strive to crawl a site’s main URL (e.g., http://stevesouders.com/) but not all the subpages within a site (http://stevesouders.com/about.php, http://www.example.com/videos.php, etc.).

Custom Fonts

I’ve been thinking more about custom fonts after Typekit‘s acquisition by Adobe and seeing Jeff Veen at Velocity Europe. (Make sure to watch the video of Jeff’s talk – it’s an amazing presentation with a humorous start.) So this week I added a chart to track the adoption of custom fonts:

Typekit is clearly on to something – the use of custom fonts has tripled in one year. I warn against using @font-face for performance reasons, but performance isn’t all that matters. (Gasp!) Custom fonts obviously have aesthetic benefits that are attractive to website owners.

Fortunately, Typekit has several performance optimizations in how they load fonts. They combine all the fonts in a single stylesheet for browsers that support data: URIs. The fonts are served over a CDN. The font’s are only cacheable for 5 minutes which hurts repeat visits, but I believe they’re working on longer cache times.

For truly fast and robust font loading we need to lean on browser developers to implement better caching for fonts and better timeout choices during rendering. I’ll be talking about this during my High Performance HTML5 session at QCon on Friday.

2 Comments

HTTP Archive growing

November 3, 2011 2:11 pm | Leave a comment

Today the number of URLs analyzed was doubled in both the HTTP Archive (from 17K to 34K URLs) and in the HTTP Archive Mobile (from 1K to 2K URLs).

This is a small step toward our goal of 1 million URLs, but it validates numerous code changes that landed recently:

  • 22: update URL lists – Previously the list of URLs to crawl was manually created (by me) from multiple other lists (Alexa, Quantcast, Fortune 500, etc.). Because it was manually created it wasn’t updated frequently. Now the list is based on the Alexa Top 1,000,000 Sites and is updated every crawl.
  • 243: handle non-ranked URLs – Some of the URLs crawled up until now are NOT in the Alexa Top 1M. In order to support looking at long term trends (by selecting “intersection“) I wanted to continue crawling these outliers. So the list of URLs that is crawled supports crawling non-ranked websites. This will allow many other nice features that you’ll hear about next week.
  • 242: rewrite batch_process.php – There’s a bunch of code for doing the crawl that needed to be made more efficient as we increase two orders of magnitude.
  • 68: cache aggregate stats for trends.php – Again, in order to deal with a larger number of URLs and still generate charts quickly, I introduced a caching layer for the aggregate stats.
  • #196: Publish a mysql schema dump – Exploring the data is now easier. Instead of having to setup an entire instance of the code, you simply create the tables based on the schema dump and download data that is of interest.

With these and other changes behind us, we’ll continue to increase the number of URLs to reach our goal. There are still some big tasks to tackle including changing the DB schema, increasing the capacity on mobile with more devices or switching to an emulator, and combining these two sites into a single site for easier comparison of desktop & mobile data.

No blog post about HTTP Archive would be complete without some observations. As mentioned earlier, whenever looking at long term trends I choose the intersection – which means the exact same URLs are included in every data point.

The main trend I’ve been noticing is how the size of resources is growing much faster than the number of resources. This growth is most evident in scripts and images. It’s no surprise – the Web is getting bigger. But now we can see where that’s happening and explore solutions.

I also wanted to shout out to Pat Meenan and Guy (“Guypo”) Podjarny. Pat works at Google and is the creator of WebPagetest, which is the foundation for the HTTP Archive (Mobile). Guypo works at Blaze and provides additional infrastructure and devices for all the mobile testing. In addition, there are a growing number of contributors to the open source project. And none of this would be happening without support from our sponsors: Google, Mozilla, New Relic, O’Reilly Media, Etsy, Strangeloop, and dynaTrace Software.

Watch for a fun announcement next week.

Leave a comment