Loadtimer: a mobile test harness

December 1, 2011 1:35 am | 14 Comments

Measuring mobile performance is hard

When Amazon announced their Silk browser I got excited reading about the “split architecture”. I’m not big on ereaders but I pre-ordered my Kindle Fire that day. It arrived a week or two ago. I’ve been playing with it trying to find a scientific way to measure page load times for various websites. It’s not easy.

Since it’s a new browser and runs on a tablet we don’t have plugins like Firebug.
It doesn’t (yet) support the Navigation Timing spec, so even though I can inspect pages using Firebug Lite (via the Mobile Perf bookmarklet) and Weinre (I haven’t tried it but I assume it works), there’s no page load time value to extract.
Connecting my Fire to a wifi hotspot on my laptop running tcpdump (the technique evangelized by pcapperf) doesn’t work in accelerated mode because Silk uses SPDY over SSL. This technique works when acceleration is turned off, but I want to see the performance optimizations.

While I was poking at this problem a bunch of Kindle Fire reviews came out. Most of them talked about the performance of Silk, but I was disappointed by the lack of scientific rigor in the testing. Instead of data there were subjective statements like “the iPad took about half as long [compared to Silk]” and “the Fire routinely got beat in rendering pages but often not by much”. Most of the articles did not include a description of the test procedures. I contacted one of the authors who confided that they used a stopwatch to measure page load times.

If we’re going to critique Silk and compare its performance to other browsers we need reproducible, unbiased techniques for testing performance. Using a stopwatch or loading pages side-by-side and doing a visual comparison to determine which is faster are not reliable methods for measuring performance. We need better tools.

Introducing Loadtimer

Anyone doing mobile web development knows that dev tools for mobile are lacking. Firebug came out in 2006. We’re getting close to having that kind of functionality in mobile browsers using remote debuggers, but it’s pretty safe to say the state of mobile dev tools is 3-5 years behind desktop tools. It might not be sexy, but there’s a lot to be gained from taking tools and techniques that worked on the desktop and moving them to mobile.

In that vein I’ve been working the last few days to build an iframe-based test harness similar to one I built back in 2003. I call it Loadtimer. (I was shocked to see this domain was available – that’s a first.) Here’s a screenshot:

The way it works is straightforward:

It’s preloaded with a list of popular URLs. The list of URLs can be modified.
The URLs are loaded one-at-a-time into the iframe lower in the page.
The iframe’s onload time is measured and displayed on the right next to each URL.
If you check “record load times” the page load time is beaconed to the specified URL. The beacon URL defaults to point to loadtimer.org, but you can modify it if, for example, you’re testing some private pages and want the results to go to your own server.
You can’t test websites that have “framebusting” code that prevents them from being loaded in an iframe, such as Google, YouTube, Twitter, and NYTimes.

There are some subtle optimizations worth noting:

You should clear the cache between each run (unless you explicitly want to test the primed cache experience). There’s no way for the test harness to clear the cache, but it does have a check that helps remind you to clear the cache. (It loads a script that is known to take 3 seconds to load – if it takes less than 3 seconds it means the cache wasn’t cleared.)
It’s possible that URL 1’s unload time could make URL 2’s onload time be longer than it actually should be. To avoid this about:blank is loaded between each URL.
The order of the preset URLs is randomized to mitigate biases across URLs, for example, where URL 1 loads resources used by URL 2.

Two biases that aren’t addressed by Loadtimer:

DNS resolutions aren’t cleared. I don’t think there’s a way to do this on mobile devices short of power cycling. This could be a significant issue when comparing Silk with acceleration on and off. When acceleration is on there’s only one DNS lookup, whereas when acceleration is off there’s a DNS lookup for each hostname in the page (13 domains per page on average). Having the DNS resolutions cached gives an advantage to acceleration being off.
Favicons aren’t loaded for websites in iframes. This probably has a negligible impact on page load times.

Â Have at it

The nice thing about the Loadtimer test harness is that it’s web-based – nothing to install. This ensures it’ll work on all mobile phones and tablets that support JavaScript. The code is open source. There’s a forum for questions and discussions.

There’s also a results page. If you select the “record load times” checkbox you’ll be helping out by contributing to the crowdsourced data that’s being gathered. Getting back to what started all of this, I’ve also been using Loadtimer the last few days to compare the performance of Silk to other tablets. Those results are the topic of my next blog post – see you there.

14 Responses to Loadtimer: a mobile test harness

Guypo | 01-Dec-11 at 6:04 am | Permalink |

Very cool – simple and powerful. The type of tool that makes you wonder why nobody wrote it before ;)

That said, one more bias to beware of: Shared 3rd party components across the URLs. If the first URL downloads ga.js, and the second URL uses it too, it will already be in the cache.
Patrick Mueller | 01-Dec-11 at 6:57 am | Permalink |

So, if any of those sites use appcache for caching data, is there a recommended way (per device/os version/etc) to clear that? Or maybe it’s good to not have to clear it, with the implication that the first time you load the page, you’re willing to absorb a perf hit (if any) and it’s subsequent load times that are more important in the long run.

BTW, have you seen http://www.iwebinspector.com/ ?
Steve Souders | 01-Dec-11 at 7:57 am | Permalink |

@Guypo: It is possible that URL 1 loads resources used by URL 2. To mitigate that issue the order of the URLs is randomized. This doesn’t make the issue go away, but if you do enough tests the bias should be distributed.

@Patrick: I have high hopes for @firt’s iWebInspector, but currently it doesn’t run on devices.
jpvincent | 01-Dec-11 at 1:22 pm | Permalink |

hey
nice initiative
is it me or in the current results, there is almost no difference in the silk tablet between the accelerated and non accelerated versions ?
I just tested with a Samsung tablet with Windows 8 and IE10, and the results dont show up in your result page, do you edit it manually ?
Joe | 01-Dec-11 at 1:57 pm | Permalink |

iPad Safari is known to fire its onload event prior to the typical convention of loading, rendering, laying out, and reflowing.

http://www.howtocreate.co.uk/safaribenchmarks.html

This is the start of a great tool, though. Thanks for posting!
Steve Souders | 01-Dec-11 at 2:18 pm | Permalink |

@jpvincent: Please see the next blog post (“Silk, iPad, Galaxy comparison“) for a review of Silk with acceleration off and on.

@Joe: I read that blog post but I don’t see any examples of a page that produces the error described. All the pages I tested produced an accurate onload time. Please send a URL that produces an inaccurate onload time. Otherwise, I think that post is wrong.
Jason Hofmann | 01-Dec-11 at 2:26 pm | Permalink |

Steve,

Nice development work, but I fundamentally disagree with the premise that using a stopwatch is unscientific. In my opinion, it is the only currently available way to consistently and reliably measure an end user’s perception of page load time on their specific device and network. Additionally, the onload event is hardly ever representative of the load time of the page, much less the viewport (the initially visible part of the page).

In the interest of full disclosure, I do work for a company that makes a Front End Optimization solution.

The importance of measuring Perceived Performance, Time-To-Action or Above-The-Fold rendering time is gaining traction. In fact, WebPageTest has introduced an experimental feature called “Measure Above-the-fold rendering time (AFT)”, but because it compares frames of the video looking for changes larger than x pixels, it is easily confused by large animations and image scrollers, design elements prevalent on home pages.

I look forward to your comments and thoughts.

Best Regards,

Jason Hofmann
Steve Souders | 01-Dec-11 at 2:39 pm | Permalink |

@Jason: A human clicking a stopwatch is neither consistent nor reliable. Human reaction times are too slow to measure performance at this level. You do have a valid point debating what is the right thing to measure – onload, AFT, etc. I’m familiar with those. AFT was developed at Google. I advised the work and did the initial integration with WebPagetest. ;-) I’m not aware of the issues you mention – the major work in AFT was to address animated GIFs, rotating ads, streaming video, etc. The primary issue with AFT as an alternative to onload is that it’s not currently feasible to measure AFT outside of a synthetic lab (like WebPagetest). Since none of the labs have the suite of tablets as test agents, Loadtimer is the best alternative. So I disagree with you about stopwatches, but I do agree that finding a better proxy than onload to capture the user’s perception of page loading is a hot topic. We’re making progress on that and I expect we’ll see good, easier solutions in the next year. Perhaps your company has one. ;-)
Steve Souders | 01-Dec-11 at 4:09 pm | Permalink |

@jpvincent: I missed your 2nd question about why you’re not seeing your test results from “Samsung tablet with Windows 8 and IE10”. To see the public data you have to select the “all data” radio button at the bottom of the results page. By default it just shows the data from the blog post’s testing.
Fazal Majid | 01-Dec-11 at 9:46 pm | Permalink |

One way to clear the DNS cache would be to load a page that has a bunch of images served off a wildcard DNS domain. If the DNS cache size is 1000 entries, the page should load 1000 images from foo1.example.com, foo2.example.com and so on (or 100 images that each redirect you 10 times to 10 different domains).

That way you would evict the previous test run’s DNS entry from the device cache.
Yanyun | 04-Dec-11 at 7:02 am | Permalink |

It’s a great tool for mobile testing of a webpage.

I am astonished by the data from loadtimer.com.
It seems mobile visitors have much more patience than PC visitors because the mobile loading time is almost 5~10 times longer.
Pavel Paulau | 05-Dec-11 at 4:45 am | Permalink |

What is the difference between Loadtimer and WebWait? I mean regarding technical implementation.
Steve Souders | 05-Dec-11 at 11:12 pm | Permalink |

@Pavel: It’s the same idea – tracking onload time of an iframe. The main differences are features – multiple URLs, beaconing the results, etc.
Joe | 28-Dec-11 at 12:16 pm | Permalink |

I tried this and then tried clearing browser cache and it then complained about not clearing browser cache. I think the problem is that there is a proxy between me and the net so your script got cached by the proxy, which I can’t clear and then it would not work in any browser. Good news is that overridding your function in firebug fixed that, but who knows how accurate the results were because at that point I was measuring proxyed data.

SteveSouders.com

Loadtimer: a mobile test harness

Measuring mobile performance is hard

Introducing Loadtimer

Â Have at it

14 Responses to Loadtimer: a mobile test harness