User Agents in the morning

January 18, 2009 5:25 pm | 17 Comments

Every working day, a script runs at 7am that opens ~20 websites in my browser. I open them at 7am so that they’re ready for me when I sit down with my coffee. I’m the performance guy – I can’t stand waiting for a page to load. Among the sites that I read everyday are blogs (Ajaxian, O’Reilly Radar, Google Reader for the rest), news sites (MarketWatch, CNET Tech News, InternetNews, TheStreet.com), and stuff for fun and life (Dilbert, Woot, The Big Picture, Netflix).

The last site is a page related to UA Profiler. It lists all the new user agents that have been tested in the last day. These are unique user agents – they’ve never been seen by UA Profiler before. When I first launched UA Profiler, there were about 50 each day. Now, it’s down to about 20 per day. But I’ve skipped over the main point.

Why do I review these new user agents every morning?

When I started UA Profiler, I assumed I would be able to find a library to accurately parse the HTTP User-Agent string into its components. I need this in order to categorize the test results. Was the test done with Safari or iPhone? Internet Explorer or Maxthon? NetNewsWire or OmniWeb? My search produced some candidates, but none of them had the level of accuracy I wanted, unable to properly classify edge case browsers, mobile devices, and new browsers (like Chrome and Android).

So, I rolled my own.

I find that it’s very accurate – more accurate than anything else I could find. Another good site out there is UserAgentString.com, but even they misclassify some well known browsers such as iPhone, Shiretoko, and Lunascape. When I do my daily checks I find that every 200-400 new user agents requires me to tweak my code. And I’ve written some good admin tools to do this check – it only takes 5 minutes to complete. And the code tweaks, when necessary, take less than 15 minutes.

It’s great that this helps UA Profiler, but I’d really like to share this with the web community. The first step was adding a new Parse User-Agent page to UA profiler. You can paste any User-Agent string and see how my code classifies it. I also show the results from UserAgentString.com for comparison. The next steps, if there’s interest and I can find the time, would be to make this available as a web service and make the code available, too. What do people think?

  • Do other people share this need for better User Agent parsing?
  • Do you know of something good that’s out there that I missed?
  • Do you see gaps or mistakes in UA Profiler’s parsing?

For now, I’ll keep classifying user agents as I finish the last drops of my (first) coffee in the morning.

17 Responses to User Agents in the morning

  1. Steve, have you thought about testing your library against the browsers listed in the browscap.ini file that is part of Gary Keith’s Browser Capabilities Project? (http://browsers.garykeith.com/) As far as I know it’s the largest publicly available database classifying User-Agent strings according to the browser type, name, version and capabilities.

  2. I visited the parse page with a nightly build of Firefox (Gecko/20090118 Minefield 3.2a1pre). The parser returned a browser of Minefield and a version of 3.2. Is this your desired behavior, or are you hoping to connect Firefox, Shiretoko, Minefield, etc.?

    I presume your current code is largely regex-based? If you would like to send me what you have I can take a look at putting a service up on App Engine.

  3. @Dan: I checked several UA strings against Gary Keith’s User Agent Checker. It got many of them, but I noticed some mistakes. It missed some of the smaller browsers (Shiretoko, Iron, Lunascape), and missed the version #s on Android and iPhone. I used the UA strings from the examples in my parse.php page.

    @Niall: Hi, Niall! Yes, I thought it would be best to differentiate Minefield from Firefox. I presumed that Minefield 3.1 might be different from Firefox 3.1. But I’m open to suggestions on this. App Engine would be good to avoid any load on my hosted servers. I’ll ping you. Thanks.

  4. Awesome work my man! Kudos. Have you thought about getting requests from Browsershots.org? They have a varied range of browsers / systems cause it’s user contributed.

  5. nice! Add to http://wurfl.sourceforge.net/ ??

  6. My gut is to lean towards a library, but I can see how a web service would be able to handle frequent updates better and free users from implementation details. (Is it written in Java, Python, .NET, etc…?)

    A Python package in PyPi would be easy to update frequently, at least….

  7. Steve, do you think identifying browsers only by their name and version is enough or it makes sense to account for nightly builds and other variations (e.g. special extensions that affect performance)?

  8. Just curious about your thought of jQuery’s v1.3 recent decision of deprecating the browser sniffing logic?

    http://docs.jquery.com/Release:jQuery_1.3#Upgrading
    (scroll up a little bit)

    “As of 1.3, jQuery no longer uses any form of browser/userAgent sniffing internally – and is the first major JavaScript library to do so. Browser sniffing is a technique in which you make assumptions about how a piece of code will work in the future. Generally this means making an assumption that a specific browser bug will always be there – which frequently leads to code breaking when browsers make changes and fix bugs.”

  9. Steve, I work for Mozilla Corporation doing data mining, ETL, and business intelligence. A lot of the work I do involves parsing the tremendous amounts of web log files we accumulate, and recently I spent a large chunk of time looking around for a really good UA parser myself. I didn’t find one, so like you, I built it myself.

    It outputs 9 different attributes about an UA string:
    os_category (Windows; Macintosh; *nix; Mobile; Other)
    os (PPC Mac; Windows CE; FreeBSD; Vista/Server2008)
    platform (Windows; Linux; Solaris; Symbian; iPhone)
    browser_category (Gecko; MSIE; WebKit; Opera; Other)
    browser_name (Firefox; Lynx; Safari; Fennec)
    browser_version
    engine_1 (Gecko version; WebKit version; Opera Mini version; Opera Cloaking status)
    engine_2 (Gecko build number; Safari build number; Opera cloak type; misc info from mobile browsers)

    Mine is based on the ETL tool I use, Pentaho Data Integration (Kettle). It is a pretty large transformation, and while I am using it in production, it isn’t quite where I want it to be. That said, I’m still really happy with where it is at the moment.

    I’ve been working on releasing my UA parser as open source, and I’d love to collaborate with you on your work. If that sounds good to you, please contact me.

  10. I would love to see this put up (open-source) where it can be made better. What about GitHub?

  11. Hmm, I’m also wondering if you’re going to release your sniffing code integrated with profiler data as some kind of “front-end capabilities” library to automate browser specific optimizations?

    Users could make calls like (PHP):

    <img src=”http://cdnconnsPerHost() ?>.mycdn.net/path/to/asset.gif”>

    or something like this…

  12. Obviously the folks at BrowserHawk have something like this to a degree. You might reach out to them to see if they have interest in sharing.

    For our own use in a number of areas we have addressed this problem and found it to be true like you found that there is no obvious good solution out there and second that it quickly becomes problematic with variations that exists. Once you start looking at pure logs versus those who execute JS you can see a tremendous number of crazy bot user-agent values. The more traffic you have the worse it gets. You might look at some of the folks in bot control and SEO arena and see them fighting a like battle.

    Summary: Yes put this open source, you have people who want it and many will help improve it.

  13. I would also love to see the code for this UA profiler. I am currently trying to write something that will allow me to determine whether my site is being accessed by a standard browser, or if it is being accessed by one of the more restricted mobile device browsers, such as the ones used on older BlackBerries.

  14. Please please please release this! I have a hacked together browser-detection module, but it’s certainly not up-to-date and doesn’t encompass any of the obscure browsers.

  15. So I just took the test and it reported I was using FireFox. I am not. I am using Opera 10.

    I retook it with IE using a toolbar – it reported me as Firefox again.

    I switched my UA and retook the test and it finally got the UA right atleast once in Opera.

    So basically this test was wrong 66% of the time for me.

    Seems pretty common to switch the UA by browsers, toolbars, and other utilties.

    But lets just do like other UA checkers and stick our heads in the sand and pretend they don’t exist. That way, we can publish pretty graphs and results on our blogs and claim they are accurate. No one is ever going to complain or come up with a reasonable fact checker. Ya, that’s the ticket…a UA profiler.

    Next week kids, we take on Big Foot, the Loch Ness monster, and how CSS can speed up your website…ya, that’s the ticket.

  16. Hi Steve,

    I’ve been working on the same thing for something here at Yahoo! — I’m sure you know what. Perhaps if you put your code on github, I could get permission to contribute to it.

    My current code is in perl, and chances are that to get permission it would have to be the same, but I’m cool with php, python and C++ as well.

  17. If the license permits I’ll add this to Open Web Analytics (http://www.openwebanalytics.com). A good library would be a cleaner alternative to downloading Gary’s file.