Sunday, March 31, 2013

Using Log Parser to Extract User Agents From Web Log Files

I've been meaning to write a follow-up (or two...) to my I LOVE LogParser post from a few months ago. The time has finally arrived.

Every day I collect and analyze (at least at a high level) somewhere around 1/2 a billion (yes, billion) web server log lines. And about that many more from a couple CDNs. Needless to say that's quite a bit of information. Besides the mundane stuff like number of hits, bandwidth served, hits per file type, etc. I've recently buckled down and written a few pretty good scripts with LogParser to extract and count User Agents.

I know even this isn't all that sexy or sophisticated, and that numerous companies selling analytic have already solved this, but, since I have the data right at my finger tips why should I pay someone else tens of thousands of dollars to analyze my logs and give me a pretty web interface. Yeah, I'll admit that would be nice, but for what I'm after I'm able to get what I need with just a little elbow grease.

This pursuit actually began several months ago when my boss came to me and asked how many and what types of Android devices were hitting our services. Since our product runs on numerous sites around the web we get all kinds of traffic. And, of course, many people (our partners, etc.) all say your product has to run on this Android device, or that Android device. But with so many out there all running so many different OS versions it's crazy. This image (from this phandroid.com article) shows it quite well.

Figure 1 - Android Fragmentation.
At this point I must give credit where credit is due. The LogParser queries below are by no means unique, nor are they original. The best one I found was a little dated. So I took it, made it a little prettier, and adapted it for three distinct outputs. First, is my daily User Agents Summary report (below). This is a high level report showing us the type and distribution of browsers, which are hitting our sites. While others publish  similar information regularly this applies directly to us, to our products, which gives us good, reliable information we can both use to convey to our partners, but also so we know where to focus time and energy on development and QA resources.

The numbers in this summary report and others in this post come from a single web server (I have lots more) for one day (March 28, 2013 UTC). So, this is current as of this posting. (See below for the exact LogParser query I'm using for this summary report.)

UserAgent UAHits UAPercent
Internet Explorer 2,611,065 38.03%
Chrome 1,322,187 19.26%
Firefox 1,241,372 18.08%
Apple IOS 775,693 11.30%
Safari 620,618 9.04%
Android 214,479 3.12%
Other Mobile 25,171 0.37%
Opera 19,262 0.28%
Other User Agents 15,342 0.22%
IEMobile 6,188 0.09%
No User Agent 5,061 0.07%
BlackBerry 4,481 0.07%
Search Bot 3,407 0.05%
Gaming Device 590 0.01%

This summary is great and very useful, but we certainly need some detail. Since the detail report is over 100 lines long I'm only going to show about a dozen lines here. Again, this detail is great for us so we know which versions of which browsers are being used to access our content at any given point.

UserAgent UAHits UAPercent
IE 9 1,170,034 16.99
Firefox 19 968,039 14.06
Chrome 25 846,618 12.30
IE 8 815,316 11.84
Safari 5 622,317 9.04
iPad 553,625 8.04
IE 10 331,290 4.81
Chrome 26 289,195 4.20
IE 7 286,986 4.17
iPhone 234,653 3.41
Android 4.1 72,227 1.05
Android 2 64,688 0.94
Android 4.0 54,877 0.80

Finally, the thing I was really after - what type of Android devices are being used to access our content? Just like the detail report this is only a partial list.

UserAgent UAHits PercentOfAndroid
Other Android Devices 52,470 24.11
Samsung Galaxy S III 25,903 11.90
Motorola Droid RAZR 16,558 7.61
Samsung Galaxy Tab 12,285 5.64
Samsung Galaxy S II 9,589 4.41
Samsung Galaxy Nexus 7 6,772 3.11
Samsung Galaxy Nexus 6,458 2.97
Samsung Galaxy Note II 6,270 2.88
HTC EVO 6,177 2.84
B&N Nook 4,271 1.96
Motorola Droid Bionic 3,736 1.72
Asus Transformer Pad 3,279 1.51
Motorola Droid RAZR M 3,140 1.44
Motorola Droid X 2,479 1.14
Motorola Droid 4 2,308 1.06
Motorola Xoom 2,278 1.05
Kindle Fire 1,498 0.69
Kindle Fire HD 728 0.33

It's no surprise that there are dozens and dozens of Android devices that are all browsing the web and hitting sites like ours. One little surprise is that Barnes and Nobel's Nook registered higher than the Kindle Fire. So many devices so little time.

Here's the Log Parser query I'm using for the User Agent summary (above).

select case strcnt(cs(user-agent),'Android') when 1 THEN 'Android'
else case strcnt(cs(user-agent),'BlackBerry') when 1 THEN 'BlackBerry'
else case strcnt(cs(user-agent),'iPad') when 1 THEN 'Apple IOS' when 2 THEN 'Apple IOS'
else case strcnt(cs(user-agent),'iPhone') when 1 THEN 'Apple IOS' when 2 THEN 'Apple IOS'
else case strcnt(cs(user-agent),'iPod') when 1 THEN 'Apple IOS' when 2 THEN 'Apple IOS'
else case strcnt(cs(user-agent),'Opera') when 1 THEN 'Opera'
else case strcnt(cs(user-agent),'Chrome') when 1 THEN 'Chrome'
else case strcnt(cs(user-agent),'Safari') when 1 THEN 'Safari'
else case strcnt(cs(user-agent),'IEMobile') when 1 THEN 'IEMobile'
else case strcnt(cs(user-agent),'MSIE') when 1 THEN 'Internet Explorer'
else case strcnt(cs(user-agent),'Firefox') when 1 THEN 'Firefox'
else case strcnt(cs(user-agent),'Googlebot') when 1 THEN 'Search Bot' when 2 THEN 'Search Bot'
else case strcnt(cs(user-agent),'Yahoo!+Slurp') when 1 THEN 'Search Bot' when 2 THEN 'Search Bot'
else case strcnt(cs(user-agent),'bingbot') when 1 THEN 'Search Bot' when 2 THEN 'Search Bot'
else case strcnt(cs(user-agent),'Yandex') when 1 THEN 'Search Bot' when 2 THEN 'Search Bot'
else case strcnt(cs(user-agent),'Baiduspider') when 1 THEN 'Search Bot' when 2 THEN 'Search Bot'
else case strcnt(cs(user-agent),'loc.gov') when 1 THEN 'Search Bot' when 2 THEN 'Search Bot'
else case strcnt(cs(user-agent),'crawler@alexa.com') when 1 THEN 'Search Bot' when 2 THEN 'Search Bot'
else case strcnt(cs(user-agent),'Mobile') when 1 THEN 'Other Mobile'
else case strcnt(cs(user-agent),'PlayStation') when 1 THEN 'Gaming Device'
else case strcnt(cs(user-agent),'Nintendo') when 1 THEN 'Gaming Device'
else case strcnt(cs(user-agent),'curl') when 1 THEN 'curl'
else case strcnt(cs(user-agent),'wget') when 1 THEN 'wget'
else case strcnt(cs(user-agent),'-') when 1 THEN 'No User Agent'
ELSE 'Other User Agents' End End End End End End End End End End End End End End End End End End End End End End End
AS UserAgent, count(cs(User-Agent)) AS UAHits, MUL(PROPCOUNT(*),100) AS UAPercent
INTO D:\Reports\UserAgent_Summary.csv
FROM D:\Logs\.log