Tuesday, March 10, 2009

What's a "Real User"? Twitter Bots are Crawling Go2.me links.

One of the features of Go2.me is that it can tell you how many users are visiting your link. One problem I've been noticing lately, is that I am counting users that aren't real people.

I experimented with twittering a link, and then looking at the viewers of that link shortly after. I'm seeing a number of Bots crawling twittered links right away. From my server logs I see the following request that got counted as real users: - - [10/Mar/2009:11:57:13 -0700] "GET /xL HTTP/1.1" 200 8416 - "Mediapartners-Google,gzip(gfe),gzip(gfe)" - - [10/Mar/2009:12:00:25 -0700] "HEAD /xL HTTP/1.1" 200 9468 - "AideRSS 2.0 (postrank.com),gzip(gfe),gzip(gfe)"

xx.xx.xx.xx - - [10/Mar/2009:12:00:56 -0700] "GET /xL HTTP/1.1" 200 9468 - "Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/20071204 Ubuntu/7.10 (gutsy) Firefox/ FirePHP/,gzip(gfe),gzip(gfe)" - - [10/Mar/2009:12:01:22 -0700] "GET /xL HTTP/1.1" 200 9469 - "WWW::Shorten::TinyURL/1.90,gzip(gfe),gzip(gfe)" - - [10/Mar/2009:12:01:35 -0700] "GET /xL HTTP/1.1" 200 9469 - "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv: Gecko/2009011913 Firefox/3.0.6 TweetmemeBot,gzip(gfe),gzip(gfe)" - - [10/Mar/2009:12:08:03 -0700] "HEAD /xL HTTP/1.1" 200 9468 - "Chat Catcher - (http://chatcatcher.com/bot.htm),gzip(gfe),gzip(gfe)" - - [10/Mar/2009:12:08:33 -0700] "HEAD /xL HTTP/1.1" 200 9470 - "PycURL/7.19.3,gzip(gfe),gzip(gfe)"

Note that the link was posted at 11:57 - so these requests came in from 6 seconds to 10 minutes after Tweeting the original link. Only one of these listed was a "real human". So, Go2.me is counting 6 unique users that are actually Bots.

To fix this, I'm going to be changing my counting algorithm to require that you've successfully received a Go2.me cookie (as a normal web browser would do), AND that your browser has successfully made an AJAX request to the server to receive a Chat update. This should eliminate all Bots and Crawlers as falsely being reported as "Vistors" to Go2.me links.

1 comment:

Mike Koss said...

I noticed a new crawler on another link: - - [10/Mar/2009:12:36:16 -0700] "GET /xN HTTP/1.1" 200 8959 - "Twitturly / v0.6,gzip(gfe),gzip(gfe)"