Exactly What Data Are You Sending to Akismet?

Put on your tin-foil hats fella’s…

In my on-going development of the Akismet plugin, I needed to figure out exactly what data one of their functions was receiving (so I knew what pieces I needed to steal to check for whitelist / blacklist).

The easiest way to do this was to simply spit out the data right before it’s sent to the Akismet server to be processed there. I load up my test blog, put in a cheeky comment, hit the big red button, then wait for snoopy goodness to get dumped to my newly created logging table in the WP database.

The results? Way more than I expected…

Array ( [comment_post_ID] => 7 [comment_author] => MellerTime [comment_author_email] => chris@doesnthaveone.com [comment_author_url] => http://chrismeller.com [comment_content] => more commenty goodness!!! [comment_type] => [user_ID] => 2 [user_ip] => 127.0.0.1 [user_agent] => Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5 [referrer] => http://localhost/noteblog/?p=7 [blog] => http://localhost/noteblog [HTTP_HOST] => localhost [HTTP_USER_AGENT] => Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5 [HTTP_ACCEPT] => text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 [HTTP_ACCEPT_LANGUAGE] => en-us,en;q=0.5 [HTTP_ACCEPT_ENCODING] => gzip,deflate [HTTP_ACCEPT_CHARSET] => ISO-8859-1,utf-8;q=0.7,*;q=0.7 [HTTP_KEEP_ALIVE] => 300 [HTTP_CONNECTION] => keep-alive [HTTP_REFERER] => http://localhost/noteblog/?p=7 [HTTP_COOKIE] => [snipped for brevity] [CONTENT_TYPE] => application/x-www-form-urlencoded [CONTENT_LENGTH] => 79 [PATH] => C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\Program Files\Common Files\Adobe\AGL;C:\php;;C:\Program Files\QuickTime\QTSystem\;C:\Program Files\MySQL\MySQL Server 4.1\bin;C:\Program Files\Bitvise Tunnelier [SystemRoot] => C:\WINDOWS [COMSPEC] => C:\WINDOWS\system32\cmd.exe [PATHEXT] => .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH [WINDIR] => C:\WINDOWS [SERVER_SIGNATURE] =>

Apache/2.0.54 (Win32) PHP/5.0.5 Server at localhost Port 80
[SERVER_SOFTWARE] => Apache/2.0.54 (Win32) PHP/5.0.5 [SERVER_NAME] => localhost [SERVER_ADDR] => 127.0.0.1 [SERVER_PORT] => 80 [REMOTE_ADDR] => 127.0.0.1 [DOCUMENT_ROOT] => C:/htdocs [SERVER_ADMIN] => chris@doesnthaveone.com [SCRIPT_FILENAME] => C:/htdocs/noteblog/wp-comments-post.php [REMOTE_PORT] => 4751 [GATEWAY_INTERFACE] => CGI/1.1 [SERVER_PROTOCOL] => HTTP/1.1 [REQUEST_METHOD] => POST [QUERY_STRING] => [REQUEST_URI] => /noteblog/wp-comments-post.php [SCRIPT_NAME] => /noteblog/wp-comments-post.php [PHP_SELF] => /noteblog/wp-comments-post.php )

Needless to say, I was a bit surprised… Why exactly is every $_SERVER[] variable needed to process my blog’s spam? You just manually grabbed the necessary values (as I see them) a few lines previously:

$comment['user_ip'] = $_SERVER['REMOTE_ADDR']; $comment['user_agent'] = $_SERVER['HTTP_USER_AGENT']; $comment['referrer'] = $_SERVER['HTTP_REFERER']; $comment['blog'] = get_option('home');

So why do you need to know the rest? Even if we ignore any possible privacy concerns here, if nothing else, looks to me like we’re wasting a LOT of bandwidth… Let’s do some quick math, shall we?

All that crap, when saved to a text file, totals 2,639 bytes (2.57 kb). If we cut out the relevent stuff at the beginning (everything after “blog” is removed), we’re down to 437 bytes.

After checking the Akismet Homepage, we see from their Zeitgeist that they’ve caught a total of 302,974 SPAMs, which represents 82% of all comments. If I try and remember some of my high school Algebra classes, that means:

302974 = .82(x) x = 369480.4878

We’ll use 369,480 for simplicity. Time for a little more math:

369480 x 2639 = 975,057,720

You checking me as we go along? Good… So that’s 975 million bytes of data, give or take some gzip compression here and there, some header information, and a few random character sets.

975057720 / 1024 = 952204.8046875 (kbytes) 952204.8046875 / 1024 = 929.8875 (mbytes)

So that’s 929.8875 megabytes of data hitting their servers. In the grand scheme of things, that’s not much, but let’s look at what it would have been with our smaller set of data:

369480 x 437 = 161,462,760

So now we’ve got 161 million bytes

161462760 / 1024 = 157678.4765625 (kbytes) 157678.4765625 / 1024 = 153.983 (mbytes) So we've gone from almost a gig of data, down to 150mb... Seems pretty damn sizeable to me, how about you? Hmm, maybe I should offer a neutered Akismet plugin option?

Tagged , , , | 7 Comments

Comments

  1. ANONYMOUS?

    I am sure it is partly due to robots… If you can see that there is a HTTP_REFERER/etc. you can tell if it’s a robot or not (or at the least, tell the dumb robots from the smart robots).

    But yeah, there’s no need for the ENTIRE $_SERVER array to be joined, only certain tell-tale array keys.

    November 20, 2005 at 2:57 am | Permalink
  2. Chris Meller

    That’s the point though… We actually manually set “referrer” to HTTP_REFERRER and “user_agent” to HTTP_USER_AGENT, and then go and include the whole raw $_SERVER anyway… Seems a bit… fishy. If not fishy, then certainly huge overkill.

    I’ve added said “Neuter Akismet” option to the plugin, which prevents this. No noticeable impact as of yet, but we’ll see how it fares in further testing.

    November 20, 2005 at 4:11 am | Permalink
  3. Incoherent Babble » Blog Archive » Enhanced Akismet Plugin - Version 1.06b5

    […] Plugins « Exactly What Data Are You Sending to Akismet? […]

    November 20, 2005 at 5:59 am | Permalink
  4. Gea-Suan Lin’s BLOG

    WordPress 2.0 Beta 1 - Akismet

    WordPress 2.0 Beta 1 引入了 Akismet,一個 Antispam Service,這個 Service 需要一個 WordPress.com API key,我剛好有,所以我就裝起來測試看看。不過,即使 測試了以後發現效果很好,我可能還是會換回 Spam …

    November 20, 2005 at 6:03 am | Permalink
  5. Wordpress/Automattic: All you data is belong to us - h0bbel

    […] wonder if people actually realize how much data Akismet actually gathers? For some reason it sends much more data than the actual comments, and combine all that information with views, post/page views, referrers, and clicks that the new […]

    May 6, 2007 at 9:55 am | Permalink
  6. h0bbel.p0ggel.org

    […] wonder if people actually realize how much data Akismet actually gathers? For some reason it sends much more data than the actual comments, and combine all that information with views, post/page views, referrers, and clicks that the new […]

    September 7, 2007 at 9:44 am | Permalink
  7. h0bbel.p0ggel.org

    […] wonder if people actually realize how much data Akismet actually gathers? For some reason it sends much more data than the actual comments, and combine all that information with views, post/page views, referrers, and clicks that the new […]

    September 7, 2007 at 9:44 am | Permalink

Comments are disabled.