Blank Comment Spam

There seems to be a lot of comment spam at the moment where the content is just a pair of <a>  </a> tags with no other content.

The common point linking most of them is their email address being usually an, and a handful with Or more specifically, the following:

There’s been about 76 of these that I’ve found, and they’ve started appearing on Saturday March 15th. To say the least, they are quite unusual compared to the regular spam I’ve been getting. I’m sure with time they may vary those addresses, or something else is going to be launched with this being just a simple system test-run.

Doing a little quick investigation via Google indicates that these aren’t new email spam addresses.

Akismet Rocks

I’ve been using this Comment/Trackback spam stopping plugin for WordPress called Akismet since yesterday.

Thus far, it’s caught 69 comment spams. And all 69 of those were genuine comment spams.

I’m impressed with it thus far.

But since this blog is fairly low on the blog comment traffic, I haven’t noticed any non-spam comments getting stuck.

Sometimes I can get upwards of 150 spam comments per day.

Akismet has been around for a little while now. (About 2 months).

I finally decided to install it after reading lots of success stories in particular, this post by Dougal Campbell titled Poisoning the Well. The type of comments that he mentions are similar to the new batch of comment spam that I’ve noticed lately. These comment spams are becoming a little bit harder to tell apart from those written by people living in countries with really bad English.

The only way that I can actually tell is that 1 out of 5 or so links they post are not-legitimate links.

New wave of blog spamming

Damn, looks like some of those links in the “spam” comments are links to real blogs.

There’s many “abandoned” blogs out there, and it looks like these spammers are taking advantage of them by putting all of their spam links in the comments area of those blogs:

I guess when one thought the “fight” against these spammers was getting under control they’ve started to throw a few new spanners into the mix.

These days all of my comments are moderated because of this spam problem.

Have a read of this blog post for more details about this new form of spamming.

I’ve actually experienced this here. It’s been going on for the past 2 days or so. It looked like a ligit foreign language blog to me at first, and I didn’t think any more of it. Aside from the fact that the comments that were being posted sounded almost word for word with the old spams that were being posted here that I regularly deleted. That’s when things seemed a bit fishy.

So watch out for this new wave of blog spam all!

Are blog pinging and trackback spam related?

This is interesting…

Is it a coincidence that I posted one blog entry, and published once, then edited it and republished it again…

Then a few minutes later I get two trackback spams?

Could it be that when my blog “pings” one of the ping servers that one of the bots checks for the latest updated blog posts on one of those sites that lists recently updated blogs?

Or just too close of a coincidence.

Actually, this is not the first time this has happened.

I’ve also noticed that when there’s been no activity on my blog, there’s not any trackback spam, just your regular run of the mill blog comment spam.

more spam stuff

Matters of spam of all sorts, comment spam, referrer spam, trackback spam… all look like they are targetting bloggers of all sorts…

What’s next on the list?

Anyway, out of curiosity, I decided to do domain whois on the various referrer spam URLs, as well as visit their actual domain (not the subdomain they actually spam).

Turns out for the big many of the sites, they’ve been closed down due to account abuse.

That’s a good sign, but it doesn’t mean the spamming has stopped of those sites.

Another trend, a lot of sites that do work, seem to be related to this “” domain.

Hrm, weird. What is so special about that domain?

Very odd that the links on that domain seem to lead to many many weird and different domains with odd URLs.

Another spam referrer

While browsing through blog stats, I noticed the following domain:

There seems to be a lot of spam referrers coming to my blog where the referrer is a subdomain of the main site.

In some cases the incoming IP of the spammer that comes via this subdomain also came a few times to my blog via the non-existant URL (as mentioned previously)

Looks like our spammers are evolving as they see fit to do so, but haven’t totally done away with their old stuff!

Update: Looks like the entire domain has been shutdown… Due to “mis-proper use of the hosting account”

Comment Spam Prevention Technique – NoFollow

This is a new initiative to prevent comment spam.

There are perhaps many more blogs than I list below mentioning this…

MSN Search Blog
Six Apart

There is a WordPress plugin already developed which implements the nofollow stuff, so you don’t need to wait until there’s an updated wordpress that makes use of it.

Let’s say your normal links inside of your blog comments are of the type:

<a href=””>Will’s Domain</a>

Now when Google searches your blog, and sees the link to, it will follow that link, and index it.

And through some sort of algorithm, that probably adds some browny points to’s PageRank.

Now, this new nofollow attribute is just a bit of extra “fluff” that tells Google and other search engines to not follow links in the comments.

Code as below:

<a href=”” rel=”nofollow”>Will’s Domain</a>

So when a search engine comes to index any blog webpage, sees a link, rather than following the link to index the linked page, it will just continue on to other links that don’t have the nofollow attribute added.

The hope is by eliminating the incentive for a PageRank boost, we may see a decline in comment spam.

And you can observe the feedback in the blogosphere here

Trackback spam?

Got 2 Trackbacks on my blog overnight.

Here they are:


Website: jzlmiiv
<trackback /><strong>lgjalpppa</strong>


Website: vcoazrkyq
<trackback />&ltstrong>ihpivek</strong>

So why do I think it could be spam?

Well, all the listed details are just plain gibberish, and also neither of the domains are real ones.

Both of the IP addresses (not listed above) associated with the trackbacks actually look like it was from something hosted on a local machine running off a “home” ISP account.

Could very well be the start of things to come.

I am still yet to get new spam comments go by my spam filter word list. Which is good, but if they spammers are that determined, they will find a way through sooner or later.

The battle is never won forever, it is only for the time being.

I’ll be deleting the two trackbacks and going to observe what happens.

Oh, and also noticed that despite the different IP addresses (may be spoofed, or otherwise an infected machine of some sort) they all have the same User Agent (UA) string (a commen trend when looking for patterns amongst the spammers).

Here is the UA string according to my logs:

Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)

Here is another item to consider.

In my logs I have 4 different IPs logged for two spam trackbacks.

Basically I get one IP address visiting an individual post, when it gets the “200” (everything is A-OK) response, a few seconds later, I get another IP address doing a Trackback to that exact post which was just visited.

A few minutes later, another one of my blog posts gets the exact same treatment!

Looking further into my logs I see two IP addresses listed as being from Bermuda: and

These two IP addresses i’ve seen quite often over a period of time.

Both have the following user agent:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0; PCUser)

Oh, and my other visitor tracker states the following UA (instead of the above) for the particular IP addresses:
Mozilla 3.01

Makes me suspect that the two IP addresses may have something to do with this whole comment spamming thing.

Ah, here we go, found something else.

A page of anonymous proxies:

The two of the suspect Bermuda IP addresses are listed there as anonymous proxies.

They are listed as “Elite Proxies”, here is a description of an Elite Proxy on that site:

High anonymity (elite proxy) – HTTP Servers of this type does not send HTTP_X_FORWARDED_FOR, HTTP_VIA and HTTP_PROXY_CONNECTION variables. Host doesn’t even know you are using proxy server an of course it doesn’t know your IP address.

It also has a “+” next to it, which indicates ssl_support.

So, now we have something that seems a bit more interesting, and particularly suspect!

From the RAW access logs: – – [01/Jan/2005:09:57:44 -0600] “POST /blog/wp-comments-post.php HTTP/1.0” 302 0 “” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0; PCUser)” – – [01/Jan/2005:09:57:45 -0600] “GET /blog/archive/2004/07/21/project-ideas-simplify-javascript-learn-c-20 HTTP/1.0” 200 14224 “-” “Mozilla/3.01 (compatible;)” – – [01/Jan/2005:09:57:45 -0600] “POST /blog/archive/2004/07/21/project-ideas-simplify-javascript-learn-c-20 HTTP/1.0” 200 17638 “” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0; PCUser)” – – [02/Jan/2005:16:41:23 -0600] “POST /blog/wp-comments-post.php HTTP/1.0” 302 0 “” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0; PCUser)” – – [02/Jan/2005:16:41:24 -0600] “GET /blog/archive/2004/09/03/more-on-ntt-docomo-i-mode HTTP/1.0” 200 14224 “-” “Mozilla/3.01 (compatible;)” – – [02/Jan/2005:16:41:24 -0600] “POST /blog/archive/2004/09/03/more-on-ntt-docomo-i-mode HTTP/1.0” 200 18962 “” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0; PCUser)”

A rough appromixation of the above RAW access logs are as below:

IP Address – – [dd/Mmm/yyyy:hh:mm:ss timezone] “FORM_ACTION /directory/file HTTP_MODE” HTTP_RESPONSE FILE_SIZE “User Agent”

Noticed any patterns?

Here’s one:

When it retrieves the file from my blog, it’ll post its’ User Agent as: Mozilla/3.01 (compatible;)
When it wants to post, the posted User Agent is listed as: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0; PCUser)

GoogleRank, Spammers, Phising

One interesting thing to note is a non-existant website ( has a GoogleRank of 4/10.

[If you don’t know why that above web address is of interest, go read my previous post].

So how does a website that does not exist, end up being given a GoogleRank of 4 out of 10?

Well, some individuals actually have a list of recent referrers on their blog that is publicly visible. Now what Google does is it’ll remember each time it sees a link and it’ll probably record it somewhere.

Google loves blogs, why? Well, generally there are many links available on blogs. Blog authors tend to link to a lot of websites that are of interest to them, and their blog audience if they know how that is. (Which is usually quite unlikely).

Ok, so you have the GoogleBot crawling through your blog every now and then (the higher the GoogleRank, the more often it gets indexed), and it’ll index every single page that it can find on your website. It’ll remember that it has seen a link to site x and site y.

In this case GoogleBot has found the referrals page (or links) and finds all of the referrers, in this case it’ll find

Now, GoogleBot tries to follow that link, but is unsuccessful, so it will not actually appear in the search results (as there is nothing there to index). However, when you browse to the web address, the Google Toolbar will request some stats from Google telling it “I’m at, please give me the site rank”. Google will then say, hey, GoogleBot has seen that website linked to a lot, and here are some stats for you on it.

[Note: Most of the above may not be the full truth, but rather is my understanding of it.]

And here is the official Google explanation of how PageRank (or as I call it, GoogleRank):

“PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page’s value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves “important” weigh more heavily and help to make other pages “important.” ”

Ok, and below is a bunch of statistics I pulled out of my server logs about the number of requests from users who were referred to my site via the spam (possibly) spoofed address:

Referring URL Pages Percent Hits Percent 61 15.2 % 61 9.6 % [Oct 2004] 36 7.2 % 36 6.4 % [Sept 2004]

There is no data for the months prior to September 2004 from users being referred to my site via that address.

Also note that after I blocked off users coming in from that referral address, they are still hitting the server, and the server throws the HTTP Response 403 [Unauthorised] or 404 [Page Not Found].

Although I’ve noted 7 IPs in my server logs, it doesn’t actually mean that much as I think they are likely spoofed addresses or came packaged as part of something the end user downloaded and unaware of. Either of which means that there could be a higher number of different IPs out there than the 7 i’ve found in the logs thus far.

Here are the 7 IPs:

1) 2) 3) 4) 5) 6) 7)

And I’ve also had a look at the previously mentioned “Fetch API Request” as the User Agent. This one was interesting, 14 IP addresses.

While i’m at it, I also saw this one:

Host: /
Http Code: 200 Date: Oct 25 14:52:39 Http Version: HTTP/1.1 Size in Bytes: 2198
Referer: –
Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt; DTS Agent

Courtesy of this blog post by Joy Larkin, I realised that is actually the UA for an email harvesting bot.

Upon further investigation I found this blog post by Neil Turner which states that it is actually “used by the Nigerian 419 scammers”.

Hrm, interesting, no?

So they don’t actually go via Google cached version (speculating again, maybe some do?), but rather have their own spider/bot crawling the web for email addresses and so on.

And now, onto more of the Citibank phising tales.

Thus far I’ve gotten 4 emails since October 8th:

Here are the dates, and the “from” addreses…

8/10/04 –
10/10/04 –
19/10/04 –
25/10/04 –

I actually decided to place them all in a folder in my email client just to see how many they send me.

I should do the same for all the fake rolex emails I’ve been getting lately.

Yep, all 4 emails contain the same characteristics as previously described. They contain 1 image, plus a bunch of random “hidden” text.

Anyway, I should call it a night now and get some rest. Another full day ahead tommorow!

Got home from MDNUG presentation by Jonathan Wells a few hours ago and I seem to be quite tired. I’ll write up a summary maybe during my lunch break tommorow, or tommorow night (or maybe weekend, no promises, whenever i’m free basically!)

And in regards to the announcement of the details of my personal pet project, i’ve decided to delay that until I get something more solid and maybe something to show to people. Maybe a prototype that partially works. Or does enough for people to play around with.

It’s nothing special, it’s just something for me to do and learn the new things that are coming out in .NET 2.0 such as Generics.

And hopefully, it’ll allow me to use some things like TDD (after i’m done with deciding on my specs).

Dissecting Comment Spam

I’ve been getting a lot of comment spam over the last few days and there is one key element that is the same, and that is they all come via the following referrer:

The IPs may differ, but the referrer has consistantly come from the same address.

I doubt I am the only one being spammed in comments by people coming in via that referrer.

And their “User Agent” seems to always say: “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.1.4322)”

Don’t think the User Agent part is that interesting though.

Suggests that the spammer is running Internet Explorer 6 on a Windows 2003 machine, with .NET version 1.1 installed.

I think it would be really easy to speculate without knowing what is with this virtually non-existant What could it be?

I could take a few guesses, but chances are they would be pure speculation based on guesswork.

One of the good things about comment moderation has been that none of the SPAM comments have actually gotten past the moderation and onto the actual blog.

Doing a Google search to find answers to what in the world is so good about, and why IPs being referred to from there are spamming my site, I came across the following blog entry by Thomas Strömberg.

He also noticed the familiar referral pattern, so his solution was to do a mod_rewrite (available only in Apache) using the following code (place it into the .htaccess file):

RewriteCond %{HTTP_REFERER} ^
RewriteRule ^(.*) /asshole-bot

That’s basically to block users who use the above address as their referrer.

I’ve just put that into my .htaccess file, and i’ll see how it goes.

I’ve only decided to investigate this further because it got to the point where I was getting emails coming in asking to approve messages. This morning alone I had 7 when I opened up my mail client.

Ah well, if enough people block/redirect users who come via that referral, they will probably just adjust the referrer to something else, and then the battle with comment spam restarts once again.

It also looks like some blogs out there are have that as the “trackback” url to their posts.

My guess is that there are “infected” computers out there that are forcibly pulling down these pages and adding comments anyway they can.

Infected with what, you ask? Honestly, I have no idea.

Judging by the User Agent (which could be totally bogus), it is targetting a vulnerability in Windows 2003 machines with .NET 1.1.

Again, this is all just pure speculation.

Looking at something like Mitch Denny’s RoryCom, you can see just how easy it would be to develop such a monster.

Ok, maybe a bit of modification of what Mitch developed, but still, just a few lines of C# and we’ve got ourselves a monster.

There is a million and one ways this comment spam monster could be implemented… (I will refer to it as a monster, what type? use your imagination :P)

Perhaps it sets up a little web server on the host computer with an IP address of The hosted page then uses some form of GET request to pull down a single or multiple URLs.

Hrm, some of that may be a little too complex.

But hey, without knowing what is going on, anything is possible!

Update: cindy noticed that they are using the following two User Agents to harvest comment link URLs:

“Fetch API Request” as well as “Microsoft Scheduled Cache Content Downnload Service”.

RewriteCond %{HTTP_USER_AGENT} (Fetch\ API\ Request) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (Microsoft\ Scheduled\ Cache\ Content\ Download\ Service) [NC,OR]
RewriteRule .* – [F]

Update: Ok, so the above doesn’t really work the way it should apparently.

After removing the second one, things did work. I noticed that I’ve never actually gotten any User Agent that contained “Microsoft Scheduled Cache Content Download Service” when I looked through my server logs, but I did get the Fetch API Request. So, now I’ve blocked it off.

So, right now what I’ve got contained in the .htaccess file is the following:

RewriteCond %{HTTP_REFERER} ^
RewriteRule ^(.*) /asshole-bot

RewriteCond %{HTTP_USER_AGENT} ^Fetch\ API\ Request
RewriteRule ^.* – [F,L]

Oh, and I found the following list to be handy for the curious.

And for those who are running IIS, you can try ISAPI_Rewrite, which is an ISAPI module for IIS that allows you to do much of the above, but for IIS rather than Apache. (So, basically bringing Apache like functionality to your IIS server).

Final Update: It seems that WordPress likes to strip the off the end of words. So the two mod-rewrites posted by Cindy, should have a trailing after each word. Well, except for the last one.