GoogleRank, Spammers, Phising

One interesting thing to note is a non-existant website ( has a GoogleRank of 4/10.

[If you don’t know why that above web address is of interest, go read my previous post].

So how does a website that does not exist, end up being given a GoogleRank of 4 out of 10?

Well, some individuals actually have a list of recent referrers on their blog that is publicly visible. Now what Google does is it’ll remember each time it sees a link and it’ll probably record it somewhere.

Google loves blogs, why? Well, generally there are many links available on blogs. Blog authors tend to link to a lot of websites that are of interest to them, and their blog audience if they know how that is. (Which is usually quite unlikely).

Ok, so you have the GoogleBot crawling through your blog every now and then (the higher the GoogleRank, the more often it gets indexed), and it’ll index every single page that it can find on your website. It’ll remember that it has seen a link to site x and site y.

In this case GoogleBot has found the referrals page (or links) and finds all of the referrers, in this case it’ll find

Now, GoogleBot tries to follow that link, but is unsuccessful, so it will not actually appear in the search results (as there is nothing there to index). However, when you browse to the web address, the Google Toolbar will request some stats from Google telling it “I’m at, please give me the site rank”. Google will then say, hey, GoogleBot has seen that website linked to a lot, and here are some stats for you on it.

[Note: Most of the above may not be the full truth, but rather is my understanding of it.]

And here is the official Google explanation of how PageRank (or as I call it, GoogleRank):

“PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page’s value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves “important” weigh more heavily and help to make other pages “important.” ”

Ok, and below is a bunch of statistics I pulled out of my server logs about the number of requests from users who were referred to my site via the spam (possibly) spoofed address:

Referring URL Pages Percent Hits Percent 61 15.2 % 61 9.6 % [Oct 2004] 36 7.2 % 36 6.4 % [Sept 2004]

There is no data for the months prior to September 2004 from users being referred to my site via that address.

Also note that after I blocked off users coming in from that referral address, they are still hitting the server, and the server throws the HTTP Response 403 [Unauthorised] or 404 [Page Not Found].

Although I’ve noted 7 IPs in my server logs, it doesn’t actually mean that much as I think they are likely spoofed addresses or came packaged as part of something the end user downloaded and unaware of. Either of which means that there could be a higher number of different IPs out there than the 7 i’ve found in the logs thus far.

Here are the 7 IPs:

1) 2) 3) 4) 5) 6) 7)

And I’ve also had a look at the previously mentioned “Fetch API Request” as the User Agent. This one was interesting, 14 IP addresses.

While i’m at it, I also saw this one:

Host: /
Http Code: 200 Date: Oct 25 14:52:39 Http Version: HTTP/1.1 Size in Bytes: 2198
Referer: –
Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt; DTS Agent

Courtesy of this blog post by Joy Larkin, I realised that is actually the UA for an email harvesting bot.

Upon further investigation I found this blog post by Neil Turner which states that it is actually “used by the Nigerian 419 scammers”.

Hrm, interesting, no?

So they don’t actually go via Google cached version (speculating again, maybe some do?), but rather have their own spider/bot crawling the web for email addresses and so on.

And now, onto more of the Citibank phising tales.

Thus far I’ve gotten 4 emails since October 8th:

Here are the dates, and the “from” addreses…

8/10/04 –
10/10/04 –
19/10/04 –
25/10/04 –

I actually decided to place them all in a folder in my email client just to see how many they send me.

I should do the same for all the fake rolex emails I’ve been getting lately.

Yep, all 4 emails contain the same characteristics as previously described. They contain 1 image, plus a bunch of random “hidden” text.

Anyway, I should call it a night now and get some rest. Another full day ahead tommorow!

Got home from MDNUG presentation by Jonathan Wells a few hours ago and I seem to be quite tired. I’ll write up a summary maybe during my lunch break tommorow, or tommorow night (or maybe weekend, no promises, whenever i’m free basically!)

And in regards to the announcement of the details of my personal pet project, i’ve decided to delay that until I get something more solid and maybe something to show to people. Maybe a prototype that partially works. Or does enough for people to play around with.

It’s nothing special, it’s just something for me to do and learn the new things that are coming out in .NET 2.0 such as Generics.

And hopefully, it’ll allow me to use some things like TDD (after i’m done with deciding on my specs).