Wednesday, July 21, 2010

Spam SEO trends & statistics (Part IV)

In Spam SEO trends & statistics (Part III), I've shown that hot trends should be scanned five to twelve days after they first appear in Google hot trends. To decrease the number of URLs to check even further, while maximizing the number of spam SEO sites found, I have further analyzed spam SEO found to date.

I've scanned 294,691 unique URLs the Google search results. A full 5% of results (15,913 URLs) turned out to be spam SEO pages, which redirected to 573 different spam or malicious domains.



Filter URLs to scan with a regular expression

The spam URLs often look the same. For the search "word1 word2 word3", most of the spam addresses match this regular expression:


   (\.php|\/)\?[a-z]+=word1(%20|\+|-| )word2(%20|\+|-| )word3

For example, for the search "keith britto actor", I get these types of URLs:
http://t-and-d.net/jzxhe.php?sell=keith%20britton%20actor
http://taos-inc.com/jxddd.php?p=keith%20britton%20actor
http://whitevoice.com/wiktp.php?sell=keith%20britton%20actor

I've applied this regular expression to all the spam SEO links. This gives 10,821 matches (68% of all spam links) which lead to 352 domains (61% of all the spam/malicious domains). If applied on all the links, this regular expression also triggers on 5,455 good URLs.

So, by using this one regular expression, I can scan 3.4% off all search results (16,276 links) and I catch 68% of all spam and 61% of the bad domains.


Loose regular expressions

We can catch even more spam by making the regular expression less strict. Some popular searches exist in different variations. For example, the search  "keith britto actor" shows spam results for the another poplar trend - "keith britton wiki". So I came with the following regular expression: (\.php|\/)\?[a-z]+=[a-z]

This new regular expression gives us 3,618 additional spam links to 56 new domains, but it doubles the number of good URLs scanned by adding 5,047 good search results. So the new numbers are:
  • 14,439 spam results, 91% of all spam links
  • 408 spam/malicious domains, 71% of all bad domains
  • 10,502 good results, 3.5% of all links to scan


More effective regular expressions

There is one more optimization that can be done: all hot trends contain at least 2 words, so the regular expression may contain one word separator: (\.php|\/)\?[a-z]+=[a-z]+(%20|\+|-| )

This new regular expression adds only 1,362 legitimate results, and adds the 56 bad domains found with the previous regex version. It also finds 2,115 spam links. Total results:
  • 12,936 spam results, 81% of all spam links
  • 408 spam/malicious domains, 71% of all bad domains
  • 6,817 good results, 2.3% of all links to scan
There is one more benefit to filtering the search results with any of the 3 regular expressions: it tends to filter out false positives, and increase the proportion of malicious domains versus spam.


Here is a comparison of the scan efficiency with no filter and with the optimized filter:







Which search pages to scan?

Finally, I wanted to find out which search results pages should be scanned. Here is the distribution of spam SEO links on the first 10 pages of a Google search:


While pages 5 to 10 have more malicious links, no page should be skipped. The fact that page10 contains the most number of links means that more pages should be scanned.

Conclusion

We can optimize the Google search results scanning by looking at trends 5 to 12 days after they appear, and by looking at links that match a regular expression. This will allow for the scanning of more pages per search terms, and increase the number of malicious domains found each day.


-- Julien

0 comments: