What’s with all those spam ping-bots?


Last December, some people started to get annoyed by the pingback-bots, and others were confused by them. What's the deal with those pingback-bots?

It's all about fooling the search engines in order to make money, taking advantage of friendly policies at domain registrars to make it less costly an undertaking.

Step one: Register a bunch of domains with a domain registrar that includes a money-back guarantee.

Step two: Set up fake blogs on each of those sites, with different keywords.

Step three: Use a script to search the blogosphere for articles that contain keywords that match your site. (There appears to be a single script that 90% of the spam blogs use, since they all look exactly the same, and have the same bugs!)

Step four: Create a bogus blog entry for each one that say something like "Hey, here's something interesting I found on the Internet" and then reprints the article in question. (You may notice that many of these sites mis-attribute the authorship; some of them even claim to have written the article themselves!)

Step five: Host ads on the site.

Step six: Just before the money-back guarantee period expires, look at each of your fake blogs to see which ones have made money from the ads and which ones haven't. Cancel the domain registrations of the ones that didn't make money.

Most of these sites are in existence for only a few days, so trying to stop each individual site is a waste of effort; the site is going away soon anyway. The way to get the attention of the spammers is to hit them in the pocketbook.

Go to the site and look at the ads. if they're using Google Ads, look for violations of the terms of service, such as having more than three sets of ads on a single page or hosting ads from other companies on the same page. Even if you can't find anything wrong, click the "Ads by Google" link.

From the Google Ads page, click "Send Google your thoughts on the site or the ads you just saw," then "Also report a violation," and then say that you had a problem with "the website," and then say that "The site violates AdSense policies in other ways." Here is where you can write "Hosted more than three ad blocks" or "Also hosts ads from competing vendor." But always write "Contains no original content."

The theory here is that once Google has determined that the site is violating AdSense policies, they will shut down the account, preventing them from getting any more money, which was the whole point of their scam in the first place.

Now, I don't hold out much hope that this will work, since I've reported sites and found that even weeks later, the site is still up, happily serving up Google ads and pocketing the click-throughs. But maybe it's because they don't act until there is some critical mass of complaints.

(I can find no way of reporting violations to the Yahoo Publisher Network.)

Another category of these types of sites is just people who reprint blog articles (usually erroneously attributed) in order to improve the search engine ranking of the non-spam part of the site.

Now, you may notice also that there is a "The site is hosting/distributing my copyrighted content" checkbox. That box is useless to me because I am not the copyright owner of the content of this blog. The content of this blog is owned by Microsoft Corporation, If you check that box, Google demands that you file a formal DMCA complain, and I'm pretty sure our legal department is busy with plenty of more important things than chasing down people who rip off the content of some random employee's blog in order to generate ad revenue.

Normally you don't see the spam pingbacks because I tend to delete them pretty quickly. If you're really clever, you might use the fact that the spam pingbacks linger for days at a time to determine that I'm out of the office.

Sidebar: Here are some examples of spambots. Feel free to report them to the ad vendor, if they are hosting ads. And as I already noted above, some of these sites may already be down.

Update: The victory over 247blogging was short-lived. Within a month, they moved to a new ad company whose terms of service have no problem with sites with no original content.

One annoying consequence of all these content-scraping sites is that they end up ranking higher in Google than me, and I'm the one who wrote the article in the first place! For example, a Google search for Joshua Roman groupies on 17 February 2008 doesn't even show my blog article; instead, the top hits are

  1. A site which scraped my entry.
  2. Another page from the same site as #1 which also scraped my entry.
  3. A different site which scraped my entry.
  4. An article from this Web site but not the one that says Joshua Roman groupies in the title.
  5. Another misfire from this Web site.
  6. A third site which scraped my entry.
  7. A fourth site which scraped my entry.
  8. A fifth site which scraped my entry.
  9. An unrelated hit.
  10. Another unrelated hit.

So there you go. The top ten search results contain five sites that scraped my entry and no links to the original! On the other hand, Live Search is not fooled and finds the right article as the top search result. Yahoo ranks my article as #1 and #3 (go figure), which is nice, but all but one of the remaining hits are for scrapers.

A Google search for bands of Valentine minstrels is even worse. The first three hits are sites which scraped my article and there are no hits at all to this Web site in the top 100 search results, although nine scrapers rank in the top 100. Again, Live Search is not fooled and finds my article as its #1 hit. Yahoo also ranks my article at #1 although a scraper sneaks in at #2.

Comments (70)
  1. Spike says:

    Wow, I’m amazed that you put in those apparent plugs for Live Search without a "pre-emptive snarky comment".

    Watch those snarky comments fly…

  2. Will says:

    This is why I think the engines need lists of "trusted original content creators."  Sites like yours would be included as such and whenever outside sites glommed onto your posts the search engines would automatically recognize the hijacked content and erase ALL traces of the scammy site from their index plus deny ALL ad revenues.   It’d take a while to build the list of trusted creators of course.

    All this trackback stuff sure makes it easy though.  If a trackback link is found in your blog comments and it leads to a page on another site that duplicates more than some small percentage of your content then that site is striken from the index and any ad revenues earned for that site are forfeited.

  3. snark says:

    Your content is copyright MS? No wonder people get confused about its official/unofficial status.

    (How’s that one, spike? :-)

    Is MS not worried about people ripping off its copyright material then?

  4. Puckdropper says:

    Will,

    There are sites out there called "mirrors" which have permission to copy and duplicate all of another site’s content.  Your suggestion would disallow mirroring sites the ability to earn money in much the same way as their original.

    The idea of "trusted original source" is a good one, though.  It’s worth sending to Google.

  5. Nawak says:

    Wasn’t the "link rel=no-follow" attribute supposed to end this practice by nullifying the value of the link in the search engine algorithm?

    So, this seems to indicate that not all blog software use it and/or that Google fails to use it correctly.

    blogs.msdn.com does use this link categorization, therefore Raymond’s copied articles should have no weight in the search engine. (Judging from what we see in Raymond’s blog, I don’t think that the articles copies are linked anywhere else except the site where they are copied from)

    So either Google use the whole-site weight (which is not null because of other blog software) to promote a page of 0-weight or google doesn’t use the ‘no-follow’ attribute in its PageRank.

    I’m sure that Google knows that but didn’t act because of some side effect when they tried to do it. LiveSearch and Yahoo that are outsiders may care less about these "backwards compatibility" problems if it means they can surpass Google in an area and gain marketshare from that.

  6. What's with all those spam ping-bots?-download music says:

    PingBack from http://downloads-home.com/test/whats-with-all-those-spam-ping-bots/

  7. Page Rank » Blog Archive » What's with all those spam ping-bots? says:

    PingBack from http://pagerankpimp.com/2008/02/18/whats-with-all-those-spam-ping-bots/

  8. What's with all those spam ping-bots? says:

    PingBack from http://makemoney.buckethunt.com/?p=3380

  9. What's with all those spam ping-bots?-unlimited music downloads says:

    PingBack from http://alaska.realestate-investment-solutions.com/whats-with-all-those-spam-ping-bots/

  10. What's with all those spam ping-bots? says:

    PingBack from http://info.biyad.com/?p=63804

  11. Eric Burnett says:

    I expect that Google does honour ‘nofollow’ links, considering it was their idea in the first place. The site in question would interlink with itself, however, so it wouldn’t be a 0-weight page. It would be applying link weight from other blogs to all pages it scrapes, balancing its pages out.

    To help combat domain tasting, Google is implementing some new strategies, including (I think, not fact) not indexing domains registered within 5 days, until they cannot be returned for free. I would be interested in knowing what Live Search is already doing, however.

    Joi Ito discusses a bit more about domain tasting:

    http://joi.ito.com/archives/2005/12/01/the_parked_domain_monetization_business.html

  12. What's with all those spam ping-bots?-Download Music says:

    PingBack from http://www.online-precision.com/whats-with-all-those-spam-ping-bots/

  13. What's with all those spam ping-bots? | Domains Yahoo says:

    PingBack from http://domains-yahoo.thegeekyblog.com/2008/02/18/whats-with-all-those-spam-ping-bots/

  14. random joe says:

    "If you check that box, Google demands that you file a formal DMCA complaint"

    But I don’t live in the USA so they don’t demand it for me, but what if I were in the USA and the infringing website isn’t? Hmmmm…

  15. /.er says:

    Matt Cutts http://www.mattcutts.com/ are you out there? We’ve got a real problem over here…

  16. Grijan says:

    Raymond, I think that all pingbacks in this article’s comments should be left as an example of the real problem, even if you kill them from all other pages.

    Great article. It gives a detailed explanation about WHY something does happen. It is really interesting, even if it isn’t about Window’s user interface. Keep up the good work (and don’t worry about nitpickers)!

  17. Dude, thanks a lot for posting this… those things have annoyed and confused me for ages, now that i know what they are I’ve been to every single one of them to report the abuse:)

  18. Asztal says:

    Thankfully, Google puts you at #3, by my reckoning. Amusingly, the second result is a splog syndicating this very post.

  19. Gwyn says:

    Hi Raymond,

    I think you are attacking the problem from the wrong angle. If you want Google to do something about it, you need to hit them where it hurts them i.e., their pocketbook. Complain to the original advertisers, saying something like "Hey, you are advertising on http://www.spamblog.com/ripping-off-raymond" via google adsense. Unfortunately this page is just a copy of my own original article at http://blogs.msdn.com/…. I am really upset that you would advertise in this fashion and take advantage of my hard work. Please remove this ad". Advertisers tend to be sensitive to complaints of this nature, and I bet google will be pretty sensitive to advertisers removing ads.

  20. Geld Lenen says:

    Asztal, in the Netherlands Google shows this entry  on the first 2 results.

    Below these 2 results, a <strong>lot</strong> of scrapers.

    [Oh great, an article talking about how hard it is to find the article titled “Joshua Roman groupies” shows up in a search for “Joshua Roman groupies” and the actual original article doesn’t show up in the Google results. -Raymond]
  21. What’s with all those spam ping-bots? | Domains Yahoo says:

    PingBack from http://domains-yahoo.thegeekyblog.com/2008/02/18/whats-with-all-those-spam-ping-bots-2/

  22. Sitten Spynne says:

    Hmmm… since I browse with JavaScript disabled I’m not seeing any ads on those sites; could it be that this costs bandwidth but doesn’t generate revenue?  (I actually don’t know because I don’t know how web advertising works!  But if I’m not seeing a banner or image, and the script that records my visit doesn’t run, wouldn’t that not count as a hit even though I consumed the images?)

  23. Andrew says:

    Why not host your blog elsewhere and retain copyright ownership of your writing?

    [Microsoft has the copyright to all of my computer-related writing regardless of where I post it. It’s part of the standard employment agreement, and it’s probably part of yours too. -Raymond]
  24. Evan says:

    I *was* about to post that it would be amusing if the spammers would pick up and copy this article that explains how to cause their downfall, but it looks like I’ve been beaten to the punch. (Whether the previous post (12:50pm) is a joke by someone or a "legitimate" ping-back is a question I will not consider further.)

  25. Cheong says:

    I believe the search engines have ways to submit the web links to for them to crawl.

    And the copiers apparently have more time to submit their website before the web-spiders find them (web spiders visits websites on average 2-3 days intervals unless demanded). When they do this, their page contents would appear to be the first occurrence to the search engine, therefore makes the search engine think that it’s "you" who copied from their site. (The create date on the web is not to be trusted, isn’t it?)

    I can easily guess that the web spiders from search.live.com can have higher search rate on Microsoft owned/sponsored website than the others. (Web spiders generate loading to web servers so it’s good to make the visits "not too frequent" unless you know it’s okay)

    Btw, the above is just my guess and may not be true…

  26. Dean Harding says:

    "I can easily guess that the web spiders from search.live.com can have higher search rate on Microsoft owned"

    Actually, I’d say it’s probably because the spammers don’t bother to submit their fake sites to yahoo/microsoft and so they get a more balanced view of the world.

  27. Cheong says:

    Dean Harding: Even if they’re not bother to submit the links, with even visiting interval and the fact that there’re "many copiers’ site versus one", the copiers’ sites should have higher probably to be detected first.

    The fact that they have correct result would represent that there’re some form of effective measure works, or the search result returned is simply biased (in a good way).

  28. GreaseMonkey says:

    I think the reason why Google is "fooled" is because spambots tend to "target" Google. This is one of the few cases where I’m glad that many douchebags only support certain things.

    Kinda like how Windows is frequently targeted with viruses. There’s still a fair whack of Linux shellcode lying around on the internet, though. /me wants BSD… or a MIPS Malta…

  29. Anthony Wieser says:

    Equally annoying are all those copies of usenet articles that clutter up the google web search.

    Maybe I should start using live.com to search…

    Sure enough Raymond, "Joshua Roman groupies" would be the I feel lucky hit there!

  30. Dean Harding says:

    Sitten Spynne: You only get money if someone *clicks* on the ad… if nobody clicks, you don’t get any money.

    Personally, I can’t see how the click-through rate is all that high on these sites. But I guess it must work.

    Like Eric Burnett, I have also read that Google is going to be blocking domain tasting. Here’s a link: http://blog.domaintools.com/2008/01/google-to-kill-domain-tasting/

    The idea behind domain tasting was that *legitimate* people would be able to "taste" a domain for a period of time to see how it works in terms of page rank and so on. They could "try out" a couple of different domains and then choose the one which worked the best. Of course, we’re all very much aware of what happens when you trust people to do the right thing…

  31. MS says:

    "Maybe I should start using live.com to search…"

    I have been using it for a while now and I only miss some of the more advanced calculator bits, especially unit conversions.

    I wonder if the spam blogs are using zombie machines.  Otherwise, identifying certain IPs might be a feasible way to block the crawler itself.  Then again, I have never felt particularly fond of pingbacks as it seems like an easily abusable feature.  There just has to be a better solution to the problem than manual action, as abusers are machines who can do this a hundred times a second.

  32. GreaseMonkey says:

    On second thoughts, I should build a script to flood those domain tasting website comment boxes with flames, e.g. "YOU F***ING P***S, GO F***ING EAT A D**K YOU F***ING BUM**XER, YOU ARE A F***ING NOOB SO F*** THIS YOU S***HEAD;", except it’d be like 10 times longer.

  33. Dave says:

    Most of these scrapers are targeted to get a higher rank in google, they don’t try to ‘fool’ live search or yahoo in the first place.

  34. "On second thoughts, I should build a script to flood those domain tasting website comment boxes with flames, e.g. "YOU F***ING P***S, GO F***ING EAT A D**K YOU F***ING BUM**XER, YOU ARE A F***ING NOOB SO F*** THIS YOU S***HEAD;", except it’d be like 10 times longer."

    You can improve on that idea by posting porn and other illegal or un-child-friendly materials. See if you can get the advertisers and search engines to either drop them or label them adults only.

  35. SM says:

    @Gwyn: Unfortunately, the way Google ads works, the advertisers don’t get to micromanage which specific sites are advertised on.  They can elect to only advertise on google.com and opt out of the content networks in general — and many do — but this may not really affect the money between Google and the advertiser.

  36. Alex says:

    I am attempting to not be snarky.

    It seems to me the economics behind the pingback bot makes google the best target.  It is the part of the same reason MS still makes piles of money by taking 90% of the desktop market.  Why go out of your way to make sure you get that extra 2% of the market by fooling whatever the live search people are doing these days when your resources are better spent, counting money, pharming bots, etc, enlarging organs.

    On top of it, google has published quite abit about their search engine.  They keep quiet about their anti-junk process but it is defeated by spammers regularly, as is any system that is a big enough target.

  37. What if... says:

    It seems that one technical solution to this might be practical, given the reader ship.

    For each day, grab all pingbacks from the site

    For each pingback, grab the page, run through ECMA script engine, and extract all URL’s

    For each URL, generate a click (onclick() or just follow URL )

    This should have two effects

    1. The ad servers will detect "fake" clicks and devalue all clicks from the page

    2. It will be difficult to judge the value of the tasted domains

  38. Igor Levicki says:

    74.208.85.32 downloads-home.com

    69.93.239.11 pagerankpimp.com

    74.54.143.50 makemoney.buckethunt.com

    74.208.26.250 alaska.realestate-investment-solutions.com

    74.54.131.194 info.biyad.com

    64.202.166.209 http://www.online-precision.com

    78.110.160.131 domains-yahoo.thegeekyblog.com

    144.92.194.24 lonesysadmin.net

    203.88.118.161 blog.wisefaq.com

    In case someone wants to do something.

  39. Cooney says:

    [Microsoft has the copyright to all of my computer-related writing regardless of where I post it. It’s part of the standard employment agreement, and it’s probably part of yours too. -Raymond]

    Not mine. My company owns the stuff I do for them. If I write about Hibernate (or whatever) and it isn’t code related to what my company is doing (i.e., I can write about things I’ve learned on the job or I can write software not related to my job), my company doesn’t own it.

    I’d be surprised if MS can legally claim ownership of your blog, although it may cost too much to test it.

    [Again, you focus on the details and miss the point. I was responding to the suggestion that I move the blog to another site to avoid the Microsoft copyright. Thank you for confirming my point (even if you disagree with the fine details). -Raymond]
  40. GreaseMonkey says:

    Here are what I deem to be dodgy:

    • oggin.net
    • domains-yahoo.thegeekyblog.com

    lonesysadmin.net looks a little fishy. wisefaq.com looks ok.

  41. I was reading this post about spam ping-bots from OldNewThing and it makes me a bit mad because they

  42. I was reading this post about spam ping-bots from OldNewThing and it makes me a bit mad because they

  43. Dave says:

    Since the suggestion box is now closed to prevent this kind of abuse, how do we submit suggestions?

    [Dup. The suggestion box is not closed due to spam ping-bots. It’s closed because it’s got enough suggestions to last until 2010 if not beyond. I won’t need any more suggestions for a while. -Raymond]
  44. Alex Railean says:

    Why won’t you turn this feature off?

    If I want so find out who else refers to a story I wrote, I use a search engine to reveal the pages that point to the URL of the original story. Alternatively, I can browse the logs of my site and see which places readers came from.

    [Trackbacks have a useful purpose. If somebody wants to reply to one of my articles on their blog, a trackback is their way of indicating that the conversation has continued elsewhere. Turning off trackbacks won’t stop the scrapers. -Raymond]
  45. What's with all those spam ping-bots?-Download Music says:

    PingBack from http://hobbies-news.info/whats-with-all-those-spam-ping-bots/

  46. Amoxicillin. says:

    Amoxicillin. Amoxicillin soar throat.

  47. Vicodin. says:

    Cheap vicodin. Buy vicodin without prescription. Vicodin. Vicodin back pain. Vicodin military urinalysis. Buy vicodin.

  48. Brewing your own ephedrine. says:

    Ephedrine faq ephedrine fatloss. Ephedrine. Ephedrine products. Ephedrine weight loss products. Real ephedrine. Biotek ephedrine.

  49. Amoxicillin floppy baby. says:

    Diagram of amoxicillin. Amoxicillin. Amoxicillin for acne. Canine dosage for amoxicillin for sinus infection. Dosing of amoxicillin for sinus infection.

  50. Ephedra. says:

    Wyoming ephedra attorney. Stores that carry ephedra diet products. Georgia ephedra attorneys. Back ephedra market. Ephedra liquid gel products.

  51. Extracting ephedrine. says:

    Ephedrine. Method of action of ephedrine. Extract ephedrine. Ephedrine pills. Brewing your own ephedrine.

  52. Prescription free amoxicillin. says:

    Amoxicillin. Amoxicillin dosage. Amoxicillin expiry. Amoxicillin allergy. Amoxicillin no prescription.

  53. Vicodin. says:

    Drug vicodin. Vicodin. Vicodin drug information. What does vicodin look like. Snorting vicodin. Vicodin without prescription. Vicodin withdrawal. Vicodin and no consults and no prescription.

  54. Amoxicillin false positive for cocaine. says:

    Canine dosage amoxicillin. Expiration date for amoxicillin. Amoxicillin. Bronchitis and amoxicillin.

  55. Adderall in mexico. says:

    Adderall xr side effects. Dexadrine adderall comparison. Adderall. Buy adderall or dexadrine no prescription.

  56. Ephedra. says:

    Buy ephedra. Products containing ephedra. Herbal stimulant ephedra alternative. Ephedra fat burner. Buy ephedra online drugstores yellow swarms. Ephedra for sports.

  57. Auto loans. says:

    Bad credit auto loans. State farm auto loans. Auto loans.

  58. Unsecured loans. says:

    Unsecured loans with fair credit. Unsecured bad credit personal loans online. Unsecured personal loans.

  59. Ancient huang he. says:

    Chinese huang farmers. Jimmy huang s recent peer reviewed publications. Huang he river pics. Hilbert-huang transform. Huang he agriculture. Emperor qin shi huang. Pingsheng huang s publications. Huang s publications. Ma huang diet pills.

  60. Personal loans. says:

    84 month 50 000 unsecured personal loans. Personal loans after bankruptcy and judgements.

  61. Auto loans. says:

    Bankruptcy auto loans. Auto loans.

  62. Liquid ephedra. says:

    Ephedra products. Colorado ephedra lawyer. Ephedra weight loss products.

Comments are closed.