Syndicated search engines broken – Part II


A few days ago I grumbled at the poor state of the search engines specializing in syndicated (RSS’d or Atomized) content.


Today, Michael Marshall Patrick is enthusiastically supporting a proposed standard by Bloglines that is trying to solve an apparent problem:



““Everything you blog goes on your permanent record!” How many times have we heard that lately? From employment to family situations, many people have been frustrated to find out that things they intended to write for a personal audience is now discoverable by anyone in the world via search engines.


From the Bloglines proposal:



“As a result, we are proposing (and have implemented) an RSS and ATOM extension that allows publishers to indicate the distribution restrictions of a feed. Setting the access restriction to ‘deny’ will indicate the feed should not be re-distributed. In a nutshell, the proposal”


I respectfully disagree with Michael’s Marshall’s view here, and a user of these services, can not support the proposal, for three reasons:


1. Keeping stuff out of participating engines wouldn’t ensure leakage. As one commenter on the quoted post has already pointed out (by ‘007’) how do you avoid the repost scenario? If you really need to sneak stuff under the radar (to avoid getting fired???), use something other than public blogsite – you will be found. Another reason: why wouldn’t some service providers show up that wouldn’t adhere to the rules that ensure they catch the slime? (I could imagine ‘Slimesearch’…). Private networks – ok = group IM, SSL’d, groups, etc (even company email considered leaky) – but just don’t use inherently public networks for this kind of stuff.


2. A common issue with search results is spam. Spammers won’t use the tag. I realize this isn’t a stated goal of the proposal, but worth pointing out, I think.


3. IMHO, these guys (Bloglines, Technorati, etc) should be focussed on trying to solve the precisely reverse of the ‘problem’ they are trying to solve here with an access:restriction’ tag – they should be trying to get more complete indexes, not the other way around.


Overall, this syndicated content search space is broken. The priorities seem wrong here – I don’t see this step getting us any closer to getting better services when there are other much more fundamental issues that need solving.

Comments (17)

  1. Lauren Smith says:

    I’m not sure I understand the problem.

    There are many free blogging sites that will allow you to create a blog pretty much anonymously, so it’s not like you can’t spout off like a moron to your heart’s content.

    Many of these free blogging sites have features which restrict access to authorized members, even for individual posts not just whole blogs.

    If you’re running your own webserver, you can turn away most search engines by fixing your robots.txt file. And if you’re already doing that much, what could be simpler than setting up a password scheme to block out anyone without a valid logon?

    The whole idea is crazy. I don’t get it.

  2. Lauren, the problem is not so much people spouting off anonymously as it is corporate sources not wanting their content indexed.  This is not necessarily confidential material.  Publishers have their own reasons for not wanting content indexed, and server side aggregators need to respect that.  Currently, publishers have to make requests of individual aggregators, and this proposal would automate that process.

    As far as robots.txt, I strongly believe that this is a misapplication of the robots.txt protocol.  Simply polling a syndicated feed is NOT robotic behavior and imposing the robots.txt convention places undue burden on both sides of the wire.  The issue here is not search engines, but aggregators, specifically server-based aggregators – Bloglines, NewsGator, Technorati, etc.

  3. typo

    Marshall Kirkpatrick

  4. I agree, there’s lots of problems in the blogosphere, but if you look at this extension as a way of telling Bloglines that you don’t want your RSS in Bloglines’ search results, then it’s a good fit.

  5. Lauren Smith says:

    First, I can’t think of why a program which periodically polls a feed wouldn’t be considered a robot. But beyond that, I also don’t see why content would necessarily be uploaded to a feed without explicit or implicit agreement by the creator of that content.

    In other words, if I write a blog that I don’t want grepped, why would I add it to the feed in the first place? If the blog scraper is actually reading the entire blogsite to find my headlines, 1) it is a misbehaving aggregator and 2) it’s Yet Another Search Engine.

    I suppose it’s my own hubris that’s making me blind to this issue. I publish to the web because I want people to read it. I reserve confidential notes to email and other forms of personal communication.

  6. "I can’t think of why a program which periodically polls a feed wouldn’t be considered a robot."

    http://www.robotstxt.org/wc/faq.html#what

    "A robot is a program that automatically traverses the Web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced."

    I don’t think my conclusion that periodically polling a feed is NOT robotic behavior is entirely unjustified.

  7. cori says:

    Man, where did that link on my pingback come from??  Oi!

  8. MSDN Archive says:

    cori – the pingbacks and trackback behaviours on this blogware are beyond my humble understanding…

  9. Lauren Smith says:

    I’m still going to have to disagree and refer you to the rest of the FAQ answer’

    "Note that "recursive" here doesn’t limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot."

    Also:

    Q: "What other kinds of robots are there?"

    A: "…"What’s New" monitoring…"

    So an autonomous agent that semi-regularly requests a document from a webserver could definitely be categorized as a robot.

    But then we’re back where we started. If a search engine can scrape the blog site because robots.txt doesn’t disallow it, then any information on that site can be found using a search engine. If the only goal is to block it from appearing in your RSS feed, then what’s the point of 1) publishing it to the web in the first place and 2) publishing it to the RSS feed in the second place?

    To publish anything to the web without any protections save some infinitely ignorable tag is an absurdity if your goal is to keep that information restricted. If security and privacy is what you want, you’re better off using a password-protected blog than something publically accessible.

  10. monkchips says:

    marshall *kirkpatrick*

  11. MSDN Archive says:

    James – just waiting for you to notice 😉

  12. cam + nude says:

    Syndicated search engines broken – Part II