Bayesian filtering for annoying blog topics


Next time I find myself with a few free hours, I know what to do:  Take a Bayesian spam filter and modify it to do away with the following families of blog posts (which comprise 80% of my daily weblog intake):


1.   I just installed [Whidbey / Longhorn / Whatever] and it rocks!


2.  [name]  [ says / on to something / points out / links to / announces  ]   [ Whatever ]


3.  I just saw the [ Matrix / LOTR / Whatever ] and it [ sucks / rocks ]


4. Anything with the words “DON“ and “BOX“ in the same sentence (except for his blog, of course)


But really, has anyone considered a mechanism for filtering blogs?  I would love a “bool FilterPost(xmlElement Item)” pluggable interface for SharpReader … <hint hint>


 


 

Comments (6)

  1. Anon says:

    Could it filter out your posts while you’re at it?

  2. Addy Santo says:

    Cute.

    So I take it you also think the idea has some merit? I’ll filter out my noise and you can filter yours.

  3. Cory Smith says:

    If you don’t like the blogs, don’t subscribe. Blogging is about sharing… just because you don’t like something, doesn’t mean others won’t. Also, not everyone reads everyone elses blog, so some of the way information gets propagated is by others pointing out what others are saying. It’s not like you haven’t asked to receive this information (blogs)… you subscribed to them. If you don’t want them, unsubscribe.

  4. Dumky says:

    It’s a good idea. The only problem is that we won’t know whether it actually works before implementing it 😉

    In I recently another guy (besides you and me) with a similar idea: http://www.jimohalloran.com/archives/000228.html

    +1 for the hook in SharpReader 🙂

    Notice: Bayesian filters aren’t rule based. You just teach them what is "spam" and what is "ham" and then they use probabilities to filter out future posts.

  5. Addy Santo says:

    Cory:

    You don’t get it. I don’t want to shut anyone up – I just want an automated way to weed the noise out of the data I pull down to my desktop.

    A large percent of the information flowing in to my aggregator is pre-aggregated: the DotNetJunkies main feed, CNET news, Techweb, slashdot, MSDN articles. And even the personal blogs I read have a staggering amount of duplicated information- every Don Box post, or new code release is mirrored by dozens of people.

    If everyone was as focused as Chris Brumme I wouldn’t be complaining. But it is a real chore to go over 500+ feeds a day just for the 2-3 gold nuggets a day which I find. I would LOVE a flexible filtering tool which would weed out some of the noise for me.

  6. Jeff Key says:

    My solution is to not subscribe to the aggregate blogs like this one, but to people you can rely on for good posts with little to no noise. Much of the aggregator blog postings are links to other blogs, as you mentioned, so there’s a good chance that you’ll get the same links in the non-aggregator blogs you subscribe to. Works for me. (I do scan the web front-end of aggregated blogs every once in a while if I have a minute just to see if anything interesting is posted; that’s how I got here.)