Trouble at SORBS

The Register reports today that a glitch on the SORBS blocklist over the past 24 hours ended up causing many thousands of legitimate emails to be blocked:

The problems at SORBS — short for the Spam and Open Relay Blocking System — began on Wednesday and continued into much of Thursday, said Michelle Sullivan, who founded the real-time blacklisting service in 2002 and sold it to GFI Software last year. As a result, messages sent from a huge number of legitimate mail servers were labeled as junk mail and returned to sender.

The snafu was the result of a transition from one SORBS system to another that corrupted a database containing potentially millions of IP addresses, Sullivan told The Register. SORBS admins have responded by temporarily clearing out the entire table of faulty listings under the theory that it's better to let through spam than to block real email. They are in the process of rebuilding the database and populating it to user servers around the world, a process that could take up to 24 hours.

The portion of the database that was corrupted stored entries for its DUHL, or dynamic user host list, which mainly contains dynamic IP addresses offered by ISPs, Sullivan explained.

The DUHL, as I understand, appears to be similar to the PBL offered by Spamhaus.  These IPs should not be sending out mail and therefore can be blocked.  If you’ve not been keeping up with what is going on at SORBS, they are upgrading from SORBS 1.0 to SORBS 2.0 which is an overhaul of their existing system.  From what I read above, SORBS has and maintains a pretty extensive database of IP addresses.  During the data migration, one of the bit fields may not have been set and the historical list that contained a portion of both live and deactivated lists set all of the deactivated list entries to active.  This has since been corrected.

This illustrates some of the logistical issues affecting the world of email filtering and it is not something unique to SORBS per se, rather, this is a software engineering issue and it happens to everyone.  It is not just about filtering algorithms but also maintaining the infrastructure necessary to deliver a service.  In order to maintain a blocklist you need the following:

  1. Mechanisms to capture data (ie, honeypots or log analysis)
  2. Algorithms to analyze the captured data
  3. Databases to either store the data or store the meta-information
  4. Mechanisms to transfer the data (push it to end users, or pull it from a provider and then push it to your mail filters)
  5. Software that queries the IP blocklists (built in to most commonly used MTAs today)
  6. Portals for investigating listings (ie, a front-end GUI)
  7. A process for delisting

As you can see, when running a filtering service, the operational issues are significant.  For example, points (4) and (5) are going to be prone to failure.  If you are running a service for years and someone leaves, then a crontab entry may not get checked and disk space can fill up.  If disk space fills up, then perhaps new files cannot be generated.  If new files cannot be generated, particularly for IP blocklists, then after 3 hours you will start to see a noticeable degradation of the service (3 hours is the magic number I have discovered by trial-and-error for when we start noticing things changing) because the lists are stale and new spamming IPs are being brought up all the time.  When IP blocklists go down, this forces more mail through the content filter which strains hardware resources since content filtering is more expensive.  This delivers more mail to downstream customers which utilizes more bandwidth and that causes email delays… all because some disk space filled up because someone forgot to check for disk usage.  That’s what happens when you have a very complex software ecosystem and a large team – things get missed.

What I have discovered as being particularly vulnerable is the migration from an old system to a new one.  This is especially true when changing platforms (ie, MySQL to SQL Server, or Linux to Windows).  If the old infrastructure has been in place for years, and especially if it were a start up, then we need to be aware that start ups are created and ran with the focus on getting functionality up and running, not on preserving maintainability.  Those of you who are coders will be nodding your heads in agreement.  You stay up all night coding the software or getting the servers structured, but you take a lot of shortcuts and don’t write everything down. You also don’t come back and document it afterwards because you’re busy with the next thing.  This works well so long as you maintain code, but if you ever leave and someone takes over, and then you have to migrate… key functionality can be missed when the migration is done.  “Oh, this variable was supposed to look in that file… and the file didn’t exist and so it crashed.  Oops.”  Or a database value isn’t imported with defaults.  Or the wrong defaults are set.  Or appropriate seed data isn’t pre-populated.  All these things can cause errors when running a service.

As I say, it isn’t an issue with the software, but instead is one of software engineering.  I don’t know if this is what happened at SORBS, only that I can sympathize with whatever happened over there and recommend we cut them some slack.  Yup, it happens.  Heck, I’ve brought down our service twice since I’ve been here (tied with two other people).  In the services world, you get feed back on these types of things very quickly.