The antispam accuracy of sender verification

Three simple techniques that are used as inputs for filtering spam are the following:

  1. Check to see if the sending domain in the SMTP MAIL FROM has an MX record
  2. Check to see if the sending domain in the SMTP MAIL FROM has an A-record
  3. Check to see if the sending IP has a reverse DNS

The point of the first two is see if the sending domain exists.  Spammers don't care about receiving answers to their messages (except in the case of 419 spam) so the theory is that if a sender does not have a domain that exists, it is probably a spammer.  In the third case, spammers will often hijack IPs with no reverse DNS so as to avoid reputation filters, so no reverse DNS = suspicious.

Customers have often requested why we do not have outright blocks on mail that meet any of these criteria.  My answer is always the same: these techniques are not reliable enough upon which to block mail.

There are plenty of examples I can name where someone might legitimately do this.  People sometimes misconfigure mail servers.  People send automated reports.  Companies that are small might not know enough to set up their reverse DNS, and so forth.  It doesn't matter how many people you correct to fix something, there will always be more.  Rather than attempting to save the world by fixing everyone else's settings, my philosophy is to avoid being overzealous in spam filtering.  In other words, I acknowledge that people out there do silly things, and I avoid being overly harsh when I encounter them.  The FP headaches are not worth the hassle.

To support this assertion that the above three techniques are not enough to block on, I revert to statistics.  Prior to the McColo outage, about 64% of all mail that hits our inbound filters (after IP rejects, which accounts for the bulk of all total mail) is marked as spam.  Here are the numbers for each of the above rules:

  1. No sending domain MX record - 17% spam rate
  2. No sending domain A-record - 16% spam rate
  3. No reverse DNS of sending IP - 29% spam rate

Spam rate means "When this rule hits a message, what percentage of the time do we mark it as spam?"  To interpret this, if spammers exclusively used a technique, we should see a higher spam rate.  For (2), we should see a 90-95% spam rate (the rest being false negatives and tiny corner cases).  If it was evenly split between spammers and misconfigured users, then we should see a 64/64 split, or thereabouts.

But that's decidedly not what we see.  We mark almost 2/3 of inbound mail as spam, but when this rule fires, only 16% of the time is it marked as spam.  The fact that there is a nearly 40-point spread makes this unlikely to have occurred by chance, noise, or false negatives.

This means that a very highly disproportionate amount of legitimate mail sends with no A-record for the sending domain.  The conclusion?  Blocking mail from senders with no A-record will be prone to false positives.  The situation will be the same for the other techniques.

Even throttling on this technique is prone to false positives.  Throttling on misconfiguration is almost as big a problem as blocking on it.  If one user screws up and sends mail with no A-record, they're probably going to send a lot of mail.  Worse, if they script it, they're probably going to send a ton of it.  So, simply because a user has sent a lot of mail with no A-record, it doesn't mean they are spamming.  More analysis is required, like seeing if the domains are all different and who they are sending to.  Simple blocks on these three techniques is a bad idea.