On using multiple blocklists

All Spammed Up has an article up the other day about whether or not a filtering service should consider using multiple blocklists as part of their spam filtering solution.  Quoting the article:

The biggest benefit to using more than one block list provider is that there are more chances to detect spam thanks to a greater diversity of lists being queried.  If you’ve ever had to troubleshoot a deliverability issue by investigating whether a mail server IP is on a block list you would have discovered that of the dozens of lists available not all of them will give the same result for a given query.

The idea here is the defense-in-depth scenario.  No filtering solution is perfect, and the more chances you give yourself of finding spam, the better off you will be in terms of maximizing the end user experience.  In other words, independent blocklists that have differing methodologies will result in more spam detected.  For example, the Spamhaus XBL is the Exploits Blocklist.  It is a list of machines that have been reported as sending spam and are part of a botnet.  These IPs send to spam traps and the XBL is very good at removing IPs that are not zombies.  By contrast, the PBL (Policy Blocklist) is a list of IPs provided by ISPs that should not be sending out mail directly.  If you get mail from these IPs, you can block it right off the hop because it implies that the machine is compromised.  There is going to be some overlap between the PBL and XBL, but both of them will contain IPs that neither has and using them both together will end up catching more spam.

On the drawbacks of using multiple lists, All Spammed Up has the following:

However the biggest drawback is that every additional list provider that you configure means additional resources are consume for every email that is checked, both in terms of server processing and network bandwidth.

The idea behind this is the assumption that the use of different blocklists requires separate DNS queries.  Many on-premise filtering solutions will use public blocklists, and they query them over DNS.  For example, if a service wants to use the Spamhaus XBL and PBL, and also use Jimmy’s Filtering Solution, then suppose they get mail from the IP 292.143.22.11, the mail filter does the following (using quasi-fictional commands):

dig 11.22.143.292.xbl.spamhaus.org –> NXDOMAIN
dig 11.22.143.292.pbl.spamhaus.org –> NXDOMAIN
dig 11.22.143.292.blocklist.jimmys.filtering.net –> 127.0.0.3

In each case, you have to make a DNS query which takes server processing and network bandwidth.  And if you don’t get a response right off the bat, you have to do another DNS query.

This problem is solved by tweaking your implementation.  Rather than doing separate DNS queries across the Internet, you can do a zone transfer and publish it to your own local (private) DNS server.  By this, I mean that you download the lists from the various data providers, append them all into one zone file and then publish that zone into your own DNS server.  Each list has a different response code.  You can then map the response codes back to different lists.  For example, suppose we had the following zone:

# Spamhaus XBL
:127.0.0.3:
<list of IPs goes on XBL here>

# Spamhaus PBL
:127.0.0.10:
<list of IPs on PBL goes here>

# Jimmy’s Filtering Service
:127.0.0.4:
<list of IPs goes here>

Next, you would upload the list into your own zone, maybe ip-list.blk.  Next, for a given IP:

dig 11.22.143.292.ip-list.blk –> 127.0.0.4

Because you are doing one DNS query (assuming you want to do it over DNS), you have managed to cut down the amount of bandwidth you have to use up.  Because of the nature of this particular protocol, you may end up getting back multiple response codes depending if an IP is on multiple lists.  In a case like that, you have two options:

  1. Clean up the lists ahead of time to make each list unique.  Make decisions as to where the priority should be in terms of attribution.
  2. Take the first response code that is returned and make attributions to that.

Doing the blocklist implementation in this manner greatly reduces bandwidth.  But that brings me to my next point – the biggest drawback of using multiple blocklists is not so much the resource consumption, but false positives.  False positives greatly degrade the user experience and each additional anti-spam technology/list adds to them.  While there is always going to be a great deal of overlap between different technologies with some incremental benefit in terms of catching spam, FPs by-and-large are additive in nature.  It is difficult to evade FPs and when they start to add up, that’s when you have to make decisions about whether or not the spam effectiveness improvement is worth the number of complaints due to using a particular list.