Digging through the problem of IPv6 and email, part 1

Recently, a couple of anti-spam (or at least email security related) bloggers have written some articles about IPv6 and the challenges that the email industry faces regarding it. John Levine, who has written numerous RFCs and a couple of books about spam fighting, writes the following in his article A Politically Incorrect Guide to IPv6, part III:

We will eventually figure out both how people use IPv6 addresses for mail, and how to manage and publish v6 reputation data (I've been doing some experiments, which I'll blog about when I have enough results), but until then, running a mail server on v6 will be a lot harder than running one on v4. And since you'll be able to handle all the real mail on v4, why bother?

Barry Leiba, another email security writer, writes the following on Circle ID on an article entitled IP Blocklists, Email, and IPv6:

John Levine has one approach: leave the email system on IPv4 for the foreseeable future. Even, John points out, when many other services, customer endpoints, mobile and household devices, and the like have been — have to have been — switched to IPv6, we can still run the Internet email infrastructure on IPv4 for a long time, leaving the IP blocklists with v4 addresses, and a system that we're already managing fine with. 

Of course, some day, we'll want to completely get rid of IPv4 on the
Internet, and by then we'll need to have figured out a replacement for
the IP blocklist mechanism. But John's right that that won't be
happening for many years yet, and he makes a good case for saying that
we don't have to worry about it.

Both writers are saying the same thing, and I have been on discussion threads where the consensus was similar: there is no agreement on how to handle IPv6 over email at least in the short term, but eventually it will probably have to be figured out (there are some believe mail will never move to IPv6 vs some who think that it will have to go there one of these days). In the meantime, just use IPv4 to send mail.

To expand a bit on what both writers are saying, the biggest reason why no mail providers are particularly thrilled about using IPv6 to handle email is because there is no way at the moment to deal with the problem of abuse. Today, spammers make extensive use of botnets. Each day, they compromise new machines and start using them to spew out spam. Each of these bots use different IP addresses, and the IP addresses change all of the time. I haven’t done an analysis in a while, but if you had 10,000 IP addresses today that are sending out spam, then tomorrow there would be 10,000 again but at least 9700 of them would be different IP addresses than were there the previous day.

The reason that there is so much rotation in IP addresses is because spam filters today make use of IP blocklists. When a blocklist service detects that an IP is sending spam, it adds it to the blocklist and rejects all mail from it. There are exceptions to this rule such as a legitimate IP that sends a majority of good mail (such as a Hotmail or Gmail IP address), but in general, mail servers reject all mail from blocklisted IPs. The reason they do this is the following:

  1. 90% of all email flowing across the Internet (not including internal mail within an organization) is spam. If a sending IP is on a blocklist, a mail server can reject it in the SMTP transaction and save on all of the processing costs associated with accepting the message and filtering it in the content filter. Many mail servers these days would topple over and crash because they could not keep up with the load if they had to handle all of the mail coming from blocklisted IPs since it would increase the number of total messages to deal with by a factor of 10.
  2. Spam filters get slightly better antispam metrics by using IP blocklists. Content filters are pretty good today, but rejecting 100% of mail from a spamming IP address means that there is no possibility of a false negative from that IP address. By contrast, if a content filter does not use an IP blocklist, the content filter has to learn to recognize the spam coming from that IP address, update the filter and then replicate out the changes. This is almost always slower than pulling down a blocklist and then using it as the first line of defense. Without an IP blocklist, a spam filter might be expected to filter between 80% and 99% of the mail coming from a blocklisted IP. While many spam filters get pretty close to that 99% range, it’s still not 100%.

Those are the two primary reasons to use IP blocklists. They are essential in blocking spam. Next up, the question is how blocklists are populated, and I’m going to leave that aside because there are resources elsewhere on how to deal with that. Blocklist operators publish their lists in two ways:

  1. They list individual IP addresses of all the servers that are sending mail, one by one.
  2. They make use of CIDR notation. CIDR notation, or Classless Internet Domain Routing, is basically a way to group large blocks of IP addresses. In IP blocklists, a provider would list a larger group of IP addresses in CIDR notation in order to save on space in the file (they don’t have to list them one by one). For example, the XBL is about 7 million entries (lines of text) and is around 100 megs in size. By contrast, the PBL contains 200,000 lines of text (without exceptions in ! notation) and is 6 megs. However, the PBL is represented mostly in CIDR notation. If all of these ranges are expanded, it is over 650 million individual IP addresses. That’s a whole heck of a lot more IPs in the PBL for a whole lot less file size space.

In terms of effectiveness, we run XBL in front of PBL and XBL blocks about 4 times as much mail as PBL(I don’t know how many would be blocked if we ran them in reverse). The XBL is better at catching individual bots that are sending out spam but are not listed anywhere (they are new IPs) whereas the PBL is better at pre-emptively catching mail servers that should never send out spam (probable bots but it doesn’t matter because they shouldn’t be sending mail anyhow). They are designed to be used in tandem. However, if we had to list every single PBL IP singly instead of compressing it into CIDR ranges, and if we use about the same ratio of 7 million IPs ~ 100 megs, then the PBL would be 9.4 gigs in total size. 9.4 gigs is a large file size. It isn’t completely unmanageable but it goes from being a minor inconvenience to being a major one. It takes a long time to download/upload/process a 9.4 gig file. It’s also far easier to store the file entries in a database if it is only 500,000 entries (or even 7 million) vs 650 million of them. Databases that large start to run into the problem of scale.

The PBL and XBL are prime examples of why different styles of IP blocklists are required. The PBL lists 650 million IPs and we still have over 7 million IPs on the XBL that aren’t on the PBL. Clearly, spamming bots can and do move around such that they are not listed on the lists that have large swaths listed. Bots are very good at hiding in places that are not called out and blocked yet. If they could not do this they would not be in business, and spammers are still in business. The fact is that given enough space to hide, spammers will hide in that space. The problem that we in the industry face is that as soon as we find a hiding space, we can block it for a bit but the spammer will vacate it, relocate elsewhere and continue to spam.

And therein is the problem of IPv6. An IPv4 IP address consists of 4 octets, and each octet is a number running from 0-255. This means that there are 256 x 256 x 256 x 256 possible IP addresses, which is 4.2 billion possible IP addresses. In reality, there are far less than this because there are lots of ranges of IPs that are reserved and not for public consumption. Still, using our formula from above, if you had to list every single IP address singly in a file, then the size of the file would be 61 gigs. 61 gigs is a very large file size and there are very few pieces of hardware that can handle that size of file in memory (whether you are doing IP blocklist look ups in rbldnsd or some other in-memory solution on-the-box). Processing the file and cleaning it up would take a very long time; you simply couldn’t do it in real time where IP blocklists need to be updated frequently (once per hour at a bare minimum).