Does SPF need an update to handle non-existent includes? I say yes.

Over the past month, my team and I have been going over logs in our system, looking for SPF PermErrors and trying to figure out how many we had, and the root cause of them. As it turns out, there are lots of things that cause a permanent SPF failure. The most common examples are going over the 10 DNS lookup limit because a domain owner adds too many 3rd parties to their SPF record. Another reason is creating two SPF records (two TXT records in DNS) instead of putting them in as a nested include. However, there are tons of other errors like:

  • Spaces after the ip4, e.g., ip4: 1.2.3.4
  • No "4", e.g., ip:1.2.3.4
  • Forgetting to include a policy, e.g., no -all, ~all, or ?all
  • Putting quote marks around everything, e.g., "v=spf1" "ip4:1.2.3.4" "-all"
  • And a whole bunch more

But the weirdest one we discovered was the case of a non-existent nested include. This is when a domain owner adds an include mechanism, but the domain does not exist. For example:

 contoso.com IN TXT "v=spf1 ip4:1.2.3.4 include:_spf.fabrikam.com ~all"

When an email comes in from an IP, say 5.6.7.8, the email receiver looks up the SPF record for contoso.com. It sees that that 1.2.3.4 is not a match, and then sees the directive to lookup _spf.fabrikam.com.

What's supposed to happen is the email receiver does an SPF lookup on _spf.fabrikam.com, and follow that chain. But suppose that the TXT record doesn't exist, it returns NXDOMAIN. As it turns out, we were stamping that as a PermError. In other words, a non-existent include was being treated a syntax error, instead of being treated as a no-op.

"What?" we said. "A PermError?" That should be treated as a non-operation, and burn a DNS lookup limit (count once against the limit of 10) and assume that the IP could not be found in there. "It must be a bug in our code" we concluded.

As it turns out, it's not a bug in our code, it's how the specification works. From OpenSPF's syntax documentation:


The "include" mechanism

 include:<domain>

The specified domain is searched for a match. If the lookup does not return a match or an error, processing proceeds to the next directive. Warning: If the domain does not have a valid SPF record, the result is a permanent error.


I disagree with the bolded part above. Rather than being a permanent error, it should be a non-match and the SPF check should continue.

In other words, the result of 5.6.7.8 for example.com against this:

 contoso.com IN TXT "v=spf1 ip4:1.2.3.4 include:_spf.fabrikam.com ~all"

...should be a soft fail, and not a PermError.

Why?

For a few reasons:

  1. Downgrading to a PermError introduces ambiguity when it could give a more authoritative result
    .
    The domain owner of example.com is taking a dependency on the domain owner of fabrikam.com. If fabrikam.com ever updates their own SPF record and breaks it, example.com shouldn't be punished. A potential SPF pass should still pass, a potential SPF fail should still fail and not give an otherwise ambiguous result.The breaking of an SPF record where the rest of it is okay should result in the okay parts being respected, and the broken parts disregarded. The one broken part should not break everything else.
    .
  2. To avoid punishing the original domain owner
    .
    Look, we get it. It turns out that SPF is hard to get right. The domain owner of example.com is doing everything they can to do the right thing. They shouldn't be punished for making one mistake; instead, the mistake should be contained (kind of like a try/catch to handle exceptions when coding) in order to limit the damage. The maintainers of email filtering software know how to handle simple errors in SPF records better than domain owners understand how to maintain valid SPF syntax. Therefore, email filtering software should try harder.
    .
  3. For practicality - apparently domain owners don't notice when they have permanent errors
    .
    If this were only a few problematic domains, I might shrug my shoulders and say "Meh, no big deal. The domain owner should be paying attention to these types of things. Surely they've noticed deliverability challenges due to invalid SPF records."Well, it turns out that there are hundreds, perhaps even thousands, of invalid SPF records. Clearly, domain owners are not paying attention. It's simply more practical to detect the domain owner's intent than to fail delivery because of it and hope they notice, and fix it.

I think this change to SPF would make it work better. While on the one hand it is more forgiving to the owners of SPF records (the ones with access to update DNS) and removes the burden from them, on the other hand it's not easy to get it right. As someone who is well-versed in SPF, and who has readers who are well-versed in SPF, you know what I'm talking about. Software designers need to be able to gracefully handle error conditions, and in Office 365 and Outlook.com, when it comes to DNS records we think it's more important to detect the user's intent (when it comes to email auth) than to pay strict attention to the letter of the law when it comes to record verification.

However, the SPF specification could probably stand to have an update, or perhaps a Best Common Practices for verifiers. Probably everyone who work with SPF has the same stories as me, so we may as well make our software more robust to predictable errors.

It would make stuff work better.