If everyone spoke English, and spammers only sent spam from bots, spam filtering would have a relatively simple task (relatively speaking, of course). But, that simply isn't reality. The biggest shift in spamming over the past 18 months or so is the use of reputation hijacking - hiding behind someone else's good reputation in order to send out spam. This occurs when a spammer compromises someone like Gmail or Live Mail (Hotmail) and then uses those accounts to send out spam.
In a case like this, reputation filtering on the recipient side won't help. That small trickle of spam coming from Yahoo Mail's servers can't be blocked on the basis of IP reputation or accepted on the basis of domain reputation. This leaves spam filters relying on the older tricks in the book - content filtering.
But leaving aside the question of content filtering for the moment, consider an actual email. How does it get from point A to point B? How is the data inside the email reconstructed? In fact, this goes back to all of computing, how do you represent data in an abstract format when all a computer knows are 1's and 0's?
Back in the olden days, a mechanism of representing data was invented called the American Standard Code for Information Interchange, or ASCII (pronounced ask-ee). ASCII is a 7-bit code that is used to represent the various letters and keystrokes found on a traditional computer keyboard in North America. 7-bits means that you can represent 128 characters.
- The letter A is mapped to the number 65 (in decimal, base 10), which is 1000001 in binary.
- The number 7 is mapped to the number 55, which is 0110111 in binary.
- The DELETE keystroke is mapped to the number 127, which is 1111111 in binary.
- The space character is mapped to the number 32, which is 0100000 in binary.
ASCII encoding allows English speakers in North America to encode all 26 letters of the alphabet in uppercase and lower case and all 10 digits. That's 62 characters and there are still 66 left over! So, ASCII also lets you encode ?, !, $, %, ^, &, carriage returns, null simples, back space, and so forth. In other words, everything you type can be encoded using ASCII... so long as you live in North America and speak English.
Let's suppose you got a message or document and it was encoded in binary. Let's also assume that it came from a friend of yours and this friend only speaks English and he lives somewhere in the United States. The message looks like the following:
This may look like a bunch of gobble-de-goop, but it's not. We know that the message is sent in bytes so every 8 digits represents 1 byte. This lets us know that the message looks like this for each byte:
Since our friend is in North America, he probably used ASCII to encode his message. Translating each of the above numbers to the decimal equivalent, we now have the message:
Hello how are you?
This simple binary -> encoding translation format forms the basis of all electronic communication. You simply pick an encoding format, map the characters you normally write to the numerical equivalent and then convert that number to binary. You transmit the message and then the receiver at the other side picks it up and puts it back together. Simple.
But what happens if you speak a language that uses characters other than the ones specified by ASCII? What if you speak French? If you want to say that it's hot in summer, you'd say "C'est chaud dans l'été" or if you were German, you'd say "Es ist heiß im Sommer." What then? ASCII doesn't have any mapping of those crazy characters. Then what do we do?
This is something I shall delve into over the next few posts.