My mind tends to wonder a bit when I drive long distances. Sometimes it goes in very weird directions. For some reason I started thinking about encoding characters in binary. Yeah, pretty weird. Anyway I was reminiscing on RADIX-50 (aka RAD-50) which was a way back in the days of expensive memory of encoding file names in fewer Binary words. Specifically if you had 8 character file names and 3 character files extensions (which is what we had back in the day) you could use one byte for each character which meant you had to use eleven bytes or six words. using RAD-50 you could encode the same file name in four words by packing three characters in each word. Wow! A 25% saving in space. Of course you had to limit yourself to a small character subset – in this case about 40 characters. So I started asking myself how many bits does one need to encode a given number of characters? From there is was a simple jump (ok it is in my weird mind) to “how many characters are there in an alphabet?”
If you are in to subtle clues you may have noticed that I said “an alphabet” rather than “the alphabet.” Why? Well it turns out, and this does not seem to be widely known by American students, there are more alphabets than just the English alphabet. Some of them have more (perhaps some with fewer) than 26 characters. So before you know how many bits you need you have to know which alphabet you are dealing with. Now some people will say “what does it matter? I am writing my application for English speakers.” In today’s world that is not going to cut it. Too much of the market for technology and software is international.
This is why were the computer industry used to standardize on ASCII and EBCDIC (two standards developed in English language and being very English centric) we now use a lot more Unicode which supports some 109,000 characters and 93 different scripts or alphabets. It also requires more bits per character. Good thing memory is cheap these days.
Of course representing the characters is one thing, using them to sort data is another. Did you know that some scripts do not have a specific order and are not used for sorting? Me either but I ran into that while researching links for this post. And in some languages letters with special marks over them are new characters and in some they are not. And does it make a difference if the letters are uppercase or lowercase? Ouch. Some of the commonest things in the world are more complicated than we realize.
Fortunately for most of us there are library routines to handle this stuff for us. Sort routines in the .NET framework for example have options for dealing with multiple languages, different or special sorting orders and of course different character sets. Jumping back to the beginning when I was working for Digital Equipment Corporation which invented and used the Radix-50 system the various programming tools had functions to handle the character encoding and decoding. When I left there I had to write my own functions to do the decoding so that I could read magnetic tapes that had my personal data on them. That was educational.
So what is the point? Well the point is that these are things we don’t often think about but people who want to get involved at the systems level or even just properly understand the issues around making international products do have to think about them. And by the way if you read the Wikipedia article on Unicode you’ll find that for all the good intentions and smart people working on the problem there is still controversy.
One last thing, if you are ever on an interview and someone asks “ how many bits are needed to encode an alphabet” be sure and ask them which one. It’s a trick question.