Why can’t we strip the diacritics?


We have some “best-fit” behavior which we generally consider to be “bad”.  Any loss of data is generally a bad thing, so we recommend storing data in Unicode (so you don’t lose anything).  Assuming you can’t use Unicode, why is it so bad to just make everything ASCII-like?  Maybe you have a published house or direct marketing firm that can’t handle Unicode, so you’ll just get rid of those annoying decorations.

In American English the diacritics are effectively quaint decorations.  Many people naïvely assume that when word auto-corrects naive to naïve that this is just a prettiness factor.  When they resume spell checking their résumé the diacritics become more important.  In English its fair to spell résumé as resume, but it seems cooler to add the accents.  Since we stole (borrowed is more politically correct) the word from French, we have a french-like pronunciation of résumé, and aren’t likely to confuse it with resume.

In most other languages diacritics aren’t optional.  You wouldn’t exchange a z with an s in english just because they look similar.  “A real singer” is a lot different than “a real zinger”.

Recently I encountered the the following example, a user wanted to get around those pesky diacritics by mapping to ASCII.

The suggested input was:
    último año de carrera

The desired output was:
    ultimo ano de carrera

My Spanish is nearly non-existent, however word’s spell checker tells me these are all legitimate Spanish words, even without the accents.  The meaning goes from something like “the last year of the race” to “I completed the anus of the race.”

Now imagine that you’re trying to reach a new market and you do that to your customer’s names or potential customer’s names, how long will they remain your customer?

– Shawn

 

Comments (5)

  1. alex says:

    "carrera" translates [literally] to race, but in the context you’re using it, it means the studies that someone has to do to get a bachelor degree (or something like that)

    Example:

    The last year of college

    And without diacritics:

    The last ass[hole] of college

    I’m not sure if I’ve illustrated well the example — my English sucks

  2. nalaka says:

    I’m reading some data from my database  and write those data to a csv file using a StreamWriter. I have diacritics in some of the fields.

    eg:-Château, Viña

    when I view the csv file from a notepad it shows diacritics correctly as they are. but when I view the csv file from excel it shows some funny chatecters as,

    Château, Viña

    what is the reason for this and how can I overcome this problem?

  3. Shawn Steele says:

    StreamWriter should be using UTF-8 as the default output (which is good, don’t change that).  Notepad recognizes that and shows you the UTF-8 data, but it sounds like excel isn’t doing that.

    I think that changing "File Origin:" in excel to "65001: Unicode (UTF-8)" (They sort alphabetically by encoding name) you’ll solve your problem.  Not working for Office I’m not sure, nor do I know if this is the same in all versions.

  4. Sue Cooper says:

    When I receive E-mail that has the Unicode(UTF-8) at he right heading aI can never downlow it

  5. Shawn Steele says:

    Bummer, is your email client a Microsoft product? (I could pass along a bug).

    Email unfortunately is a bit picky about encodings and code pages.  UTF-8 isn’t particularly special in this way.  Some phones for example only support certain encodings.  

    The good news is that the IETF has an  EAI (Email Address Internationalization) working group – http://www.ietf.org/html.charters/eai-charter.html  

    The EAI is working toward enabling UTF-8 email throught the system, including the local part and body of the email.  It’ll take a while for the standard to get implimented, but at least its progress.