How can I detect the language a run of text is written in?


A customer asked, "I have a Unicode string. I want to know what language that string is in. Is there a function that can give me this information? I am most interested in knowing whether it is written in an East Asian language."

The problem of determining the language in which a run of text is written is rather difficult. Many languages share the same script, or at least very similar scripts, so you can't just go based on which Unicode code point ranges appear in the string of text. (And what if the text contains words from multiple languages?) With heuristics and statistical analysis and a large enough sample, the confidence level increases, but reaching 100% confidence is difficult. I vaguely recall that there is a string of text which is a perfectly valid sentence in both Spanish and Portuguese, but with radically different meanings in the two languages!

The customer was unconvinced of the difficulty of this problem. "Language detection of a single Unicode character should work with 100% accuracy. After all, the operating system already has a function to do this. When I pass the run of text to GDI, it knows to use a Chinese font to render the Chinese characters and a Korean font to render the Korean characters."

The customer has fallen into the trap of confusing scripts with languages. The customer in this case is an East Asian company, so they have entered the linguistic world with a mindset that each language has its own unique script, since that is true for the languages in their part of the world.

It's actually kind of interesting seeing a different set of linguistic assumptions. Whereas companies in the United States assume that every language is like English, it appears that companies in East Asia assume that every language is like English, Japanese, Chinese, Korean, or Thai. In this company's world, the letter "A" is clearly English, since it never occurred to them that it might be German, Swedish, or French.

When GDI is asked to render a run of text, it looks for a font that can render each specific character, and once it finds such a font, it tries to keep using that font until it runs into a character which that font doesn't support, and then it begins a new search. You can see this effect when a non-Western character is inserted into a string when rendered on a system whose default code page is Western. GDI will switch to a font that supports the non-Western character, and it will keep using that font for the remainder of the string, even though the rest of the string uses just the letters A through Z. For example, the string might render like this: Dvořak. GDI switched to a different font to render the "ř" and remained in that font instead of returning to the original font for the "ak".

Anyway, the answer to the customer's question of language detection is to use the language detection capability of the Extended Linguistic Services.

If you are operating in the more constrained world of "I just want to know if it's Chinese/Japanese/Korean/Thai or isn't," then you could fall back to checking Unicode character ranges. If you see characters in the ranges dedicated to characters from those East Asian scripts, then you found text which is (at least partially) in one of those languages. Note, however, that this algorithm requires continual tweaking because the Unicode standard is a moving target. For example, the range of characters which can be used by East Asian languages expanded with the introduction of the Supplemental Ideographic Plane. You're probably best just letting somebody else worry about this, say, by asking Get­String­Type­Ex for CT_CTYPE3 information, or using Get­String­Scripts (or its redistributable doppelgänger Downlevel­Get­String­Scripts) or simply by asking ELS to do everything.

Comments (25)
  1. Random832 says:

    "When I pass the run of text to GDI, it knows to use a Chinese font to render the Chinese characters and a Korean font to render the Korean characters." – how does he know it's not using a Japanese font to render the Chinese characters?

  2. A few years back, a couple of friends and I were staying in a hostel in the north of Japan. Since we were a big group of foreigners from different countries, we spoke English, since we had no other shared language (other than Japanese of varying levels of proficiency). The hostel owners' ten-year-old daughter commented on this, and we asked her how she knew that we were speaking English (as opposed to some other foreign language). Her reply was simply "because it's not Japanese". :)

    Similarly the neighbour's kids (I live in Japan) also refer to me as the "eigo-jin" (roughly translates as "English language person") despite the fact that I am, in fact, Dutch.

  3. Joshua Ganes says:

    I was all geared up to suggest something similar (but less detailed) to you final paragraph before reading your final paragraph.

    I find that Chrome is surprisingly good when it comes to detecting if a web page is written in another language. It is also reasonably competent at translating the text for me. This also depends a lot on the language.

  4. Matt says:

    For those writing code targeting earlier versions of Windows, Chrome's language detection module is written in C++ and is open-source.

    src.chromium.org/…/cld

    Someone blogged a bit of info on extracting this into a standalone library:

    blog.mikemccandless.com/…/language-detection-with-googles-compact.html

  5. Andreas says:

    I vaguely recall that there is a string of text which is a perfectly valid sentence in both Spanish and Portuguese, but with radically different meanings in the two languages!

    "My hand is in warm water" is perfectly valid English (obviously) and Afrikaans. Means the same thing in both languages, though

  6. RichardDeeming says:

    "I don't like football" is perfectly valid English and American, and has a subtly different meaning in both.

    "I'm not wearing any pants" is also valid in both, and has a significantly different meaning.

  7. SimonRev says:

    Ok, Richard I'll bite.  I know the American meaning for "I'm not wearing any pants", but what would it mean if I said that in England?

  8. RonBass says:

    <i>Ok, Richard I'll bite.  I know the American meaning for "I'm not wearing any pants", but what would it mean if I said that in England?</i>

    I'm not wearing underwear.

  9. You wrote: "each language has its own unique script, since that is true for the languages in their part of the world".

    This is not true, even for East Asia.  Obviously so for Chinese script which is used to write Mandarin and Cantonese (linguistically they are related but distinct languages), as well as Japanese, Korean, Vietnamese (historically) and many other languages in the region.  Japanese scripts (two more of them apart from the adopted Chinese script, Kanji — Hiragana and Katakana) are used for Japanese but Hiragana is also used, albeit not consistently, for at least 6 more languages in the Ryukyuan family.  Even Korean script has been used for another language, Cia-Cia of Indonesia.  And Thai script, of course, is used for many languages within Thailand (which is still problematic on Windows because of assumptions that Thai script = Thai language, at input and rendering time).

    [Nitpick: "… since that is true for the languages in their part of the world which they are aware of." -Raymond]
  10. M Hotchin says:

    @SimonRev

    I think it goes like this:

    American 'Pants' == English 'Trousers'

    English 'Pants' == American 'Underwear'

  11. An American says:

    I believe that an Englishman saying "I'm pissed, let's go smoke a ***" would be very poorly misinterpreted by an American.

  12. dave says:

    @An American

    Did you actually type three asterisks, or did some (American!) software decide that the word you typed was a bad word?

  13. mh says:

    *dvořák

  14. cheong00 says:

    Marc: What you said enter's a different level of complexity. Mandarin can be written in Traditional Chinese or Simplified Chinese (Lots of literatures at their earlier time was still printed in Traditional Chinese), while Cantonese is mostly written in Traditional Chinese. And some of the "Han" characters in Japanese character set is the same as Simplified Chinese, while the others is mostly the same as Tradtional Chinese. (I write like this because it happens that some of these "supposedly Han" characters are actually not exist in both Traditional Chinese and Simplified Chinese.) Not mentioning that there's quite a lot of characters written the same way in both Traditional Chinese and Simplified Chinese.

    Because of these, accurate detection of language in Unicode even if restricted to only CJK region is next to impossible.

  15. Jon says:

    I enjoyed reading this article, particularly the "It's actually kind of interesting seeing a different set of linguistic assumptions" section.

    @Richard, it's interesting that in Australia, "pants" follows the American meaning.

  16. TC says:

    it's interesting that in Australia, "pants" follows the American meaning.

    as opposed to, say, "root". I just suggested to someone in a climbing website, that he shouldn't talk about doing some "roots" if he comes to Australia! (In oz, "root" is a crude word meaning – how can I put this politely – "having fun" :-)

  17. TC says:

    And further to "pants": in england, climbers and related sportspeople sometimes use "pants" to mean "poor", "unimpressuve":

    "On second pitch there was gear left where someone had ab-ed off, guess they thought the rest of the route was pants" … "After 50 miles [on the bike] I came home, I thought the route was pants with far too many long straights"

  18. Drak says:

    @TC: it's not only sportspeople who use it that way. It's pretty common among the british people I associate with.

  19. Rick C says:

    If "pants" means "underwear" or "substandard" in England, what's the word that means pants in American English?

  20. Joshua says:

    @Rick C: "trousers", "lame" depending on how to interpret your question.

  21. Rick C says:

    I can see how I didn't word that perfectly clearly, but "trousers" is the word that answers the question I meant to ask.  I guess that's an example of linguistic drift, because "trousers" is almost an obsolete word here, at least among the general public.  I'd imagine if you went to a tailor or perhaps a gentlemen's clothing store (by which I mean where you'd by suits, tuxedos, and the like, as opposed to a general clothes store) you might still see the word "trousers" used.

  22. cheong00 says:

    Btw, since we have so many unallocated codepoint ranges on UTF-32, could be introduce another decomposition form that includes this kind of language information on each characters? (I'm half joking, but I do think that it doesn't harm to provide a standardized way for people to do things that they want to do, while others need not care)

  23. Yuhong Bao says:

    I think it probably comes from the mindset of the legacy encodings era where all the ASCII extensions were mutually exclusive.

  24. evacchi says:

    «I Vitelli dei Romani sono belli» it's a valid sentence both in Italian and Latin, but with different meanings

  25. bumblefoot2004 says:

    Try this program, Polyglot3000

    http://www.polyglot3000.com/

Comments are closed.