Detecting what language or script a run of text is written in, redux


Some time ago, I discussed the confusion surrounding the question, "How can I detect the language a run of text is in?" because the person asking the question was from an East Asian country, and in that part of the world, scripts and languages line up pretty closely. Chinese uses Hanzi, Korean uses Hangul, Japanese has a few scripts, Thai has its own alphabet, and so on. There is overlap, sure, but overall, you can tell what language a run of text is in without understanding anything about the language. You just have to see what font it's written in.

By comparison, the languages of Western Europe nearly all use the Latin alphabet. You need to know something about the languages themselves in order to distinguish French from Italian.

And then there are languages like Serbian and Chinese which have multiple writing systems. In Chinese, you can write in either Simplified or Traditional characters. In Serbian, you can choose between Latin or Cyrillic characters.

Extended Linguistic Services tries to address all three of these issues.¹

  • Language Detection guesses what language that segment might be written in, offering its results in decreasing order of confidence.

  • Script Detection breaks a string into segments, each of which shares the same script.

  • Transliteration converts text from one writing system to another.

I'm not going to write a Little Program to demonstrate this because there are already plenty of existing samples.

When you adapt these samples into production code, note that MSDN recommends that you enumerate services only once, and then reuse the result, rather than enumerating each time you need the service.

(It appears to me that the Extended Linguistic Services was over-engineered. Enumeration seems unnecessary since there are only three services. Trying to force each service to use the same MAPPING_PROPERTY_BAG seems unnecessarily complicated. But what do I know. Maybe there's a method to their madness.)²

Instead of showing yet another sample, I'll just show the output of the services on various types of input. Note that language detection generally improves the longer the input, so these short snippets can generate lots of false positives.

Language detection
Input Results
That's Greek to me. en, hr, sl, sr-Latin, da, es, et, fr, lv, nb, nn, pl, pt, sq, tn, yo
Das kommt mir spanisch vor. de, gl, pt, ro
Αυτά μου φαίνονται κινέζικα. el
Это для меня китайская грамота. ru, be, uk
看起來像天書。 zh-Hant, zh

Script detection

In Greece, they say, "
Latn
Αυτά μου φαίνονται κινέζικα."
Grek
ラドクリフ、マラソン
Kana
五輪代表
Hani
に 1
Hira
 
m
 
出場
Hani
にも
Hira
 
 
 
Hani↑ ↑Latn Hani↑ ↑Hira

Observe that neutral characters (like the quotation mark in the first example and the digit 1 in the second example) get attached to the preceding script run.

Transliterator Input Output
Bengali to Latin বাংলা baaṁmlaa
Cyrillic to Latin Кириллица Kirillica
Devanagari to Latin देवनागरी devnaagrii
Mayalam to Latin മലയാളം mlyaaḷṁ
Simplified to Traditional Chinese 正体字 正體字
Traditional to Simplified Chinese 正體字 正体字

¹ Why "Extended" linguistic services instead of just plain "linguistic services"? Probably because that gave them a TLA.

² The method to their madness is that they anticipated building an entire empire of linguistic services, maybe even have multiple competing implementations, so your program could say, "You know, the Contoso script detector does a much better job than the Microsoft one. I'll use that if available." Except, of course, in practice, nobody writes script detectors except Microsoft.

Comments (15)
  1. GovindParmar says:

    A lot of languages in north India and Nepal use the Devanagari script. Sure, Hindi is the most common language, but it's also used for Nepalese, Marathi, and Sanskrit. Given that this script isn't seen as often as Latin or Chinese I'm guessing it's not a top priority to distinguish between languages that use Devanagari.

  2. Brian_EE says:

    What a coincidence that this post happens on the same day as this XKCD cartoon: http://xkcd.com/1726/

  3. Eduardo says:

    Over-engineered or protected against anti-trust accusations?

    1. DonH says:

      Exactly correct. Adding new functionality in a way that can't be replaced greatly increases the chances that a DoJ lawyer will redesign your interface.

      1. Antonio Rodríguez says:

        And this is the silliness we have come to. Microsoft provides many non-essential services to developers in Windows, but nobody makes you use them or forbids you from using a competing one. Sure, the interfaces would be different. But if you need to be able to switch libraries or write portable code, the solution is easy: a wrapper library.

  4. Marcin says:

    For a second there, I double-clicked the "TLA" to google what it stands for... sheesh.

    1. Brian_EE says:

      TLA = Truth in Lending Act, also Trial Lawyers Association

  5. Joshua says:

    > Except, of course, in practice, nobody writes script detectors except Microsoft.

    And Google, but I don't they have a Windows API hook to call it.

  6. cheong00 says:

    I think in long time ago, someone provided the idea that, "if you have multilingual translation software you can automate, just put the text in and try to translate the text to each language that it supports. The language with highest confidence is the one that have translated text with shortest edit distance to the original one".

    https://en.wikipedia.org/wiki/Edit_distance

  7. Ben says:

    So it turns out that the Extended linguistic services are great at the Stroop test...

  8. Jan says:

    Well, is there a sample how to write a script detector or transliterator?

    1. Um, I linked to two samples in the article.

      1. Jan says:

        As far as I can tell you linked examples on how to use the existing services Microsoft provides. I was following up on "Except, of course, in practice, nobody writes script detectors except Microsoft." and looking for samples on how to write the services themselves and make them available to the system. Does it make more sense?

        1. Ah, I misunderstood the question. There isn't a sample for writing your own service. I suspect because nobody has come forward and said, "Hey, I want to write an alternate transliterator and make it available to other programs. Can you show me how?"

  9. Brendan Elliott (MSFT) says:

    Running Language Detection on short Latin-script phrases is highly inaccurate if you have less than roughly ~500 chars or so.

    If you just want to know the scripts present and not their exact position, GetStringScripts() is a simpler NLS API that does the job just as well.

    Obviously, detecting Hanzi ("Hani") alone is ambiguous as to if it's Chinese Simplified, Chinese Traditional, Japanese, or Korean, especially for short words/phrases, so that one needs special care as well. Even more so if you are doing Word Breaking afterwards (such as via the WordsSegmenter WinRT API), since there are distinct linguistic behavior differences between the different East Asian word breakers.

    You also forgot to mention the "Korean decomposition" transliteration engine, but that's highly specialized to search prefix matching scenarios and shouldn't be used on user-facing strings.

    And yes, I entirely agree with you that ELS was totally over-engineered...

Comments are closed.

Skip to main content