Statistical Machine Translation – Guest Blog (Updated with additional paper)

Will Lewis is a program manager on the Microsoft Translator team, working on language quality and data acquisition.  Today's guest blog is a high level explanation of how the engine works:  

As many of you know, under the hood Microsoft Translator is powered by a Statistical Machine Translation (SMT) engine.  Statistical systems are different than rule-based ones in that the “rules” mapping words and phrases from one language to another are learned by the system rather than being hand-coded.  Training an SMT requires amassing a large amount of parallel training data—hopefully of good quality and from heterogeneous sources—and training the engine on that data.  (By parallel, we mean a source of data where the content for one language is the same as the content for the other.)  The engine learns the correspondences between words and phrases in one language and those in another, which are often reinforced by repeated occurrences of the same words and phrases throughout the input.  For instance, in training the English-German system let’s say, if the engine sees the phrase All rights reserved on the English side and also notices Alle Rechte vorbehalten on the German side, it may align these two phrases, and assign some probability to this alignment.  Repeated occurrences of the source and target phrases in the training data will only reinforce this alignment.

Generally, having parallel data for a language pair means we can train engines in both directions (i.e., both the English-German and the German-English systems can be trained on the same input sentences).  Some of you had some questions regarding why it was that we released the English-Spanish system before we released Spanish-English.  There were really two reasons.  First, English-Spanish was the first general domain language pair we released.  Releasing one language pair allowed us to test the infrastructure before we started releasing more.  Second, the technology for Spanish-English was slightly different than that used for English-Spanish, and we need some additional time to do the necessary infrastructural changes to accommodate.  In the future, we plan to release new translation systems in pairs (with a couple of exceptions).  I can’t reveal what languages we have planned next, but do expect some new ones soon!

For those of you interested in technical discussions regarding our engines and how they work, please refer to some of the papers by the researchers who developed them.  Three recent papers of note are:

Chris Quirk, Arul Menezes. Do we need phrases? Challenging the conventional wisdom in Statistical Machine Translation May 2006 New York, New York, USA Proceedings of HLT-NAACL 2006

Chris Quirk, Arul Menezes. Dependency Treelet Translation: The convergence of statistical and example-based machine translation? March 2006 Machine Translation 43-65 (Attached file)

Chris Quirk, Arul Menezes. Using Dependency Order Templates to Improve Generality in Translation July 2007 Association for Computational Linguistics

Dependency Treelet Translation The convergence of statistical and example-based machinetranslation.pdf

Comments (15)

  1. anony.muos says:

    Hey Machine Translation Team at MS, I’ve a small suggestion. Google’s Translation now has a "Detect language" feature that automatically detects the foreign language which is very useful. Can you add such a feature to Windows Live Translator?

  2. Chris Wendt says:

    Hello someone, thanks for the suggestion. We’ll plan that for one of our next updates.

  3. slawek says:

    Hey, Can I expect, that Polish language will be available in near future?

  4. Lane says:

    Hi Slawek,  We are always looking to add more languages to improve our engine, but we do not have a specific timeline for individual languages.

  5. Is there a way for programers to access the tranlation direcly from code?  C# or other programming languages.  Thanks

  6. I am hoping that the machine translation is available as a web service that would allow inputing one language and getting a translation to another language.  I am hoping that this would be availble by making a call from a dotnet programming language such as C# or any of the other programming languages.  My company is a ISV Microsoft Partner that developes applications for retail and manufacturing companies.  Please let me know if this is available.

  7. Please answer my request.  Can we (As a Microsoft Certified Partern (ISV) access the Microsoft Tranlation service from our program?


  8. Mes collègues de Microsoft Research l’annonçaient il y a quelques jours : toutes les paires de langues

  9. The Translator team is excited to announce the availability of the English to Russian language pair on

  10. The Microsoft Translator team is very proud to announce the technology preview of an innovative offering

  11. This is a repost from the Microsoft Research Machine Translation (MSR-MT) Team Blog by permission, and

  12. From elsewhere in the collective.

  13. This is a helpful tip on how machine translation works. I’m writing a project on language translation techniques and reading this article has given me much insight.

  14. computerproducts says:

    It’s good to understand how the machine translation works. But an average person doesn’t need to understand this to use the tool.

Skip to main content