Politically Incorrect Machines

While we at the Machine Translation team have been seeing increasing traffic to our various offerings over the past few months, we noticed a sudden bump in traffic yesterday. Having grown up on Agatha Christie and Sherlock Holmes, such mysteries are irresistible for me – and a number of other folks on the team were just as curious to find out what caused this sudden bump. We figured that the IE8 Activity/Accelerator, the Messenger Bot, Search translations, Office translations were all showing the same upward trend as the days before and thus were not the specific reason for this bump.

Eventually, we were able to identify one potential reason why we were seeing this spike. Our user community found an oddity in how the machine translation engine processed the translation for several names from English to German. It was to be expected that when the engine translates the name of the candidate of one party to someone from the other party, given the current political atmosphere in the run up to US elections, that it would end up as news. While we certainly welcome all the new users that came by to check this phenomenon out – we wanted to share with our users the reason why such things seem to happen from time to time with statistically trained machine translation systems from us and others.

A Statistical Machine Translation engine is trained on lots and lots of parallel data, that is, data that exists in both a source language (e.g., English) and a target language (e.g., German), where the source and target are translations of one another. Our engine is trained on millions of sentences for each language pair we support. In order to train on a particular corpus of data—maybe a large number of newswire articles in English which have been translated into German—we first have to break that corpus down into sentences. After the corpus is sentence broken, we feed the resulting sentences into a sentence aligner, the sole purpose of which is to find what sentences on the source side align with sentences on the target side. This is no trivial task, since a sentence on one side could conceivably align with one or more sentences on the target (or possibly none at all!). The aligner will sometimes make mistakes, and misalign one sentence with another that is in fact not a translation. This can lead to some mistranslations, especially if there are words in the source and target that are infrequently occurring. Since our translation engine is statistical, it is highly reliant on co-occurrence frequencies between words in the source and target data. If certain words are infrequently occurring—people’s names, for instance, may only occur a few times across a corpus of millions of sentences—the lack of frequency can lead to mistranslations resulting from incorrect “guesses” between source and target (i.e., low probabilities assigned to particular source and target words). This can lead to some comical gaffes in our translation system.

So, that is how the “machine” decided to translate in a way that ended up with the community attributing it to the sense of humor of our team. While we continue to work hard to ensure proper alignments, it is to be expected from a statistical system that is built on millions to billions of words that such a situation could repeat.

The current issue with alignment should now be resolved but we urge our community of users to keep helping us identify any such situations by contacting us through this blog.


Vikram Dendi leads Business Strategy & Product Planning for the Microsoft Translator team
Comments (11)

  1. The Microsoft Translator team is very proud to announce the technology preview of an innovative offering

  2. This is a repost from the Microsoft Research Machine Translation (MSR-MT) Team Blog by permission, and

  3. From elsewhere in the collective.

  4. Chee Wee says:

    The translation is badly broken for the Chinese Simplified version of this page.

    To reproduce, just select Chinese Simplified and read the translation 🙂

  5. Leon says:

    There seem to be some mistranslations to Dutch in the systeem too – "irresistable" is beeing translated to "bezwijken" (which means to faint or to give in) instead of "onweerstaanbaar".

    I guess this happens when a tekst is translated with a different expression. "I couldn’t resist" -> "I had to give in"

    And my personal favorite mistranslation is the checkbox "Remember me". This is interpreted as a question, so the Dutch translation reads "Do you still remember me?"

  6. William Lewis says:

    Thanks, Leon for your very helpful comments on our Dutch translations.  We’re constantly working on fixes to our translator, and constructive comments like yours really help!

    William Lewis

    Senior Program Manager, Data Acquisition

    Microsoft Translator

  7. I believe the Statistical Machine Translation engine will get better over time.

  8. computerproducts says:

    This machine translation needs to get better. Microsoft has the resources and the brain to make it happen.

  9. Curious says:

    Given the size of the data set you use as the source of your translation engine, what happens to text that I enter into the online translation engine? Is it stored or retained? Is it attributable to an IP address? Should I be worried someone will see or be able to retrieve the love notes that I’m having translated?

  10. Omid says:

    I believe if you add a Persian dictionary ,it can make about 50 million users linked to your web-translator.

Skip to main content