Despite a remarkable increase in computing and internet penetration over the past few decades, India’s population still struggles with a deep digital language divide. While English is the most common language on the internet, only about 12 percent of India’s population is familiar with English. India’s increasing digital literacy needs to be supported by a multi-lingual digital world. At Microsoft, we are committed to accelerate the use of regional languages online to help more and more Indians experience the power of computing and the internet.
Developing better tools to translate diverse regional languages with high quality and accuracy is essential to our mission to empower everyone, everywhere through the power of computing. As part of this mission, we are launching a cutting-edge deep neural network (DNN)-powered translator for Hindi, Tamil, and Bengali. Here’s how our researchers developed this DNN model to bridge the digital language divide and make the internet more accessible for millions:
Challenges in Indian language translation
The key challenge with developing digital translation capabilities is the availability of data. Traditional statistical machine translation models rely on vast datasets to accurately translate a given query or sentence into another language.
These datasets include sentences in a particular language and accurate corresponding translations of that sentence to another language. Translation tools rely on millions of sets of such unique parallel pairs of sentences.
Although India is the second most populous country in the world with six languages that are globally dominant in terms of the number of native speakers, most Indian languages are underrepresented in online exchanges.
Of the 447 different languages spoken in India, none of them make it to the list of top 50 digital languages. In other words, there is a lack of unique parallel pairs of data for Indian languages. This lack of training data – both in terms of quantity and quality – poses a major challenge for digital translation.
Adding to the complexity are the subtle differences in enunciation, accent, diction, and slang across various regions in India. For example, two native Hindi speakers from different regions of a Hindi-speaking belt may have divergent ways of constructing a sentence or describing the same thing.
This combination of complexity and lack of data has stymied the development of accurate translation tools for Indian users. However, using the recent advances in deep learning and artificial neural networks, we have developed a translation model for Indian languages that is more accurate while relying on fewer datasets.
DNN model more accurate than statistical models
Applying the DNN model for translation leads to output that is more accurate than traditional statistical machine translation models. The accuracy of such output is tested against the external BLEU (Bilingual Evaluation Understudy) score alongside an internal test. Microsoft Translator was switched from statistical machine to deep neural networks in 2016.
Our deep neural network model for language translation is based on mimicking the way the human brain works. An artificial neural network replicates the neurons in the brain to absorb data and learn to translate sentences in various languages with significant accuracy.
Unlike earlier models, deep neural networks or DNN models work on established theories about pattern recognition in the brain of bilingual or multilingual people. In other words, these algorithms learn to translate languages the way humans do. The result is that translations are more accurate, more human-sounding, and more fluent than before.
Training the neural network
The neural network-powered algorithm is trained on a curated database of translated sentences. This is done by scrubbing the data to eliminate errors and streamlining the encoding standard for text (converting to Unicode).
Since Indian languages are morphologically rich, the model deploys a morphological analyzer to enable root and affix segmentation. Moreover, owing to the paucity of data, the model is trained to dynamically decide how much training is required to prevent overfeeding.
We also generate data synthetically through back translations, which gives the model more to work with and learn from. Synthetic data not only augments the resources for training the neural network but also facilitates an iterative bootstrapping process of machine learning, allowing the model to gain fluency and accuracy with limited data.
This new neural network architecture is based on a single Recurrent Neural Network (RNN) ((Gated Recurrent Units (GRU)/Long Short Term Memory (LSTM)) layer at the bottom with a substantial number of fully-connected (FC) layers on top that allow CPU-based decoders, which do not require specialized hardware, to be built into mobile software platforms.
RNN+FC architecture is faster, more accurate, and cheaper to deploy. Training the model is also a lot more data efficient.
How it works
The algorithm breaks down the process of translating a sentence into four distinct steps:
Step 1. Speech Recognition (Speech to Text)
Recognizing spoken words and converting them into text is the most crucial step in the process. The quality of the initial input determines the quality of the eventual output. Microsoft speech translation technologies use advanced LSTM neural network architecture.
Step 2. True Text
The second step involves applying TrueText techniques to eliminate quirks in the data. This step scrubs the data of natural pauses, incomprehensible words and repetitions so that the text format is more readable by the neural network.
Step 3. Translation
Converting the text from the source language to the target language is the third step in the process. Applying the DNN model to convert text into another language is more accurate than traditional statistical machine translation methods.
Step 4. Output (Text to Speech)
The last step involves synthesizing the text-based translation into speech.
Consider the following example:
Try and compare neural network translations at http://translate.ai
In the English-to-Hindi translation above, the DNN model will extract each word and pass it through a layer of ‘neurons’ before encoding it in a 1000-dimension vector. This puts every word within the context of every other word in the sentence.
Once every word is encoded in this manner, the process is repeated several times to improve the way each word is placed in context based on the 1000-dimensions. At this point, an attention layer (software) eliminates unnecessary words from the final output matrix and a decoder layer translates every word in the target language.
Every word passes through these layers so that every subsequent word is translated appropriately within the context of the sentence. This results in more accurate and human-like translations.
In the example, the model accurately translates the words ‘broke out’ more accurately into ‘भड़क उठी’ rather than ‘तोड़ा दम’ because the model is aware of the context.
Bridging the digital language gap
The more precise and human-like DNN-based translation technologies can help us fill the digital language gap and have potential applications across the Microsoft ecosystem.
For example, this model is already built into Microsoft Edge and helps users convert web pages and online text into their preferred language. Similarly, a user can apply this technology to edit text in several languages while using Microsoft Word or translate an email delivered by Microsoft Outlook. Microsoft PowerPoint has a feature based on this model that generates translated subtitles for a presenter during a slide presentation.
Developers from across the world have access to Cognitive Services APIs based on this model to develop apps in foreign languages and integrate this model into their app’s functionality. Developers can leverage this technology to tackle other language-related challenges and create more accessible apps.
Increasing computing penetration using regional languages
By applying the new DNN model to machine translation, Indian languages can be converted easily and accurately. Fluent translations of India’s many languages should help narrow the digital gap between users in various parts of the country.
With more developers confidently creating apps in regional Indian languages, the network of online resources can finally be expanded to the masses. Hundreds of millions of regional language speakers can then access resources designed in their language across education, healthcare, banking, ecommerce, entertainment, agriculture and travel among others.
Addressing the language barrier can help solve a number of day-to-day challenges. India’s internal migrants can communicate with their new neighbours and travellers can find their way around by converting signposts from one language to another. Citizens can access central and state government-issued documents in their preferred language. Better translations will also facilitate increased online e-commerce, trade, communication and civic participation.
Indeed, DNN-powered translations can take us one step closer to a world unified by powerful computing.