Mixed language research: The key to making machines human?

Developed over thousands of years, our vast array of languages now helps us do more than simply communicate. They help us express emotions, signal our identities, and convey the nature of our relationships.

A unique element of human communication is the phenomenon of code-switching or code- mixing. Prevalent in most multilingual and multicultural societies, code mixing is the fusion of two or more languages in everyday speech. Bilingual or multilingual people often mix words from their broad set of languages to convey messages on a deeper, more intimate level. A cacophony of languages is becoming more common on the Internet as social media platforms bring us closer than ever.

While code mixing comes naturally to people who speak multiple languages, online tools and social platforms are yet to detect and translate code mixed messages effectively. Researchers at Microsoft’s Project Mélange studied this phenomenon in an attempt to build tools and machines that can go deeper than the spoken word and tap into a deeper layer of human communication.

The multilingual, multicultural web

Intriguing results from preliminary studies motivated the Project Mélange team. Technical analysis on a broad dataset of public tweets indicated that nearly 5-10% tweets were code- mixed to varying degrees. Further observations indicated a higher prevalence of code-mixing in multicultural societies across India and the Europe, and lower levels in countries such as China and the United States.

Results also indicate that despite the prevalence of code-mixing, mainstream social platforms and online technologies struggle to accurately detect and translate these messages. Code-mixed messages were being translated solely based on the dominant language. For example, if a message were to be translated to Mandarin and it had 15% Spanish words and 85% English ones, only 85% of the message could be translated, leaving the Spanish words unaltered.

Considering there are nearly 328 million active Twitter users who speak 65 different languages, this is a considerable gap. This gap is more pronounced in a global context, since there are 6,900 known languages in the world. That number could be magnified when considering blended languages such as Spanglish (Spanish-English), Singlish (Singapore English), and Hinglish (Hindi-English).

Research into code-mixing and switching is pertinent to ensure the Internet becomes more and more accessible to everyone on the planet.

Researchers at Project Mélange are driven by two important questions:

  1. Can technology detect and translate code-mixed speech and text with higher accuracy?
  2. Can the artificially intelligent machines of the future understand subliminal messages and social gestures conveyed through mixed codes?

In other words, the team wanted to understand how people mix codes and why they do so, in order to build machines that can translate code-mixed messages and apply them in appropriate context.

Making machines understand mixed languages

Detecting and translating code mixing is an engineering challenge. Algorithms are rule-based and have been getting increasingly better at detecting languages and translating them for users online.  However, code-mixing throws the algorithm off protocol, leading to lower accuracy.

A recent study by the Microsoft Research team on 1.25 million tweets from Hindi-English bilinguals  revealed that users were more likely to switch code to English when talking about formal or factual concepts (narrative-evaluative code mixing). For example, technical details in the phrase, “Petrol prices up by Rs. 3.18/litre, diesel by Rs. 3.09/litre. Sab ki aesi tesi kr di.” are in English. Usage of Hindi was more common when users wanted to reinforce a sentiment or be sarcastic. For example, the Hindi words in the following sentence, “Best wishes to the Indian team Tiranga aapke saath hai!” are only used to reinforce a positive sentiment of encouragement. Mixing codes for narrative, evaluative or reinforcement purposes was most common, making up 21.64% and 19.24% of all the code-mixed tweets in the data set.

The most interesting finding of this study is that users prefer to express negative sentiment in Hindi. In fact, the fraction of swear words and other forms of profanity was far higher for Hindi parts of the tweets than their English counterparts.

The study also detected other characteristics of code mixing.  Quotation marks were used to emphasize a word or phrase in its original form. Hindi was retained when the original conversation was in Hindi but was being reported in English.

Project Mélange researchers also found that tools to detect and translate code-mixed messages were available. However, the accuracy of these tools varied. Code-mixing in similar or closely related languages such as Spanish and Catalan were more difficult to detect than mixes in dissimilar languages such as Hindi and English. A broader range of languages also compromised accuracy.

Encouraged by their findings, the team is applying more data and research to create better translators and language detection software. Applying machine learning tools will improve the algorithms over time and eventually lead to machines that can speak to humans on a more personal and relatable level.

Broader questions

Beyond the technical aspects of detecting and translating code-mixed messages, the team wanted to answer some deeper questions about human communication. Although their studies have helped establish how codes are mixed, the researchers now seem intrigued by the question over why we mix codes.

Borrowing from decades of research in linguistics, the team has tried to verify if code mixing is a complex form of social signaling. If men are more likely to code mix than women when swearing. Whether code mixing helps users establish deeper relationships with their counterparts or signify dominance and social clout.

If research can answer these lingering questions, we believe, artificially intelligent machines of the future could do more than simply translate code-mixed messages. They could, indeed, use code-mixed messages to establish a relationship or detect subconscious emotions conveyed by the user.

These capabilities could make machines more human than ever, thus augmenting their utility in countless ways.

Skip to main content