Case Study – Using Extended Linguistic Services (ELS) for Language Detection

Introduction

In the international Testing world, we are facing a variety of challenges. One of them is to ensure correctness of localization files in the target builds. The builds matrix is huge as a result of the large number of languages we are localizing into, the big number of SKUs, and architectures.

Furthermore, there are extra complex steps in international build process compared to the English builds. Complexity could be due to similarity of language code representation, this can cause uncertainty when people manually do localization check-ins; there are number of manual steps involved, etc... These complexities add risks that may result in having the wrong localization files in the builds.

Regular international testing has a hard time validating that the UI is in the expected language for the following two reasons:

· International testing largely depends on automation which has to be language neutral. It is not designed to detect what language is in a certain build.

· Testers often do not have specific language skill. They cannot tell if UI/context is in the expected language during manual testing.

 

Previously, we had to rely on test engineers who have language skills to install each build, spot check the UI to ensure expected languages appear correctly. An alternative approach is the usage of “cheat sheets” to verify the product after installation. Both methods are costly and cannot guarantee 100% test coverage.

 

Challenge for EULA verification

Let’s use EULA verification to exemplify the challenges discussed above.

EULA stands for End User License Agreement. An EULA is a legal contract between the manufacturer and/or the author and the end user of an application. It details how the software can and cannot be used and any restrictions that the manufacturer imposes (e.g., most EULA’s of proprietary software prohibit the user from sharing the software with anyone else).

Since a EULA is legal document in the product, it takes very high priority among all other content validation. We have done a lot of upstream testing to ensure that there are no errors in the EULA content, that a unique identifier (EULAID) is correctly placed in each file, that there are no linguistic errors etc. With all the testing, we could not 100% validate that the correct localized EULA file exist in each targeted build. For example, if we, by accident put a Croatian EULA into the Russian build, regular upstream testing would not be able to catch this bug.

We have 36 languages, 13 SKUs, 3 architectures, multiple types of EULA, etc... The result is a huge number of EULA files that need to be verified. Even though there are strict processes being followed to avoid the introduction of defects, the test team still needs to verify that the correct EULA exists in each released build. To do this validation in an efficient way without knowing the 36 target languages is a rather challenging task.

Solution

To address the challenges discussed above, we decided to extract the EULA files from the builds, and then use automation to detect the language, and make sure it matches what is expected in the build.

There is new set of services created by the Windows International team, called Extended Linguistic Services (ELS). The ELS components are installed automatically with Windows 7 and they introduce a new API set that allows us to do a whole bunch of things with text, including script- and language detection. We use this service to analyze EULA content to see if it is expected language.

For more detailed information on ELS, go to https://msdn.microsoft.com/goglobal/dd156834.aspx#ELS5

You can find some code samples that show how to use the Microsoft Language Detection Service here:

https://msdn.microsoft.com/goglobal/dd156834.aspx#ELS8

Using ELS, we developed a test application which validates correctness of the localized EULA content.

Test Result

We have used our application validate EULA content. It takes about 3-4 hours to run through about 1600 files. There are about 1% false positives in the results, which are caused by the language detection not being able to conclusively detect the language in every scenario. Overall, the tool has been a great addition to our EULA verification process and helped us to identify incorrect files early on.