Facebook unveiled a multilingual machine translation model (MMT) called M2M-100 that can translate directly between over 100 languages without relying on English as an intermediate language like many other systems do. It directly translates 1,100 language combinations out of 4,450 possible ones.
The new system was trained on a data set of 7.5 billion sentence pairs gathered across 100 languages with the help of web crawlers. They scraped billions of sentences from the web, and another model called fastText identified the languages.
Facebook then used a program called LASER 2.0, which was earlier developed by the company’s research lab. It uses machine learning that does not require manually labeled data to match sentences by their meaning. LASER 2.0 trains on sentence examples from different languages and determines their relationship based on how often and close they are used together.
During training, the model focused more on languages that are most often translated to and from one another rather than on rare translational pairs. All languages were grouped in 14 collections based on cultural and geography similarities to train the model more accurately.
The newly developed artificial intelligence outperforms existing systems by 10 points on a 100-point BLEU scale that is used to evaluate the quality of machine translations. The assessment performed by humans indicated a 90% accuracy of the new model, too.
The system is not yet used on Facebook, but the company plans to implement it soon as there are nearly 20 billion translations happening on the social media platform every day when users click the “Translate” button below the posts written in almost 160 languages.
The model is now open-sourced to the research community and can be found here.