It will soon be easier to see Facebook and Instagram posts in some of the world’s less-spoken languages, but experts suggest that to improve the tool, Meta should talk to native speakers.
advertisement
Soon you’ll be able to easily view Facebook and Instagram posts in 200 less-spoken languages around the world.
Meta’s No Language Left Behind (NLLB) project announced the expansion of its proprietary technology in a paper published this month.
The project includes 12 “low-resource” European languages, including Scottish Gaelic, Galician, Irish, Linguistic, Bosnian, Icelandic and Welsh.
According to Meta, this is a language with fewer than 1 million sentences in the data available.
Experts say the tool still has room for improvement and Meta should consult with native speakers and language experts to improve the service.
How the project works
Meta trains its artificial intelligence (AI) using data from the Opus repository, an open-source platform with a collection of authentic audio or written texts in different languages onto which machine learning can be programmed.
Contributors to the dataset are experts in natural language processing (NLP), a subset of AI research that gives computers the ability to translate and understand human language.
Meta said it also combines data mined from sources such as Wikipedia into its database.
This data is used to create what Meta calls a multilingual model (MLM), allowing the AI to translate “between any language combination without relying on English language data,” according to the company’s website.
The NLLB team assesses the quality of translations using a benchmark of open-source human-translated sentences they created, which includes a list of “harmful” words and phrases that software can be taught to filter out when translating text.
In their latest paper, the NLLB team improved translation accuracy by 44 percent from their first model published in 2020.
Once the technology is fully implemented, Meta estimates that it will enable more than 25 billion translations every day on Facebook News Feed, Instagram, and other platforms.
“Talk to people”
William Lamb, professor of Gaelic ethnology and linguistics at the University of Edinburgh, is an expert on Scottish Gaelic, one of the under-resourced languages identified in Meta’s NLLB project.
Around 2.5% of Scotland’s population – roughly 130,000 people – will say they can speak some 13th-century Celtic language in the 2022 census.
There are about 2,000 Gaelic speakers in eastern Canada, but Gaelic is a minority language there, and UNESCO classifies it as an “endangered” language because so few people actively speak it.
Lamb noted that Meta’s Scottish Gaelic translations, while “their heart is right” because they use crowdsourced data, are “still not very good.”
“If they really want to improve translations, what they should do is talk to Gaelic speakers who still live and breathe Gaelic,” Lamb said.
“That’s easy to say, but hard to do,” Lamb continued, noting that most native speakers are in their 70s and don’t use computers, and younger speakers “don’t have the habit of using Gaelic like their grandparents did.”
advertisement
A good alternative would be for Meta to enter into a licensing agreement with the BBC, which is committed to preserving the language by producing high-quality online content in the language.
“This needs to be done by professionals.”
Alberto Bugarín-Dís, an AI professor at the University of Santiago de Compostela in Spain, thinks linguists like Lam should work with big tech companies to improve available datasets.
“This needs to be done by experts who can revise the text, correct it and update it with metadata that we can use,” Bugarin-Diez said.
“We need people from the humanities and technical backgrounds like engineers to work together. This is something we really need,” he added.
Bugarin-Diz went on to say that for Meta, the advantage of using Wikipedia is that its data reflects “almost every aspect of human life,” so the quality of the language can be much higher than if more formal texts were used.
advertisement
But Bugarin-Diz suggests that Meta and other AI companies should take the time to find quality data online and then meet the legal requirements necessary to use that data without violating intellectual property laws.
Meanwhile, Lam said that unless Meta makes some changes to the dataset, he would not recommend people use the tool as there are errors in the data.
“Their translation capabilities are not yet at a level where the tools would actually be useful,” Lam said.
“I still wouldn’t recommend it to anyone as a reliable language tool, and I think they would be honest about that.”
Bugarin Dis takes a different stance.
advertisement
He believes that if no one uses Meta translation, there will be “no incentive” to invest time and resources into improving it.
As with any AI tool, Bugarin-Diz believes it is important to know the technology’s weaknesses before using it.