Google says it’s made progress toward improving translation quality for languages that don’t have a copious amount of written text. In a forthcoming blog post, the company details new innovations that have enhanced the user experience in the 108 languages (particularly in data-poor languages Yoruba and Malayalam) supported by Google Translate, its service that translates an average of 150 billion words daily.
In the 13 years since the public debut of Google Translate, techniques like neural machine translation, rewriting-based paradigms, and on-device processing have led to quantifiable leaps in the platform’s translation accuracy. But until recently, even the state-of-the-art algorithms underpinning Translate lagged behind human performance. Efforts beyond Google illustrate the magnitude of the problem — the Masakhane project, which aims to render thousands of languages on the African continent automatically translatable, has yet to move beyond the data-gathering and transcription phase. And Common Voice, Mozilla’s effort to build an open source collection of transcribed speech data, has vetted only 40 voices since its June 2017 launch.
Google says its translation breakthroughs weren’t driven by a single technology, but rather a combination of technologies targeting low-resource languages, high-resource languages, general quality, latency, and overall inference speed. Between May 2019 and May 2020, as measured by human evaluations and BLEU, a metric based on the similarity between a system’s translation and human reference translations, Translate improved an average of 5 or more points across all languages and 7 or more across the 50 lowest-resource languages. Moreover, Google says that Translate has become more robust to machine translation hallucination, a phenomenon in which AI models produce strange “translations” when given nonsense input (such as “Shenzhen Shenzhen Shaw International Airport (SSH)” for the Telugu characters “ష ష ష ష ష ష ష ష ష ష ష ష ష ష ష,” which mean “Sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh”).
Hybrid models and data miners
The first of these technologies is a translation model architecture — a hybrid architecture consisting of a Transformer encoder and a recurrent neural network (RNN) decoder implemented in Lingvo, a TensorFlow framework for sequence modeling.
In machine translation, encoders generally encode words and phrases as internal representations the decoder then uses to generate text in a desired language. Transformer-based models, which Google-affiliated researchers first proposed in 2017, are demonstrably more effective at this than RNNs, but Google says its work suggests most of the quality gains come from only one component of the Transformer: the encoder. That’s perhaps because while both RNNs and Transformers are designed to handle ordered sequences of data, Transformers don’t require that the sequence be processed in order. In other words, if the data in question is natural language, the Transformer doesn’t need to process the beginning of a sentence before it processes the end.
Still, the RNN decoder remains “much faster” at inference time than the decoder within the Transformer. Cognizant of this, the Google Translate team applied optimizations to the RNN decoder before coupling it with the Transformer encoder to create low-latency, hybrid models higher in quality and more stable than the four-year-old RNN-based neural machine translation models they replace.
Beyond the novel hybrid model architecture, Google upgraded the decades-old crawler it used to compile training corpora from millions of example translations in things like articles, books, documents, and web search results. The new miner — which is embedding-based for 14 large language pairs as opposed to dictionary-based, meaning it uses vectors of real numbers to represent words and phrases — focuses more on precision (the fraction of relevant data among the retrieved data) than recall (the fraction of the total amount of relevant data that was actually retrieved). In production, Google says this increased the number of sentences the miner extracted by 29% on average.
Noisy data and transfer learning
Another translation performance boost came from a modeling method that better treats noise in training data. Following from the observation that noisy data (data with a large amount of information that can’t be understood or interpreted correctly) harms translation of languages for which data is plentiful, the Google Translate team deployed a system that assigns scores to examples using models trained on noisy data and tuned on “clean” data. Effectively, the models begin training on all data and then gradually train on smaller and cleaner subsets, an approach known in the AI research community as curriculum learning.
On the low-resource language side of the equation, Google implemented a back-translation scheme in Translate that augments parallel training data, where each sentence in a language is paired with its translation. (Machine translation traditionally relies on the statistics of corpora of paired sentences in both a source and a target language.) In this scheme, training data is automatically aligned with synthetic parallel data, such that the target text is natural language but the source is generated by a neural translation model. The result is that Translate takes advantage of the more abundant monolingual text data for training models, which Google says is especially helpful in increasing fluency.
Translate also now makes use of M4 modeling, where a single, giant model — M4 — translates among many languages and English. (M4 was first proposed in a paper last year that demonstrated it improved translation quality for over 30 low-resource languages after training on more than 25 billion sentence pairs from over 100 languages.) M4 modeling enabled transfer learning in Translate, so that insights gleaned through training on high-resource languages including French, German, and Spanish (which have billions of parallel examples) can be applied to the translation of low-resource languages like Yoruba, Sindhi, and Hawaiian (which have only tens of thousands of examples).
Looking ahead
Translate has improved by at least 1 BLEU point per year since 2010, according to Google, but automatic machine translation is by no means a solved problem. Google concedes that even its enhanced models fall prey to errors including conflating different dialects of a language, producing overly literal translations, and poor performance on particular genres of subject matter and informal or spoken language.
The tech giant is attempting to address this in various ways, including through its Google Translate Community, a gamified program that recruits volunteers to help improve performance for low-resource languages by translating words and phrases or checking if translations are correct. Just in February, the program, in tandem with emerging machine learning techniques, led to the addition in Translate of five languages spoken by a combined 75 million people: Kinyarwanda, Odia (Oriya), Tatar, Turkmen, and Uyghur (Uighur).
Google isn’t alone in its pursuit of a truly universal translator. In August 2018, Facebook revealed an AI model that uses a combination of word-for-word translations, language models, and back-translations to outperform systems for language pairings. More recently, MIT Computer Science and Artificial Intelligence Laboratory researchers presented an unsupervised model — i.e., a model that learns from test data that hasn’t been explicitly labeled or categorized — that can translate between texts in two languages without direct translational data between the two.
In a statement, Google diplomatically said it’s “grateful” for the machine translation research in “academia and industry,” some of which informed its own work. “We accomplished [Google Translate’s recent improvements] by synthesizing and expanding a variety of recent advances,” said the company. “With this update, we are proud to provide automatic translations that are relatively coherent, even for the lowest-resource of the 108 supported languages.”
谷歌表示,在提高语言翻译的质量方面,已经取得了进展。在即将发表的博客文章中,该公司详细介绍了新的创新技术,这些创新技术增强了Google翻译(Google Translate)支持的108种语言(特别是数据贫乏的约鲁巴语和马拉雅拉姆语)的用户体验,该服务平均每天翻译1500亿个单词。
自谷歌翻译首次公开亮后的13年间,诸如神经机器翻译、基于重写的范例和本地处理之类的技术使该平台的翻译准确性有了可量化的飞跃。但是直到最近,翻译的最新算法表现也落后于人类。Google之外的努力也说明了问题的难度,Masakhane项目旨在使非洲大陆上的数千种语言能够自动翻译,但它还没有超出数据收集和转录阶段。共同的声音(雷锋网注,Common Voice是Mozilla发起的一个众包项目,旨在为语音识别软件创建免费的数据库)自2017年6月推出以来,Mozilla为建立转录语音数据的开源集合所做的努力仅审查了40种声音。
谷歌表示,其翻译质量的突破并不是由单一技术推动,而是针对资源较少的语言、高质量源语言、总体质量、延迟和整体推理速度的技术组合。在2019年5月至2020年5月之间,通过人工评估和BLEU(一种基于系统翻译与人工参考翻译之间相似性的指标)进行衡量,谷歌翻译在所有语言中平均提高了5分或更多,在50种最低水平的翻译中平均提高了7分或更多。此外,谷歌表示,“翻译”对机器翻译联想的功能变得更加强大,一种现象是,当给泰卢固语字符“షషషషషష”输入,“Shenzhen Shenzhen Shaw International Airport (SSH)”)时,AI模型会产生奇怪的翻译“Sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh”。
混合模型和数据挖掘器
这些技术中的第一个是翻译模型体系结构——一种混合体系结构,包含在Lingvo(用于序列建模的TensorFlow框架)中实现的Transformer编码器和递归神经网络(RNN)解码器。
在机器翻译中,编码器通常将单词和短语编码为内部表示,然后解码器将其用于生成所需语言的文本。谷歌相关研究人员于2017年首次提出的基于Transformer模型在此方面比RNN更为有效,但谷歌表示其工作表明大部分质量提升仅来自于Transformer的一个组成部分:编码器。那可能是因为虽然RNN和Transformer都被设计为处理有序的数据序列,但是Transformers并不需要按顺序处理序列。换句话说,如果所讨论的数据是自然语言,则Transformer无需在处理结尾之前处理句子的开头。
尽管如此,在推理时,RNN解码器仍比“Transformer”中的解码器“快得多”。意识到这一点,Google Translate团队在将RNN解码器与Transformer编码器结合之前对RNN解码器进行了优化,以创建低延迟、质量更高,比四年前基于RNN的神经机器翻译模型更稳定的混合模型进行替代。
除了新颖的混合模型体系结构之外,Google还从数以百万计的示例翻译中(用于文章、书籍、文档和Web搜索结果)编译了用于编译训练集的数十年历史的爬虫。新的翻译器基于嵌入的14种主流语言,而不是基于字典的-意味着它使用实数矢量表示单词和短语-更加注重精度(相关数据在检索到的数据中所占的比例)想到(实际检索到的相关数据总量的一部分)。谷歌表示,在使用过程中,这使翻译器提取的句子数量平均增加了29%。
有噪音的数据和转移学习
另一个翻译性能提升来自更好地处理训练数据中噪声的建模方法。观察到有噪声的数据(含有大量无法正确理解或解释的大量信息的数据)会损害语言的翻译,因此Google翻译团队部署了一个系统,该系统使用经过训练的模型为示例分配分数对嘈杂的数据进行调优,并对“干净的”数据进行调优。实际上,这些模型开始对所有数据进行训练,然后逐步对较小和较干净的子集进行训练,这是AI研究社区中称为课程学习的方法。
在资源匮乏的语言方面,Google 在翻译中实施了反向翻译方案,以增强并行训练数据,该语言中的每个句子都与其翻译配对。(机器翻译传统上依赖于源语言和目标语言中成对句子的语料统计)在这种方案中,训练数据会自动与合成并行数据对齐,从而目标文本是自然语言,但会生成源通过神经翻译模型。结果是谷歌翻译利用了更丰富的单语文本数据来训练模型,Google表示这对于提高流利性特别有用。
谷歌翻译现在还利用了M4建模,其中一个大型模型M4在多种语言和英语之间进行翻译。(M4是于去年在一篇论文中首次提出,证明它在训练了100多种语言中的250亿对句子对之后,提高了30多种低资源语言的翻译质量。)M4建模使谷歌翻译中的迁移学习成为可能,收集了包括法语、德语和西班牙语(有数十亿个并行示例)的高资源语言进行训练提升了表现,从而可以应用于翻译诸如约鲁巴语、信德语和夏威夷语(仅有数万个示例)的低资源语言。
展望未来
根据Google的说法,自2010年以来,翻译每年至少提高了1个BLEU点,但是自动机器翻译绝不能解决问题。Google承认,即使是其增强的模型也容易出错,包括将一种语言的不同方言混淆,产生过多的直译,以及在特定题材和非正式或口头语言上的表现不佳。
微软试图通过各种方式解决这一问题,包括通过其谷歌翻译社区计划(Google Translate Community)来招募志愿者,通过翻译单词和短语或检查翻译是否正确来帮助提高低资源语言的翻译质量。仅在2月份,该程序与新兴的机器学习技术相结合,就增加了翻译,共有7500万人使用了五种语言:Kinyarwanda、Odia(奥里亚语)、Tatar、Turkmen和Uyghur(维吾尔语)。
追求真正通用翻译的并不只有Google。在2018年8月,Facebook 公开了一种AI模型,该模型结合了逐词翻译,语言模型和反向翻译的组合,在语言配对方面表现更好。最近,麻省理工学院计算机科学与人工智能实验室的研究人员提出了一种无监督的模型,即可以从未明确标记或分类的测试数据中学习的模型,该模型可以在两种语言的文本之间进行翻译,而无需在两种语言之间直接进行翻译。
谷歌在一份声明中以外交方式表示,它对“学术界和工业界”的机器翻译研究表示感谢,其中一些通报了自己的工作。该公司表示:“我们通过综合和扩展各种最新进展来实现(谷歌翻译最近的改进)。通过此更新,我们为提供相对一致的自动翻译而感到自豪,即使是在支持的108种语言中资源最少的情况下也是如此。”
作者 | 包永刚
以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。
阅读原文