Multilingual translation at scale: 10000 language pairs and beyond

多语种翻译规模:10000种语言对及以上

2021-11-22 18:26 Microsoft Translator Blog

本文共1088个字,阅读需11分钟

阅读模式 切换至中文

Microsoft is on a quest for AI at Scale with high ambition to enable the next generation of AI experiences. The Microsoft Translator ZCode team is working together with Microsoft Project Turing and Microsoft Research Asia to advance language and multilingual support at the core of this initiative. We continue to push frontiers with Multilingual models to support various language scenarios across Microsoft. Last summer, we announced our large scale Multi-Lingual Mixture of Expert model with DeepSpeed that can outperform individual large scale bi-lingual models. Recently, the latest Turing universal language representation model (T-ULRv5), a Microsoft-created model is once again the state of the art and at the top of the Google XTREME public leaderboard at that time. More recently, Microsoft announced the largest Megatron-Turing NLG 530B parameters model. The annual Conference on Machine Translation (aka WMT 2021) concluded last week in beautiful Punta Cana, Dominican Republic. WMT brings together researchers from across the entire Machine Translation field, both industry and academia, to participate in a series of shared tasks, each defining a benchmark in an important area of machine translation to push the field into new frontiers. The Microsoft Translator ZCode team, working together with Turing team and Microsoft Research Asia, competed in the “Large-scale Multilingual Translation” track, which consisted of a Full Task of translating between all 10,000 directions across 101 languages, and two Small tasks: One focused on 5 central and southern European languages, and one on 5 south-east Asian languages. The Microsoft ZCode-DeltaLM model won all three tasks by huge margins, including an incredible 10+ point gain over the M2M100 model in the large task evaluated on a massive 10,000 language pairs. (Findings of the WMT 2021 Shared Task on Large-Scale Multilingual Machine Translation, Wenzek et al, WMT 2021). Figure 1: Official Results (BLEU scores) on the Full-Task and the Small-Task1 at the WMT 2021 Large Scale Multilingual Translation shared task The ZCode-DeltaLM approach In this blog post, let’s take a look under the hood at the winning Microsoft ZCode-DeltaLM model. Our starting point was DeltaLM (DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders), the latest in the increasingly powerful series of massively multilingual pretrained language models from Microsoft. DeltaLM is an encoder-decoder model, but instead of training from scratch, it is initialized from a previously pretrained state-of-the-art encoder-only model, specifically (TULRv3). While initializing the encoder is straightforward, the decoder is less so, since it adds cross-attention to the encoder’s self-attention. DeltaLM solves this problem with a novel interleaved architecture, where the self-attention and cross-attention alternate between layers, with the self-attention used in the odd layers and cross-attention used in the even layers. With this interleaving, the decoder structure matches the encoder, and so it can also be initialized the same way from TULRv3. DeltaLM is augmented by ZCode powerful multitask learning: Multi-task Learning for Multilingual Neural Machine Translation. Our models show that combining multitask and multilingual learning can significantly improve training for large scale pretrained language models. Such multitask multilingual learning paradigm is leveraging the inductive bias and regularization from several tasks and languages simultaneously to perform better on various downstream tasks. We are using translation task, denoising auto encoder task and translation span corruption task as shown in the figure below. Winning the massively multilingual translation track To build our winning massively multilingual translation system (Multilingual Machine Translation Systems from Microsoft for WMT21 Shared Task), we started with zCode-DeltaLM, and added a few tricks. We apply progressive learning, first training a model with 24 encoder layers and 12 decoder layers, then continue training with 12 added encoder layers, resulting in a deep 36 layer encoder. To cover all language pairs, we generate dual-pseudo-parallel data where both sides of the parallel data are synthetic, translated by the model from English. We also apply iterative back-translation to generate synthetic data. We apply curriculum learning, starting with the entire noisy training data, then reducing it to a clean subset. We re-weight the translation objective to favor parallel data over the back-translation and dual-pseudo-parallel data. We apply temperature sampling to balance across language pairs. For each language pair, we choose, based on the dev set, whether to prefer direct translation or pivot translation through English. Putting it all together, we knew we had an amazing massively multilingual system, but the official results on the blind test set exceeded our expectations. We scored 2.5 to 9 BLEU ahead of the next competitor, and 10 to 21 BLEU points ahead of the baseline M2M-175 model. On the dev test we compared against the larger M2M-615 model, which we also beat by 10 to 18 points. Beyond Translation: Universal Language Generation While we are excited about the big win at WMT 2021, what’s even more exciting is that unlike the other competitors, our ZCode-DeltaLM model is not just a translation model, but rather a general pretrained encoder-decoder language model, usable for all kinds of generation tasks beyond translation. This really enable our models to perform quite well on various multilingual natural language generation tasks. We reached a new SOTA in many popular generation tasks from GEM Benchmark, including Wikilingua (summarization), Text simplification (WikiAuto), and structure-to-text (WebNLG). The DeltaLM-ZCode model widely outperform much larger models such as mT5 XL (3.7B) which is also trained on much larger data as well. This demonstrated the efficiency and versatility of the models leading to strong performance across many tasks. Figure 2. Performance (RL scores) of ZCode-DeltaLM on the Summarization and Text Simplification tasks in the GEM benchmark Looking Ahead Multilingual Machine Translation has reached a point where it performs very well, exceeding bilingual systems, on both low and high resource languages. Mixture of Experts (MoE) models have been shown to be a very good fit to scale up such models as has been shown in GShard. We explore how to efficiently scale such models with Mixture of Experts: Scalable and Efficient MoE Training for Multitask Multilingual Models. MoE models with massive multilingual data and unsupervised multitask training present unprecedent opportunity for such models to provide truly universal systems that can further enable the Microsoft Translator team to eliminate language barriers across the world, as well as support a variety of natural language generation tasks. Acknowledgements We would like to acknowledge and thank Francisco Guzman & his team who collected the massively multilingual FLORES test set and organized this WMT track with such large scale evaluation.
微软正雄心勃勃地追求大规模的人工智能,以实现下一代人工智能体验。微软翻译ZCode团队正在与微软图灵项目和微软亚洲研究院合作,以推进语言和多语言支持作为该计划的核心。我们继续推进多语言模型的前沿,以支持微软的各种语言场景。去年夏天,我们宣布了我们的大规模多语言混合专家模型与DeepSpeed,可以优于个人大规模双语言模型。最近,最新的图灵通用语言表示模型(T-ULRv5),一个微软创建的模型,再次成为最先进的技术,并在当时的谷歌XTREME公共排行榜上名列前茅。最近,微软发布了最大的威震天-图灵NLG 530B参数模型。 一年一度的机器翻译大会(又名WMT 2021)上周在美丽的多米尼加共和国蓬塔卡纳闭幕。WMT将来自整个机器翻译领域(包括工业界和学术界)的研究人员聚集在一起,参与一系列共享任务,每个人在机器翻译的一个重要领域定义一个基准,将该领域推向新的前沿。 微软Translator ZCode团队与图灵团队和微软亚洲研究院合作,参加了“大规模多语言翻译”竞赛,该竞赛包括一个在101种语言的10000个方向之间进行翻译的完整任务,以及两个小任务:其中一项重点研究5种中欧和南欧语言,另一项重点研究5种东南亚语言。微软ZCode-DeltaLM模型以巨大的优势赢得了所有三个任务,包括在评估了大量的10,000个语言对的大型任务中,比M2M100模型获得了令人难以置信的10+分。(WMT 2021共享任务对大规模多语言机器翻译的研究,Wenzek et al, WMT 2021) 图1:WMT 2021年大规模多语言翻译共享任务中完整任务和小型任务的官方结果(BLEU分数) ZCode-DeltaLM方法 在这篇博文中,让我们来了解一下获胜的微软ZCode-DeltaLM模型的成功之处。我们的起点是DeltaLM (DeltaLM: 通过增强预训练的多语言编码器,用于语言生成和翻译的编码器-解码器预训练),它是微软推出的功能日益强大的大规模多语言预训练语言模型系列中的最新产品 DeltaLM是一个编码器-解码器模型,但它不是从头开始训练,而是从先前预训练的最先进的仅用于编码的模型(特别是TULRv3)初始化。虽然初始化编码器很简单,但解码器就不那么简单了,因为它在编码器的自我注意上增加了交叉注意。DeltaLM用一种新颖的交叉结构解决了这个问题,在这种结构中,自我注意和交叉注意在层之间交替,自我注意在奇数层中使用,交叉注意在偶数层中使用。通过这种交错,解码器结构与编码器匹配,因此也可以用与TULRv3相同的方式初始化它。 ZCode增强了DeltaLM的多任务学习功能:用于多语言神经机器翻译的多任务学习。我们的模型表明,结合多任务和多语言学习可以显著提高大规模预训练语言模型的训练。这种多任务多语言学习范式利用了来自多个任务和语言的归纳偏差和正则化,从而在各种下游任务中表现得更好。我们正在使用翻译任务、去噪自动编码器任务和翻译跨度破坏任务,如下图所示。 在大规模多语种翻译中胜出 为了构建我们成功的大规模多语言翻译系统(微软WMT21共享任务的多语言机器翻译系统),我们从zCode-DeltaLM开始,并添加了一些技巧。 我们采用渐进学习,首先训练一个包含24个编码器层和12个解码器层的模型,然后继续训练12个编码器层,最终得到一个深度为36层的编码器。为了涵盖所有语言对,我们生成了双伪平行数据,其中平行数据的两边都是合成的,并由该模型从英语翻译。我们还应用迭代反向转换来生成合成数据。我们应用课程学习,从整个有噪声的训练数据开始,然后将其减少到一个干净的子集。我们重新确定了翻译目标的权重,以使平行数据优于反向翻译和双伪平行数据。我们使用温度采样来平衡语言对。对于每一种语言对,我们根据开发集选择是选择直接翻译还是通过英语进行枢轴翻译。 综合起来,我们知道我们有一个惊人的多语言系统,但盲人测试的官方结果超出了我们的预期。我们领先下一位选手2.5至9 BLEU,领先基线M2M-175 10至21 BLEU。在开发测试中,我们与更大的M2M-615模型进行了比较,我们也以10比18的优势击败了它。 超越翻译:通用语言生成 虽然我们对WMT 2021的大胜利感到兴奋,但更令人兴奋的是,与其他竞争对手不同的是,我们的ZCode-DeltaLM模型不仅仅是一个翻译模型,而是一个通用的预训练的编码器-解码器语言模型,可用于翻译以外的所有生成任务。这确实使我们的模型能够在各种多语言自然语言生成任务中很好地执行。 我们在GEM基准测试的许多流行的生成任务中达成了一个新的SOTA,包括维基语言(摘要),文本简化(WikiAuto)和结构到文本(WebNLG)。DeltaLM-ZCode模型的性能远远优于mT5 XL(3.7b)等更大的模型,后者也是在更大的数据上训练的。这表明了这些模型的效率和多功能性,从而在许多任务中实现了强大的性能。 图2。ZCode-DeltaLM在GEM基准测试中总结和文本简化任务上的性能(RL分数) 展望未来 多语言机器翻译在低资源语言和高资源语言上的表现都非常出色,超过了双语系统。混合专家(MoE)模型已经被证明是一个非常适合扩大这样的模型,如在GShard中所显示的。我们探讨了如何使用混合专家:多任务多语言模型的可扩展和高效的MoE训练来有效地扩展这些模型。具有大量多语言数据和无监督多任务培训的MoE模型为此类模型提供了前所未有的机会,可以提供真正通用的系统,进一步使Microsoft Translator团队消除世界各地的语言障碍,并支持各种自然语言生成任务 致谢 我们要感谢弗朗西斯科古斯曼和他的团队,他们收集了大量多语言弗洛雷斯测试集,并组织了这次赛道跟踪,进行了如此大规模的评估。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文