US, China, Big Tech, and What’s New in Machine Translation Research

美国,中国,大型科技公司,以及机器翻译研究的新动向

2020-10-16 16:50 slator

本文共975个字,阅读需10分钟

阅读模式 切换至中文

The rapid pace of research on all matters machine translation continues unabated on Cornell University’s preprint server, arXiv. Studies range from topics of purely academic interest to advances promising practical improvements to real-life applications. In preparation for the (virtual) conference season in the latter part of the year, the September–October period tends to be a busy one for research output. While authors hail from a variety of institutions and organizations, research out of US and China Big Tech has risen steadily. For Silicon Valley’s tech giants, the papers are simply the latest in a stream of continuously accelerating research related to machine translation (MT). Apple ramped up its MT research in preparation for the September 2020 release of its Translate app. Facebook open-sourced CoVoST V2 in July 2020, a “massively multilingual” speech-to-text (STT) translation dataset. And Google’s Łukasz Kaiser contributed to a September 2020 paper that claimed MT now outperforms human translation. Meanwhile, Chinese companies, as Alibaba, WeChat owner Tencent, and TikTok parent ByteDance, have released their own studies in line with the Chinese government’s three-year action plan, issued in 2018, to advance the country’s AI technology, including speech recognition and MT. This flurry of machine translation research takes place against the backdrop of a US-China trade war. China updated its restricted-export list in late August 2020 to include more language technologies, and companies must now seek approval from Beijing before exporting such products. Western critics were already wary of Chinese MT offerings due to security issues; such as those raised in an Australian thinktank’s October 2019 report claiming that China uses state-owned companies, which provide MT services, to collect data on users outside China. (When Microsoft announced in September 2020 that ByteDance declined to sell TikTok’s US operations to Microsoft, it noted that, had the sale gone through, Microsoft would have made “significant changes” to the service to protect user privacy.) Politics aside, the articles reflect some trends in MT research, such as the growing interest in STT translation solutions, as evidenced by Facebook’s fairseq S2T and ByteDance’s TED and SDST. (Slator covered the Chinese government’s investment in speech recognition back in 2019.) Two of Google’s articles explore the potential of MT for low-resource languages. The concept of inference also features in two papers, one by Google, the other by Apple. Uncertainty-Aware Semantic Augmentation for Neural Machine Translation – As every translator knows, there are multiple valid translations for any given source text. In NMT, this concept is called “intrinsic uncertainty.” Researchers built a network that does not penalize the use of accurate synonyms and found MT performance improved consistently across language pairs. Self-Paced Learning for Neural Machine Translation – Add this paper to the canon of research implying that MT can beat humans at their own game. An NMT engine was improved via “self-paced learning,” which mimics the human language learning process. Efficient Inference for Neural Machine Translation – Diving deep into the inner workings of NMT, this study explored the ideal combination of techniques to optimize inference speed in large Transformer models without sacrificing translation quality. Generative Imagination Elevates Machine Translation – With a title that, at first glance, evokes a certain Noam Chomsky sentence, this paper details the use of an “imagination-based MT model.” ImagiT synthesizes visual representations based on source text rather than relying on annotated images as input, which reportedly improved translation quality. TED: Triple Supervision Decouples End-to-End Speech-to-Text Translation – Traditional cascaded speech translation systems are slow and can introduce content errors. The TED framework, designed to imitate how humans process audio information, aims to avoid these issues by using separately trained subsystems for auto speech recognition and for MT. SDST: Successive Decoding for Speech-to-Text Translation – In response to the open question of whether end-to-end or cascaded models are stronger, the authors suggested that their framework offers the best of both worlds; and stated their plans to make their model and code publicly available. fairseq S2T: Fast Speech-to-Text Modeling with fairseq – Facebook’s fairseq S2T extension provides end-to-end speech recognition and STT translation. Documentation and examples are available on GitHub. KoBE: Knowledge-Based Machine Translation Evaluation – Seeking a method by which to evaluate MT without reference translations, researchers created and released a large-scale knowledge base across 18 language pairs. The authors described their process as language-pair agnostic, noted that synonyms in MT output should not be penalized, and expressed interest in scaling the method to much larger or domain-specific datasets. Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages – Low-resource languages often lack the large amounts of relevant parallel data and high-quality monolingual data necessary for state-of-the-art MT results. A three-stage training plan that incorporates synthetic data and higher-resource languages as pivot languages outperformed all unsupervised baselines and surpassed a variety of WMT submissions. Inference Strategies for Machine Translation With Conditional Masking – What is best inference strategy for a trained conditional masked language model? Researchers found that disallowing the re-masking of previously unmasked tokens resulted in “favorable quality-to-speed trade-offs.” Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020 Shared Task – As part of a contribution to WMT 2020 Metrics Shared Task, the main benchmark for automatic evaluation MT, the previously published metric BLEURT was extended beyond English to evaluate 14 language pairs with fine-tuning data available, plus four zero-shot languages. Token-Level Adaptive Training for Neural Machine Translation – NMT encounters a range of learning difficulties with different tokens based on how frequently different words appear in natural language. Assigning larger weights to meaningful low-frequency words during training yielded consistent improvements in translation quality for Chinese to English, English to Romanian, and English to German.
在康奈尔大学的预印本服务器arxiv上,对所有问题的机器翻译的快速研究继续有增无减。研究范围从纯学术兴趣的课题到有前景的实际改进到实际应用的进展。 为了准备今年下半年的(虚拟)会议季节,9月至10月期间往往是研究成果的繁忙时期。尽管作者来自各种机构和组织,但来自美国和中国的大科技公司的研究一直在稳步增长。 对于硅谷的科技巨头来说,这些论文只是机器翻译(MT)相关研究不断加速的最新成果。苹果加紧了机器翻译研究,为2020年9月发布的翻译应用做准备。Facebook于2020年7月开放源代码CoVoST V2,一个“大规模多语言”语音到文本(STT)翻译数据集。谷歌的UkaszKaiser在2020年9月发表的一篇论文中称,机器翻译现在的表现超过了人工翻译。 与此同时,中国企业,如阿里巴巴(Alibaba)、微信公司腾讯(Tencent)和蒂克托克(TikTok)母公司ByteDance,根据中国政府2018年发布的三年行动计划,发布了自己的研究报告,以推进中国人工智能技术,包括语音识别和机器翻译。 这一系列的机器翻译研究是在美中贸易战的背景下进行的。中国在2020年8月底更新了其限制出口的清单,增加了更多的语言技术,现在企业在出口此类产品之前必须获得北京的批准。 由于安全问题,西方的批评人士已经对中国提供的机器翻译产品持谨慎态度; 例如,澳大利亚智库2019年10月的一份报告提出,中国利用提供机器翻译服务的国有企业收集中国境外用户的数据。(当微软在2020年9月宣布 ByteDance 拒绝将 TikTok 的美国业务出售给微软时,它指出,如果交易成功,微软将对该服务做出“重大改变” ,以保护用户隐私。) 撇开政治因素不谈,这些文章反映了机器翻译研究的一些趋势,比如人们对科技翻译解决方案日益增长的兴趣,Facebook的fairseq S2T和ByteDance的TED和SDST就是明证。(Slator早在2019年就报道了中国政府在语音识别方面的投资。)谷歌的两篇文章探讨了机器翻译在低资源语言领域的潜力。推理的概念也出现在两篇论文中,一篇是谷歌的,另一篇是苹果的。 神经机器翻译的不确定性感知语义增强——正如每一个翻译者知道的,对于任何给定的源文本都有多个有效的翻译。在NMT中,这一概念被称为“内在不确定性”。研究人员建立了一个不惩罚使用准确同义词的网络,并发现跨语言对的机器翻译性能不断提高。 神经机器翻译的自定速度学习——将这篇论文加入到暗示MT可以在人类自己的游戏中击败人类的研究经典中。NMT引擎通过模仿人类语言学习过程的“自定进度学习”得到了改进。 有效的推理神经机器翻译-潜水深入到内部工作的NMT,本研究探讨了理想的组合技术,以优化推理速度在大型变压器模型,而不牺牲翻译质量。 生成性想象提升了机器翻译——本文详细介绍了“基于想象的机器翻译模型”的使用,其标题乍一看让人想起诺姆·乔姆斯基的某句话。ImagiT根据源文本来合成视觉表示,而不是依赖于标注的图像作为输入,据报道这提高了翻译质量。 TED:三重监控将端到端的语音到文本翻译解耦-传统的级联语音翻译系统速度慢,可能会导致内容错误。TED框架旨在模拟人类处理音频信息的方式,旨在通过使用单独训练的子系统进行自动语音识别和机器翻译来避免这些问题。 SDST:语音到文本翻译的连续解码——对于端到端模型和级联模型孰强孰弱这一开放性问题,作者认为他们的框架同时提供了两种模式的优点;并声明了他们的计划,使他们的模型和代码公开。 Fairseq S2T: 使用 fairseq-Facebook 的 fairseq S2T 扩展进行快速语音到文本建模,提供端到端语音识别和短时传输转换。文档和示例可以在 GitHub 上找到。 科比: 基于知识的机器翻译评估——研究人员创建并发布了一个跨越18个语言对的大规模知识库,寻找一种无参考译文的机器翻译评估方法。作者将他们的过程描述为语言对不可知论,指出 MT 输出中的同义词不应受到处罚,并表示有兴趣将该方法扩展到更大的或领域特定的数据集。 利用稀有语言的无监督机器翻译中的多语性——资源较少的语言往往缺乏大量相关的并行数据和高质量的单语数据,而这些数据是最先进的机器翻译结果所必需的。三阶段培训计划将合成数据和高级资源语言作为中心语言,超过了所有无监督的基准,并超过了各种 WMT 提交的数据。 有条件掩蔽的机器翻译的推理策略-什么是训练好的条件掩蔽语言模型的最佳推理策略?研究人员发现,不允许重新掩蔽之前未掩蔽的令牌会导致“良好的质量和速度的权衡”。 学会评估超越英语的翻译: BLEURT 提交到 WMT 2020共享任务——作为对 WMT 2020度量标准共享任务(自动评估机器翻译的主要基准)的贡献的一部分,以前发布的度量标准 BLEURT 已经超越英语,评估了14对语言,包括可用的微调数据,加上四种零镜头语言。 基于标记级自适应训练的神经机器翻译-NMT 遇到了一系列的学习困难与不同的标记在自然语言中出现的频率不同。在训练过程中,给有意义的低频词赋予更大的权重,可以持续提高中文翻译成英文、英文翻译成罗马尼亚文、英文翻译成德文的翻译质量。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文