Yes, Now They Claim Machines ‘Outperform’ Human Translation in Adequacy

是的,如今他们声称机器翻译在充分性方面“胜过”人工翻译

2020-09-10 18:50 slator

本文共723个字,阅读需8分钟

阅读模式 切换至中文

After machine translation became neural machine translation around 2016, academia and big tech research groups began publishing papers implying machine translation had reached human-level quality (whatever that means). Simplified headlines in tech publications followed, researchers argued journalists cherry-picked quotes, and Slator tried to make sense of it all by asking the experts. “Achieving human-level translation” is so 2018, though. In 2020, it has become “outperforming human-level translation.” That claim was made in a paper published on September 1, 2020, about CUBBITT, a new Transformer-based deep-learning system. The authors include Google’s Łukasz Kaiser, Jakob Uszkoreit of Google Brain Berlin, and Ondřej Bojar at Charles University in Prague. Okay, let’s start with the caveats. The paper’s title only mentions reaching translation quality “comparable” to human professionals in the domain of “news” translation. And the claim of outperforming humans is reserved for the “adequacy” metric. Still, the claim of “outperforming humans” on any metric other than speed seems new. Furthermore, the authors conceded that “highly qualified human translators with [an] infinite amount of time and resources will likely produce better translations than any MT system.” They added, however, that “many clients cannot afford the costs of such translators and instead use services of professional translation agencies, where the translators are under certain time pressure. Our results show that the quality of professional-agency translations is not unreachable by MT, at least in certain aspects, domains, and languages.” An interesting take on the professionalism of linguists who earn their living from translation. So how did CUBBITT’s supposed outperformance come about? The study defined adequacy as “adequately expressing [the source text’s] intended meaning in the target language.” The assertion that CUBBITT outperformed human translation, therefore, means that human evaluators rated CUBBITT’s translations as representing the source text’s meaning better than the human reference translations: 52% of CUBBITT’s sentences scored higher than the human translations; 26% of CUBBITT translations were scored lower than human translations. Using the same source documents and translations from CUBBITT’s winning performance on the WMT18 news translation task, 15 human evaluators rated the quality of almost 8,000 sentences across 53 documents. Unlike the news translation task, however, evaluators were provided document-level context for the translations. This allowed evaluators to catch errors that might not have been evident without context, such as a gender mismatch or the incorrect translation of an ambiguous expression. Compared to the human reference translations, the authors observed that “CUBBITT made significantly fewer errors in addition of meaning, omission of meaning, shift of meaning, other adequacy errors, grammar, and spelling.” On the other hand, CUBBITT made significtantly more errors due to cross-sentence context (as the researchers anticipated), and human translation was still rated as more fluent. The group also conducted a “sentence-level translation Turing test” by showing evaluators 100 pairs of sentences, each consisting of a source sentence and a translation. Participants then identified each translation as produced by either a human or by MT. CUBBITT translations were less likely to be identified as MT than translations produced by Google Translate. “One potential contributor to human-likeness of CUBBITT could be the fact that it is capable of restructuring translated sentences where the English structure would sound unnatural in Czech,” the authors posited, crediting CUBBITT’s training on back-translation data. To overcome the lack of English–Czech parallel data for training, the researchers used back-translation, translating more widely available monolingual target language data into the source language. The resulting sentence pairs comprise additional synthetic parallel training data, which are traditionally mixed together with authentic sentences in random order. CUBBITT is “trained with back-translation data in a novel block regime (block-BT), where the training data are presented to the neural network in blocks of authentic parallel data alternated with blocks of synthetic data.” The authors noted that back-translation can sometimes have the inadvertent benefit of improving the fluency (and sometimes adequacy) of the final translations, “since the target side in back-translation are authentic sentences originally written in the target language.” The English–French and English–Polish versions of CUBBITT attained BLEU results consistent with those of the English–Czech version. Document-level evaluations suggest that CUBBITT performs best on articles related to business and politics, and performs the worst on articles about art, entertainment, and sports.
2016年前后,机器翻译转为神经机器翻译,学术界和大型科技研究团体开始就此发表论文,暗示机器翻译已经达到了人工水平(不管他们想表达什么)。随之而来的,是被简化的科技出版物标题。研究人员对记者们精心挑选了一些引语(这一误导行为)表示争议,而斯莱特(Slator)试图通过询问专家的方式来理解这些引语 不过,“实现人性化翻译”是2018年的潮流。在2020年,它已经成为“超越人类水平的翻译。”2020年9月1日发表的一篇关于 CUBBITT (一种基于 transformer 的新型深度学习系统)的论文中提出了这一观点。作者包括 Google 的Łukasz Kaiser,Google Brain Berlin 的 Jakob Uszkoreit,以及布拉格查尔斯大学的Ondřej Bojar。 好吧,让我们从注意事项开始。这篇论文的标题仅仅提到在“新闻”翻译领域机翻可以达到人工翻译的质量。而优于人类的说法是留给“充分性”来度量的。尽管如此,“在速度之外的任何指标上都胜过人类”的说法似乎还是新鲜事物。 此外,作者承认“高质量的人力译者配合无限的时间和资源,其翻译文本将可能远胜过任意MT系统。” 然而,他们也补充说,“许多客户无法承担这些译员的费用,从而转向专业翻译机构,译员承受的时间压力也是他们做出选择的原因之一。研究结果表明,至少在某些方面、领域和语言上,机器翻译能够达到专业机构翻译的质量。”这是对靠翻译谋生的语言学家的专业素养的一个有趣的看法。 那么,CUBBITT所谓的超群表现是如何产生的呢? 研究将充分性定义为“在目的语中充分表达[源语的]意图。”因此,关于CUBBITT优于人工翻译的断言,意味着人类评价者认为CUBBITT的翻译比人工参考翻译更能代表原文的意思:有52%的CUBBITT的句子比人工翻译得分高; 26% CUBBITT翻译的得分低于人工翻译。 在使用同样源文件的情况下,CUBBITT系统的翻译结果在WMT18新闻翻译任务中拔得头筹。15名人工评阅员对53份文件中近8000个句子的翻译质量进行了评估,与平日的新闻翻译任务不同的是,评价者为翻译提供了源文档的上下文。这使得评估人员能够捕捉到没有原文就难以发现的错误,例如性别不匹配或含糊表达的错误翻译。 与人工参考翻译相比,作者观察到,“CUBBITT在意义、遗漏、转移意义、其他适当性错误、语法和拼写方面的错误显著减少。”另一方面,CUBBITT由于跨句语境而出现了的错误更多(正如研究人员所预期),人工翻译相比之下仍然被认为更加流畅。 研究小组还进行了一次“句子水平翻译图灵测试”,向评阅者展示了100对句子,每对句子由一个原句和一个译文组成。然后,参与者识别每一个由人类或机器翻译产生的翻译。CUBBITT翻译比谷歌翻译产生的结果更不容易被识别为机翻。 “使CUBBITT翻译结果更接近人工翻译的一个潜在因素可能是,它能够重新组合翻译后的句子,而英语的结构在捷克语中听起来可能不自然,”作者推测,并将其归功于CUBBITT在反向翻译数据方面的训练。 为了克服缺乏用于训练的英-捷克并行数据,研究人员使用了反向翻译,将更广泛的单语目标语言数据翻译成源语言。生成的句子对包含额外的合成并行训练数据,这些数据传统上与真实句子以随机顺序混合在一起。 CUBBITT是“在一种新的块结构(block- bt)中使用反向翻译数据进行训练,在这种结构中,训练数据以真实的并行数据块和合成数据块交替出现的形式呈现给神经网络。” 作者指出,反向翻译有时会无意中带来提高最终翻译的流畅性(有时是适当性)的好处,“因为反向翻译的目标端是用目标语言写成的真实句子。” 英语-法语和英语-波兰语版本的CUBBITT得到了与英语-捷克语版本一致的蓝色结果。文档级别的评估表明,CUBBITT在商业和政治相关的文章上表现最好,在艺术、娱乐和体育方面表现最差。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文