How to Fix the 5 Flaws in Evaluating Machine Translation

如何修正机器翻译评估中的5个缺陷

2020-03-26 20:50 slator

本文共889个字,阅读需9分钟

阅读模式 切换至中文

No one would argue that machine translation quality has not significantly improved from three and a half years ago. It was back then that Google launched neural machine translation into production by (infamously) describing some of the system’s output as “nearly indistinguishable from human translation.” Experts in the field responded with different views. Ever since Google’s 2016 claim, many rival machine translation providers, big and small, have proclaimed similar breakthroughs. Now a new study examines the basis of such claims that (as the researchers put it) “machine translation has increased […] to the degree that it was found to be indistinguishable from professional human translation in a number of empirical investigations.” And it does so by taking a closer look at the human assessments that led to such claims. The new study, published in the peer-reviewed Journal of Artificial Intelligence Research, shows that recent findings of human parity in machine translation were due to “weaknesses” in the way humans evaluated MT output — that is, MT evaluation protocols that are currently regarded as best practices. If this is true, then the industry needs to stop dead in its tracks and, as the researchers suggest, “revisit” these so-called best practices around evaluating MT quality. The study is called “A Set of Recommendations for Assessing Human–Machine Parity in Language Translation.” Published in March 2020, it was authored by the following: Samuel Läubli, Institute of Computational Linguistics, University of Zurich; Sheila Castilho, ADAPT Centre, Dublin City University; Graham Neubig, Language Technologies Institute, Carnegie Mellon University; Rico Sennrich, Institute of Computational Linguistics, University of Zurich; Qinlan Shen, Language Technologies Institute, Carnegie Mellon University; Antonio Toral, Center for Language and Cognition, University of Groningen. “Machine translation (MT) has made astounding progress in recent years thanks to improvements in neural modelling,” the researchers write, “and the resulting increase in translation quality is creating new challenges for MT evaluation. Human evaluation remains the gold standard, but there are many design decisions that potentially affect the validity of such a human evaluation.” What researchers Läubli, et. al. did was to examine human evaluation studies in which neural machine translation (NMT) systems had performed at or above the level of human translators — such as a 2018 study, previously covered by Slator, which concluded that NMT had reached human parity because (using current human evaluation best practices) no significant difference between human and machine translation outputs was found. But in a blind qualitative analysis outlined in this new study, Läubli, et al., showed that the earlier study’s MT output “contained significantly more incorrect words, omissions, mistranslated names, and word order errors” compared to the output of professional human translators. Moreover, the study showed that human evaluation of MT quality depends on three factors: “the choice of raters, the availability of linguistic context, and the creation of reference translations.” In rating MT output, “professional translators showed a significant preference for human translation, while non-expert raters did not,” the researchers said, pointing out that human assessments typically crowdsource workers to minimize cost. Professional translators would, therefore, “provide more nuanced ratings than non-experts” (i.e., amateur evaluators with undefined or self-rated proficiency), thus showing a wider gap between MT output and human translation. Linguistic context was also crucial, the study showed, because evaluators “found human translation significantly more accurate than machine translation when evaluating full documents, but not when evaluating single sentences out of context.” While both machine translation and evaluation have, historically, operated at sentence-level, the study said, “human raters do not necessarily understand the intended meaning of a sentence shown out-of-context […] which limits their ability to spot some mistranslations. Also, a sentence-level evaluation will be blind to errors related to textual cohesion and coherence.” As for the third factor, constructing reference translations, the researchers noted how the aforementioned 2018 study used inconsistent source texts as reference — that is, only half of which was originally written in the source language, while the other half was translated from the target language into the source language. “Since translated texts are usually simpler than their original counterparts […] they should be easier to translate for MT systems. Moreover, different human translations of the same source text sometimes show considerable differences in quality, and a comparison with an MT system only makes sense if the human reference translations are of high quality,” they said. Crucially, the new study also found that “aggressive editing of human reference translations for target language fluency can decrease adequacy to the point that they become indistinguishable from machine translation, and that raters found human translations significantly better than machine translations of original source texts, but not of source texts that were translations themselves.” Since, as the study concludes, “machine translation quality has not yet reached the level of professional human translation, and that human evaluation methods which are currently considered best practice fail to reveal errors in the output of strong NMT systems,” it behooves those that use machine translation to think about making the following design changes to their MT evaluation process: The researchers end by saying that while their recommendations are intended to increase the validity of MT assessments, they are aware that having professional translators perform MT evaluations is expensive. They, therefore, welcome further studies into “alternative evaluation protocols that can demonstrate their validity at a lower cost.”
没有人会认为机器翻译质量没有明显改善,从三年半前。当时,谷歌( Google )推出了神经机器翻译产品,称该系统的一些产品“与人类翻译几乎无法区分”。该领域的专家以不同的意见作了答复。 自谷歌2016年提出索赔以来,许多竞争对手机器翻译家大大小小的供应商都宣布了类似的突破。现在,一项新的研究对这种说法的基础进行了研究,研究人员认为“机器翻译”在一些实证研究中已经增加到了与专业的人类翻译无法区分的程度。它通过更仔细地观察导致这种说法的人类评估来做到这一点。 这一新的研究发表在《人工智能研究杂志》上,它表明,机器翻译人类平等的最新发现是由于人类评估 MT 输出方式的“弱点”,即目前被认为是最佳实践的 MT 评估协议。 如果这是真的,那么该行业需要停止死板的轨道,并且,正如研究人员所建议的,围绕评估 MT 质量“重新审视”这些所谓的最佳实践。 这项研究被称为“一套在语言翻译中评估人类-机器平等的建议”。它于2020年3月出版,作者如下:苏黎士大学计算语言学研究所 Samuel L ä ubli ;都柏林大学 ADAPT 中心 Sheila Castilho ;卡内基梅隆大学语言技术研究所 Graham Neubig ;苏黎士大学计算语言学研究所 Rico Sennrich ; Qinlan Shen ,卡内基梅隆大学语言技术研究所、格罗宁根大学语言与认知中心 Antonio Toral 。 研究人员写道:“由于神经模型的改进,机器翻译( MT )近年来取得了惊人的进展,因此翻译质量的提高给 MT 的评估带来了新的挑战。人类评估仍然是金标准,但有许多设计决策可能会影响这种人类评估的有效性。” 研究人员 L ä ubli 等。al 。我曾研究过人类评价研究,其中神经机器翻译( NMT )系统在人类翻译员的水平上或高于人类翻译员的水平,例如 Slator 之前涵盖的2018年的一项研究,其结论是, NMT 已达到人类平等,因为(使用目前的人类评价最佳做法)在人类产出和机器翻译产出之间没有显著差异。 但在这一新研究中所概述的盲定性分析中, L ä ubli 等。结果表明,与专业翻译人员的翻译产出相比,早期研究的 MT 输出“包含了更多的不正确的词、省略、不译名和词序错误”。 此外,研究还表明,人对 MT 质量的评价取决于三个因素:“评分员的选择、语言语境的可用性以及参考译文的创建。” 研究人员说,在评定 MT 产量时,“专业翻译人员对人工翻译有显著的偏好,而非专业翻译人员则没有。”他们指出,人工评估通常会将外包人员的成本降至最低。 因此,专业翻译员“比非专家提供更微妙的评级”(即具有未定义或自我评级的熟练程度的业余评价员),从而显示 MT 输出和人工翻译之间的差距更大。 研究显示,语言语境也是至关重要的,因为评估者“在评估完整的文档时,发现人类翻译比机器翻译更加准确,而不是在超出上下文的情况下评估单个句子时。” 尽管机器翻译和评估在历史上都是在句子层面上运行的,但该研究表示,“人类评分员不一定理解上下文之外的句子的意思[…],这限制了他们发现某些错误的能力。此外,句级评估将忽略与语篇衔接和连贯相关的错误。” 至于第三个因素,即构建参考译文,研究人员指出,上述2018年的研究如何使用不一致的源文本作为参考——也就是说,其中只有一半最初是用源语编写的,而另一半则是从目标语言翻译成源语。 “由于翻译文本通常比原始文本简单[…]因此对于 MT 系统来说,翻译应该更容易。此外,同一源文本的不同人类翻译有时在质量上表现出很大的差异,与 MT 系统的比较只有在人类参考译文质量高的情况下才有意义,”他们说。 最重要的是,新的研究还发现,“积极编辑目标语言流畅性的人的参考译文可以降低充分性,使之与机器翻译无法区分,并且评分者发现人类翻译比机器翻译原始原文要好得多,但不是翻译本身的原始文本。” 由于研究得出的结论是,“机器翻译质量还没有达到专业的人的翻译水平,而目前被认为是最佳实践的人的评价方法未能揭示强 NMT 系统输出的错误,“应该让那些使用机器翻译的人考虑对他们的 MT 评估流程进行以下设计更改: 研究人员最后说,虽然他们的建议旨在提高 MT 评估的有效性,但他们意识到让专业翻译人员进行 MT 评估是昂贵的。因此,他们欢迎对“能够以较低成本证明其有效性的替代评估协议”的进一步研究。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文