How to Improve Automatic Machine Translation Evaluation? Add Humans, Scientists Say

如何改进机器自动翻译的评测?科学家说,加上人类的评测。

2021-02-04 18:00 slator

本文共629个字,阅读需7分钟

阅读模式 切换至中文

A group of researchers has developed a leaderboard to automate the quality evaluation of natural language processing (NLP) programs, including machine translation (MT). The leaderboard, known as GENIE, was discussed in a January 17, 2021 paper on preprint server arXiv.org. A leaderboard records automatically computed evaluation metrics of NLP programs. Over time, a leaderboard can help researchers compare apples to apples by standardizing comparisons of newer NLP programs with previous state-of-the-art approaches. Automatic evaluation of MT is notoriously challenging due to the wide range of possible correct translations. The existing metrics to measure MT, in particular BLEU and ROUGE, fall short by diverging significantly from human evaluations; tuning MT models to maximize BLEU scores has even been linked to biased translations. More generally, as MT quality has improved and produced more nuanced differences in output, these systems have struggled to keep apace with more sophisticated MT models (SlatorPro). It follows, then, that academics and tech companies alike will search for a more efficient, standardized method of human evaluation. (For example, Facebook patented a method for gathering user engagement data to rate MT in 2019.) The researchers behind GENIE believe they are on the right path. The group comprises Daniel Khashabi, Jonathan Bragg, and Nicholas Lourie of Allen Institute for AI (AI2); Gabriel Stanovsky from Hebrew University of Jerusalem; Jungo Kasai from University of Washington; and Yejin Choi, Noah A. Smith, and Daniel S. Weld, who are affiliated with both AI2 and the University of Washington. “We must actively rethink the evaluation of AI systems and move the goalposts according to the latest developments,” Khasabi wrote on his personal website, explaining that GENIE was built to present “more comprehensive challenges for our latest technology.” GENIE is billed as offering “human-in-the-loop” evaluation, which it provides via crowdsourcing. The process begins when a researcher makes a leaderboard submission to GENIE, which then automatically crowdsources human evaluation from Amazon Mechanical Turk. Once human evaluation is complete, GENIE ranks the model relative to previous submissions. Users can view and compare models’ performance either in a task-specific leaderboard or in a meta leaderboard that summarizes statistics from individual leaderboards. In addition to MT, there are currently three other task-specific leaderboards: question answering, commonsense reasoning, and summarization. The authors encourage researchers and developers to submit new text generation models for evaluation. According to a VentureBeat article on GENIE, the plan is to cap submission fees at USD 100, with initial submissions paid by academic groups. After that, other options may come into play, such as a sliding scale whereby payments from tech companies help subsidize the cost for smaller organizations. “Even upon any potential updates to the cost model, our effort will be to keep the entry barrier as minimal as possible, particularly to those submissions coming from academia,” the authors wrote. Of course, GENIE has a ways to go before becoming ubiquitous in NLP. The authors acknowledge that their system will require “substantial effort in training annotators and designing crowdsourcing interfaces,” not to mention the costs associated with each. Procedures for quality assurance of human evaluation have also yet to be finalized. In particular, the researchers note that human evaluations are “inevitably noisy,” so studying the variability in human evaluations is a must. Another concern is the reproducibility of human annotations over time across individuals. The authors suggest estimating annotator variance and spreading annotations over several days to make human annotations more reproducible. Besides standardizing high-quality human evaluation of NLP systems, GENIE aims to free up model developers’ time; instead of designing and running evaluation programs, they can focus on what they do best. As a “central, updating hub,” GENIE is meant to facilitate an easy submission process with the ultimate goal of encouraging researchers to report their findings.
一组研究人员开发了一种程序,该程序用于自动评估自然语言的质量的处理,包括机器翻译。2021年1月17号,就这个名为GENIE的排行榜在预印本服务器arxiv.org上的一篇论文中进行了讨论。 排行榜记录了自动计算的NLP程序的评估指标。随着时间的推移,排行榜可以帮助研究人员将苹果和苹果进行比较,方式是通过比较标准化和新的NLP程序以及以前最先进的方法。 由于会存在各种可能的正确翻译,机器翻译的自动评估的挑战性是出了名的。现有的MT衡量标准,特别是蓝色和红色,由于显著偏离于人类评价,所以短板明显;调优MT模型最大化了蓝色分数甚至联系到有偏见的翻译。 更普遍地说,随着机器翻译质量的提高和产出的差别越来越细微,这些系统也一直在努力跟上更复杂的MT模型的脚步。 因此,学者和科技公司都在寻求一种更有效,更标准的人类评估方法。(例如,Facebook在2019年申请了一项专利,这项专利是有关一种收集用户参与度数据以对MT进行评级的方法。) GENIE背后的研究人员相信他们所走的路是正确的。该小组成员包括艾伦人工智能研究所的丹尼尔·卡沙比,乔纳森·布拉格和尼古拉斯·路里;来自耶路撒冷希伯来大学的加布里埃尔·斯塔诺夫斯基;来自华盛顿大学的琼戈·葛西;以及来自于AI和华盛顿大学的耶金·乔伊,诺亚·史密斯和丹尼尔·维尔德。 “我们必须以积极的态度重新思考AI系统的评估,并根据最新发展来移动球门柱,”哈萨比在个人网站上写道,他解释说,建造GENIE是为了展示“为我们的最新技术提供更全面的挑战”。 GENIE的宣传是通过众包提供“人在回路”的评估。这个过程开始于一名研究人员,该研究人员向GENIE提交排行榜,然后GENIE自动从众亚马逊土耳其机械众筹人类评价。 一旦人类评估完成,GENIE就会参照以前提交的评估对模型进行排序。用户可以在具体任务的排行榜或来自各个排行榜的汇总的数据的初始排行榜中查看并比较模型的性能。 除了MT,目前还有其他三个具体任务的排行榜:回答问题,推理常识和总结。 作者鼓励研究人员和开发人员提交新的文本生成模型以供评估。根据外媒在GENIE上的一篇文章,计划是将投稿费限制在100美元以内,且首次投稿由学术团体支付。在此之后,其他方式可能会发挥作用,比如滑动比例,由科技公司支付来帮助补贴小机构的花费。 作者写道:“即使对成本模型进行任何内部的更新,我们也会努力把进入壁垒保持的尽可能低,尤其是那些来自于学术界的论文。” 当然,要让GENIE达到在NLP中无处不在的程度还有一段路要走。作者承认,他们的系统培训需要“大量的财力来训练注释者和设计众包接口”,更不用说与每个相关的成本了。 人类评价的质量保证程序也有待确定。尤其是研究人员注意到,人类的评估“不可避免地有影响”,因此研究人类评估的可变性是必要的。 另一个关注点是人类注释在个体之间在时间推移下的重复性。作者建议估计注释的方差,并将注释在几天之内分散开,以使人工注释更易复制。 除了标准化高质量的人类对NLP系统的评估之外,GENIE旨在解放模型开发者的时间;且中心不在设计和运行评估程序,他们可以专注于他们最擅长的事情。作为一个“中心,更新中心”,GENIE旨在简化提交过程,鼓励研究人员报告他们的新发现是最终目标。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文