How to Improve Automatic MT quality Evaluation Metrics

如何优化机器翻译质量评测指标

2019-12-14 16:23 TAUS

本文共944个字,阅读需10分钟

阅读模式 切换至中文

The MT Evaluation Dilemma The translation industry is adopting machine translation (MT) in increasing numbers. Yet a prerequisite for efficient adoption, the evaluation of MT output quality, remains a major challenge for all. Each year there are many research publications investigating novel approaches that seek ways to automatically calculate quality. The handful of techniques that have entered the industry over the years are commonly thought to be of limited use. Human evaluation is relatively expensive, time consuming and prone to subjectivity. However, when done well, human evaluation is still felt to be more trustworthy than automated metrics. There are no specific best practices applied when undertaking MT quality evaluation and no reliable benchmarking data to enable cross industry comparisons of users’ performance is yet available. Furthermore, there is little open sharing of learning between industry and the research community. Simple, Cheap and Fast...but not very Accurate Automated metrics often assume a single correct output, because only in rare occasions resources are available to produce a handful of possibilities. The most commonly used metrics range from word error rate or edit distance computation to a myriad of string similarity comparisons, the latter including the famous BLEU and METEOR variants. These metrics are simple, cheap and fast, but they are not very accurate. More importantly, they are rarely connected to the notions of quality that are relevant for the intended use of such translations. For a long time, the research community has been using such metrics on standard – often artificial – datasets for which human translations are available. BLEU is by far the most popular option, despite having a number of well-known limitations, such as its low correlation with humans at the sentence-level and inability to handle similarity between synonyms. Very rarely, translations are also assessed manually to verify whether improvements according to such metrics are indeed observed. Application-Oriented Metrics Metrics used for quality assessment of MT tend to be more application-oriented, commonly using information derived from the post-editing of automatic translations (such as edit distance between the MT system output and its post-edited version or post-editing time). They are applied to relevant datasets, as opposed to artificial datasets for which reference translations are available. Overall, metrics based on post-edited machine translations provide a good proxy for human judgements on quality, but in practice they are very limited: they cannot be used while the system is being developed, especially for the optimization of parameters in statistical approaches, where millions of sentences need to be scored quickly and multiple times, as the algorithm iterates over possible parameter values. In addition, post-editing is only one of the possible ways in which automatic translations can be exploited. The use of raw MT for assimilation or gisting is becoming more popular, and thus using post-editing as a quality metric is not always appropriate. Quality Estimation The divergence between metrics used for MT system development and metrics used during production is far from ideal: MT systems should be developed and optimized against metrics that reflect real production needs. To bridge this gap, more advanced metrics are needed. Metrics that take the quality requirements at hand into account, but are still cheap and fast to run. Such metrics can be useful to develop or improve MT systems and in production, in cases where manual evaluation is not feasible or needs to be minimized. These are “trained” metrics, that is, metrics that rely on relevant data at the time of design, as a way of learning how to address specific quality requirements. However, once trained, they can be applied to new data with the same quality requirements and without the need for human translations. These metrics are commonly denominated “quality estimation” metrics. Significant research towards quality estimation metrics has been done in recent years: a general framework to build such metrics is available and can be customized to specific language pairs, text types and domains, and quality requirements. However, these metrics have only been tested in very narrow scenarios, for a couple of language pairs and datasets commonly used by the MT research community. Data Scarcity...NOT! Work in this area has been held back by the lack of availability of relevant data to train metrics. Relevant data consists of a fairly small number (1,000+) of examples with pairs consisting of source and translations (preferably at the sentence level) for which a quality assessment has already been performed. This quality assessment can take various forms: post-editing (the actual post-edited translations or statistics from the process, such as time measurement, logging of edits, or edit distance), accuracy/fluency judgements, error counts, Likert scores, etc. This type of data is often abundant among providers and buyers of automatic translations, since they routinely need to assess translations for quality assurance. Research on better, reference-free automatic evaluation metrics would therefore greatly benefit from a closer relationship between industry and academia. Benefits At a first stage, data of the type mentioned above provided by the industry could be used to train a number of variants of quality estimation metrics using the existing framework. Industry collaborators could validate these metrics, for example, by direct comparison of their scores against those given by humans, or using them to select relevant data samples to be manually assessed (e.g. the cases estimated to have the lowest quality). Feedback to researchers on the quality of the metrics and how they need to be further adapted to particular scenarios could also result in further improvements of such metrics. The benefits for the industry include better automatic metrics to support or minimize the need for human assessment, and potentially better MT systems.
机器翻译评测的困境 在翻译行业,机器翻译(MT)的应用逐渐增多。然而,有效使用机器翻译的一个前提是能准确评测机器译文的质量,而这仍然是所有人面临的主要挑战。每年都有许多研究成果发表,这些研究不断采用新颖方法,以寻求提高自动计算质量的良方。 但多年来,人们常认为,应用于机器翻译质量评测的少数技术用途有限。 人工评估相对昂贵,耗时且容易产生主观性。 但人们仍认为,人工评估若处理得当,比自动化测评更具可信度。机器翻译质量评测尚无特定的最佳实践方法(Best Practice),也无可靠的基准数据以对用户的性能进行跨行业比较。 此外,产业界和研究界之间几乎不存在公开的学习共享。 简便、便宜、快速,但不够准确 自动化测评通常会假设只有一种正确的译文,因为只有在极少数情况下,资源才可能产生不同的可能译文。常用的评测标准多种多样,包括词错误率或编辑距离计算和种种字符串相似性比较方法,包括著名的BLEU算法和METEOR变量。 这些指标简便,便宜又快速,但不是很准确。更重要的是,它们很少能关联与此类翻译的预期用途相关的质量概念。长期以来,研究界一直将此类指标应用于可提供人工翻译的标准(通常为人工)数据集上。 BLEU是迄今为止最受欢迎的选择,尽管它存在许多广为人知的局限性,例如其与译者在句子处理上的低相关性,以及其无法处理同义词之间的相似性。极少数情况下,译者会人工评估翻译质量,以验证根据此类指标,译文是否得到了改进。 应用导向的评测指标 用于MT质量评估的测评标准更倾向于应用导向,它们使用的信息常从机器翻译译后编辑中获得(例如机器翻译的译文与经译后编辑的译文之间的编辑距离或译后编辑所用的时间)。这些信息应用于相关的数据集,而不是可使用参考译文的人工数据集。 总体而言,基于机器翻译译后编辑译文的评测标准可以较好地体现人类对翻译质量的判断,但实际上它们的用途却十分有限:在系统开发时,尤其在统计方法中优化参数时,不能使用上述评测标准,因为在上述过程中,随着算法迭代可能的参数值,需要快速且多次地对数百万个句子评分。 此外,译后编辑只是自动翻译的多种应用方式之一。使用原始机器翻译进行吸收型机器翻译或要旨翻译的现象越来越普遍,因此将译后编辑作为翻译质量测评指标并不总是合适。 质量评估 (Quality Estimation) 用于机器翻译系统开发的指标与翻译生产中使用的指标之间的差距远非理想:应根据反映实际生产需求的度量指标来开发和优化机器翻译系统。为缩小这一差距,需使用更先进的评测指标,新的评测指标不仅需要考虑质量要求,而且便宜且运行速度快。 在人工评测不可行或需要尽可能减少时,此类测评指标对于开发或改进机器翻译系统以及翻译生产大有用处。它们为“训练过的”指标,即参考相关数据而设计的指标,从而可以学习如何满足特定质量要求。但是,一旦经过训练,这些指标便可应用于具有相同质量要求且无需人工翻译的新数据。这样的指标通常被称为“质量评估”(quality estimation)指标。 近年来,人们对质量评估(quality estimation)指标进行了大量研究:可使用一个通用框架来构建质量评估指标,并可针对特定语言对,文本类型和域,以及质量要求定制该评估指标。但是,这些评估指标的测试范围十分有限,仅可针对机器翻译研究领域常用的几种语言对和数据集。 数据资源并不匮乏! 由于缺乏相关数据训练指标,该领域的工作受阻。相关数据由极少量(1,000多个)示例组成,示例包括原文和翻译(最好是在句子层面),并已对译文进行了质量评估。 质量评估可以采用多种形式:如译后编辑(实际的译后编辑翻译或译后编辑过程的统计信息,例如时间测量、编辑记录或编辑距离),准确性或流利性判断、错误计数和李克特评分等。 为保证译文质量,自动翻译的提供方和购买方通常需要评估译文,因此,两方之间的数据类型十分丰富。 正因如此,更优良的、无参考的自动评估指标研究将极大地受益于产业界和学术界之间的紧密关系。 益处 在初始阶段时,产业界提供的上述类型数据可使用现有框架来训练质量评估指标的多种变量。产业合作者可通过不同方式验证这些指标,如直接比较指标的评分与人类评分,或通过这些指标选择手动评估的相关数据样本(例如,预计质量最低的样本)。 向研究人员反馈相关评测指标的质量,以及评测指标进一步适应特定场景的方法,也可以进一步改进此类评测指标。这对行业发展大有裨益,如促进自动评测标准的发展,从而支持或尽可能减少对人工评估的需求,还可能优化机器翻译系统。 译后编辑:吴晨昱(中山大学)

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文