Quality estimation (QE) for MT is less accurate than commonly thought, according to a paper published by Facebook researchers on July 6, 2020.
Among QE’s many useful applications, it can be trained to automatically identify and filter bad translations, which can reduce costs and human post-editing efforts. End-users who cannot read a source language can also use QE as a feedback mechanism. In July 2019, researchers at Unbabel published a paper on post-editing styles and said they were looking to apply their work in, among other things, quality estimation.
The paper, “Are we Estimating or Guesstimating Translation Quality?,” is based on research carried out as part of PhD student Shuo Sun’s internship with Facebook over the summer of 2019. Francisco Guzman, a research scientist manager with Facebook’s Language and Translation technologies (LATTE) group, supervised the research, along with Imperial College London professor and fellow LATTE research collaborator Lucia Specia.
The authors identified three main reasons QE is overstated: (1) issues with the balance between high- and low-quality instances; (2) issues with the lexical variety of test sets; (3) the lack of robustness to partial input.
They believe these issues with QE datasets lead to “guesstimates” of translation quality, rather than estimates. The findings were a surprise to Sun, whose project originally aimed to examine whether multi-task learning can be used to learn better neural QE models.
“I ‘accidentally’ discovered the problems because of a bug in my code,” Sun told Slator. “My QE neural models performed well when they were not properly ingesting source sentences.”
The research team found that QE datasets tended to be unbalanced, often excluding translated sentences with low quality scores. As a result, most of the translated sentences required little to no post-editing.
“This defeats the purpose of QE, especially when the objective of QE is to identify unsatisfactory translations,” the authors wrote. To combat this imbalance, the researchers recommended purposefully designing models to include varying levels of sentence quality.
Lexical artifacts (i.e., a lack of diversity across labels, sentences, and vocabulary) can also lead to a QE system’s artificially strong performance due to the repetitive nature of the content. Sampling source sentences from various documents across multiple domains can provide a more diverse range of material.
The authors also suggested “using a metric that intrinsically represents both fluency and adequacy as labels” when designing and annotating QE datasets.
Building on their own recommendations, the researchers created a new QE dataset, called MLQE. They focused on six language pairs: two high-resource languages (English–German and English– Chinese); two medium-resource languages (Romanian–English and Estonian–English); and two low-resource languages (Sinhala–English and Nepali–English).
For each language pair, 10,000 sentences were extracted from Wikipedia articles on a range of topics to prevent lexical artifacts. These sentences were then translated by state-of-the-art neural models. Finally, they were manually annotated according to direct assessment in order to mitigate issues of sampling bias and a lack of balance between high- and low-quality translations.
“We decided to build an improved QE dataset for the research community the moment we discovered the issues with current QE datasets,” Sun said. MLQE is now available on GitHub and is currently being used for the WMT 2020 shared task on QE.
Sun said neural QE models seem to perform better on mid- and low-resource language directions than on high-resource language directions.
Sun’s next plans include studying QE in zero-shot and few-shot, cross-lingual transfer settings, and experimenting with multilingual QE models that can simultaneously handle multiple language directions.
Facebook研究人员2020年7月6号发表了一篇论文,其表明,机器翻译质量评估没我们想的那么精确。
众多质量评估应用软件可以自动识别并过滤不佳的译文,这样既可以降低成本,也可以减轻译员后期编辑工作压力。 而且如果读者不会文本源语言,也可以使用质量评估软件,来进行使用反馈。 2019年7月,Unbabel的研究人员发表了一篇关于后期编辑的论文,并表示他们希望将自己的研究成果应用于质量评估等板块。
2019年夏天,博士生孙硕在Facebook实习进行的部分研究就作为了题为“我们到底是在评估翻译质量还是猜测翻译质量?”的论文的参考。Facebook语言和翻译技术(铁“能”小组)科学小组项目研究经理, Francisco Guzman,他和伦敦帝国理工学院教授及院士,还有铁“能”小组助理研究员Lucia Specia进行监督研究。
作者指出了夸大质量评估的三个主要原因:(1)高质量和低质量实例之间的平衡问题; (2)测试集词汇多样性的问题; (3)部分输入缺乏权威性。
他们认为通过数据采集产生的这些问题导致了对翻译质量的“估测”,而不是评估。 对孙硕来说,这简直太惊喜了,本来他的研究项目最初的是为了检验多任务学习是否可以用来研究更好的大脑神经质评估模型。
孙硕告诉Slator:“我‘偶然’地在我的错误代码报告发现了这些问题。质量评估神经模型在没有正确截取源句子时表现良好。”
研究小组发现,质量评估数据采集往往达不到平衡,低质量的翻译句子会被剔除掉, 导致最后,大多数翻译的句子几乎不需要译后编辑。
“这样就违背了质量评估的目的,尤其是当质量评估的目标是识别不佳的翻译时。”作者写道。 为了达到平衡,研究人员建议设计水平不等的句子模型。
词汇工件(即,标签,句子和词汇之间缺乏多样性)也会由于内容的重复性而导致人为地操作质量评估系统。 如若跨越多个领域,从各种文献中抽取源句子,这样可以使评估材料更多样化。
作者还建议在设计和标注质量评估数据集时,“使用以流畅度和完成度作为度量的标签”。
基于上述他们自己的建议,研究人员创建了一个新的质量评估数据集,称为MLQE。 他们以六种语言对为重点研究对象:两种高资源语言(英语-德语和英语-汉语); 两种中等资源语言(罗马尼亚语-英语和爱沙尼亚语-英语); 和两种资源不足的语言(僧伽罗语-英语和尼泊尔语-英语)。
在每一种语言中,从维基百科各种各样主题的文章中提取了10,000个句子,防止出现重复性问题,然后用最先进的神经模型翻译出来。 最后,进行人工注释来直接评估它们,以减轻抽样偏差和高质量和低质量翻译之间缺乏平衡的问题。
孙硕说:“当我们发现当前质量评估数据集存在的问题的时侯,我们决定为研究界建立一个改进的质量评估数据集。” MLQE现在在GitHub上可用,目前正用于2020国际机器翻译大赛的质量评估任务。
他还表示,质量评估神经模型在中低资源语言方向上似乎表现得比在高资源语言方向上要好。
之后,孙硕计划研究零样本和少样本下,跨语言迁移设置的质量评估,以及研发能够同时处理多个语言方向的多语言质量评估模型。
以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。
阅读原文