Machine Translation: Customization Evaluation in Spotlight

机器翻译:为Spotlight定制的评估准则

2021-01-14 02:25 Nimdzi Insights

本文共778个字,阅读需8分钟

阅读模式 切换至中文

In December 2020, Nimdzi was given an opportunity to test a brand new product  —  Spotlight. It is developed by Intento to support machine translation (MT) curation, enabling quick analysis of the MT training results. This product is intended mainly for those who train custom MT models and thus regularly face the task of evaluating MT quality. The machine translation evaluation process and what Spotlight means for it The usual evaluation methods include random sampling and costly human review (that runs the risk of providing different results for the same samples), which oftentimes happens after the trained model is already in production (alas!). Also, there’s usually no easy way to understand if a model can be improved further or to find examples of improved and degraded segments of the text. All of this can make the MT trainers’ and evaluators’ job onerous and daunting. Not to mention the fact that the evaluation sometimes occurs after it is actually needed, with the end users of the resulting MT output wondering what the evaluators do in the shadows. Intento’s Spotlight is designed to shed some light on this subject and dispel the gloom. Our initial impression is that this tool represents a useful and quick way to evaluate MT training results by spotlighting those segments that really need to be reviewed. In the spotlight Spotlight is a cloud solution available on demand from the Intento Console. The user interface (UI) is lean and the wizard helping you create an evaluation is pretty straightforward. Test set description We played with this new product using COVID-related corpora by TAUS from Intento’s research on the best MT engines for this area. It was Google Cloud Advanced Translation API (stock) versus Google Cloud Advanced Translation API (custom) dataset, from English to Russian. Spotlight suggests the “Less is More” principle for the dataset size: it uses the first 2,000 segments from the evaluation files, as it's considered the optimal size for an evaluation that is sufficiently accurate. How does it work? In addition to hLEPOR, BERT score is coming soon, with two more metrics, TER and BLEU also on Intento’s roadmap. Quick evaluation overview In our small experiment, Spotlight showed the higher overall hLEPOR score of 0.61 achieved by custom Google Cloud Advanced Translation API — compared to the 0.58 by a stock engine. After getting a quick overview of the evaluation situation, a reviewer is welcome to proceed to a detailed analysis of the segments, e.g., the degraded ones appearing below the line, or check improved ones. In the process of such a review, a reviewer is able to: comment on the segments — for example, if the reference translation is wrong or both MT versions are correct, etc. mark a segment for further check add a spotted issue type (omission, mistranslation, untranslated text, terminology, paraphrases, other) download the export of the evaluation in Excel file This “light-weight” review approach helps get faster evaluation results by catching and addressing only the issues that need to be improved. Depending on the results of the evaluation by Spotlight, users may want to retrain the custom MT engine or mention the particular issues to the post-editors. The reviewed data (already corrected and “annotated”) can also be used to retrain the MT model. Summary An overview of the segment-level hLEPOR scores helps to get an understanding of the current MT customization situation and save time by performing a focused review instead of a full scope evaluation. Spotlight can definitely save evaluators time and money. It also enables the linguistic teams to have a quick and sufficient understanding of the customization results before rolling out the particular MT engine in production. This could spare post-editors efforts and nerves, especially if there’s something wrong and the engine needs retraining. According to the development roadmap mentioned at the presentation of Spotlight in November 2020 (the launch page offers a virtual demo of Spotlight and a slide deck from that event), this is just one of the tools from Intento’s product MT Studio. The new toolkit for the complex MT curation will include options for data cleaning, training, and evaluating of multiple MT models, which can be even more interesting for a broader audience. Source: Intento Being a software company, Intento leaves the task of trying a new service and actually training the engines to language service providers (LSPs). However, they do use Spotlight internally at Intento, saving their analytics team hours of precious time. Yes, that is correct: even with such agile automation, the human stays in the loop — to curate the MT training, evaluate ation, and fine-tuneing the process and adjust it where needed.
2020年12月,Nimdzi获得了一个测试全新产品——Spotlight的机会。该产品是由Intento开发的,为机器翻译(MT)内容管理提供支持,能够快速分析MT训练结果。本产品主要面向训练定制MT模型的人,因此需要定期评估MT质量。 机器翻译评估过程及Spotlight对该过程的意义 通常评估方法包括随机抽样和成本较高的人工审查(会有相同样本会得出不同结果的风险),不过评估通常在训练好的模型已经投入生产之后(唉!)。而且,通常也没有简单的方法来判断模型是否可以进一步改进,或者找到文本中改进和分解的句段。所有这些问题都使MT培训人员和评估人员的工作变得繁重又艰巨。更不用说,有时候在真正需要进行评估之后才会进行评估工作,因为MT输出译文的终端用户想知道机器翻译的背后,评估人员做了些什么。Intento的Spotlight旨在为这一问题提供一些灵感,驱散阴霾。 我们的初步印象是,通过聚焦那些真正需要检查的句段,这个工具是评估MT训练成果有效且快速的方法。 关于Spotlight Spotlight是一种云解决方案,可从Intento控制台按需获取。用户界面(UI)非常精简,引导创建评估的用语也非常简单。 测试集说明 我们利用TAUS搜集的新冠肺炎(COVID)相关的语料,并根据Intento调查选取该领域表现最好的MT引擎,对Spotlight这一新产品进行测试。两个MT引擎分别是Google Cloud Advanced Translation API(通用)和Google Cloud Advanced Translation API(定制)数据集,将语料从英语译为俄语。 Spotlight针对数据集尺寸提出了“少即是多”的原则:只利用评估文件中的前2000个句段,因为这是能够充分进行精确评估的最佳大小。 Spotlight如何运作? 除了hLEPOR、BERT评分即将登场,另外两个指标TER和BLEU也在Intento的蓝图上。 快速评估概览 在我们的小实验中,Spotlight显示定制版Google Cloud Advanced Translation API的hLEPOR总分更高,为0.61;而通用版hLEPOR总分为0.58。 在快速了解评估情况后,审阅者可继续对句段进行详细分析,例如,低于标准的部分,或检查被改进的部分。 在该检查过程中,检查人员能够: 对句段进行批注,例如,参考译文是否错误,或者两是否两个MT版本都是正确的,等等。 标记句段以便进一步检查 添加有缺陷的问题类型(遗漏、误译、未译文本、术语、释义或其他) 下载导出的评估 至Excel文件夹 这种“轻量级”检查方法只捕捉和解决需要改进的问题,所以有助于快速获得评估结果。 根据Spotlight的评估结果,用户可能会想重新训练定制MT引擎或提醒译后编辑者注意特定问题。检查后的数据(已经经过校正和“注释”)也可以用于重新训练MT模型。 摘要 对细分级hLEPOR分数的概述有助于了解当前MT评估定制情况,并通过对重点部分进行评估而非全面评估来节省时间。 毫无疑问,Spotlight可以为评估人员节省时间和金钱。同时还可使语言团队在推出特定MT引擎之前,快速并充分了解评估定制结果。这可以节省译后编辑者的精力,特别是产生问题以及需要重新训练MT引擎的时候。 根据2020年11月Spotlight发布会上提到的开发蓝图(启动页面提供了Spotlight的虚拟演示和那次活动的幻灯片),该产品只是Intento旗下产品MT Studio的工具之一。管理复杂MT内容的新工具包将包括多个MT模型的数据清理、训练和评估选项,这可能是更多受众更加感兴趣的。 来源:Intento 作为一家软件公司,Intento将该新服务试用和实际培训引擎的任务留给了语言服务提供商。然而,他们在Intento内部确实使用了Spotlight,从而为他们的分析团队节省了数小时的宝贵时间。是的,没错:即使有这样敏捷的自动化工具,人类仍然需参与进来——策划MT训练、评估、微调过程,并在需要调整的地方进行调整。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文