Understanding BLEU Scores in Customized Machine Translation

了解自定义机器翻译中的BLEU分数

2022-04-09 00:00 TAUS

本文共2547个字,阅读需26分钟

阅读模式 切换至中文

It doesn’t happen very often nowadays, but every now and then I still find in my inbox a great example of what is becoming a relic from the past: a spam email with cringy translation. Like everyone else, I’m certainly not too fond of spam, but the ones with horrendous translations do get my attention. The word-by-word translation is like a puzzle to me: I want to know if I can ‘reverse-translate’ it to its original phrasing. That said, it doesn’t mean that all automatic translation is flawless from now on. Far from that. Often, automated translations, though perfectly understandable, still have an out-of-place feeling around it, and especially reading a dense text about a subject that is new, is often very tiring in the longer run. It just is more of an effort to get behind the text and grasp the meaning. Automatic translation moved from incomprehensible to sometimes uneasy. Tuning training data The big machine translation engines, such as Google Translate, Microsoft Translator, and Amazon Translate, allow customers to adjust the output of translations towards their preferred domain or even their preferred style. That way, customizing machine translation is the next step forward to make translations more in line with the context expectations readers might have. The idea here is that neural machine translation offers baseline translations that are good enough for generic use, but by feeding it custom training material, it gets the extra quality that makes for a more knowledgeable translation. The training material should consist of a good amount of approved translations in a given language pair. Behind the scenes, customization is done through complete retraining of the translation model, or by readjusting the parameters on the fly, but the result is that translation is more ‘your style’. TAUS firmly believes in boosting engines this way. As a data company, we are eager to run the experiments of setting up different training datasets, and see what the impact of training on domain data can be. There are a few steps involved in the training: selecting domain and language pairs; selecting the right training material; evaluating the training results. TAUS has a huge repository of language data, but as with any big text corpora, some combinations of language and domain are just more suitable for customization than others. Based on experience, chances of success can be estimated. Selecting data for training is a harder job. It requires thinking about how narrow you want to have your domain, and what the quality of your data should be. As you may expect, narrowing the focus of your training material means lower applicability, but better results. Selecting the domain-relevant parts of the data is an art in itself, and one that will improve more and more with the advancement of neural models. Regarding the quality of the training data: more is not always better. High quality and consistency in the training data outperform quantity. This is true to a larger extent than you might think. It is a variation on the Anna Karenina principle. The number of ways in which things can go wrong is so much larger than the ways in which things can go right. That makes the lower end of the quality spectrum suffer much more than average from internal inconsistencies, so it actually pays off not to be too conservative when trimming your data. In fact, it is all about tuning and dialing. We use different metrics for the reliability of our training data. It is much like creating the best espresso with the coffee beans you’re given. Temperature, coarseness, amount: you carefully keep on dialing in all the different parameters until you hit the sweet spot. Evaluating the quality The idea of testing machine translation is simple: you create a training set with source sentences and their translations, but you keep a small portion aside with very reliable reference translations. Never train with this test set, because that would give away the right answers of the testing. You only use it to try out the translation engine, both before and after customization, and then you compare the generated translations with the given reference translation. For the estimation of translation quality, there are quite a few different methods and metrics available. Nothing beats the good old human review. We should know as we've worked hard on establishing a dynamic approach to evaluate quality with DQF. For our initial estimations, we used a small-scale human review of reference translations and generated translations. We wanted to know if our reference translations were good in the first place (yes, they were all quite good, and almost always better than the generated translations), and also if the customized translations changed things for the better. Apart from initial exploration, for that last estimation, human evaluation has its limits. Quickly, the workload for reviewing experiments is too large to implement at scale. That is when the need for automated evaluations comes in. If you are familiar with the debates in natural language processing, you probably know that quite some effort is spent on estimating whether or not automatically evaluating translation quality is in tune with the human estimation of what a good translation is. The presumption is that a common human judgment is leading here. A good metric should therefore reflect human judgment in giving good translations a high score. The problem with these types of metrics often is that they might not always be immediately evident or intuitive. The logic of the metric does not appear to follow its purpose. That certainly applies to the most used metric for machine translation, the BLEU score. Calculating BLEU scores Let’s get you up to speed with calculating BLEU scores. In short, BLEU score takes already existing perfectly good translations as the reference translation, and compares the output of machine translation, the candidate translation, to this reference. Eventually, this comparison is expressed in a number between 0 and 1, and higher numbers indicate better scores. A method like this must somehow make up for the fact that each source segment can have several perfectly good translations. BLEU score actually provides that and allows for multiple reference translations, each of them is considered equally good. But any deviation from the reference or references gets a lower score. This is where the BLEU score gets complicated. BLEU score checks on the words in the candidate translation, counts them, and whenever there are words in the candidate translation that are not in the reference translation, the score suffers. This is a way of calculating the precision of the translation: too much is not good. From this, you might think that a series of words in random order that happens to be also in the reference translation, would make a high score, but it doesn’t work that way. Not only single words are included in the calculation, but also groups of consecutive words. The algorithm offers some leeway to come up with variations, but typically all groups of two, three, and four consecutive words that are in the candidate translation are counted and compared to all the groups of the same number of consecutive words in the reference translations. These groups of consecutive words are so-called n-grams, and they make sure that randomly ordering the correct words will not be rewarded because they only match the reference when the words are in the same consecutive order. Also, a brevity penalty is applied to the score. We already saw that words in the candidate sentence that don’t appear in the reference sentences lower the score. Candidates with fewer words than the reference, on the other hand, will lower the maximum possible score by means of the brevity penalty. As a good explanation of the complete calculation, you can check out this page. BLEU score: more than the sum of its parts BLEU score is the type of metric that works best when applied to large amounts of data. First of all, don’t expect that each and every segment will be better if the BLEU score of a complete test translation is higher than that of another test translation. After all, the BLEU score is an average, and that means that individual segments will have different scores, better or worse. Furthermore, even in the case of two different candidate translations of the same segment, it is not necessarily the one with the higher BLEU score that is always the better translation. And finally, it is not recommended to compare larger bodies of translated text based on BLEU score, if the source text is completely different. But on the whole, when comparing two larger bodies of candidate translations of the same source, the candidate with the higher score is generally perceived as the better translation. What does raising the score look like Tuning training sets to boost machine translation requires a lot of trial and error. We were set for a modest BLEU score impact of a few points at first, but we were quite impressed that a good training set could boost test translation by 6 points, or sometimes even around 10 points. That is actually a lot. How much is a lot? The numbers may seem pretty impressive, but what does it do to the translations themselves? I’ll present some samples that demonstrate how translations can show great improvement. As it is my native language, I will give the samples for Dutch, but I will explain the subtleties enough for non-Dutch readers to understand. One of the customizations involved a large set of medical translations. We trained Amazon Translate using “Active Custom Translation”, which allows for on-the-fly tuning translations using a bilingual corpus. Some of the main topics in the training corpus were about: How and when to administer medicines; Which effects and side-effect to expect from medical treatments; Setting up experiments for medical research; Reporting on life science reports. We used a test set of 2000 segments. After customizing the translation with our training set, the total BLEU score went up by 7 points, from 44.3 to 51.3. There were 825 segments that had some sort of change, out of which 600 had a higher BLEU score after translation. The ones with a negative impact on BLEU score didn’t change as much on average as the ones with a higher BLEU score. The changes to the translations came in all different forms. But some corrections are coming back more often. The training stimulated a much more formal language. The source sentence: ‘Thank you! We will contact you as soon as possible.’ changed from: Dank je wel! We nemen zo snel mogelijk contact met je op. to: Bedankt! We nemen zo spoedig mogelijk contact met u op. whereas the reference was: Bedankt, we nemen zo spoedig mogelijk contact met u op. Note that both ‘u’ and ‘je’ are translations for ‘you’, but ‘je’ is much more informal, and will not be used to address people in a medical setting. ‘As soon as possible’ changed from ‘zo snel mogelijk’ to ‘zo spoedig mogelijk’. Both are correct, but ‘spoedig’ has again a more formal tone that makes it more like what you expect from an organization in the medical field. Apart from using more formal language, the customized translation also sounded more professional for the medical field. For example: [Product] is given according to official recommendations. had the following reference translation: [Product] wordt toegediend in overeenstemming met officiële aanbevelingen. The customized translation was exactly the same as the reference translation. ‘Toegediend’ is a translation of ‘administered’, and is preferred over the uncustomized: [Product] wordt gegeven volgens officiële aanbevelingen. which uses the more literal ‘gegeven’ for ‘given’. Same is true for the difference in tone of ‘in overeenstemming met’ and ‘volgens’. Other changes made the translation less ambiguous. For example: [Substance] was studied in 14 main studies involving over 10,000 patients with essential hypertension. which, without customization, was translated as: [Substance] werd bestudeerd in 14 hoofdonderzoeken waarbij meer dan 10.000 patiënten met essentiële hypertensie betrokken waren. After customization, it was exactly phrased as the reference translation: [Substance] werd onderzocht in veertien belangrijke studies waaraan meer dan 10 000 patiënten met essentiële hypertensie deelnamen. Note here that the original sentence uses ‘studied’, in the sense of performing empirical research. The Dutch ‘bestudeerd’ can be used for that as well but is more commonly used for learning from literature, and ‘onderzocht’ has a less ambiguous meaning for scientific research. The same sort of disambiguation is in ‘betrokken’ as translation for ‘involving’: it is a good translation, and actually the most literal. However, ‘deelnamen’ (‘participating’) is better since it means more active involvement in the research. Finally, ‘hoofdonderzoeken’ is a bit strange by implying a sort of hierarchy in studies, whereas ‘belangrijke studies’ is perfectly natural in this context. Hallucinations are known as a side-effect of treatment with dopamine agonists and levodopa. translated, without customization, to Hallucinaties staan bekend als een neveneffect van behandeling met dopamineagonisten en levodopa. After customization, it translated to: Hallucinaties zijn bekend als bijwerking van de behandeling met dopamine-agonisten en levodopa. Here again, the customized version shows more knowledge of the field. With ‘staan bekend als’, the uncustomized translation suggests ‘Hallucinations are known for being a side-effect of’, with the hint that most people probably only know hallucinations as they are side-effects of these particular treatments, whereas ‘Hallucinaties zijn bekend als’ just states that it is known that hallucinations can occur as a side-effect. It might be subtle, but it is the difference between a good-sounding statement and one that would surprise the reader for the wrong reasons. As a final example, customization was able to correct an incomprehensible translation in a very concise way. The source was quite awkward: [Product name] also induced an advance of the time of sleep onset and of minimum heart rate. The uncustomized translation was: [Product name] veroorzaakte ook een voorschot van het begin van de slaap en de minimale hartslag. which supposed some kind of ‘deposit’ (‘voorschot’) of the start of sleep and of minimal heart rate. The customized translation removed that financial connotation, and got it right that the product caused an earlier sleep and minimal heart rate: [Product name] vervroegt ook de tijd van inslapen en van minimale hartfrequentie. These examples show how much a domain sets the expectations of the language to be used. By using more generic models for translation, that expectation gets breached, and that makes reading and understanding text so much harder. The BLEU score of a translation is not the kind of metric that immediately feels familiar. It has a maximum of 100% and a minimum of 0%, but apart from that, it is difficult to decide on hard limits for good or bad quality. It’s not recommended to compare values across domains and languages, but it will indicate improvements when applied to the same test translation, as long as the test translation is large enough, and the reference translations are reliable. When will improvements become noticeable? It is a matter of sensitivity, but improvements of more than 5 percentage points make a better translation. Not every sentence is better, but the improvement, on the whole, is real and will make a better reading overall.
现在这种情况已经不常发生了,但我仍然不时地在收件箱中发现一个很好的例子,说明什么正在成为过去的遗物:一封带有令人讨厌翻译的垃圾邮件。像其他人一样,我当然不太喜欢垃圾邮件,但那些翻译糟糕的垃圾邮件确实引起了我的注意。逐字翻译对我来说就像一个谜:我想知道我是否能把它“反向翻译”成原来的措辞。 也就是说,这并不意味着从现在起所有的自动翻译都是完美无瑕的。通常,自动化翻译虽然完全可以理解,但仍然有一种格格不入的感觉,特别是阅读一个新主题的密集文本,从长远来看通常是非常累人的。它只是更多的是一种努力,以获得背后的文本和把握的意义。自动翻译从难以理解到有时令人不安。 调整训练数据 大型机器翻译引擎,如谷歌翻译,微软翻译,和亚马逊翻译,允许客户调整翻译输出到他们的首选领域,甚至他们的首选风格。这样,定制机器翻译是下一步,使翻译更符合读者可能有的上下文期望。这里的想法是,神经机器翻译提供的基线翻译是足够好的一般用途,但通过给它提供自定义培训材料,它得到了额外的质量,使一个更知识渊博的翻译。 培训材料应包含大量指定语言对的经批准翻译。在幕后,自定义是通过对翻译模型的完全重新训练或通过动态重新调整参数来完成的,但结果是翻译更符合“您的风格”。 TAUS坚信用这种方法来提高发动机的性能。作为一家数据公司,我们渴望运行设置不同训练数据集的实验,并看看训练对领域数据的影响可能是什么。 培训包括以下几个步骤: 选择领域和语言对; 选择合适的培训材料; 评估训练结果。 TAUS有一个巨大的语言数据库,但是和任何大型文本库一样,一些语言和领域的组合比其他组合更适合定制。根据经验,可以估计成功的机会。 选择用于训练的数据是一项比较困难的工作。它需要考虑你希望你的域有多窄,以及你的数据应该有什么质量。正如您所料,缩小培训材料的重点范围意味着适用性较低,但效果更好。选择数据中与领域相关的部分本身就是一门艺术,而且随着神经模型的发展,这门艺术将得到越来越多的改进。 关于训练数据的质量:越多并不总是越好。训练数据的高质量和一致性优于数量。这在很大程度上比你想象的要正确。这是安娜·卡列尼娜原则的一个变体。事情出错的方式比事情正确的方式要多得多。这使得质量谱的低端受到内部不一致性的影响比平均值大得多,所以在修整数据时不要太保守实际上是有好处的。 事实上,这一切都是关于调谐和拨号。我们使用不同的指标来衡量训练数据的可靠性。这很像用你得到的咖啡豆创造最好的浓缩咖啡。温度、粗糙度、量:你小心地不断地拨入所有不同的参数,直到你击中最佳点。 评估质量 测试机器翻译的想法很简单:您创建了一个包含源语句及其翻译的训练集,但保留了一小部分非常可靠的参考翻译。永远不要用这个测试集训练,因为那会泄露测试的正确答案。您只需要在定制前后使用它来试用翻译引擎,然后将生成的翻译与给定的参考翻译进行比较。 对于翻译质量的评估,有很多不同的方法和度量标准可供使用。没有什么能比得上人类的评论。我们应该知道,我们一直在努力建立一个动态的方法来评估质量与DQF。对于我们的初步估计,我们使用了参考翻译和生成翻译的小规模人工审查。我们想知道我们的参考翻译是否一开始就很好(是的,它们都很好,而且几乎总是比生成的翻译好),以及定制的翻译是否使事情变得更好。 除了最初的探索,对于最后的估计,人类的评价也有其局限性。很快,审查实验的工作量太大,无法大规模实施。这时就需要自动化评估了。 如果您熟悉自然语言处理中的争论,您可能知道,在评估自动评估翻译质量是否与人工评估好的翻译一致上,花费了相当多的精力。我们的假设是,一个共同的人类判断是领导这里。因此,一个好的度量标准应该反映出人们对好的翻译给予高分的判断。这些类型的度量标准的问题通常是它们可能并不总是立即明显或直观的。指标的逻辑似乎不符合其目的。这当然适用于机器翻译中最常用的指标,BLEU评分。 计算BLEU分数 让我们来帮助您快速计算BLEU分数。简言之,BLEU评分将已经存在的完美译文作为参考译文,并将机器翻译的输出(候选译文)与该参考译文进行比较。最后,该比较以0和1之间的数字表示,并且数字越高表示得分越高。 这样的方法必须以某种方式弥补每个源代码段可以有几个完美的翻译这一事实。BLEU评分实际上提供了这一点,并允许多个参考翻译,每一个都被认为是同样好。但是任何与参考文献的偏离都会得到较低的分数。这就是BLEU分数变得复杂的地方。BLEU评分会检查候选译文中的单词,并对其进行计数,只要候选译文中有参考译文中没有的单词,评分就会降低。这是一种计算转换精确度的方法:太多是不好。 由此,你可能会认为,一系列随机排列的单词碰巧也出现在参考译文中,会得到高分,但事实并非如此。计算中不仅包括单个单词,还包括连续单词组。该算法提供了一些余地来提出变化,但是通常对候选翻译中的两个、三个和四个连续单词的所有组进行计数,并将其与参考翻译中的相同数量的连续单词的所有组进行比较。这些连续单词的组被称为n元语法,并且它们确保随机排序正确的单词不会得到奖励,因为它们仅在单词处于相同的连续顺序时才匹配参考。 此外,对分数应用简短罚分。我们已经看到,候选句子中没有出现在参考句子中的单词会降低分数。另一方面,单词数少于参考文献的候选人将通过简短惩罚降低最高可能得分。作为完整计算的一个很好的解释,你可以查看这个页面。 BLEU评分:超过其各部分的总和 BLEU分数是一种应用于大量数据时效果最佳的指标类型。首先,如果一个完整的测试翻译的BLEU分数高于另一个测试翻译的BLEU分数,不要指望每个部分都会更好。毕竟,BLEU分数是一个平均值,这意味着个别片段会有不同的分数,更好或更差。此外,即使在相同片段的两个不同候选翻译的情况下,具有较高BLEU分数的一个不一定总是较好的翻译。最后,如果源文本完全不同,则不建议基于BLEU评分比较较大的译文主体。 但总体而言,当比较两个来自同一来源的大量候选译文时,得分较高的候选译文通常被认为是更好的译文。 提高分数看起来像什么 调整训练集以提高机器翻译需要大量的试验和错误。起初我们设定的BLEU分数影响不大,只有几分,但令我们印象深刻的是,一个好的训练集可以将测试翻译提高6分,有时甚至是10分左右。这实际上是很多。 很多是多少?这些数字可能看起来相当令人印象深刻,但它对翻译本身有什么影响呢?我将展示一些示例,说明翻译如何能够显示出巨大的改进。由于这是我的母语,我将给出荷兰语的样本,但我会解释的微妙之处足以让非荷兰读者理解。 其中一个定制涉及到大量的医学翻译。我们使用“Active Custom Translation”对Amazon Translate进行了培训,它允许使用双语语料库对翻译进行动态调整。培训语料库中的一些主要主题是: 如何和何时给药; 药物治疗的预期效果和副作用; 为医学研究设立实验; 报道生命科学报告。 我们使用了一个包含2000个数据段的测试集。在使用我们的训练集定制翻译后,BLEU总分上升了7分,从44.3上升到51.3。有825个片段发生了某种变化,其中600个片段在翻译后的BLEU评分更高。那些对BLEU分数有负面影响的人平均变化没有那些BLEU分数更高的人那么大。 对翻译的修改以各种不同的形式出现。但一些修正回来的频率更高了。训练激发了一种更正式的语言。 源句: “谢谢你!我们会尽快与您联系。“ 更改自: 我好湿!我们必须与我联系。 至: 贝丹克!我们必须与您取得联系。 而参考文献为: 当然,我们不需要和你联系。 请注意,“u”和“je”都是“you”的翻译,但“je”更为非正式,不会用于在医疗环境中称呼人。“尽快”从“zo snel mogelijk”改为“zo spoedig mogelijk”。两者都是正确的,但“spoedig”再次有一个更正式的语气,使它更像你期望从一个组织在医疗领域。 除了语言更加正式之外,对于医学界来说,这种定制的翻译也更加专业。例如: 【产品】根据官方推荐给药。 有以下参考译文: [产品]这个词是在过度的热情和官方的一个窗口。 定制翻译与参考翻译完全相同。'Toegediend'是'administered'的翻译,比非自定义的更受欢迎: [产品]这个词已经被官方认可了。 它使用更接近字面意义的“gegeven”来表示“given”。“in overestemming met”和“volgens”在语气上的差异也是如此。 其他的修改使翻译不那么含糊。例如: 在14项主要研究中对[物质]进行了研究,涉及10,000多名原发性高血压患者。 如果不进行自定义,则翻译为: [物质]将在14个小时内进行研究,10000例原发性高血压患者接受了研究。 定制之后,就是参考译文: [物质]在广泛的研究中,有10000名患者患有已知的原发性高血压。 注意这里的原句使用了“studied”,意思是进行实证研究。荷兰语中的bestudeerd也可以用来表示这一点,但更常用于从文学中学习,而onderzocht在科学研究中的含义就不那么模糊了。“betrokken”中的歧义消除与“involving”的翻译相同:这是一个很好的翻译,实际上是最字面的。然而,“deelnamen”(“参与”)更好,因为它意味着更积极地参与研究。最后,“hoofdonderzoeken”暗示了研究中的一种等级制度,这有点奇怪,而“belangrijke studies”在这种情况下是完全自然的。 幻觉被认为是多巴胺激动剂和左旋多巴治疗的副作用。 未经自定义,转换为 幻觉被认为是左旋多巴中多巴胺激动剂的一种不良反应。 自定义后,它将转换为: 幻觉通常被认为是左旋多巴中多巴胺激动剂的作用。 在这里,定制版本再次显示了更多的领域知识。与“staan bekend als”,非定制的翻译暗示“幻觉是已知的副作用”,暗示大多数人可能只知道幻觉,因为他们是这些特定治疗的副作用,而“幻觉zijn bekend als”只是说,它是已知的幻觉可以作为一个副作用发生。这可能是微妙的,但这是一个听起来不错的声明和一个会让读者感到惊讶的错误原因之间的区别。 作为最后一个例子,Customization能够以一种非常简洁的方式纠正一个无法理解的翻译。消息来源很尴尬: [产品名称]还可诱导睡眠开始时间和最低心率提前。 未自定义的翻译为: [产品名称]可在最小硬度下使用。 其假定睡眠开始的某种“沉积”(“voorschot”)和最小心率。定制的翻译去掉了财务上的含义,正确地表达了该产品导致睡眠提前和心率最低: [产品名称]最低安装频率时的转向灯。 这些例子显示了领域对所使用语言的期望值有多大。通过使用更通用的翻译模型,这种期望被打破了,这使得阅读和理解文本变得更加困难。 翻译的BLEU分数并不是那种让人立刻感到熟悉的度量标准。它的最大值为100%,最小值为0%,但除此之外,很难确定质量好坏的硬限制。不建议跨域和跨语言比较值,但只要测试翻译足够大且参考翻译可靠,应用于相同的测试翻译时,它将指示改进。 什么时候会有明显的改进?这是一个敏感度问题,但超过5个百分点的改进使翻译更好。不是每一句都更好,但改进,总的来说,是真实的,将使一个更好的阅读整体。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文