Approaches to assessing the quality of machine…

机械零件质量评定方法。

2022-07-05 00:25 GALA

本文共856个字,阅读需9分钟

阅读模式 切换至中文

Machine translation is now used almost universally. This raises the issue of assessing the quality of the resulting product. This is partly done with automated algorithms and partly by humans, in the traditional way. Here we will present the approaches we currently use. Language quality This is the traditional approach to assess translations. The best-known tool in this respect is LISA QA, where mistakes are graded by severity, and by type. A major problem with this kind of assessment is that it leads to spot far more errors in machine translations than in human translations. Edit distance Edit distance is interpreted and used differently from the original scope. Some see it as the amount of text that needs to be corrected; others see it as the amount of time to spend to bring the text up to a required quality level. We follow the second approach. In theory, a comparison of edit distances would help identify the best machine translation. It is essential to have a clear understanding of the reason for selecting a specific machine translation program, whether to provide a usable unedited text, or a translation that will then be edited. In the first case, the key is the amount of text to edit (i.e., the number and severity of errors), assuming that the translation would be assessed following the traditional approach. In the other case, machine translation should be seen as a tool and the main issue would be how much it increases the process effectiveness, i.e., how much time must be spent on editing. In any case, human assessment is prone to subjectiveness. Linguists have all their own strengths and weaknesses. For one, terminology may be a weak point and a lot of time would be spent looking for the right term. Another would take longer on grammar and fluency. Estimating the edit distance Edit distances can be reckoned precisely only downstream and is expensive and time-consuming. We therefore decided to estimate the amount of time to edit a text, rather than actually computing it. To this end, we identified three types of errors, based on how long it would take to correct them. Coefficients applies to errors based on the following logic: the time to correct a comma is assumed to be less than that to correct a grammatical error and even more an unedited fragment. Human error are the main source of inaccuracies in the assessment process. We found that despite changing the criteria used by editors, they continued to assess translations based on linguistic criteria. In fact, in all the batches analyzed, scores correlate with the results of assessment in terms of linguistic quality rather than the actual edit distance. Percentage of similarity Another widely used method involves comparing an unedited translation with an edited version. While this idea may at first seem attractive, as the fewer the corrections, the better the original translation, in practice we found this approach ineffective. The algorithms to assess similarity levels are often rather imprecise. For example, they frequently treat changing a capital letter to a lower-case one as equivalent to changing a whole word. Also, editors often use the filter options in CAT tools. For example, a text may contain a dozen examples of a term, but the editor changes it using the find and replace function. This takes just seconds to do, but the volume of edited text may be substantial. Automated quality assessment tools These days there is a great deal of talk about BLEU, hLepor, COMET and CHRF+. Indeed, these algorithms do not express the quality of a machine translation output. They do allow users to compare dozens of alternative machine translation platforms very quickly, but they are not by any means translation quality tools. Also, the results of comparisons can easily be misinterpreted. Most automatic scoring algorithms express the similarity of a sample output to a reference text, typically a human translation. This means that, if the same tool is given two different human translations of the same text, which do not use the phrases contained in the reference corpus, these translations may be rated quite differently, possibly worse than the sample machine translation output. This initially led us to false conclusions. The best way to use this kind of tools is to measure the increase in scoring by comparing the output of a trimmed engine with one from the old one. A comparison of many different machine translation systems may lead to the conclusion that the system that gives the best results is the one that was used for the reference translation. Another clear drawback is the need to have a reference translation to run a comparison. Conclusion Automation is getting unavoidable in our shared future. Possibly, it is only a matter of time until we have reliable automated assessment tools. Yet, it is still too early to rely on them entirely. The use of human-operated tools and assessment by human specialists will remain the standard methods for assessing the quality of machine translations for the next few years, the key being an intelligent and critical approach to the whole process.
机器翻译现在几乎被普遍使用。这就提出了评估所得产品质量的问题。这部分是通过自动化算法完成的,部分是通过传统方式由人类完成的。在这里,我们将介绍我们目前使用的方法。 语言质量 这是评估翻译的传统方法。在这方面最著名的工具是LISA QA,其中错误按严重性和类型分级。 这种评估的一个主要问题是,它导致机器翻译中的错误比人工翻译中的错误多得多。 编辑距离 编辑距离的解释和使用方式与原始范围不同。有些人认为这是需要更正的文本数量;另一些人则把它看作是把文本提高到所需质量水平所花费的时间量。我们采用第二种方法。 理论上,编辑距离的比较将有助于识别最佳的机器翻译。必须清楚了解选择特定机器翻译程序的原因,是要提供可用的未编辑文本,还是要提供待编辑的翻译。在第一种情况下,关键是要编辑的文本量(即,错误的数量和严重程度),假定按照传统方法对翻译进行评估。在另一种情况下,机器翻译应被视为一种工具,主要问题是它在多大程度上提高了处理效率,编辑工作必须花费多少时间。 无论如何,人的评价容易带有主观性。语言学家各有长处和短处。首先,术语可能是一个弱点,大量的时间将花费在寻找正确的术语上。另一个则需要更长的时间来学习语法和流利性。 估计编辑距离 编辑距离只能在下游进行精确计算,并且成本高且耗时。因此,我们决定估计编辑文本所需的时间,而不是实际计算。为此,我们根据纠正错误所需的时间确定了三种类型的错误。 根据以下逻辑,系数适用于误差:假设校正逗号的时间小于校正语法错误的时间,甚至小于校正未编辑片段的时间。人为错误是评估过程中不准确的主要来源。 我们发现,尽管编辑们改变了使用的标准,但他们仍然根据语言标准来评估翻译。事实上,在所有分析的批次中,分数与语言质量方面的评估结果相关,而不是实际的编辑距离。 相似度百分比 另一种广泛使用的方法涉及将未经编辑的翻译与经编辑的版本进行比较。虽然这个想法初看起来很有吸引力,因为修改的越少,原文的翻译就越好,但在实践中我们发现这种方法是无效的。用于评估相似性水平的算法通常相当不精确。例如,他们经常将大写字母改为小写字母视为等同于更改整个单词。此外,编辑人员经常使用CAT工具中的过滤器选项。例如,一个文本可能包含一个术语的十几个示例,但是编辑器使用查找和替换功能对其进行了更改。这只需要几秒钟的时间,但编辑的文本量可能很大。 自动化质量评估工具 这些天有很多关于BLEU,hLepor,COMET和CHRF+的讨论。实际上,这些算法不表达机器翻译输出的质量。它们确实允许用户非常快速地比较几十种可供选择的机器翻译平台,但它们绝不是翻译质量工具。此外,比较的结果很容易被误解。大多数自动评分算法表示样本输出与参考文本(通常是人工翻译)的相似性。这意味着,如果相同的工具被给予相同文本的两个不同的人工翻译,其不使用包含在参考语料库中的短语,则这些翻译的评级可能相当不同,可能比样本机器翻译输出更差。 这最初使我们得出错误的结论。使用这种工具的最好方法是通过比较调整后的发动机与旧发动机的输出来测量得分的增加。对许多不同的机器翻译系统进行比较可以得出这样的结论,即给出最佳结果的系统是用于参考翻译的系统。 另一个明显的缺点是需要有一个参考翻译来运行比较。 结语 在我们共同的未来,自动化正变得不可避免。也许,我们拥有可靠的自动化评估工具只是时间问题。然而,完全依赖它们还为时过早。在未来几年内,使用人工操作工具和由人类专家进行评估仍将是评估机器翻译质量的标准方法,关键是对整个过程采用智能和关键的方法。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文