Managing Machine Translation Engine Quality
Machine Translation (MT) has taken tremendous strides in the past decade to improve its quality and make itself an essential part of translation workflows. Yet taking full advantage of MT can be tricky for both new and existing users who are unsure of how to choose the right engine. We’re giving you a breakdown on MT engine quality and how to choose the optimal engine for your content.
A complete newcomer to Machine Translation? Check out our Beginner’s Guide to Machine Translation.
Start your (machine translation) engines
Whether you are starting with MT or already leveraging it in your translations, the single most important factor is your MT engine.
Today there is an enormous number and variety of MT engines to choose from. The MT landscape is also continuously changing, with the constant release of new engines and the ongoing improvements of existing ones. Picking out the best performing engine can be a complicated and frustrating process.
It helps to think of the big picture. The main advantages of using MT are time and cost savings: the speed of translation is effectively instantaneous and the cost minuscule when compared to human translation. This is generally true of all MT engines available today.
This leaves only one major stumbling block: the quality of the MT output. This is perhaps the most important variable to consider when managing MT workflows, as poor output can jeopardize the gains in time and cost.
About machine translation quality
Recent developments in MT, key among them the wholesale transition from statistical machine translation to neural machine translation , have dramatically improved the base level quality of MT output. Our own internal data suggests that since 2017, the likelihood of getting a near-perfect segment, requiring minimal post-editing, has nearly doubled. The most commonly used engines today are likely to produce passable translations that can convey the meaning, if not exactly the nuance of the original text.
The trust you place in the quality of MT is largely dependent on the size and importance of your task. A student hoping to quickly translate a few lines of homework before his language class (shame on you) doesn’t have to be especially picky: all of the major MT engines used today are likely to produce a passable translation. Errors are more likely to happen due to an ambiguous source text, rather than a poor MT engine. But if you are looking to translate your life motto into French or Chinese for an interesting tattoo, you might want to have it double checked by a native speaker. The internet has no shortage of pictures of bad tattoos, testament to people who put too much trust in their MT engines.
Things change with scale. For a large enterprise a “passable” translation may not be good enough. With a growing volume of translations simple errors can start to add up and the likelihood of catastrophic errors is proportionately increased, ultimately requiring more extensive (and expensive) human review and post-editing. Cents become dollars and the workflows start to slow down.
An increase in scale can also reveal positives. The more you translate, the more likely you are to see differences between MT engines that you might not notice when focusing on smaller samples. Incremental differences will start to add up. Some engines will perform better, and using the right engine could result in comparative increases in quality and savings. Choosing the best performing engine is important.
See how different MT engines stack up in our Machine Translation Report.
Machine translation engine types
When choosing an MT engine you can opt for either a generic engine, such as Amazon Translate, Google Translate, or Microsoft Translator, or for a custom engine. Both types of engine rely on past translation data to produce their results.
Custom engines are trained using data that you provide them with to help refine their output. Successful past translations are used to guide the engine, making it more likely to produce the kind of translations that you are used to. Travel and hospitality content, for example, is especially suitable for custom engine training. Hotel listings or user reviews often share similar characteristics, and the sheer amount of content available makes engine training both possible and desirable.
This specificity is the greatest advantage of a custom engine, but also its main drawback. By focusing on specific types of content, out-of-domain performance is likely to be worse. Your engine, trained on hotel descriptions and reviews, may perform poorly translating news articles.
Custom engines are generally more expensive to set-up and maintain. They are well-suited for businesses that handle large volumes of copy that are quite similar in style and content and are able to justify the slightly higher costs involved.
Generic engines are the best options for most users, as the set up is quick and the costs significantly lower than those of customizable engines. Choosing one engine over another may be a slightly more complicated process if you place a premium on quality.
Evaluate or estimate machine translation quality?
When choosing an engine it is always a good idea to evaluate the MT output quality to know if you will be getting your money’s worth. Many MT users carry out extensive evaluations of all available options before they commit to an engine. The industry has adopted a number of quality metrics to help standardize the process.
We generally can distinguish between quality evaluation and quality estimation.
Quality evaluation relies on evaluating the quality of MT output, usually in reference to a human translation of the same source text. While most readers can easily determine which translation is more ‘natural’, a purely subjective evaluation will not be able to effectively evaluate at scale.
One method of evaluation is to rely on the evaluation of bilingual experts, who rate the quality of the MT output and the output of professional translators in a blind test. In the past, this method has been used to make some bold claims about the rising quality of MT, but it does have some significant limitations.
Primarily it is a question of cost: carrying out this test requires human translators and human evaluators. To get an accurate evaluation you may need to invest significant resources in the test. There are also concerns about the subjectivity inherent in any evaluation; studies have shown that professional translators are more likely to give higher marks for human translation, as opposed to non-professional linguists. Similarly, evaluation at segment level is more likely to reflect favorably on MT, as opposed to evaluating the segment in context of the whole article.
An alternative is to rely on computer algorithms to evaluate high volumes of translation quickly to produce an objective numerical score. This score is produced by an automated comparison of the MT output with a reference translation. The exact variables involved in the calculation differ from algorithm to algorithm, but generally speaking the closer the MT output is to the reference translation, the higher the score.
There is an enormous variety of different algorithms, the most commonly used today include:
BLEU (BiLingual Evaluation Understudy)
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
METEOR (Metric for Evaluation of Translation with Explicit ORdering)
Each of these algorithms takes a different approach to measuring how “similar” the MT output is to the reference translation, and their relative advantages and disadvantages are a debate in themselves.
Generally speaking quality evaluation is an effective way of evaluating output that gives the user a lot of control over the process and a reliable result that allows effective comparisons between engines. However, the need to have human translated texts as a reference point and the process of setting up the evaluation itself make this a relatively slow and expensive method. An added malus is that these evaluations effectively produce ‘snapshots’ of a given point in time. Most MT engines today improve rapidly over time, so yesterday’s results might not be true today.
Quality estimation works differently. Rather than evaluating the output of an MT engine, it instead analyzes the source text you wish to translate and, based on certain criteria, predicts how good the translation might be.
To take an example that’s close to home, Memsource has developed a form of quality estimation, known as Machine Translation Quality Estimation (MTQE). No reference translation is needed, only the source text, as the estimation is powered by past performance data. The “quality” itself is estimated as the need to make edits to the output produced by the engine. With MTQE this is expressed as a percentage assigned to specific segments of the translation: a score of 100% would suggest that this particular segment is perfect and requires no edits, a score of 75% suggests that there might be some room for improvement. Though the estimation is made on a granular level for each segment, cumulatively the scores can give you an idea of how well the engine performs. One of the benefits of quality estimation is that it is a dynamic process that is continuously improving based on user feedback and improving its results. It is not ‘static.’
Now whichever method you choose will provide you with a sense of how the different engines will perform and which might be the ideal candidate for your machine translation needs.
More than one machine translation engine?
It is also important to note that you do not have to commit to a single engine. Most translation management software enables its users to switch relatively quickly from engine to engine. You might find that engine A is suitable for a certain language pair, while engine B is better suited for translating specific kinds of content. If you commit yourself to engine A or B exclusively, you will lose out on the quality gains achieved by the other engine in specific areas.
At Memsource we have developed Memsource Translate, a unique machine translation management solution that allows you to conveniently leverage multiple engines for the best possible translations. Our AI-powered algorithm automatically selects the best performing MT engine for your content, based on the language pair and content type of your document. Data on engine performance is collected in real-time and used to continuously update the algorithm’s recommendations. Memsource Translate comes with three fully managed engines and allows users to add their own, including customizable engines. The process of engine management and testing becomes automated, helping both newcomers to MT and existing users to optimize their workflows.
Machine translation engine quality shouldn’t stop you from leveraging MT to its full potential. There are many ways to approach the quality conundrum and new innovations that will enable you to go further with your translations.
机器翻译引擎质量管理
在过去的十年里,机器翻译(MT)取得了巨大进步,质量不断提高,跻身为翻译工作流程中不可或缺的一个环节。然而,有些新用户和既有用户并不确定如何选择合适的引擎,这样一来,充分利用机器翻译就可能成为棘手之事。我们将细致介绍各类机器翻译引擎的质量以及如何根据具体内容选择最佳引擎。
你是刚刚接触机器翻译的小白吗?请查看我们的机器翻译初学者指南吧。
启动(机器翻译)引擎
无论你是刚开始使用机器翻译还是已经在用它进行翻译,最重要的就是机器翻译引擎。
现如今,可供选择的机器翻译引擎数量巨大、种类繁多。机器翻译也随着新引擎的持续发布和现有引擎的持续改进不断发展变化,而挑选性能最好的引擎的过程可能会比较复杂、令人懊丧。
有大局视角颇有益处。机器翻译的主要优点是节省时间、节约成本:翻译速度高效迅速,所需成本与人工翻译相比微乎其微,是现有机器翻译引擎的普遍特征。
唯一的问题在于机器翻译结果的质量。这也许正是管理机器翻译流程时最需要考虑的变量,因为糟糕的结果会抵消在时间和成本方面的优势。
关于机器翻译质量
近年来,机器翻译的发展(特别是从基于统计的机器翻译全面过渡到基于神经网络的机器翻译)极大地提高了机器翻译结果的基本质量。我方公司内部数据显示,自2017年以来,进行极少的后期编辑后译文片段就能近乎完美的可能性几乎翻了一番。如今,最常用的引擎所输出的译文即使与原文有细微差别,也很可能可以传达出尚能接受的意思了。
对机器翻译质量的信任度很大程度上取决于任务的规模和重要性。希望在上课前快速翻译几行作业的学生(真丢人)不必特别挑剔:目前可用的所有主要机器翻译引擎都可以输出一个说得过去的译本。机器翻译出现错误更有可能是因为源文本表达不明确,而不是因为机器翻译引擎自身糟糕。但是,你如果想把人生格言翻译成法语或中文,继而做个有趣的纹身,可能得让母语为英语的人再检查一遍。互联网上不乏糟糕的纹身图片,这证明了人们对机器翻译引擎过于信任。
对机器翻译的信任度也因规模而异。对于一个大企业来说,一个“尚可”的译文可能不够好。随着翻译量的增长,简单的错误会开始累积,出现灾难性错误的可能性也会相应增加,最终需要更大规模(且昂贵)的人工审核和后期编辑。本来以美分计算的成本就变成了美元,工作流程开始变慢。
规模的增加也可以显示出积极的一面。翻译得越多,就越有可能看到机器翻译引擎之间的差异,而这些差异在样本量较小时可能不会注意到。增量差异将开始累积。相比而言,一些引擎的性能会更好,合理地选择会提高质量、节约成本。选择性能最好的发动机是很重要的。
来看看我们的机器翻译报告中不同的机器翻译引擎是如何叠加的。
机器翻译引擎类型
在选择机器翻译引擎时,可以选择通用引擎,如Amazon Translate,Google Translate或Microsoft Translator,也可以选择自定义引擎。这两种类型的引擎都是依据过去的翻译数据输出译文。
自定义引擎受到提供给它们的数据的训练,从而改进输出结果。参考以前的成功案例来指导翻译引擎,更有可能得到人们所习惯的翻译版本。例如,旅行和招待内容特别适合自定义引擎的训练。酒店列表或用户评论通常具有相似的特征,现有的大量内容让引擎训练具有可能性和可取性。
这种特定性是自定义引擎的最大优势,但也是其主要缺点。专注于特定类型的内容,翻译其他类型文本的效果很可能会变差。受酒店描述和评论类文本训练的机器在翻译新闻类文章时可能表现不佳。
设置和维护自定义引擎通常比较昂贵。但是,如果企业处理的文件风格和内容非常相似,自定义引擎就会很合适,它也能证明稍微高一点的成本是合理的。
通用引擎是大多数用户的最佳选择,因为它的安装速度快,成本比自定义引擎低得多。如果很重视质量,引擎选择可能会稍微复杂。
评价或评估机器翻译质量?
选择翻译机器时,评价其译文质量从而明确是否物有所值一直是个不错的方法。许多机器翻译用户会在使用翻译机器之前,对所有可用的机器进行广泛的评估。业界采用了许多质量度量标准促进评价过程趋向标准化。
可以大致进行质量评价和质量评估的分类。
质量评价通常是参考同一源文本的人工翻译,对机器翻译结果的质量进行评价。虽然多数读者能很容易地确定哪一种翻译更“自然”,但纯粹的主观评价无法有效地进行大量评价。
一种评价方法是依靠双语专家的评价,他们通过盲测对翻译机器的译文和专业译员的译文进行评分。过去曾用这种方法对日臻完善的机器翻译译文做出大胆的质量判断,但是确实存在一些明显的局限性。
首先,涉及到成本问题:进行这项测试需要人工译员和评价者。为了得到准确的评价,可能需要在测试中投入大量的资源。任何评价所固有的主观性也引人担忧;研究表明,不同于非翻译专业的语言学家,专业译者更容易给人工翻译打高分。评价中,机器在翻译单独的段落时更胜一筹,而人工译员能更好地处理有篇章背景的段落。
另一种选择是依靠计算机算法快速评价大量的翻译,从而得出客观的分数。该分数是通过自动比较机器译文与参考译文得出的。计算所涉及的精确变量因算法而异,但一般来说,机器译文越接近参考译文,得分越高。
算法种类繁多,现在最常用的包括:
BLEU(双语评估替换)
ROUGE评估法(自动摘要)
METEOR(用显式排序评价翻译的度量)
这些算法中的每一种都采用不同的方法来衡量机器翻译文与参考译文的“相似度”,它们之间的优劣之分本身就是一个争论点。
一般来说,质量评价是一种评价译文的有效方式,它让用户更好地控制翻译过程,并且提供可靠的评价结果,对各机器进行更加有效的比较。只不过,这一方法需要有人工译本作为参考且设置评价流程,故而相对缓慢、昂贵。质量评价的另一个优点是有效地得出了给定时间点的“快照”。如今,大多数机器翻译引擎都在快速提升自己,所以昨天的结果不一定适用于今天。
质量评估的运行方式不同于质量评价。它不是分析机器翻译引擎的输出结果,而是分析希望翻译的源文本,并基于某些标准,预测翻译质量。
举一个很相关的例子,Memsource开发了一种质量评估形式,称为机器翻译质量估计(MTQE),由过去的性能数据支撑评估过程,不需要参考翻译,只需要源文本。以“质量”本身作为翻译引擎输出结果的修订依据,这表现为分配给翻译的特定片段的百分比:100%的分数表示这个特定片段是完美的,不需要编辑,75%的分数表示可能有一些改进的空间。虽然评估是在每个片段的粒度级别上进行的,但累积起来的分数可以让你了解引擎的性能状况。质量估计的好处之一是,它是一个动态过程,基于用户反馈不断改进,而不是“静态的”。
现在,无论选择哪一种方法,你都将体验不同引擎的运作,为机器翻译需求找到合适的引擎
不止一个机器翻译引擎?
同样重要的是,你不必承诺只使用单个引擎。大多数翻译管理软件允许其用户相对快速地切换到引擎。你可能会发现引擎A适合于某种语言对,而引擎B更适合于翻译某种特定内容。如果只使用A或B引擎,将错失其他引擎翻译特定领域时的优势。
在Memsource,我们开发了Memsource Translate,这是一个独特的机器翻译管理解决方案,方便你利用多个引擎获得可实现的最佳翻译。我们的人工智能算法根据文档的语言对和内容类型,自动为待译内容选择性能最佳的机器翻译引擎。实时收集机器性能的数据,从而不断更新算法。Memsource Translation附带三个完备的管理引擎,并允许用户添加自己的引擎,包括定制化引擎。引擎管理和测试的过程趋向自动化,帮助机器翻译的初学者者和现有用户优化工作流程。
机器翻译引擎的质量不应该成为充分发挥机器翻译潜力的阻力。有很多方法可以解决质量难题,也有很多新方法能助你在翻译中更上一层楼。
以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。
阅读原文