The Next 5 Biggest Challenges in Neural Machine Translation

神经机器翻译即将面临的5大挑战

2021-02-28 03:50 Lingua Greca

本文共1312个字,阅读需14分钟

阅读模式 切换至中文

The last few years have witnessed the rise of Neural Machine Translation (NMT), which has taken over the entire translation services industry. And this is just the beginning of a new era for the translation business. NMT has displaced rule-based translation systems, statistical machine translation (SMT) and all previous efforts of automatic translations. In some languages, NMT steadily approaches near-human translation quality. Along with the great results, however, challenges have emerged along the way. In this article we will discuss five of the most difficult challenges facing NMT today. 1. Out of Domain Let’s start by illustrating the concept behind this challenge with a short example. If we take the word “second” for instance, it can be a measurement of time or also denote the placement of something or someone after the “first.” These different domains equal different meanings, and should end up with different translations. Why is this such an important topic? The main reason is the development of targeted domain-specific systems instead of broad or cross-domain systems, which result in a lower translation quality. A popular approach is to train a general domain system, followed by training on in-domain for a few epochs, in other words a bit of customization into the specific domains. In the field, it is very common to find large amounts of training data, opus, paracrawl, TED Talks, etc., available for a very broad domain. This makes a domain adaptation exercise crucial to build a successful system using those public data sets. Here at Acclaro, we have trained engines targeted for a very specific domain with generic data sets, such as legal. Not only will the system encounter out-of-domain words, but it also will find a lot of new words not included in training. As a note, this was an exercise for low-resource language pairs and the entry point for building a targeted neural machine translation engine. To illustrate the problem more in depth, let’s look at the following table from Phillipp Koehn’s book, Neural Machine Translation. Image caption: Quality of Systems (BLEU) when trained on one domain (rows) and tested on another domain (columns), NMT as green bars and SMT as blue. The experiment reaffirms our problem and shows in a very concise way how important it is to address the domain adaptation before training the engine, making sure that the most important part is consistent data in the desired domain. 2. Amount of Training Data Machine translation quality relies heavily on the amount of training data. For SMT systems, the correlation between quality of the system and BLEU scores was almost direct, but in NMT the relationship, like the Facebook status, is complicated. When NMT finds more data, it is more certain to generalize and perform better within the larger context. To be more quantity-specific, the numbers can vary depending on the languages. A NMT system needs at least 20 or more million words, and really outperforms any other system when the amount of words is higher than 30 to 35 million words. To make our case stronger and offer a great example, let’s borrow the following chart from Koehn. The chart illustrates how the NMT system depends heavily on data. It needs a huge amount of training data, which is not always so easy to find, especially when dealing with rare domains or low-resource languages. 3. Long Sentences A long-known flaw of NMT models, and especially of the early encoder-decoder architectures, was the incapacity to translate long sentences properly. Fortunately, sometime in 2018, the attention model remedied this problem somewhat, although not entirely. In our many experiments with a migration to the attention models, we came to the consensus of cutting long sentences with a threshold of 50 to 60 words as maximum to obtain the best possible results in every translation request. The overall results showed an immense decrease in BLEU scores when surpassing the 54 tokens in a particular translation request. As a side note, it is important to mention that SMT on sentences of 60 or more words does not show the same weakness as NMT. However, the same problem appears in SMT after 80 or more words (aka very long sentences). 4. Beam Search The task of translation has been tackled over the years with different search techniques that explore a subset of the space of possible translation. A common parameter in those searches is the beam size argument that limits the number of translation tokens maintained per input word. While in SMT there is typically a clear relationship between this parameter and the model’s overall quality score — the larger the beam search, the greater the expected score should be. In NMT there are two factors to consider, for big beam numbers the NMT BLEU scores from top performance to mediocre or even really bad translations. When setting up the beam score, Acclaro’s NMT group discovered that a beam number between five and eight proved to be the best possible alternative, with a beam number over 10 always something we don’t recommend, as it negatively impacts model output and model performance. Why? In short, because the number of words per second that the model can translate is determined by the logical pattern that the model considers when using a longer beam search. The higher the word count, the more the model needs to consider each partial translation and subsequent word predictions. That’s one of the reasons why our recommendation is to never overestimate your beam search. Depending on performance, keep your number between five and eight for an optimal score and words-per-second rate. 5. Word Alignment In almost every translation request, there is a need for word alignment to correspond between the source and target text. The reasons for this can include tags, formatting, ICU or a long list of features that we encounter daily. Fortunately, the key aspect of the attention mechanism (included in our NMT engines since 2018) is the attention table, which includes every probability for words in alignment between input and output words. But here lies the challenge surrounding word alignment for NMT: the role that the attention mechanism plays for the correspondence between input and output words is not the same as in SMT. With NMT the attention technique has a broader role, paying attention to context. For instance, when translating a noun, the subject in a sentence, attention may also be paid to the verb and any other words related to or describing the subject. Context can help clarify the meaning. That said, word alignment itself is not the most reliable way to determine any given word’s meaning — and it certainly cannot be the only way to determine meaning. The attention model, for the above reasons, may choose alignments that do not correspond with our intuition or alignment points. We strongly suggest that translation tasks that involve alignment features include a guided alignment training where supervised word alignments (such as the ones produced by fast-align, which more closely corresponds to intuition) are provided to train the model. Crossing the Final NMT Barriers Notwithstanding the widespread success of NMT, there are still a few barriers that NMT has yet to cross. Of these five biggest challenges we’ve covered, out of domain and the amount of training data (not enough) take a huge portion of performance out of the table for any system. As we seek to address these challenges, Acclaro’s NMT group is always happy to collaborate with NMT practitioners and researchers who have a similar focus. And, last but not least, we encourage our clients to consider those two factors as the most prominent ones whenever they use machine translation: How much data is enough data? And is the data in the domain I want? Talk to us today about how we put this data to work when building a successful localization program.
近几年来,神经机器翻译(NMT)的兴起席卷了整个翻译服务行业。而这只是翻译行业进入新时代的第一步。NMT取代了基于规则的翻译系统、统计机器翻译(SMT)和之前为机器翻译所作出的所有努力。在某些语言中,NMT逐步达到接近人工翻译的质量。然而,在取得巨大成果的同时,也出现了一些挑战。在这篇文章中,我们将讨论NMT当今面临的五个最困难的挑战。 1.脱离语域 让我们先用一个简短的例子来说明这个挑战的概念。以“second”为例,它可以表示时间的度量,也可以表示某物或某人在“第一”之后的位置,这些不同的语域意味着不同的含义,因此最终应该有不同的翻译。 为什么这个话题如此重要?主要原因是我们要开发的是有针对性的特定领域系统,而不是宽泛的或跨领域的系统,后者会导致翻译质量较低。一种普遍的方法是训练通用领域系统,然后在各领域内进行几个阶段的训练,换句话说,就是对特定领域进行一些个性化定制。 在这些领域中,很容易获得大量适用广泛领域的训练数据,如opus、paracrawl、TED演讲等语料。这就意味着要使用这些公共数据集构建一个成功的系统,语域适用性调整至关重要。在Acclaro公司,我们利用通用数据集针对特定领域,如法律,对引擎进行了训练。期间系统不仅会遇到语域外的词汇,还会发现大量未纳入训练的新词。需要注意的是,这是一个针对资源匮乏的语言对进行的训练,也是构建特定神经机器翻译引擎的切入点。 为了更深入地说明这个问题,让我们来看看下面这个表格,来自Phillipp Koehn的书《神经机器翻译》。 图像说明:该图表示在一个语域(行)上训练和在另一个语域(列)上测试时翻译系统的质量(双语替换测评,BLEU),绿色条为NMT,蓝色条为SMT。 实验重申了我们的问题,并以非常简洁的方式显示了在训练引擎之前解决语域自适应的重要性,确保最重要部分的数据集在目标语域中保持一致。 2. 训练数据量 机器翻译质量在很大程度上依赖于训练数据量。对于SMT系统来说,翻译系统质量和BLEU分数之间的关系几乎是直接相关的,但是在NMT中,这种关系就像脸书的状态一样复杂。若NMT能找到更多的数据,其就更可能在更大的文本中进行泛化并表现得更好。 为了用更具体的数据说明,以下数字可能视不同语种有所差异。一个NMT系统至少需要包括2000万或更多的单词,且单词量高于3000万至3500万单词时,其才能真正胜过其他所有机器翻译系统。 为了使我们的案例更有力,并成为一个很好的例子,让我们借用以下Koehn的图表来说明。 图表说明了NMT系统极度依赖数据集。该系统需要海量的训练数据,而这些数据并不总是那么容易找到,尤其是稀有领域或资源匮乏的语料。 3.长句子 NMT模型有一个众所周知的缺陷,尤其是其早期的解码器-解码器架构,不能正确地翻译长句子。幸运的是,2018年某个时刻,注意力模型的出现一定程度上解决了这个问题,尽管仍无法完全克服。我们在大量注意力模型的迁移实验中,达成了一个共识,即在每个翻译请求,以50-60个单词为最大阈值来切割长句子,以获得尽可能好的机器译文。总体结果显示,当在特定的翻译请求中字符超过54时,BLEU得分会大幅下降。 另外补充一下,特别要提到的是,句子单词数在60个左右时,SMT并没有呈现出与NMT相同的弱点。然而,在句子包含80个或更多的单词(也就是非常长的句子)时,SMT就也会出现同样的问题。 4.集束搜索 多年来,我们一直在用不同的搜索技术来解决翻译任务,这些技术搜索出可能的译文形成一个子集。上述搜索中的一个常见参数是集束的尺寸参数,它限制了每个输入单词的翻译字符数量。而在SMT中,通常这个参数和模型的总体质量分数之间有明确的关系——集束搜索越大,期望分数就应该越高。在NMT中,有两个因素需要考虑,处理大的集束数据时,NMT的BLEU得分会从领先降至中等甚至低等。 在设定集束分数时,Acclaro的NMT小组发现集束值在5到8之间是最好的选择,不推荐集束值超过10,因为它会对模型输出和模型性能产生负面影响。为什么?简而言之,就是因为模型每秒能翻译的字数是由模型在使用长束搜索时所考虑的逻辑模式决定的。字数越多,模型需要考虑的每个部分翻译和后续单词预测量就越大。 这就是为什么我们建议永远不要高估集束搜索的原因之一。根据不同性能,将该数字保持在5到8之间,以获得最佳的分数,并保持每秒处理单词速率最高。 5.文字对齐 几乎在每一个翻译请求中,都需要将词以在源文本和目标文本之间实现对应。使之产生问题的原因可能包括标签、格式、ICU或我们日常遇到的一长串特殊功能。幸运的是,注意力机制有个关键部分(2018年起我们的NMT引擎中就包含了该部分)就是注意力表,其涵盖了所有可能的输入和输出单词之间的对齐词汇。 但是,NMT中的单词对齐仍有挑战:在输入和输出单词之间的对应关系中,注意力机制所起的作用与SMT不同。在NMT中,注意力技术的作用更加广泛,会注意语境。例如,当翻译一个作为句子主语的名词时,其可考虑动词和任何其他与主语相关或描述主语的词。语境可以帮助确定意思。也就是说,单词对齐本身并不是确定任何给定单词意义最可靠的方法——当然也不是确定词义的唯一方法。 鉴于上述原因,注意力模型可能会选择与我们第一直觉或对齐点不对应的词义。我们非常建议,用涉及对齐特征的翻译任务来训练模型,包括可追踪词汇对齐的引导对齐训练任务(如Fast-Align提供的词对齐,它更接近我们的第一直觉)。 跨越NMT的最后关卡 尽管NMT取得了广泛的成功,但其仍有一些障碍有待跨越。在我们所讨论的这五个最大的挑战中,脱离语域和训练数据量(不足)都使任何系统都无法发挥很大作用。在我们寻求解决这些挑战的过程中,Acclaro的NMT小组一直乐于与同样关注这两点的NMT从业者和研究人员合作。最后,我们鼓励客户在使用机器翻译时将这两个因素视为最重要的因素:拥有多少数据才足够?以及我想要的语域中提供的数据吗?跟我们谈谈在构建一个成功的本地化程序时,我们如何将这些数据应用到系统中。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文