Salesforce Just Open-Sourced a Large, XML-Tagged Machine Translation Dataset

Salesforce刚刚开放了一个大型的包含XML标记的机器翻译数据集

2020-07-29 17:20 slator

本文共702个字,阅读需8分钟

阅读模式 切换至中文

Training neural machine translation (NMT) engines with XML tags can improve translation accuracy when working with text data, according to a June 2020 paper published by a team of researchers at Salesforce. As part of this research, the team, which includes language industry veteran Teresa Marshall, Vice President of Globalization and Localization at Salesforce, made available on Github a dataset that draws on the software company’s professionally-translated online help documentation. The entire dataset covers 17 languages — any of which can be used as a source or target language — and includes about 7,000 pairs of XML files for each language pair. “Our work is unique in that we focus on how to translate text with XML tags, which is practically important in localization,” lead researcher Kazuma Hashimoto told Slator. A new dataset was necessary for the team’s research, as widely used datasets of plain text do not reflect the fact that “text data on the Web is often wrapped with markup languages to incorporate document structure and metadata such as formatting information,” the researchers explained. “We decided to publish our new dataset so that people can use it if interested, and we can also gain significant benefit if they report interesting solutions to our task,” Hashimoto said, pointing out that the source data, online help for Salesforce customers, was already publicly available. Looking ahead, the team wrote, “As our dataset represents a single, well-defined domain, it can also serve as a corpus for domain adaptation research (either as a source or target domain).” According to the paper, this online help text has been localized and maintained for 15 years by the same localization service provider and in-house localization program managers. “At every release, we run our system to translate the content in English to other target languages, and then human experts verify the quality and perform post-editing to meet the quality demand,” Hashimoto said. Drawing on this multilingual content, the researchers created datasets for seven English-based language pairs (English to Dutch, Finnish, French, German, Japanese, Russian, and Simplified Chinese) and one non-English pair, Finnish to Japanese. The group performed baseline experiments on NMT output with XML tags removed (i.e., plain text) and compared them to experiments on NMT output with XML tags included. The team trained three models for each language pair: one trained only with text, without XML; one trained with XML; and one trained with XML and with copy mechanisms, which copy XML elements from the original source text. For the plain text NMT, “including segment-internal XML tags tends to improve the BLEU scores,” the authors wrote, which “is not surprising because the XML tags provide information about explicit or implicit alignments of phrases.” This was not the case, however, for English to Finnish, “which indicates that for some languages it is not easy to handle tags within the text.” Similarly, the model trained with both XML and copy mechanisms achieved the best BLEU scores for both plain text and text with XML tags across all language pairs, except for English to French plain text. “We expected that tagged text would be helpful in improving translation accuracy,” Hashimoto said, “especially when the training dataset size is limited, as in our specific use case, compared with very general machine translation work in existing research papers.” The researchers also encountered a typical error, undertranslation, when they found that the underlined phrase “for example” was missing in certain translation results, despite the fact that the dataset’s BLEU scores were higher than those of other, standard public datasets. For this reason, and because online help translations must be accurate, the authors concluded that NMT should be used “for the purpose of helping the human translators” perfect final translations. Although human evaluators identified more than 50% of the translation results as “complete” or “useful in post-editing,” translators still spent a significant amount of time verifying MT and correcting MT errors. Ideally, future translation models that take into account Web-structured text “may help human translators accelerate the localization process,” according to the paper’s authors, whose future work will explore “the effectiveness of using the NMT models in the real-world localization process where a translation memory is available.”
Salesforce的一个研究小组于2020年6月发表一篇论文,该论文表示,使用XML标记来训练神经机器翻译(NMT)引擎可以提高处理文本数据时的翻译准确性。 研究小组包括语言行业资深人士、Salesforce全球化和本地化副总裁特雷莎·马歇尔在内,该团队在Github平台上发布了一个数据集,数据集借鉴了软件公司专业翻译的在线帮助文档。 整个数据集涵盖了17种语言,每一种语言都可以作为源语言或目标语言,并且每个语言对包括大约7000对XML文件。 首席研究员桥本和夫(Kazuma Hashimoto)告诉Slator:“我们的工作十分特别,因为我们主要关注如何用XML标记翻译文本,这在本地化中工作中极其重要。” 研究人员解释说,新的数据集对于我们的研究来说十分必要,因为事实上“网页上的文本数据通常使用了标记语言,以合并文档结构和元数据(如格式化信息)。”而广泛使用的纯文本数据集并不能反映这一点。 桥本说:“我们决定公布我们的新数据集,感兴趣的人都可以使用,如果大家给我们的任务提供了有趣的解决方案,我们也能受益。” 他还指出源数据,即Salesforce客户的在线帮助,也已公布可用。 对于未来,该团队写道,“由于我们的数据集是一个单一且明确的领域,它也可以作为领域适应研究的语料库(既可以作为源领域,也可以作为目标领域)。” 论文还显示,这个在线帮助文本已经由同一本地化服务提供商和内部本地化经理进行本地化并且维护了15年之久。 “每次发布时,我们都会使用系统将英文内容翻译成其他目标语言,然后由专家审核质量并进行后期编辑,以满足质量需求” 桥本说道。 根据这些多语言资源,研究人员创建了七个基于英语的语言对(英语对荷兰语、芬兰语、法语、德语、日语、俄语和简体中文)和一个非英语对(芬兰语对日语)的数据集。 该小组对去掉XML标记(即纯文本)的NMT结果进行了基础实验,并将它们与包含XML标记的NMT结果的实验进行了比较。 研究小组为每种语言对训练了三个模型:一个训练模型只使用文本,不包含XML; 一个包含XML; 还有一个训练模型使用XML和复制机制,从源文本中复制XML元素。 对于纯文本NMT,“内部包含XML标记往往会提高BLEU分数,”论文的作者们写道,“这并不奇怪,因为XML标记提供了有关短语显式或隐式对齐的信息。”然而,对于英语译至芬兰语的语言对来说,情况并非如此,“这表明对于某些语言来说,处理文本中的标记并不容易。” 类似地,同时使用XML和复制机制训练的模型在几乎所有语言对的纯文本和带有XML标记的文本中都获得了最佳的BLEU分数,英语译至法语纯文本除外。 “我们预计,标记文本将有助于提高翻译准确性,”桥本表示,“特别是当训练数据集大小有限时,就像在我们特定的使用案例中那样,不同于现有研究论文中非常普通的机器翻译工作。” 当研究人员发现带下划线的短语“for example”在某些翻译结果中缺失时,尽管数据集的BLEU分数高于其他标准公共数据集,但这是一个典型的错误——翻译不足。正因如此,加上在线帮助翻译必须准确,作者们认为,NMT的目的应该是“帮助人工译员完善译文”。 尽管评估员认为50%以上的译文都是“完整”或“在译后编辑中有用”的,但译者仍然要花费大量的时间来验证机器翻译和纠正机翻错误。 理想情况下,考虑到网络结构文本的未来翻译模型“可能会帮助人工译员加速本地化过程”,论文的作者们表示,他们未来的工作将探索“在有可用翻译记忆库的真实本地化进程中使用NMT模型的有效性”。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文