Effectiveness of Domain-specific Language Data

特定领域中语言数据的效用

2019-10-07 17:00 TAUS

本文共1225个字,阅读需13分钟

阅读模式 切换至中文

Not so long ago, I was at the airport, listening to a voice announcing the gate change. I couldn’t help but notice that the sentences sounded somewhat unnatural as if parts of it were cut and pasted together. Shortly after, I had a chance to see backstage the technology used by a company providing natural voice announcements. I was surprised to learn that it consisted of cutting and pasting parts of pre-recorded sentences into new ones, with the guidance of native speakers. Not that I’m an expert in the field, but choosing for what seems a largely manual process appeared old-fashioned to me, in this age of data and technology. Language Data Barrier You could say that the technology is not there yet, but it is. And I don’t mean only the Alexa’s, Cortana’s and Google assistants, but also the available open-source systems and algorithms, used both to empower research and support derivative work. In 2017, we wrote about this democratization of algorithms and concluded that machine translation is a simple sum of algorithms and data (see Nunc est Tempus). The algorithms and models are out there, up for grabs - statistical, hybrid, neural - the choice is yours. And what about the data? Google’s recent Massively Multilingual NMT research is proof of the leap forward that can be achieved when you have large volumes of language data. What still appears to be the main bottleneck is the lack of data in a particular target language or domain. hbspt.cta._relativeUrls=true;hbspt.cta.load(2734675, '0bcdf6f8-0788-4911-9873-9d47601a88e1', {}); Powering Language Data blog). Before they can be used, these data need to be transformed - into the right structure and fit for a specific purpose/domain. While generic engines perform well with a general-purpose text, a machine translating text in a particular linguistic domain will give the best results when trained with a customized data set, carefully selected to cover the vocabulary and semantic specificity of the content. A simple explanation is that the engine trained on data relevant to the domain will have a built-in bias towards that domain, and will deal better with issues like word sense disambiguation - determining which meaning of the word is the best fit in a certain context. Take for example the word virus and the different meaning that it has in software translations as opposed to the life sciences domain. To address this lack of availability of domain-specific data and the growing number of engine customization use cases in the industry, we released the TAUS Matching Data clustered search technology earlier in 2019. It is a search technique that uses an example data set to search a data repository, calculate matching scores on segment-level and return high-fidelity matched data. That was also a perfect opportunity for us to test our own hypothesis - that the data today are just as relevant, if not more relevant than the algorithms. In-domain Data Experiment For our experiment, we have chosen a well-known platform in the machine translation field - the WMT Workgroup, which posts a number of machine translation tasks every year. Each of the WMT tasks has a goal and a set of success metrics, inviting MT practitioners to perform experiments with their algorithms or data and to submit their results. The shared task that was focused on domain adaptation and hence relevant to our hypothesis was the WMT16 IT translation task. The task provided both in-domain and out-of-domain training data, together with in-domain test data in seven languages for the IT domain. The rules of the game are simple: One can use any system and data to train an engine and submit the relevant results Using additional training data next to the provided WMT data is possible, but must be flagged. This type of training is called unconstrained, while constrained training only uses the provided WMT training data. The test data set and the corresponding reference translations are included in the task The testing consists of translating the provided test data set with the chosen engine, and using the reference translations to get a BLEU score. The final step is to post your score and compare it with other submissions TAUS Data Selection and Training Process The in-domain training data provided in the WMT task (Batch1 and Batch2) consisted of 2,000 answers from cross-lingual hardware and software help-desk service. In order to test the performance of the in-domain IT corpus created with Matching Data technology, we used the WMT training data as the query corpus to select matching sentences from our vast language data pool (TAUS Data Cloud)and create our own IT training corpus for the selected six languages. That qualified our training as unconstrained. This Matching Data was returned from the Data Cloud: As for the system to train, we deliberately chose an open-source, vanilla Moses engine, and not a more advanced neural or hybrid system, so that the focus would be solely on the performance of the training data and not of the system itself. After training the engine on the Matching Data IT corpus, we used the WMT test set (Batch 3) and reference translations to compare with our engine output and calculate a BLEU score. Performance Overall, the MT engines that were trained with the matching data corpora performed strongly across all language pairs. For three language pairs, they beat all the other systems submitted as part of the WMT 2016 IT task. For the other three languages, they were on par with the other submissions for the particular language pair. What is more important with regards to our hypothesis, is that the matching data corpora outperformed all submissions where only WMT in-domain training corpora were used (constrained), regardless of the engine model (see the table below), proving that there is a guaranteed improvement coming solely from fine-tuning the data. What this also proves is the relevance of the data towards the set domain, even in a set up where a basic Moses machine translation system and a relatively small training set is used. The scores and comparison are based on the system submissions available on http://matrix.statmt.org for the it-test2016 test set. Sourcing Domain-specific Training Data Adaptation of MT engines for domains is quite common these days. Most MT providers offer it as a service and so do the giants like Google (AutoML) and Microsoft (MS Business Translator). Whichever option you chose, you will first need to source the in-domain parallel training data. You can start by looking at the data that your organization collects, which is most likely not MT ready or search for a third-party provider that would be able to provide in-domain data, but that might still be too generic for your purpose. Or, if you want to make sure to get training data that matches your specific needs, you can have TAUS perform data matching for you, based on your own query corpus - nothing can be more customized than that! Have a look at the corpora we already created in our MD Library, you might find a ready-made one that is just what you are looking for. hbspt.cta._relativeUrls=true;hbspt.cta.load(2734675, 'a25cdac6-be9f-443a-adea-8adc72fd6e87', {});
不久以前,我在机场,听到一个声音宣布门的变化。我不能帮忙,但我注意到,句子听起来有些不自然,好像它的部分被剪切和粘贴在一起。不久之后,我有机会看到一家提供自然语音公告的公司使用的技术后台。我惊讶地发现,它包括在母语人士的指导下,将部分预先录制好的句子剪切和粘贴到新的句子中。不是说我是这个领域的专家,而是在这个数据和技术时代,选择一个看起来很传统的手工过程。 语言数据屏障 你可以说这项技术还没有出现,但事实是如此。我不仅指 Alexa ’ s 、 Cortana ’ s 和 Google 助理,还指可用的开源系统和算法,用于授权研究和支持派生工作。2017年,我们撰写了关于算法民主化的文章,并得出结论:机器翻译是一个简单的算法和数据的总和(参见 Nunc est Tempus )。算法和模型出来了,对于抓取-统计,混合,神经-选择是你的.数据怎么样?谷歌( Google )最近的大规模多语言 NMT 研究证明,当你拥有大量的语言数据时,这一飞跃是可以实现的。仍然是主要瓶颈的是在特定的目标语言或域中缺少数据。 。cta 。相对 Urls = true ; hbspt 。cta 。加载(2734675,{}0bcdf6f8-0788-4911-9873-9d4760a88e1',{} Powering 语言数据博客。在使用这些数据之前,需要将它们转换成正确的结构,并适合特定的目的/领域。 虽然通用引擎在通用文本中表现良好,但在特定语言领域中翻译文本的机器在使用定制的数据集进行培训时,会获得最佳的结果,经过精心选择,可以涵盖内容的词汇和语义的特殊性。一个简单的解释是,对与该领域有关的数据进行培训的引擎将对该领域有内在的偏见,并将更好地处理诸如词义消歧之类的问题-确定该词的哪种含义在特定上下文中是最合适的。例如,与生命科学领域相比,病毒这个词在软件翻译中具有不同的含义。 为了解决特定领域数据的可用性不足以及行业内越来越多的引擎定制使用案例,我们在2019年早些时候发布了 TAUS 匹配数据集群搜索技术。它是一种搜索技术,使用示例数据集搜索数据存储库、计算段级别的匹配分数并返回高保真度匹配数据。这也是我们测试自己假设的绝佳机会——今天的数据与算法同样相关,甚至更相关。 域内数据实验 在我们的实验中,我们选择了机器翻译领域中的知名平台—— WMT 工作组,每年发布许多机器翻译任务。每个 WMT 任务都有一个目标和一组成功的度量标准,邀请 MT 从业者对其算法或数据进行实验并提交结果。 集中于领域适应并因此与我们的假设相关的共享任务是 WMT16 IT 翻译任务。该任务提供了域内和域外培训数据,以及 IT 领域的7种语言的域内测试数据。 游戏规则简单: 可以使用任何系统和数据来训练引擎并提交相关的结果 可以在所提供的 WMT 数据旁边使用附加的培训数据,但必须标记。这种类型的培训称为无约束培训,而受限培训仅使用提供的 WMT 培训数据。 测试数据集和相应的参考译文包含在任务中 测试包括将所提供的测试数据集与所选引擎进行转换,并使用引用转换来获得 BLU 分数。 最后一步是张贴你的分数,并将它与其他提交的比较 TAUS 数据选择和培训流程 WMT 任务( Batch1和 Batch2)中提供的域内培训数据包括来自跨语言硬件和软件服务台服务的2000个答案。为了测试使用匹配数据技术创建的域内 IT 语料库的性能,我们使用 WMT 训练数据作为查询语料库,从海量语言数据池( TAUS Data Cloud )中选择匹配句子,并为所选的六种语言创建自己的 IT 训练语料库。这使我们的训练不受约束. 此匹配数据从数据云返回: 至于系统的训练,我们故意选择一个开源的,普通的摩西引擎,而不是一个更先进的神经或混合系统,这样的重点将只是性能的培训数据,而不是系统本身。 在对引擎进行了匹配数据 IT 语料库的培训后,我们使用了 WMT 测试集(第3批)和参考翻译来与我们的引擎输出进行比较,并计算了一个 BLUEU 评分。 性能 总体而言,在所有语言对中,使用匹配数据语料库训练的 MT 引擎表现强劲。对于三个语言对,它们击败了作为 WMT 2016 IT 任务一部分提交的所有其他系统。对于其他三种语言,它们与针对特定语言对的其他提交文件相同。 与我们的假设相比,更重要的是,匹配的数据语料库优于所有仅使用 WMT 域内培训语料库(受限)的投稿,而不考虑引擎模型(见下表),这证明了仅仅通过微调数据有保证的改进。这也证明了数据对设置域的相关性,即使在使用基本的 Moses 机器翻译系统和相对较小的训练集的设置域中也是如此。 评分和比较基于 http://matrix 上提供的系统提交。统计。用于 it-test2016测试集的 org 。 搜寻特定领域的培训数据 MT 引擎对域的适应性是很常见的。大多数 MT 供应商都提供这种服务,谷歌( AutoML )和微软( MSBusinessTranslator )等巨头也提供这种服务。无论您选择哪个选项,您都需要首先获取域内并行培训数据。您可以从查看您的组织收集的数据开始,这些数据很可能不会准备好 MT ,也不会搜索能够提供域内数据的第三方提供程序,但这对于您的目的来说可能过于一般性。 或者,如果您想确保获得与您的特定需求相匹配的培训数据,您可以让 TAUS 根据您自己的查询语料库为您执行数据匹配-没有什么可以比这更定制!看看我们在 MD 库中创建的公司,你可能会发现一个现成的公司,就是你正在寻找的。 。cta 。相对 Urls = true ; hbspt 。cta 。装载(2734675,{} a25cdac6-be9f-443a-8adc72fd6e87';

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文