How Netflix Researchers Simplify Subtitles for Translation

Netflix研究人员简化字幕翻译的方法

2020-05-28 14:25 slator

本文共807个字,阅读需9分钟

阅读模式 切换至中文

As original productions of media entertainment content have come to a halt amid coronavirus lockdowns, streaming services have turned their attention to localizing back-catalog content into more languages. With high levels of localization demand, even in times of lockdown, streaming providers such as Amazon Prime Video are increasingly active participants in the machine translation (MT) research space. Streaming giant Netflix confirmed back in April 2019 that they had not yet rolled out MT for their subtitle operations, but said they were investigating the use of the technology. Investigating they are: In May 2020, a paper published by a group of computer scientists at Netflix explored how to improve MT quality for low-resource languages, with the intended use likely to be in subtitles and meta-descriptions. The paper, entitled “Simplify-then-Translate: Automatic Preprocessing for Black-Box Translation,” was published on pre-print platform arXiv on May 22, 2020. The study is a collaboration between former Netflix Research Intern Sneha Mehta, former Engineering Manager Ballav Bihani, and current Netflix employees Bahareh Azarnoush, Data Science Manager, Boris Chen, Machine Learning Engineer, Vinith Misra, Artwork and Video Data Science Manager, Avneesh Saluja, Research Scientist, and Ritwik Kumar, Machine Learning Director. Kumar’s LinkedIn profile provides a glimpse into wider MT-related research areas at Netflix, and lists a number of the team’s projects: deep learning for high-quality machine translations, predicting per-title language demand, and deep learning for text understanding such as customer complaint mining. Azarnoush’s LinkedIn profile also outlines her mandate to “partner with localization experts to unleash the power of data to transcend language barriers and ensure the best local user experience at scale.” Her focus includes, for one thing, “experimentation and causal inference to support localization decisions.” Netflix’s Simplify-Then-Translate paper brings together two natural language processing (NLP) disciplines: sentence simplification and machine translation. Sentence simplification is nothing new. As the paper points out, sentence simplification was originally explored in the 1990s as a way to improve machine translation. The idea was that simpler source sentences lead to more fluent translations and “reduce technical post-editing effort.” Netflix’s method relies on this premise and also leverages the notion that translated content is fundamentally simpler than original source content. By extension, they argued, back-translations are simpler than the original source sentences and can be used to build a simplification model. This is what is novel about Netflix’s approach. First, Netflix took content previously translated by humans (reference translations) and back-translated it into the original source language using MT; in this case, English. From there, the researchers used the simpler, back-translated sentences to build a simplification model for English sentences. The simplification model — called an automatic pre-processing model or APP — would then be applied to any English source content prior to the machine translation step, to improve the resulting output. Netflix’s flagship APP for English, the figsAPP, is built specifically to tackle tricky content such as idioms by replacing such expressions with a simplified alternative. Given that they focus on “conversational language as used in dialogues of TV shows, [which] tends to be colloquial and idiomatic,” Netflix judged that it was important to use reference translations from this domain. Suitably, Netflix used entertainment content in high-resource languages to build the figsAPP, employing French, Italian, German, and Spanish (FIGS) reference translations for a number of titles including “How to Get Away with Murder,” “Star Trek: Deep Space Nine,” and “Full Metal Alchemist.” To conduct their experiments, Netflix used a “black-box” machine translation system, Google Translate. To test the results of the figsAPP against an out-of-domain simplification dataset, Netflix machine-translated simplified content into seven low-resource languages: Hungarian, Ukrainian, Czech, Romanian, Bulgarian, Hindi, and Malay. Source content that had been simplified with the figsAPP resulted in better quality translations in all seven languages, compared to translations resulting from non-simplified, original source content. Source content pre-processed with the out-of-domain APP performed significantly worse than the original, confirming Netflix’s hypothesis that using domain-specific content improves the performance of the APP. Netflix also looked at the Translation Edit Rate (TER), and found that using figsAPP-treated source content improved edit distance by between 1.3% to 7.3% for the seven languages tested. This is “intuitive,” Netflix said, “because the APP simplification brings the sentences closer to their literal human translation.” The researchers also used humans to evaluate the quality of a sample of the translations resulting from figsAPP-treated source content for five of the seven low-resource languages. Here, too, Netflix found that, at least for three languages, figsAPP-treatment resulted in improved translation output. Although English source content is Netflix’s primary focus for the purposes of the research, APPs can also be built in any language for which enough corresponding reference translations exist.
随着媒体娱乐原创内容的制作因新冠疫情封锁隔离而陷入停顿,流媒体服务公司已将注意力转向对过往的内容进行本地化处理,转换成多种语言。 本地化需求高涨,即使在封锁时期,流媒体提供商,比如亚马逊Prime Video,也越来越积极地参与机器翻译(MT)研究领域。 流媒体巨头Netflix早在2019年4月就证实,尚未在字幕运营中推出机器翻译,但表示正在调查这项技术的使用情况。 据调查,2020年5月,Netflix的计算机科学家发表了一篇论文,探讨如何提高低资源语言的机器翻译质量,预期将用于字幕和元描述中。 2020年5月22日,这篇题为《简化然后翻译:黑匣子翻译的自动预处理》的论文发表在预印平台arXiv上。 这项研究是由前Netflix研究实习生内哈·梅塔(Sneha Mehta)、前工程经理巴拉维·比哈尼(Ballav Bihani)以及Netflix现任员工:数据科学经理巴哈雷·阿诺诗(Bahareh Azarnoush)、机器学习工程师鲍里斯·陈(Boris Chen)、插图和视频数据科学经理维尼斯·米斯拉(Vinith Misra)、研究科学家阿维尼什·萨鲁加(Avneesh Saluja)、以及机器学习主管里特维特·古玛(Ritwik Kumar)合作完成的。 古玛在领英上的个人资料,让人能够一堵Netflix在与机器翻译相关的更为广泛的领域进行的研究,其中还列出了该团队的多个项目:有关高质量机器翻译的深度学习,预测每个标题的语言需求,以及关于文本理解的深度学习(如客户投诉分析)。 阿诺诗在领英的个人资料中还概述了她的任务:“与本地化专家合作发挥数据的力量,克服语言障碍,确保最佳的本地用户规模体验。”她的工作重点还包括:“参与实验和因果推理,支撑本地化决策。” Netflix的论文《简化然后翻译》汇集了两个自然语言处理(NLP)学科:语句简化和机器翻译。 语句简化已不是新鲜话题。 正如论文所指出的,20世纪90年代,为改进机器翻译,对语句简化进行了探索。 当时人们认为源语句越简单,译文就越流畅,并且能够“减少后期技术编辑工作”。 Netflix的方法基于这一前提,还利用了一个理念:译文内容从根本上来说比源文本内容简单。 他们认为,由此推论,回译的内容比源文本内容更简单,可以用此建立一个简化模型。 这就是Netflix方法的新奇之处。 首先,Netflix将之前由人工翻译的译文(参考翻译),使用机器翻译回译为源语言,此处源语言是英语。 在此基础上,研究人员使用更为简单、回译的句子来建立英语语句的简化模型。 这种简化模型被称为自动预处理模型或简称APP,将在机器翻译之前,应用于任何英语源文本,以改进最终的输出译文。 Netflix的旗舰英语自动预处理模型figsAPP,是专门为处理诸如习语等棘手内容而开发的,用一个简化的版本来替代这些内容。 鉴于这些内容集中在“电视节目对话中使用的会话语言,(这种语言)往往比较口语化和惯用化,”Netflix认为使用这一领域的参考翻译很重要。 Netflix恰当地使用了高资源语言的娱乐内容来构建figsAPP,包括《逍遥法外》,《星际迷航:深空九号》和《钢之炼金术师》在内的许多标题都使用了法语(French)、意大利语(Italian)、德语(Italian)和西班牙语(Spanish)(FIGS)的参考翻译。 为了展开实验,Netflix使用了一个“黑匣子”机器翻译系统——谷歌翻译。 为了测试figsAPP在域外简化数据集上的结果,Netflix用机器翻译将简化过的内容翻译成七种低资源语言:匈牙利语,乌克兰语,捷克语,罗马尼亚语,保加利亚语,印地语和马来语。 与未经简化的源语言内容相比,使用figsAPP简化过的源内容在七种语言中的翻译质量都更高。 经过域外APP(自动预处理模型)预处理的源内容质量明显降低,这证实了Netflix的假设,即使用特定领域的内容能够提高APP(自动预处理模型)的性能。 Netflix还研究了翻译编辑率(TER),发现使用FigsApp处理过的七种语言的源内容,编辑距离均提高了1.3%到7.3%。 效果是“直观的”,Netflix称,“因为APP(自动预处理模型)的简化使译文更接近人工翻译的译文。” 研究人员还通过人工来评估七种低资源语言中五种语言的翻译样本的质量,这些翻译样本均来自FigsApp处理过的源内容。 Netflix同样发现,FigsApp至少提高了三种语言的译文质量。 虽然英语源内容是Netflix研究的重点,但只要有足够的相应参考翻译,其他任何语言都可以构建APP(自动预处理模型)。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文