Overcoming the challenges of machine translation for long-tail languages

克服机器翻译在长尾语言中的挑战

2021-03-17 23:50 SDL blog

本文共626个字,阅读需7分钟

阅读模式 切换至中文

Even in the middle of a pandemic with serious effects on the economic climate, globalization shows no sign of letting up. On the contrary, international collaboration and joint partnerships are more important than ever. Just take the development and roll-out of successful vaccines as an example – none of this could happen without businesses and people working together across borders and language divides. Global collaboration and communication matter for all businesses and machine translation is playing an increasingly important role in enabling businesses to speak to their customers in their own language.Of course, this also means communicating in languages that are considered long-tail or niche languages for machine translation. Long-tail languages are languages that are less frequently localized and have therefore not been the main focus for machine translation or post-editing.Why is that? There are several reasons for this. Long-tail languages refer to a very diverse group of languages, some with just a few thousand and others with millions of speakers. But there are also commonalities which can be blockers for a successful MT strategy.Low data resourcesLeading commercial languages such as English, FIGS, Dutch or Portuguese are frequently localized and have huge data resources which can be used for building or enhancing machine translation capabilities. Long-tail languages often have low data resources or potentially lower quality data which can have a negative effect on machine translation quality.Little Post-Editing experienceThrowing Post-Editing into the mix adds another layer of complexity. Freelance markets are often small and very conservative with no previous exposure to MT or Post-Editing.Lack of market awareness in the translation industryMany of our global customers require fast and effective localization solutions for a large number of language pairs, including long-tail languages but very few Language Service Providers have the know-how to penetrate these markets and successfully introduce machine translation and Post-Editing. How do we address these challenges? Understanding and acknowledging the challenges that long-tail languages pose for MTPE (Machine Translation Post-Editing) adoption in terms of technology, translation resources and local market know-how is the first step in building a successful strategy.Deep MT experience is the second. NMT technology has already proven to be a game changer for previously incredibly challenging languages such as Japanese or Russian and is now key to paving the way for long-tail languages. It has transformed our approach to developing direct models for a number of long-tail languages, producing tangible quality improvements confirmed by translators. And thanks to powerful enhancements such as any-to-any translations, where English is used as a pivot language behind the scenes, we can continue to expand our footprint for these languages. This is especially relevant in terms of regional commercial developments that are increasing the need for more specialist MT. A good example is the 2020 Regional Comprehensive Economic Partnership (RCEP) free trade agreement between 15 Asia-Pacific nations which accounts for about 30% of the world's population.Our third building block is our Post-Editing expertise. SDL have been committed to sharing knowledge and spreading the word about machine translation right from the start, not only within our organization but also with our freelance community through trainings, collaterals and our very own Post-Editing certification course. With NMT technology moving very quickly, our popular course is currently undergoing an update to include the latest developments as well as real-life examples from a wide range of languages.And last but not least we have a dedicated Global Language Office handling all external – primarily long-tail – languages on a 24/7 basis, offering specialist support to the rest of the organization and managing the all-important vendor relationships with a view to introducing MT in a sustainable fashion.
即使正在经历的疫情对经济气候产生了严重影响,全球化也没有丝毫松懈的迹象。相反,国际协作及各国合作伙伴关系比以往任何时候都更加重要。仅以疫苗的成功开发和推广为例——如果没有企业和人们跨越国界和语言鸿沟进行合作,这一切都不可能实现。 全球协作和交流对所有企业都很重要,且机器翻译在帮助企业用本国语言与各国客户交流方面发挥着越来越重要的作用。当然,这也意味着和使用长尾语言或小众语言的客户交流也需要机器翻译。长尾语言本地化频率较低,因此不是机器翻译或译后编辑的主要焦点,这是为什么呢?有以下几个原因。长尾语言指的是一个非常多样化的语言群体,各语言使用人数低至几千人,高至数百万人。但这些语言有一些可能阻碍机器翻译应用的共同点。一是长尾语言数据资源不足。通常,英语、FIGS(法语、意大利语、德语和西班牙语)、荷兰语或葡萄牙语等主要商用语言的本地化较多,并且有大量的数据资源可用于构建或增强机器翻译能力。但长尾语言数据资源往往较少,或质量较低,从而影响机器翻译质量。二是译后编辑的经验不足,将译后编辑与机器翻译结合又增加了本地化的复杂性。自由译者的市场通常很小,而且他们非常保守,以前也没有接触过机器翻译或译后编辑。三是翻译行业缺乏市场意识。我们许多国际客户要求为大量语言对提供快速有效的本地化解决方案,包括长尾语言,但很少有语言服务供应商有能力进入这些市场并成功引入机器翻译和译后编辑。 我们如何应对这些挑战? 构建成功战略的第一步是理解并承认长尾语言在技术、翻译资源和当地市场知识等方面对机器翻译译后编辑(Machine Translation Post-Edition)的应用带来的挑战,而丰富的机器翻译译后编辑经验则是第二步。事实已经证明,NMT(神经网络机器翻译)技术颠覆了机器译文的表现,尤其是此前极具挑战性的语言,如日语、俄语。现在,它也是为长尾语言机器翻译铺平道路的关键。NMT让我们放弃为大量长尾语言开发直接模型的方法,而采用由译者确认的方式改进翻译质量。而且由于翻译质量的改善,比如利用英语作为中间语的任意语言对的翻译,我们可以继续扩大其在长尾语言上的应用。这与当地商业发展尤其相关,因为这些发展使对专业MT的需求逐步增加。占世界人口约30%的15个亚太国家签订2020年区域全面经济伙伴关系(RCEP)自由贸易协定就是一个很好的例子。我们的第三步是译后编辑专业知识。SDL从一开始就致力于分享并传播机器翻译知识,不仅在我们的组织内部,并且在自由译者社区通过培训、宣传和我们自己的译后编辑认证课程进行交流。随着NMT技术的快速发展,我们的热门课程正在进行更新,包括最新的发展趋势以及各种语言的实际例子。最后,我们有一个专门的全球语言办公室,全天候处理所有外部语言(主要是长尾语言),以为组织的其他部门提供专业支持,并管理所有重要的供应商关系,以期以可持续的方式引入MT。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文