Data-Enhanced Machine Translation

数据增强机器翻译

2021-12-02 22:25 TAUS

本文共1762个字,阅读需18分钟

阅读模式 切换至中文

This is the third article in my series on Translation Economics of the 2020s. In the first article published in Multilingual, I sketched the evolution of the translation industry driven by technological breakthroughs from an economic perspective. In the second article, Reconfiguring the Translation Ecosystem, I laid out the emerging new business models and ended with the observation that new smarter models still need to be invented. This is where I will now pick up the thread and introduce you to the next logical translation solution. I call it: Data-Enhanced Machine Translation. Machine Translation technology has brought the promise of a world without language barriers. The need for translation is astronomical: the total output from MT platforms is tens of thousands of times bigger than the translation production of all professional translators in the world combined. And yet in its current state and use of the technology we are still looking at a half-baked solution. What’s wrong? Well, the quality of course, and more specifically the coverage of domains and languages. Imagine where we would be if we could mobilize the technology to handle all languages and domains equally well. Let’s zoom out here for a moment and put ourselves in the shoes of an average netizen somewhere in the world. The 4.6 billion people that live and work online on our planet look at translation as a utility: translation out of the wall, not much different from electricity or the internet. It’s always there, and if it isn’t, well that’s most inconvenient of course. The quality may also not always be great, but it will work better next time, right? Remember the poor quality of Skype calls in the early days or the trouble we had making connections to the internet by dialing in through our computers. We accepted all of these inconveniences because we anticipated that the technology would catch up and improve. So do millions of small, medium and large businesses around the world. Translation is a utility they purchase from their cloud provider and pay for through monthly billing, along with hosting costs for their websites and other IT services. If the translation ‘signal’ is not available for a particular country, then so be it. They have to live with the disconnect until the ‘coverage’ is available. The alternative of a full human translation is not an option, because it may take a week, a month or more before the translation is finished, and that simply does not match the fast pace of global business these days. Not to speak of the cost of human translation: probably 500 times higher than what the company pays for using the MT service from the cloud provider. So what can we do to bridge the gap: boost the quality and expand the coverage of the translation ‘signal’ for the billions of end-users and millions of business users? The sobering truth is that the biggest shortcoming is not so much in the technology but in the way we go about it. We need an innovation ecosystem. But the best we are getting at the moment is a translation industry making compromises: the new MT technology is molded into existing business models and processes. If you can’t beat ‘m, eat ‘m, seems to be the mantra of some of the most forward-thinking translation platforms, judging from the sheer number of MT engines they claim to integrate with. What they achieve is a productivity gain resulting in a price reduction for the customer: ten to twenty percent faster and cheaper every year. At the core, though, nothing is changing. Except that in this race to the bottom they are dragging along the professional translators and giving them ever more dumb post-editing work to do. The compromises made in the translation industry come from on the one hand a long tradition of craftsmanship and on the other the principle that a translation is a service that is always commissioned and paid for by a customer. The post-editing model, or ‘human-in-the-loop’ as it is more neatly referred to, is in a way a half-baked solution, a compromise. These compromises stop us from seeing the bigger picture and bigger opportunities. What if these traditions and principles no longer count under the new economics? As we have seen at other ruptures in the history of economies, it is thinking out-of-the-box that will help us to reap the full benefits of a technology breakthrough. Rethinking everything, a tabula rasa kind of approach to the definitions of markets and products, is the key to innovation. The translation industry, I believe, has the mission to help the world communicate better. The world is our market. Operators in the translation industry must see beyond the inner circle of customers that they are serving today and help every business, small and large, become truly world-ready. The challenges of practically real-time quality translation and utility-based pricing are waiting to be solved by innovative thinkers from the core language industries. Others won’t do it or can’t do it. The tech companies bring the translation utility only so far. They can’t close the quality gap or don’t want to because they can’t excel in everyone’s field of expertise and style preferences. The reward for solving the problems is a market that is thousands of times bigger, at least in volumes of output, than the translation industry today. To deliver on our mission to help the world communicate better and support every business to become truly world-ready, we need to really think of translation as a product, rather than a creative service. There is no other way we can scale up and deliver real-time. And as we can see all around us: translation is already a utility, a feature or a product on the internet. The central problem is that it’s just not good enough. To make it better, the professionals in the language industry need to look under the hood, analyze the product and figure out how it works. They will then discover that MT is not a magic black box, and realize that the difference between good and bad or not-so-good MT lies in the data that we put into the engine. Think of MT as tasteless instant food: the product of a recipe (the algorithms) and ingredients (the data). To produce really good tasty food we need to get the best ingredients. Yes, we can perhaps tweak the recipe a bit here and there, but the biggest gain is in the quality of the ingredients. Putting it this way, we can say that the data - and not the translation process - is in fact the product. There is a growing awareness in the MT and AI industries overall that too much emphasis is being put on the models and that the more important data work is undervalued. So now that we have redefined the market and agreed that it is so much bigger than what we consider our market today, and now that we have redefined translation as essentially a data product, the question remains: how do we turn this into a viable business, and how do we scale? There is a huge need for translation and this is continuing to grow. Yet the population of professional translators is not infinite. MT technology, as we have seen, helps to increase productivity. But is post-editing MT the best use of our human capital? The core competencies of the translation industry are of course the skills of the people and their deep knowledge of vernaculars. If we want to optimize the utilization of this human capital, we are much better off investing in data products that we can sell multiple times. Knowing how to assemble high-quality data in their domain is the professional competitive advantage of the people working in the translation industry. For example, an English to French professional translator with subject matter expertise in fishery and maritime law can alone meet the needs of a hundred different lawyers who may now be engaging with customers on both sides of the Channel over fishing rights after Brexit. Or think of Mohamed Alkhateeb , the Syrian medical doctor and translator who uploaded his English to Arabic medical translation memories to the TAUS Data Marketplace, helped to boost the quality of Systran’s medical engine for the Arabic markets and created a good income for himself. The final piece of the puzzle is how do we bring demand and supply together in actual practice. How can specialized translators and language service providers operating in a niche market find as many customers as possible for their translation or data products? In my previous article, Reconfiguring the Translation Ecosystem, I introduced marketplaces and collaborative platforms as the new sharing models that would best support innovation in the translation ecosystem. As translation is now becoming another form of AI we see that the big AI marketplaces are gradually starting to include data for translation. We are also seeing the emergence of more specialized aggregation platforms such as aiXplain, Systran’s Marketplace, Hugging Face, and of course the TAUS Data Marketplace. All of these sharing platforms will push the one-to-many business model and help the translation industry to scale up and eventually build bridges to the millions of new business customers waiting for the translation ‘signal’ and better quality. The next logical translation solution therefore is Data-Enhanced Machine Translation. This is the premium quality level of real-time translation, perhaps not as good as human translation quality or transcreation, but good enough for 90% of all use cases. The crucial fact is that it’s a product, a feature. Supply is driven by demand. An Azerbaijani translator specialized in VAT regulations may be the sole provider of translation data in this niche, whereas French e-commerce specialists must share customers with many others. Dashboards on collaborative platforms will help operators in the translation industry allocate their resources to where they can make the biggest difference. This innovation wave may make the translation industry more transparent and create a more equal level playing field for all providers. Data sellers: go to TAUS Data Marketplace to check the value of your data. Translation buyers: build your own domain-specific dataset with the Matching Data feature on the TAUS Data Marketplace (Coming Soon). See our blog article on data cascades and check out Andrew Ng’s plea for data-centric AI.
这是我关于2020年代翻译经济学系列文章的第三篇。在《多语》发表的第一篇文章中,我从经济角度勾勒了翻译行业在技术突破推动下的演变过程。在第二篇文章《重新配置翻译生态系统》中,我对正在出现的新商业模式进行了布局,并以观察到仍需发明新的更聪明的模式作为结尾。这就是我现在拿起线程并向您介绍下一个逻辑翻译解决方案的地方。我称之为:数据增强机器翻译。 机器翻译技术带来了一个没有语言障碍的世界的承诺。翻译的需求是天文数字:翻译平台的总产量比世界上所有专业译者的翻译产量总和还要大几万倍。然而,在其目前的状态和技术使用情况下,我们仍在寻找一个半生不熟的解决方案。怎么了?当然是质量,更确切地说是领域和语言的覆盖率。想象一下,如果我们能够调动技术来同样好地处理所有语言和领域,我们将会处于什么样的境地。 让我们把镜头放远一点,设身处地为世界上某个地方的普通网民着想。在我们这个星球上生活和工作在网上的46亿人把翻译看作是一种实用工具:从墙外翻译,与电力或互联网没有太大区别。它总是在那里,如果它不在,那当然是最不方便的。质量可能也不总是很好,但下次会更好,对吧?还记得早期Skype通话质量很差,或者我们通过电脑拨入连接互联网时遇到的麻烦吗?我们接受了所有这些不便,因为我们预计技术将迎头赶上并改进。 世界各地数以百万计的小型,中型和大型企业也是如此。翻译是他们从云提供商那里购买的一种实用工具,并通过每月计费来支付,同时还要支付网站和其他IT服务的托管费用。如果翻译‘信号’不能用于特定的国家,那就顺其自然。他们不得不忍受这种脱节,直到“覆盖范围”可用。全人翻译不是一个选择,因为翻译可能需要一周,一个月或更长时间才能完成,而这根本不符合当今全球商务的快速节奏。更不用说人工翻译的成本了:可能比公司使用云提供商提供的MT服务支付的成本高出500倍。 那么,我们可以做些什么来弥合这一差距:提高翻译“信号”的质量,扩大对数十亿终端用户和数百万商业用户的覆盖范围?令人清醒的事实是,最大的缺点并不在于技术,而在于我们的工作方式。我们需要一个创新生态系统。 但目前我们得到的最好结果是翻译行业做出妥协:新的MT技术被塑造到现有的商业模式和流程中。如果你不能打败'm,那就吃'm吧,这似乎是一些最具前瞻性的翻译平台的口头禅,从他们声称集成的MT引擎的数量来看。他们实现的是生产率的提高,从而降低了客户的价格:每年速度快10%到20%,成本低。然而,从核心上说,一切都没有改变。只不过在这场竞争中,他们拖着专业翻译的后腿,让他们做更多愚蠢的后期编辑工作。 翻译行业做出的妥协一方面来自于悠久的手艺传统,另一方面来自于翻译是一种服务的原则,它总是由客户委托并付费。后期编辑模式,或者更确切地说是“人在循环”,在某种程度上是一种半生不熟的解决方案,一种折衷方案。这些妥协使我们无法看到更大的图景和更大的机会。如果这些传统和原则在新经济学下不再算数怎么办? 正如我们在经济历史上的其他断裂中所看到的那样,创新思维将帮助我们获得技术突破的全部好处。重新思考一切,一种对市场和产品定义的方法,是创新的关键。 翻译行业,我相信有帮助世界更好交流的使命。世界就是我们的市场。翻译行业的经营者必须看到他们今天所服务的客户之外的核心圈子,并帮助每一个企业,无论大小,成为真正的世界准备。实际上实时的高质量翻译和基于效用的定价的挑战正等待着核心语言行业的创新思维者来解决。别人不会做或者做不到。科技公司提供的翻译功能目前还很有限。他们无法弥合质量差距或不想,因为他们无法在每个人的专业领域和风格偏好中出类拔萃。解决这些问题的回报是一个比今天的翻译行业大数千倍的市场,至少在产量上是如此。 为了履行我们的使命,帮助世界更好地沟通,并支持每一个企业成为真正的世界准备,我们需要真正把翻译看作是一种产品,而不是一种创造性的服务。我们没有其他方法可以扩大规模并实时交付。正如我们在我们周围所看到的那样:翻译已经是互联网上的一种实用工具,一种功能或一种产品。核心问题是它还不够好。为了让它变得更好,语言行业的专业人士需要深入研究,分析产品并弄清楚它是如何工作的。然后他们会发现MT并不是一个神奇的黑匣子,并意识到MT好坏或不太好的区别在于我们放入引擎中的数据。 把MT想象成无味的即食食品:食谱(算法)和配料(数据)的产物。要生产真正好的美味食物,我们需要最好的配料。是的,我们也许可以对食谱做些调整,但最大的收获还是在原料的质量上。这样说,我们可以说,数据--而不是翻译过程--实际上是产品。在整个机器翻译和人工智能行业,人们越来越意识到,人们过于强调模型,而更重要的数据工作被低估了。 既然我们已经重新定义了这个市场,并同意它比我们今天所认为的市场要大得多,既然我们已经重新定义翻译本质上是一种数据产品,那么问题仍然是:我们如何将它变成一个可行的业务,我们如何扩大规模? 翻译的需求非常大,而且还在继续增长。然而,专业译者的数量并不是无限的。正如我们所看到的,MT技术有助于提高生产率。但是后期编辑MT是我们人力资本的最好利用吗?翻译行业的核心能力当然是人的技能和他们深厚的方言知识。如果我们想要优化这种人力资本的利用,投资于我们可以多次销售的数据产品要好得多。 了解如何在自己的领域内组装高质量的数据是从事翻译行业的人的职业竞争优势。例如,一名在渔业和海事法方面具有专门知识的英法专业翻译就能满足上百名律师的需求,这些律师现在可能正与英吉利海峡两岸的客户就英国脱欧后的捕鱼权问题进行接触。或者想想Mohamed Alkhateeb,这位叙利亚医生和翻译,他将自己的英语到阿拉伯语医疗翻译记忆上传到TAUS数据市场,帮助提高了SYSTRAN医疗引擎在阿拉伯语市场的质量,并为自己创造了一笔不错的收入。 最后一个难题是我们如何在实际操作中把需求和供给结合起来。在利基市场运营的专业化翻译和语言服务提供商如何为其翻译或数据产品找到尽可能多的客户? 在我上一篇文章《重新配置翻译生态系统》中,我介绍了市场和协作平台,作为支持翻译生态系统创新的新的共享模式。随着翻译正在成为人工智能的另一种形式,我们看到大的人工智能市场正逐渐开始包含用于翻译的数据。我们还看到更多专门的聚合平台的出现,如aiXplain,Systran的Marketplace,Hugging Face,当然还有TAUS Data Marketplace。所有这些共享平台将推动一对多的商业模式,帮助翻译行业扩大规模,并最终为数百万等待翻译信号和更好质量的新商业客户搭建桥梁。 因此,下一个逻辑翻译解决方案是数据增强机器翻译。这是实时翻译的优质水平,也许不如人类的翻译质量或转写,但对于90%的所有用例来说已经足够好了。关键的事实是,它是一个产品,一个功能。供给是由需求驱动的。阿塞拜疆的一名专门研究增值税条例的翻译人员可能是这一领域翻译数据的唯一提供者,而法国的电子商务专家必须与许多其他人分享客户。协同平台上的仪表板将帮助翻译行业的运营商将其资源分配到他们能够发挥最大作用的地方。这一创新浪潮可能会使翻译行业更加透明,为所有提供商创造一个更加平等的公平竞争环境。 数据卖家:去TAUS数据市场查看你的数据的价值。 翻译买家:在TAUS Data Marketplace(即将发布)上使用匹配数据特性构建您自己的特定于领域的数据集。 请参阅我们关于数据级联的博客文章,并查看吴恩达对以数据为中心的人工智能的呼吁。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文