The long-expected technical revolution is here. Automatic translation is no longer just a freebie on the internet. It is now entering the ‘real’ economy of the translation sector and it changes everything.
A Short History of the Translation Industry
Over a period of four decades, the translation sector has gone through regular shifts of adaptation, instigated by changes in the business and technological environment (see image below).
However impressive the journey has been so far, nothing compares to what is still to come: the Singularity. In this new phase, technology essentially takes over completely. The human translator is no longer needed in the process. Google and Microsoft alluded to this future state when they claimed that their MT engines translated as well as human professional translators. However, this has led to heated debates both in academic and professional circles about what this so-called human parity really means.
2. The Rise of Zero-Cost Translation
The global translation industry finds itself now in a 'mixed economy' condition: on one side a traditional vertical cascaded supply chain and on the other the new flat free machines model. The speed with which the machines are improving when fed with the right quality and volumes of data makes translation a near-zero marginal cost type of business (in the spirit of Jeremy Rifkin). This means that once the right infrastructure is in place, the production of a new translation costs nearly nothing and capacity becomes infinite.
As long as the translation industry is locked up in a vertical labor-based cost model, how realistic is it to think that we can just add more capacity and skills into our existing economic model to generate that global business impact?
For operators in the translation industry to follow the trend and transition to the new free machines model, they need to consider fundamental changes in the economics of their business: be prepared to break down existing structures, adopt new behaviors of sharing and collaboration, decrease the need for human tasks and work activities, and progress technology. Under the new economic model the concepts of linguistic quality, translation memories, and word rates will lose their meaning. Instead, we will talk about global business impact, data and models, and value-based pricing.
3. Kill Your Business Before it Kills You
In 2019 Google alone translated 300 trillion words compared to an estimated 200 billion words translated by the professional translation industry. Adding other big players like Microsoft Bing Translator, Yandex MT, Alibaba, Tencent, Amazon and Apple, the total output from MT engines is probably already ten-thousands times bigger than the overall production capacity of all professional translators on our planet.
Until only two or three years ago, after the new Neural MT success stories started to settle in, human professional translation and MT existed in two parallel worlds. Even inside Google and Microsoft, the product localization divisions didn’t make use of their company’s own MT engines. But that has changed now. MT is integrated into almost every translation tool and workflow.
The question for operators in the translation industry, therefore, is whether the two processes continue to co-exist, or whether MT will completely wash away the old business. The increasing pressure that LSPs are feeling already pushes them to offer various data or transcreation services or to start building their own MT systems and services. Gartner, in their recent research report, reckons that by 2025 enterprises will see 75% of the work of translators shift from creating translations to reviewing and editing machine translation output.
The takeaway for language service providers is that ignoring AI and MT is not an option. To grow the business they need to get out of their localization niche and use data and technology to scale up and expand into new services.
4. Buying the Best MT Engine
The question is asked many times: which MT engine is the very best for language A or in domain B? As MT developers use the same frameworks or models, like Marian, BERT or OpenNMT, which are shared under open-source licenses on GitHub, the answer to all these questions is that the “best” MT engine out-of-the-box does not exist. MT is not static: the models are constantly being improved and the output of the machine is dependent on the data that is used to train and customize the models. It’s a constant process of tuning and measuring the results.
For LSPs, it is much more important to have an easy way to customize the MT engines with their own high-quality language data. Some of the disruptive innovators in the translation industry that have implemented a real-time or dynamic adaptive MT process show how easy that is with “predictive translation”, which means that the engine learns almost immediately from the edits made by humans. This real-time, adaptive MT is only available in a closed software-as-a-service offering, which is understandable because of the immediate data-feedback loop required to optimize the speed and success of the learning process.
Language companies that need more flexibility and control over the technology should build and customize their own end-to-end solution. Their main challenge is to set up a diligent pipeline for data preparation, training and measurement. Their language operations then become a data-driven solution.
5. No Data, no Future
In the past thirty years, the translation industry has accumulated a massive amount of texts in source and target languages that are stored in databases referred to as translation memories. However, they do not always make the best training data. Translation memories are often not maintained very well over time, they may be too specific and too repetitive, or contain names and attributes that can confuse the MT engines.
To optimize quality output from machine translation the language data needs to be of the highest possible quality. Data cleaning and corpus preparation should include steps like deduplication, tokenization, anonymization, alignment checks, named entity tagging and more. To ensure that the language data used for customization of MT engines is on topic, more advanced techniques can be used to select and cluster data to match the domain.
Even if you may decide to outsource most of the data-related activities, new skills, talents and a new organizational structure in your business are required to get ahead in the new AI-enabled translation space.
6. Who Owns My Language Data?
Convinced as they may be that the future lies in taking control over the data, many owners of agencies, as well as translation buyers, still hesitate to move forward because they are in doubt about their legal rights to the data. There is a strong feeling across the translation industry that translations are copyright-protected and can never be used to train systems. If uncertainty over ownership of data is a factor that slows down innovation, it is time to get more clarity on this matter.
In the Who Owns My Language Data White Paper, Baker McKenzie and TAUS address important questions regarding the privacy and copyright of language datasets, individual segments, GDPR and international rulings and more. The white paper functions as a blueprint for the global translation industry. One important point to highlight is that copyrights are intended to apply to complete works or parts of the work more so than to individual segments. Since MT developers normally use datasets that consist of randomly collected segments for the training of their engines, the chances of a copyright clash are minimal.
Copyright on language data is complex and involves multiple stakeholders and many exceptions. Customers expect vendors to use the best tools and resources available, and today that means using MT and data to customize the engines. To our knowledge, there is no precedent of a lawsuit over the use of translation memories for training of MT engines, and the risk of being penalized is negligible. But in case of a doubt, you can always consult your stakeholders about the use of the data.
7. Breaking the Data Monopolies
If there is no future in translation without access to data, it is in the interest of all language service providers, their customers and translators in the world to break the monopolies on language data. Right now, a handful of big tech companies and a few dozen large language service providers have taken control of the most precious resource of the new AI-driven translation economy. A more circular, sharing and cooperative economic model would fit better into our modern way of working.
One solution is to unbundle the offering tied up in the AI-driven translation solution and to recognize the value of all the different contributors:
Hosting the powerful, scalable infrastructure that can support the ever-growing AI systems is a task that can only be managed by the largest companies.
Customizing the models for specific domains and languages is a specialized service that may best be left to service companies that have expertise in these fields and are capable of adding value through their offering.
And since the best quality training data is vital for everyone, why not let translators and linguistic reviewers who produce this data take full responsibility and earn money from their data every time an engine is trained with their data?
The process of creative destruction is now in full swing and may lead to a redesign of our entire ecosystem. The first marketplaces to enable this new dispensation are now out there: SYSTRAN launched a marketplace that allows service providers to train and trade translation models, while TAUS launched a data marketplace that allows stakeholders in the translation industry to monetize their language data. These first steps should lead to healthy debate across the industry as we feel the shock waves of an industry reconfiguration driven by radical digitization, human reskilling, and exponential data intelligence.
**The long version of this article has been published in the July/August 2021 issue of the Multilingual Magazine. This shorter version has been composed by Anne-Maj van der Meer.
期待已久的技术革命来了。自动翻译不再只是互联网上的免费工具。它正在成为翻译行业的“实体”经济,改变一切。
翻译行业简史
在过去的四十年里,由于商业和技术环境的变化,翻译部门历经了适应与变化(见下图)。
尽管到目前为止这段旅程令人印象深刻,但没有什么能与即将到来的奇点相比。在这个新阶段,技术基本上完全接管了一切。在这个过程中不再需要人工翻译人员。谷歌和微软暗示了这种未来的状态,他们声称他们的MT引擎可以像人类专业翻译一样翻译。然而,这在学术和专业领域引发了激烈的争论,即所谓的人的平等到底意味着什么。
2.零成本翻译的兴起
全球翻译行业处于一种“混合经济”的状态:一边是传统的垂直级联供应链,另一边是新的平面自由机器模式。当输入适当质量和数量的数据时,机器的改进速度使得翻译成为一种近乎零边际成本的业务。这意味着,一旦正确的基础设施到位,新的翻译的生产成本几乎为零,能力变得无限。
只要翻译行业被锁定在一个垂直的以劳动力为基础的成本模式中,那么我们是否可以在现有的经济模式中增加更多的能力和技能,以影响全球商业呢?
对于翻译行业的运营商来说,要顺应趋势并过渡到新的免费机器模式,他们需要考虑业务经济的根本变化:准备好分解现有结构,采用新的共享和协作行为,减少对人工任务和工作活动的需求,并改进技术。在新的经济模式下,语言质量、翻译记忆和词率等概念将失去意义。相反,我们将讨论全球业务影响、数据和模型以及基于价值的定价。
3.在生意毁掉你之前毁掉它
仅2019年谷歌就翻译了300万亿词,而专业翻译行业估计翻译了2000亿词。加上微软必应翻译、Yandex MT、阿里巴巴、腾讯、亚马逊和苹果等其他大公司,MT引擎的总产出可能已经超过地球上所有专业翻译的总产出能力的数万倍。
直到两三年前,在新的神经机器翻译成功案例开始出现后,人类专业翻译和机器翻译存在于两个平行的世界。即使在谷歌和微软内部,产品本地化部门也没有使用他们公司自己的MT引擎。但现在情况发生了变化。几乎所有的翻译工具和工作流程都集成了MT。
因此,翻译行业的运营商面临的问题是,这两个过程是否会继续共存,还是MT将彻底淘汰旧业务。lsp所感受到的越来越大的压力已经迫使他们提供各种数据或transcreation服务,或者开始构建自己的MT系统和服务。高德纳(Gartner)在其最近的研究报告中估计,到2025年,企业将看到75%的翻译工作从创建翻译转向审查和编辑机器翻译输出。
对于语言服务提供商来说,忽略人工智能和MT不是一个选择。为了发展业务,他们需要走出本地化的利基市场,利用数据和技术扩大规模,并扩展到新的服务领域。
4.买最好的MT发动机
这个问题被问了很多次:哪个MT引擎是语言A或领域B最好的?由于MT开发人员使用相同的框架或模型,如Marian、BERT或OpenNMT,这些框架或模型是在GitHub上的开源许可下共享的,所有这些问题的答案是,“最好的”MT引擎不存在。MT不是静态的:模型正在不断改进,机器的输出依赖于用于训练和定制模型的数据。这是一个不断调整和测量结果的过程。
对于LSP来说,更重要的是要有一种简单的方法来使用它们自己的高质量语言数据定制MT引擎。翻译行业的一些颠覆性创新者已经实现了实时或动态自适应MT过程,这表明“预测翻译”是多么容易,这意味着引擎几乎可以立即从人类编辑的内容中学习。这种实时、自适应的MT只能在封闭的软件即服务中使用,这是可以理解的,因为优化学习过程的速度和成功需要即时的数据反馈循环。
需要更多灵活性和对该技术的控制的语言公司应该构建并定制自己的端到端解决方案。他们的主要挑战是建立一个勤勉的数据准备、培训和测量管道。他们的语言操作就变成了数据驱动的解决方案。
5.无数据,无未来
在过去的30年里,翻译行业积累了大量源语言和目标语言的文本,这些文本存储在被称为翻译记忆的数据库中。然而,它们并不总是最好的训练数据。随着时间的推移,翻译记忆往往不能很好地维护,它们可能太具体、太重复,或者包含可能混淆MT引擎的名称和属性。
为了优化机器翻译输出的质量,语言数据需要尽可能高的质量。数据清理和语料库准备应该包括重复数据删除、标记化、匿名化、对齐检查、命名实体标记等步骤。为了确保用于定制MT引擎的语言数据是切题的,可以使用更高级的技术来选择和聚类数据以匹配领域。
即使你可能决定外包大部分与数据相关的活动,你的业务中也需要新的技能,人才和新的组织结构,才能在新的AI赋能的翻译空间中出人头地。
6.我的语言数据归谁所有?
尽管许多翻译机构所有者以及翻译购买者相信,未来取决于对数据的控制,但他们仍然对向前迈进犹豫不决,因为他们怀疑自己对数据的合法权利。整个翻译行业都有一种强烈的感觉,即翻译是受版权保护的,永远不能用于培训系统。如果数据所有权的不确定性是阻碍创新的一个因素,那么是时候让这个问题变得更加清晰了。
在《Who Owns My Language Data 》白皮书中,Baker McKenzie和TAUS讨论了关于语言数据集、单个片段、GDPR和国际裁决等隐私和版权的重要问题。白皮书是全球翻译行业的蓝图。需要强调的一点是,版权更多地适用于完整的作品或作品的一部分,而不是单个部分。由于MT开发人员通常使用由随机收集的片段组成的数据集来训练他们的引擎,因此发生版权冲突的可能性很小。
语言数据的版权是复杂的,涉及多个利益相关者和许多例外情况。客户希望供应商使用最好的工具和资源,而今天,这意味着使用MT和数据来定制引擎。据我们所知,目前还没有使用翻译记忆对MT发动机进行培训的诉讼先例,被处罚的风险也可以忽略不计。但如果有疑问,您可以随时咨询涉众关于数据的使用情况。
7.打破数据垄断
如果说没有数据就没有翻译的未来,那么打破语言数据的垄断符合世界上所有语言服务提供商、客户和译者的利益。目前,几家大型科技公司和几十家大型语言服务提供商已经控制了新人工智能驱动的翻译经济中最宝贵的资源。一个更循环、共享和合作的经济模式将更适合我们的现代工作方式。
一种解决方案是将人工智能驱动的翻译解决方案捆绑在一起,并认识到所有不同贡献者的价值:
拥有能够支持不断增长的人工智能系统的强大、可扩展的基础设施是一项只有最大的公司才能完成的任务。
为特定领域和语言定制模型是一种专门的服务,最好留给在这些领域拥有专门知识并能够通过其提供的服务增加价值的公司。
既然高质量的培训数据对每个人都至关重要,为什么不让翻译人员和语言评论者承担全部责任,每次用他们的数据培训引擎时都从他们的数据中赚钱呢?
创造性破坏的过程现在正在全面展开,并可能导致我们整个生态系统的重新设计。第一个实现这种新分配的市场已经出现:SYSTRAN推出了一个市场,允许服务提供商培训和交易翻译模型,而TAUS推出了一个数据市场,允许翻译行业的利益相关者将他们的语言数据货币化。当我们感受到由彻底数字化、人力再技能和指数级数据智能驱动的行业重组的冲击波时,这些第一步应该会导致整个行业的健康辩论。
**本篇完整版已发表在2021年7/8月号的多语种杂志上。此较短的版本由Anne-Maj van der Meer编写。
以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。
阅读原文