A Win-Win Situation


2020-12-29 23:32 TAUS


阅读模式 切换至中文

AI systems are becoming a global trend. Businesses around the world are starting to explore how these systems can benefit them and their customers, but AI is not yet at the stage where it can simply be plugged in and expected to operate. They require an immense amount of data and training to provide the desired outputs. It has therefore become quite a buzz phrase that data is the new oil. Those who possess data are to become the next generation of power bearers. In the language industry, Language Services Providers (LSP) are often the ones that have access to great quantities of language data. Yet, only those that know how to transform it into an actionable business will hold the key to unleash their potential in the data space. Based in St. Petersburg, EGO Translating Company is one of those. The company was founded in 1990 and 80% of its services focus on translation and interpretation with the rest focusing on related technologies and platforms. To produce their own technology solutions, they have opened a technology branch including an MT department under which they collect and clean language data to feed back into their MT systems. We spoke with Margarita Menyaylova, Head of Machine Translation Division and Evgenia Gorodetskaya, Vice President of Technology Development. In their quest for more language data, they came across the Data Marketplace. Now they have published about half a million words of English-Russian language data in manufacturing and related domains. So how did their quest for more data turn into sharing their own data for others to purchase and improve their systems? hbspt.cta._relativeUrls=true;hbspt.cta.load(2734675, '93c72fc5-17f4-4e74-b9d2-0d33f09adebb', {}); “Why can’t we also share the data we accumulated over time with other data consumers?”, asks Margarita. “We are aware of the importance of language data abundance for the growth of ML systems. And we also know how important and challenging data cleaning is. The fact that the Data Marketplace allows data sellers to buy back a cleaned and anonymized version of their dataset has been the greatest motivator for us”. She sees this as a win-win situation. The marketplace gets enriched with more data for the growth of the industry, while LSPs can download the clean and anonymized version of the data they’ve just uploaded. The hardest nut to crack for many LSPs would be the issue of language data ownership. Who owns the data they process? Is it the translator who provides the translated text? Or the customer who provided the source text? These and more questions were significant for EGO Translating Company as well. “We had a discussion around this with our colleagues and business partners and what convinced us were the answers we found in the Who Owns My Language Data White Paper,” says Margarita. The key takeaway from the white paper is that the existing laws do not provide black and white answers to questions that appear to be simple and straightforward. Seeking clarity around what data you process and for whom is an important first step for any organization. Common sense, in combination with some essential rules of thumb will help getting grips on legal compliance. “It’s also good to note that different data processing and privacy rules apply in Russia and other parts of the world. We have milder laws and regulations regarding data ownership and intellectual property,” adds Evgenia. EGO Translating Company is sure that other LSPs will be just as excited for this opportunity, yet they will share similar concerns around data privacy. The ladies of EGO also emphasize the importance of analyzing each provider-client agreement separately to make a decision on what data can be monetized safely. Margarita and Evgenia are both very positive that in the end most LSPs will catch up with the new reality of the industry and become more willing to share their language data. “A similar approach is widely used in other industries, such as the IT industry. They freely share their codes, APIs and program solutions as open source. This ecosystem of sharing turns out to be effective for both sides,” says Margarita. “Those who will manage to align with realities of the translation industry will also win in the end. As LSPs, we have more business potential and assets to monetize than just simply translation.” Their datasets are available for purchase on the Data Marketplace for AI and ML services providers to train their systems with Russian-English data. In the meantime, EGO Translating Company continues to improve their own MT systems with the cleaned and anonymized version of their own datasets TAUS provides. hbspt.cta._relativeUrls=true;hbspt.cta.load(2734675, '18445e49-2db2-4712-a9f8-8a809fe0149c', {});
AI系统正在成为一种全球趋势。世界各地的企业都在开始探索这些系统如何让他们和他们的客户受益,但AI还没有到简单地插上电源就能运行的阶段。它们需要大量的数据和培训来提供所需的产出。 因此,“数据是新石油”成了一个相当流行的说法。那些拥有数据的人将成为下一代的权力承担者。在语言产业中,语言服务提供商(language Services Providers,LSP)往往是能够访问大量语言数据的机构。然而,只有那些知道如何将it转变为可操作的业务的人,才能掌握在数据空间释放其潜力的钥匙。 总部设在圣彼得堡的EGO翻译公司就是其中之一。该公司成立于1990年,80%的服务专注于笔译和口译,其余专注于相关技术和平台。为了生产他们自己的技术解决方案,他们开设了一个技术分支机构,其中包括一个MT部门,在该部门下,他们收集和清理语言数据,以反馈到他们的MT系统中。 我们采访了机器翻译部门负责人玛格丽塔·门亚伊洛娃和技术开发副总裁叶夫根尼亚·戈洛德茨卡娅。在他们寻求更多语言数据的过程中,他们遇到了数据市场。现在,他们已经在制造业和相关领域发表了大约50万字的英俄语言数据。那么他们对更多数据的追求是如何变成分享自己的数据供他人购买和改进他们的系统的呢? hbspt.cta._relativeURLS=true;hbspt.cta.load(2734675,'93c72fc5-17f4-4e74-b9d2-0d33f09adebb',{}); 玛格丽塔问道:“为什么我们不能与其他数据消费者共享我们长期积累的数据呢?”“我们意识到丰富的语言数据对ML系统发展的重要性。而且我们也知道数据清洗是多么重要和具有挑战性。数据市场允许数据销售者回购其数据集的清洁和匿名版本,这一事实对我们来说是最大的动力“。她认为这是一个双赢的局面。随着行业的发展,更多的数据使市场变得更加丰富,而LSP可以下载他们刚刚上传的数据的干净和匿名版本。 对于许多LSP来说,最难解决的问题是语言数据所有权问题。谁拥有他们处理的数据?提供翻译文本的是译者吗?还是提供源文本的客户?这些问题对自我翻译公司也具有重要意义。玛格丽塔说:“我们与同事和商业伙伴就这个问题进行了讨论,让我们信服的是我们在《谁拥有我的语言数据白皮书》中找到的答案。”白皮书的主要启示是,现行法例并没有为看似简单明了的问题提供黑白分明的答案。对于任何组织来说,弄清您处理什么数据以及为谁处理数据都是重要的第一步。常识,结合一些基本的经验法则,将有助于掌握法律合规。“值得注意的是,俄罗斯和世界其他地区适用不同的数据处理和隐私规则。我们在数据所有权和知识产权方面有更温和的法律法规,“Evgenia补充说。 EGO Translating Company确信,其他LSP也会为这个机会而兴奋不已,但他们也会对数据隐私有类似的担忧。EGO的女士们还强调了分别分析每个提供者-客户协议的重要性,以便就哪些数据可以安全地货币化做出决定。 Margarita和Evgenia都非常积极地认为,最终大多数LSP会赶上行业的新现实,变得更愿意分享他们的语言数据。他说:“其他行业,例如资讯科技业,亦广泛采用类似的方法。他们以开放源码的形式自由地分享他们的代码,API和程序解决方案。事实证明,这种共享的生态系统对双方都是有效的,“玛格丽塔说。“那些能与翻译行业的现实保持一致的人最终也将获胜。作为LSP,我们有更多的商业潜力和资产来实现盈利,而不仅仅是简单的翻译。“ 他们的数据集可在数据市场上购买,供人工智能和ML服务提供商使用俄英数据训练他们的系统。同时,EGO翻译公司继续用TAUS提供的数据集的清洗和匿名版本来改进他们自己的MT系统。 hbspt.cta._relativeURLS=true;hbspt.cta.load(2734675,'18445e49-2db2-4712-a9f8-8a809fe0149c',{});

