Going Beyond 100 Languages with Data

用数据超越100种语言

2020-11-19 04:00 TAUS

本文共2150个字,阅读需22分钟

阅读模式 切换至中文

On 10-11 November 2020, TAUS held its first virtual Data Summit. On the agenda were presentations and conversations discussing the new “Language Data for AI” subsector, the impact of neural MT on data needs, collection methods and data services, massive web-crawling projects, “mindful” AI and data ownership, privacy and copyright. An international audience of 100+ people came together to learn more about language data. Language Data for AI Earlier this month, TAUS published the Language Data for AI (LD4AI) Report and announced this as a new industry sub-sector. To analyze how language really became data and how that megatrend starts to overhaul the professional language industry, we need to go back a few decades, Jaap van der Meer and Andrew Joscelyne said in their opening conversation with Şölen Aslan. And that’s exactly what we’ve done in this new TAUS report that is freely downloadable on the TAUS: it offers context and perspective and assesses the opportunities and challenges for both buyers and new providers entering this industry. The main takeaways are: Language is core to AI Data-first paradigm shift Acceleration of change Rise of the new cultural professional New markets move faster In the conversation with Aaron Schliem (Welocalize), Adam LaMontagne (RWS Moravia) and Satish Annapureddy (Microsoft), we dove a little deeper into a few of these takeaways. When it comes to data, Aaron says: in our industry so far data has been used to produce content or optimize content flows, but now we use data to more intelligently interact with human beings. Specifically how we use data and what we use it for differs per stakeholder. This means that one of the takeaways (language is core to AI) will most likely mean something slightly different for everyone. It also extends beyond just language. hbspt.cta._relativeUrls=true;hbspt.cta.load(2734675, '18445e49-2db2-4712-a9f8-8a809fe0149c', {}); Ever since the rise of the machine, talks about translators and language professionals fearing losing their jobs have been trending. However, as also Adam confirmed, we’re just going to see a shift beyond professional translation services, and towards linguistic data tagging, labeling and other data-related services and tasks. The “rise of the new cultural professional” means there is a great opportunity in our industry for a new kind of professional. Aaron says the changes in our industry just opens the door to bring in different kinds of talents who want to get involved in building language solutions. In addition to the new skills, we’re also looking into “new” (low-resource) languages. Both Aaron and Adam say that the demand in these low-resource languages is mostly for machine translation and training the engines. How do you get this data? There are initiatives like the TAUS Human Language Project and the newly launched Data Marketplace that creates and sells datasets in various languages. With all these new technologies, MT engines, platforms, etc. emerging so quickly we just need to make sure we adapt well to the digital systems and make sure we use it to our advantage. The end goal of all professionals working in the language industry, everyone agreed, is of course for every individual to be able to seamlessly communicate with each other, regardless of the language they speak. Massively Multilingual To talk more in-depth about the changes in the MT field when it comes to data we were joined by Orhan Firat (Google), Paco Guzmán (Facebook), Hany Hassan Awadalla (Microsoft) and Achim Ruopp (Polyglot Technologies). Massively Multilingual Machine Translation is a new buzzword in the MT world. What it means exactly for the developments, Orhan explained in three simple bullet points: Positive transfer: learning of one language will help learning of another language. This helps boost the quality of low resource languages. Also enabling usage of less data. Less supervision: massively multilingual systems can make better use of monolingual data. Need for less structured data: being massively multilingual allows them to use less structured data like monolingual phrases, documents etc. The focus is shifting towards exploiting publicly available monolingual data to generate artificial data. Bilingual parallel data is expensive and hard to obtain. The main role of parallel data comes up in the evaluation and benchmarking step. Paco says they are reaching the long tail of languages and it’s important to evaluate the translations better. hbspt.cta._relativeUrls=true;hbspt.cta.load(2734675, 'ecc16e9f-06a2-444e-ac9b-5b97391762d6', {}); Hany adds: you can now jumpstart a low resource language translation with minimal data. However, it’s far away from any high-quality language and this gap is not easy to close. Currently, MT only covers about 100+ languages. There are more languages out there with fewer language resources, but millions of speakers. What does it take to go beyond 100+ languages? Is there a playbook for defining the minimum amount of data (monolingual or bilingual) needed to run a system? Hany says that about 100K monolingual segments would be enough to start it, but then it depends on the difficulty of the language and other similar variables. Paco adds that the answer may change for certain domains, genres, languages. Close languages such as Ukrainian-Belarussian are easier. If you have supervision from one of them it will benefit the other. If they are farther away in terms of language families it will be harder. After 1000 languages, the only data you’ll have is bibles or religious texts (a very specific domain). If you have at least some, but very well-structured, parallel data (even under 1000 segments), it’s good enough. When you approach the tail, you face vocabulary and learning problems. Other important factors are the noise level, data quality, the objective function you are optimizing etc. He concludes by saying that one is always better than zero when it comes to data. There are some institutions and governments that fund research that try to tackle low resource languages, but we do not have enough open-source datasets and this should be a shared responsibility. Breaking language barriers and connecting people is crucial for our business, Paco says. Orhan adds that it should be a collaborative effort. The presence of data is not always related to a language being out there on the internet. It can be the case there is no online content in that language. We need to enable that community to create content online. If a person is speaking a language that is on the verge of extinction but has no phone or internet, we should give him the means to go online and generate the content. For that, we need companies and governments to work collaboratively. When asked what role the new Data Marketplace can play in this massively multilingual world, Paco says it might help solve the benchmarking issue. If we can have datasets that are highly curated, they can become the standard. For high-quality languages, we are in the evaluation crisis phase, adds Orhan. The future will be all about less data but higher quality data, even if it’s monolingual, that is natural and not machine-translated, Orhan says. Hany adds that we need better data in more domains and speech-to-speech data is an interesting area to get to 100 languages. Paco emphasizes the high-quality data for evaluation once more. Revising processes for low-resource languages and finding what causes catastrophic mistakes and sourcing data to solve those issues. Mindful AI The term “Mindful AI” stands for being thoughtful and purposeful about the intentions and emotions evoked by an artificially intelligent experience. In a recent AI survey by Gartner, it was found that people believe that AI will transform businesses and that it’ll be part of future business and consumer applications. However, in reality, AI adoption has not lived up to its potential. The failure to embrace AI seems to be a human problem rather than a technology problem. In order to allow a machine to make decisions for us we need to trust it to be fair, safe and reliable. We need to know the quality of the data and transparency about how they are used. Generic models must be free of bias (gender, racial, political, religious), clean and balanced, trained on large quantities of unprejudiced and diverse data. These are three key pillars to operationalize AI in a mindful way: Human-centric approach to designing the AI systems - end-to-end, human-in-the-loop integration in the AI solution lifecycle, from concept discovery, data collection, to model testing, training and scaling Trustworthy - transparent about the way the models are built and how they work Responsible - ensuring that AI systems are free of bias and grounded in ethics AI localization at Pactera EDGE is defined as cultural and linguistic adaptation of machine learning solutions through the injection of securely obtained and meticulously processed localized AI data, Ahmer Inam and Ilia Shifrin (Pactera EDGE) explained in their presentation. The majority of consumers nowadays expect personalized services. There is not sufficient data for such customizations, which is why AI sometimes gets it wrong (inherits historical gender bias, etc.). Moreover, all AI models naturally become obsolete, production models can degrade in accuracy by 20% every year, meaning they need to be fed new data continuously. Here’s what they recommend to get started with AI localization: Build a strong foundation of professionally localized data to build AI models Find a reliable and scalable AI data partner Designate R&D resources to drive AI solutions The Data Marketplace In November 2020 TAUS launched the Language Data Marketplace, a project funded by the European Commission. The main objectives of the platform are to provide high-quality data for machine translation engines, and bridge the gap for low-resource language and domains. The TAUS team presented the available features, such as data analysis, cleaning and smart price suggestion for data sellers and the easy exploration flow for the buyers to be able to identify the desired data. The team also shared the roadmap with the upcoming functionalities. Some of the early adopters (data sellers) explained why Data Marketplace was a great opportunity for them: Mikhail Gilin from TransLink explained that as a large Russian LSP they like the prospect of using this new unique technology and its advanced NLP capabilities to sell the data that they’ve created internally. Margarita Menyailova from EGO Translating Company recognized the need for data in order to develop new services and explained how Data Marketplace is key in shortening the supplier-customer chain. Adéṣínà Ayẹni, journalist with an ambition to bring the Yoruba language into the digital space, was excited about the opportunity Data Marketplace gives him: to address the marginalization of the African languages while earning a monetary reward. hbspt.cta._relativeUrls=true;hbspt.cta.load(2734675, '93c72fc5-17f4-4e74-b9d2-0d33f09adebb', {}); Who Owns My Language Data Questions around privacy and ownership of language data become more pressing in this time of AI and machine learning. Wouter Seinen, Partner at Baker McKenzie and leader of the IP and Commercial Practice Group in Amsterdam, shared the highlights of the White Paper on ownership, privacy and best practices around language data, jointly published by Baker McKenzie and TAUS. Use your common sense, was Wouter’s main advice. Language as such can not be owned, although the use of language data can be within the realm of intellectual property (IP) rights. In the era of digitalization, the reality is diverging from the law books from 20-30 years ago, as content is freely copied and shared online. The type of data that is in question here is functional, segment-based text, and not highly creative content. Yes, there is a chance that there could be a name in the data that is subject to GDPR, or a unique set of words subject to IP rights, but in the specific niche where we are operating, these issues are more likely to be an exception than a rule. Massive web-crawling projects (like Paracrawl), without written consent from the owner is an issue solely from a legal perspective. In theory, copying someone else’s text can only be done if you have permission. But then again, the Google caching program is basically copying the whole internet. Since it happens on a very large scale and the majority of people don't have an issue with it, there seems to be a shift that comes from the discrepancy in what the law prescribes and what people are doing. Cleaning data doesn’t move the needle of ownership much from an IP perspective, similarly to the translation that is considered an infringement of the original, if done without permission. In light of this and other new data scenarios, the TAUS Data Terms of Use from 2008 were updated to include the new use scenarios, and are now available as Data Marketplace Terms of Use.
2020年11月10-11日,翻译自动化用户协会举办了第一届虚拟数据峰会。会议的议程包括介绍和对话,讨论全新的“人工智能语言数据”细分领域、神经机器翻译对数据需求的影响、收集方法和数据服务、大规模网络爬行项目、“用心”人工智能和数据所有权、隐私和版权。来自世界各地的100+观众齐聚一堂,深入了解语言数据。 人工智能(AI)语言数据 本月早些时候,翻译自动化用户(TAUS)协会公布了人工智能语言数据(LD4AI)报告,并宣布这是一个新的行业细分领域。雅普·范德米尔和安德鲁·乔斯林在与索伦·阿斯兰(Sölen Aslan)在开场白中表示:“要分析语言是如何真正变为数据,以及这一大趋势是如何开始彻底改变专业语言行业,我们需要追溯到几十年前。这正是我们在这份新的TAUS报告中所做的”。该报告可在TAUS上免费下载:它提供了背景和前景,并对进入该行业的买家和新供应商所面临的机会和挑战进行了评估。主要有以下几点: 语言是人工智能的核心 数据优先的范例转变 加速改变 新文化专业人士的崛起 新市场加快发展 在与亚伦·史莱姆(来自北京多语信息公司),亚当· 莱蒙唐(来自南京摩睿信息技术有限公司)和萨蒂什·安纳普雷迪(来自微软)的对话中,我们更深入地探讨了其中的一些问题。谈到数据,亚伦表示:“在我们的行业中,到目前为止数据一直用于生产内容或优化内容流,但现在我们使用数据提供更智能的人类交互。具体地说,我们如何使用数据以及我们将数据用于的每个领域都是不同的。这意味着其中一个观点(语言是AI的核心)很可能对每个人来说都意味着一些略有不同的东西。不仅仅是语言,它还延伸到其他方面。” hbspt.cta._relativeURLS=true;hbspt.cta.load(2734675,'18445e49-2db2-4712-a9f8-8a809fe0149c',{}); 自机器兴起以来,关于翻译和语言专业人士担心失业的讨论一直很流行。然而,正如亚当所证实的那样:“我们将看到一种超越专业翻译服务的转变,即转向语言数据标记,标注和其他数据相关的服务和任务。‘新文化专业的崛起’意味着我们的行业有一个很大的机会来培养一种新的专业。”亚伦说,我们行业的变化只是打开了大门,引进了想要参与构建语言解决方案的不同类型的人才。 除了新技能,我们还在研究“新”(低资源)语言。亚伦和亚当声称在低资源语言方面的需求大多是机器翻译和训练引擎。你是如何得到这一数据的?关于这个问题人们有一些倡议,例如TAUS人类语言项目和新推出的数据市场以各种语言创建和销售数据集。 随着所有这些新技术,机翻引擎,平台等的迅速出现,我们只需要确保我们能够很好地适应数字系统,并利用它为我们自己提供服务。所有人都认为在语言行业工作的所有专业人士的最终目标当然是每个人都能够无接缝地相互交流—不管他们说的是哪种语言。 大量使用多种语言 为了更深入地讨论MT领域在数据方面的变化,我们邀请了奥尔汗·菲拉特(来自谷歌),帕克· 古斯曼(来自脸书),哈尼· 哈山· 阿瓦多拉(来自微软)和阿希姆·卢浦(来自多语种技术)。 大规模多语言机器翻译是机器翻译界的一个新名词。奥尔罕用三个简单的要点解释了这对机翻发展意味着什么: 正迁移:学习一种语言有助于学习另一种语言。这有助于提高低资源语言的质量,同时使用数据会相应减少。 减少监督:大规模的多语言系统可以更好地利用单语言数据。 对结构化程度的数据需求减小:大量使用多种语言可以使用减少结构化数据的使用,如单语短语,文档等。 对于翻译的关注点正在转向利用公开提供的单语数据来生成人工数据。双语平行数据昂贵且难以获取。平行数据的主要作用出现在评估和基准测试步骤中。帕科声称他们正在接近语言的长尾,更好地评估翻译是很重要的。 hbspt.cta._relativeURLS=true;hbspt.cta.load(2734675,'ECC16E9F-06A2-444E-AC9B-5B97391762D6',{}); 哈尼补充说:“你现在可以使用最低程度的数据启动一个低资源的语言翻译。然而,它离任何高质量的语言都很远,这种差距不容易缩小。 ” 目前,机翻只覆盖了大约100+种语言。世界上还有许多语言尚未覆盖。其语言资源很少,但却拥有数以百万计的使用者。要想超越100+种语言需要什么?是否有相关规则定义运行系统所需的最小数据量(是单语还是双语)? 哈尼说,大约100K个单语片段就足够启动机翻,但接下来机翻取决于语言的难度和其他类似的变量。帕克补充说:”答案可能会因某些领域,流派和语言而改变。像乌克兰-白俄罗斯语这样关系接近的语言更容易使用机翻。如果你能监管其中一种语言的翻译,这将有利于另一种语言的生成。如果两种语言在语系上离得更远,机翻就变得较为困难。 ” 在1000种语言之后,圣经或宗教文本(一个非常特定的领域)将会成为您唯一拥有的数据。至少有一些结构良好的平行数据(即使在1000个片段以下)就足以实行机翻。当你接近语言尾部时,你将面临词汇和学习问题。其他重要因素还包括噪音水平,数据质量,优化的目标函数等,他总结说,当涉及到数据时,有一个数据总比没有好。 一些得到了机构和政府资助的研究试图解决低资源语言,但我们没有足够的开放源码数据集,这应该是我们所共同承担的责任。帕科说:“打破语言障碍,将人们联系起来,对我们的业务至关重要。”奥尔罕补充说,这应该是同心协力的结果。数据的存在并不总是与互联网上存在的语言相关联。这种情况可能是该语言没有在线内容。我们需要让这一社区能够在线创建内容。如果有人说的是一种濒临灭绝的语言,但他所处环境没有电话或互联网,我们应该提供给他他上网和生成内容的方法。为此,我们需要企业和政府通力合作。 当被问及新的数据市场在这个庞大的多语言世界中发挥什么作用时,帕克表示,它可能有助于解决基准问题。如果我们能够拥有高度策展的数据集,这些数据就可以成为标准。“对于高质量的语言,我们正处于评估危机阶段”,奥尔罕补充道。 奥尔罕说,未来的数据将会更少,但质量会更高,即使是单语言的翻译也十分自然,不会有机翻痕迹。哈尼补充说,我们需要来自更多领域的更好数据,而语音—语音数据是一个有趣的领域,机翻数量可达到100种语言。帕克再次强调评估的高质量数据。修改低资源语言的流程,找出导致灾难性错误的原因,并寻找解决这些问题的数据。 用心AI “用心AI”一词代表对人工智能体验所唤起的意图和情感深思熟虑。高德纳公司最近的一项人工智能调查发现,人们相信人工智能将改变商业,并将成为未来商业和消费者应用的一部分。然而,在现实中,AI领养未能挖掘自身潜力。未能成功拥有AI似乎是人的问题,而不是技术的问题。为了让机器为我们做决定,我们需要相信它是公平的、安全的、可靠的。我们需要知道数据的质量和如何使用数据的透明度。一般模型不可以带有偏见(性别,种族,政治,宗教),要保证模型的干净和平衡,并根据大量无偏见和多样化的数据进行训练。 以下是以谨慎的方式实现AI运营的三个关键支柱: 以人为中心的人工智能系统设计方法--在人工智能解决方案生命周期中从概念发现,数据收集到模型测试,训练和缩放,进行端到端,人在回路的集成, 可信任性--关于模型的构建方式和工作方式必须透明 负责任--确保人工智能系统建立在道德基础上,不带有任何偏见 艾哈迈尔·伊那木和莱拉· 希夫林(文思海辉公司)在他们的演讲中解释说,在文思海辉,AI本地化被定义为通过注入安全获取和精心处理的本地化AI数据,对机器学习解决方案进行文化和语言适应。现在的大多数消费者都期望个性化的服务。这类定制没有足够的数据,这就是为什么AI有时会弄错(继承历史性性别偏见等)。而且,所有的AI模型都有过时的时候,生产模型的精确度每年会下降20%,这意味着它们需要不断地获得新的数据。 下面是他们推荐的AI本地化入门方法: 建立强大的专业本地化数据基础来构建AI模型 寻找可靠且可扩展的AI数据合作伙伴 指定研发资源驱动AI解决方案 数据市场 2020年11月,TAUS启动了语言数据市场。这是一个由欧盟委员会资助的项目。该平台的主要目标是为机器翻译引擎提供高质量的数据,并为低资源语言和领域搭建桥梁。TAUS团队介绍了可用的功能,如数据分析,清理和为数据卖家提供的智能价格建议,以及为买家提供的能够识别所需数据的简易探索流程。团队还与即将到来的功能共享了路线图。 一些早期采用者(数据销售者)解释了为什么数据市场对他们来说是一个契机: 来自运输联线的米哈伊尔·基林解释说,作为一家大型俄罗斯语言服务提供商,他们喜欢使用这种新的独特技术及其先进的自然语言处理能力来出售他们在内部创建的数据的前景。 EGO 翻译公司的玛格丽塔·马纽罗娃认识到开发新服务需要数据,并解释了数据市场是缩短供应商-客户链的关键。 阿德马鸣·艾尼(Adéminitiínàayni)是一位志在将约鲁巴语带入数字空间的记者,他对数据市场为他提供的机会感到兴奋:在获得金钱回报的同时又能够解决非洲语言的边缘化问题。 hbspt.cta._relativeURLS=true;hbspt.cta.load(2734675,'93c72fc5-17f4-4e74-b9d2-0d33f09adebb',{}); 谁拥有我的语言数据 在这个人工智能和机器学习的时代,语言数据的隐私和所有权问题变得更加紧迫。贝克·麦坚时的合伙人,阿姆斯特丹IP和商业实践小组组长沃特·塞纳分享了贝克·麦坚时和TAUS联合发布的《围绕语言数据的所有权,隐私和最佳实践白皮书》中的亮点。 运用你的常识—这是沃特的主要建议。尽管语言数据的使用可以在知识产权(IP)的范围内,语言本身不能被拥有。在数字化时代,现实正与二三十年前的法律书籍相背离,因为大量内容在网上被自由复制和分享。 这里所讨论的数据类型是功能性的,基于段的文本,而不是高度创造性的内容。是的,数据中可能有一个名称受通用数据保护条例的约束,或者一组独特的词受IP权利的约束,但在我们经营的特定利基领域,这些问题更有可能是例外而非规则。 大规模的网络爬行项目(如Paracrawl),如果没有得到所有者的书面同意,仅仅从法律角度来说是一个问题。理论上,复制别人的文字只有在你获得许可的情况下才能进行。不过话说回来,谷歌缓存程序基本上是在复制整个互联网。因为它发生在一个非常大的范围内,而且大多数人对它没有异议,法律规定和人们正在做的事情之间的差异似乎导致了一种转变。 从知识产权的角度看,清理数据并不会改变所有权,就像未经许可的翻译被认为是对原作的侵犯一样。鉴于这一情况和其他新的数据情景,2008年的TAUS数据使用条款进行了更新,包括新的使用情景,现已提供数据市场条款的使用服务。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文