2022年的数据和AI趋势--翻译技术速递

Technologies such as Natural Language Processing (NLP), deep learning and computer vision have been thriving since data science has become well-established as a field of study and expertise. These developments have paved the way for the rise of machine learning (ML) to achieve the concept of artificial intelligence (AI). The transformative effects of these new technologies continue to be observed in our daily lives at a gradually increasing pace as we move into 2022. The importance of data and how to work with data is becoming increasingly a common theme and making data science more accessible. Just a decade ago, data and AI were niche concepts that were only vaguely understood. Now we see that the methodologies of data and the science behind it are well-established and excellence in these areas is what most companies strive for. Despite the increasing understanding and popularity, research still shows that “everyone wants to do the model work, not the data work”. This allows for companies such as TAUS to emerge as pure data and data services companies to provide excellence in tailored data for AI solutions. Traditionally each year we present an outlook into what language data for AI trends are likely to rise in the coming year. It’s time to look at what is in store for 2022 and list the trends to keep an eye on. The trends listed below are based on the Data Trends 2022 panel discussion at the TAUS Virtual Data Summit 2021, featuring Matthew Jackowski (Localization Director at eBay), Jonas Ryberg (Chief Globalization Officer at PacteraEDGE) and Watson Srivathsan (Product Manager at Amazon AI). 1. Expansion of Multilingual AI Data and Models One of the main goals of machine translation researchers and developers has long been to create a single model that supports all languages, dialects, and modalities. These efforts have resulted in the emergence of multilingual models. Multilingual models work based on the common approach that text phrases are embedded in the same vector space. In other words, any input language is transferred into a language-agnostic vector in a space, where all languages for the same or similar input would be mapped in the same area. Any input phrase with the same meaning would point to the same area in vector space. Studies and research in this area have been ongoing for a while now. It’s expected that new and highly functional multilingual AI models will be trending in the near future. In pursuing this goal, multilingual training datasets play a crucial role and can, therefore, be listed as a trend by itself. As it’s more common and easier to access bilingual or monolingual datasets, the generation of multilingual domain-specific datasets will be a trending topic of conversation, especially considering that most brands now strive to communicate with a much more global audience in an ever more local manner. Community-based platforms such as the TAUS HLP Platform where tailored datasets can be generated and annotated by a specially formed community of contributors can be used to get access to multilingual training datasets. 2. More Companies Joining the Data Market Language services providers (LSPs), individual translators, publishing companies and so many other players in similar sectors have now discovered that they hold the oil to make the machine learning engines run smoothly. The texts they have collected, generated, or processed over the years can now be used as training datasets, and this way turned into business assets. With the emergence of marketplaces for almost any type of data, from language data to geodata or marketing data, these new companies gain direct access to the market. Along with this growing awareness, many success stories emerge. However, it’s good to note that having the data is only the initial step. Pre-processing the data in order to mold it into a usable format for AI training is the key element. Considering that most new companies do not really know how to handle or prepare data for AI training, marketplaces such as TAUS Data Marketplace offer cleaning and anonymization services for every uploaded dataset. 3. Data Diversity Artificial intelligence needs data diversity. This topic can be addressed in two folds: diversity in types of data and diversity to eliminate bias in data. Training datasets can come in many forms, from text to speech, image, multimedia, and so forth. Text data is more common than all the other forms of available datasets on the market. However, with the changing user habits due to highly digitalized daily environments (and with that decreasing attention spans), voice messages, commands or images and memes are becoming the preferred method of communication both online and offline. To cater to these needs, services supported by AI are required. Some good examples can be the voice assistants and voice-to-text features in mobile phones. To improve these systems, more and more speech data is needed and, as these services are brought to a more global market, more diversity in the languages for voice data is needed. In order to provide these services for everyone, regardless of gender, origin, age, race, and physical restrictions such as speech impediments, diverse voice datasets should be fed into the AI systems. This is key in providing a bias-free experience for all users. However, finding voice data is harder than text data. Even if it can be found, finding it with the level of diversity needed in terms of languages, dialects, personas and conversation types and domains remains to be highly challenging. It can be overcome with custom datasets created by a specially formed group of people and taking all the points mentioned above into account. Such services are offered through platforms like the TAUS HLP Platform. 4. Lifelong Learning Machines A new branch of AI approaches, called lifelong learning machines, is being designed to pull and feed data continually and indefinitely into AI systems. A lifelong learning system can be defined as a model that can efficiently and effectively retain knowledge it has learned from other tasks and selectively transfer it to be used in the learning of new tasks. In basic terms, lifelong learning machines defy forgetting in classification tasks. Researchers at Western University, Canada present this mechanism with the above diagram in their paper called A Deep Learning Framework for Lifelong Machine Learning. Generally, the concept of lifelong learning is concerned with developing techniques and architectures enabling the machine learning models to learn sequentially without the need to re-train from scratch. Practical examples such as chatbots and production lines can be given to present the scope of solutions that can be created using lifelong learning methods. Lifelong learning is still a fairly new topic and more research and development is likely to happen in 2022 and beyond. Conclusion The future is definitely data-centric, more so than ever. But data on its own cannot do the magic. It seems like we will be seeing more data enhanced with professional pre-processing; data that is multilingual; data that is diverse in representation and format; and systems that learn on their own. In addition, more providers of all of these types of data will pop up. It’s a given that in 2022, an increasing amount of exciting data science applications and research will take place with the convergence of these transformative technologies and concepts that augment and complement each other.

自从数据科学成为一个成熟的研究领域和专业知识以来，诸如自然语言处理(NLP)，深度学习和计算机视觉等技术一直蓬勃发展。这些发展为机器学习(ML)的兴起以达成人工智能(AI)的概念铺平了道路。随着我们进入2022年，这些新技术的变革性影响继续以逐渐加快的速度在我们的日常生活中得以观察到。数据的重要性和如何使用数据正日益成为一个共同的主题，并使数据科学更容易获得。就在十年前，数据和AI还是小众概念，还只是被模糊理解。现在我们看到，数据的方法学及其背后的科学已经很成熟了，在这些领域的卓越是大多数公司所追求的。尽管认识和普及程度不断提高，但研究仍显示“每个人都想做模型工作，而不是数据工作”。这使得像TAUS这样的公司能够以纯数据和数据服务公司的身份出现，为人工智能解决方案提供精良的量身定制数据。传统上，我们每年都会对未来一年人工智能的语言数据趋势可能会上升进行展望。现在是时候看看2022年将会发生什么，并列出值得关注的趋势了。下面列出的趋势是基于TAUS虚拟数据峰会2021上的数据趋势2022小组讨论，与会者包括Matthew Jackowski（eBay本地化总监），Jonas Ryberg（PacteraEDGE首席全球化官）和Watson Srivathsan（亚马逊AI产品经理）。 1.多语言AI数据和模型的扩展长期以来，机器翻译研究人员和开发人员的主要目标之一就是创建一个支持所有语言，方言和模态的单一模型。这些努力导致了多语言模式的出现。多语言模型基于文本短语嵌入相同向量空间的共同方法工作。换句话说，任何输入语言都被转移到一个空间中的语言不可知向量中，在该空间中，用于相同或相似输入的所有语言都将被映射到相同的区域中。任何具有相同含义的输入短语都将指向向量空间中的相同区域。这方面的研究和调查已经进行了一段时间。预计新的，高功能的多语言人工智能模型将在不久的将来成为趋势。在追求这一目标的过程中，多语种训练数据集发挥着至关重要的作用，因此可以单独列为一种趋势。由于获取双语或单语数据集越来越普遍，也越来越容易，生成多语言特定领域的数据集将成为一个热门话题，尤其是考虑到现在大多数品牌都在努力以更加本地的方式与更多的全球受众进行交流。基于社区的平台，例如TAUS HLP平台，可以由一个特别组成的贡献者社区生成和注释量身定制的数据集，可以用来访问多语种培训数据集。 2.更多公司加入数据市场语言服务提供商，个体翻译，出版公司和许多类似行业的其他参与者现在都发现，他们掌握着机器学习引擎顺利运行的关键。他们多年来收集，生成或处理的文本现在可以用作训练数据集，并通过这种方式转变为业务资产。随着几乎任何类型的数据（从语言数据到地理数据或营销数据）市场的出现，这些新公司获得了直接进入市场的机会。随着这种意识的增强，出现了许多成功的故事。然而，值得注意的是，拥有数据只是最初的一步。对数据进行预处理以便将其塑造成可用于AI训练的格式是关键要素。考虑到大多数新公司并不真正知道如何处理或准备用于AI训练的数据，TAUS data Marketplace等市场为每个上传的数据集提供清洗和匿名化服务。 3.数据多样性人工智能需要数据多样性。这一主题可以分为两个方面：数据类型的多样性和消除数据偏见的多样性。训练数据集可以有许多形式，从文本到语音，图像，多媒体等等。文本数据比市场上所有其他形式的可用数据集更常见。然而，随着高度数字化的日常环境所带来的用户习惯的变化（并且随着注意力持续时间的减少），语音消息，命令或图像以及模因正在成为在线和离线交流的首选方法。要迎合这些需求，就需要AI支持的服务。一些很好的例子可以是手机中的语音助手和语音转文本功能。为了改进这些系统，需要越来越多的语音数据，并且随着这些服务被带到更加全球化的市场，需要语音数据的语言更加多样化。为了向每个人提供这些服务，不论性别，出身，年龄，种族，以及诸如言语障碍之类的身体限制，不同的语音数据集应该被输入到人工智能系统中。这是为所有用户提供无偏见体验的关键。然而，寻找语音数据比文本数据更难。即使能够找到它，在语言，方言，人物角色和会话类型和领域方面找到所需的多样性水平仍然是极具挑战性的。它可以通过由一组特殊组成的人员创建的自定义数据集来克服，并考虑到上面提到的所有要点。这些服务是通过TAUS HLP平台等平台提供的。 4.终身学习机器人工智能方法的一个新分支，被称为终身学习机器，正在被设计来不断地和无限期地将数据拉入和馈送到人工智能系统中。终身学习系统可以定义为一种模型，它能够高效有效地保留它从其他任务中学到的知识，并有选择地将其迁移到新任务的学习中使用。从根本上说，终身学习机器在分类任务中不允许遗忘。加拿大西部大学的研究人员在其名为《终身机器学习的深度学习框架》的论文中用上图展示了这种机制。一般来说，终身学习的概念涉及开发技术和体系结构，使机器学习模型能够连续学习，而不需要从头开始重新训练。可以给出聊天机器人和生产线等实际例子来呈现使用终身学习方法可以创建的解决方案的范围。终身学习仍然是一个相当新的话题，更多的研究和发展可能会在2022年及以后发生。结论未来肯定是以数据为中心的，比以往任何时候都更加如此。但数据本身并不能起到神奇的作用。看起来我们将看到更多的数据通过专业的预处理得到增强；多语种数据；表现形式和格式不同的数据；和独立学习的系统。此外，还会弹出更多所有这些类型数据的提供者。在2022年，随着这些互为补充的变革性技术和概念的融合，越来越多令人兴奋的数据科学应用和研究将会出现。

以上中文文本为机器翻译，存在不同程度偏差和错误，请理解并参考英文原文阅读。

阅读原文

机器翻译

工具

翻译管理

本地化