The value of language data

语言数据的价值

2022-11-16 12:00 ELRC-欧洲语言资源协同化

本文共714个字,阅读需8分钟

阅读模式 切换至中文

Since the COVID-19 pandemic, the importance of language-centric AI has significantly increased, not only in Europe, as Language Technologies (LT) provided valuable tools and services to facilitate – and in many cases to actually enable – the exchange of information. These significant changes in the way we work contributed to new trends and a greater availability and uptake of language-centric AI in general. Today, the increased use of LT is no longer limited to Machine Translation, as more and more organisations have recognised the usefulness of LT tools such as Fake news detection, Anonymisation, Speech recognition or Text to Speech, to facilitate their daily operations, just to name a few. Language Data Management and Sharing For all LT applications, language data plays a crucial role. This is even more true when we consider the exponential growth of digital communication platforms, which in turn increase the need for more efficient and reliable LT. Organisations, however, can only collect the necessary amount of language data required for the development of competitive language-centric AI if they invest considerable efforts – both in terms of time and resources. For this reason, data sharing is increasingly considered as the best way towards a truly sustainable language data management. Nonetheless, in many countries of the EU, the sharing of language data is still not common practice, even though tons of data are produced in public administrations, research and industry on a daily basis. During our investigation for the 2022 edition of the ELRC White Paper, we tried to find out more about the Language Data management and sharing in European PA and SMEs. Following the results of our 2022 survey, the value of language data is being increasingly recognised all over Europe (see Figure 1): 17 of the ELRC National Anchor Points (NAPs) stated that their organisations are storing language data whenever possible (dark blue), while only 4 of them indicated that this is still hardly or never the case (yellow). In the remaining countries/organisations, language data is stored at least sometimes. Similarly, the large majority of the external survey contributors indicated that language data are stored whenever possible in their organisations (59%), but the percentage of those who indicated that they hardly or never store such data is not minimal (19%). Despite the encouraging results, this confirms that awareness-raising efforts must be maintained – also on the part of the European governments, which is also reflected by the fact that although 17 NAPs know that language data are explicitly mentioned in the AI regulations of their countries, only 4 are aware of a corresponding strategic or financial plan. Moreover, 6 stated that language data are mentioned only as a side note – e.g., as useful example of AI, while 5 indicated that language data are not mentioned at all. However, this doesn’t necessarily mean that this topic is completely disregarded by the country’s AI regulations. For example, the external survey contributors from Spain gave partly completely divergent answers, while most of them (34%) openly stated that they don’t know whether language data are mentioned in the national AI regulations or not. So, it is probably true that there is rather just a lack of communication/information to this respect. And in fact, we could already find numerous best practice examples with regard to the management of language data in European AI regulations. Just to name a few: The Norwegian strategy includes a full chapter about LT and language data, which highlights the crucial importance of language resources, especially for the NLP systems targeting less-resourced languages like the Sami languages. The Spanish AI Strategy mentions boosting the National LT Plan and the creation of resources in the Spanish Language as one of their action items. In Ireland, the value of language data is publicised, because one of their action items is to move away from US-based language data and use sources that include everyday language used by Irish citizens. In addition to that, the development of language resources for Irish is mentioned as one of the key enablers to provide digital services in Irish. Such developments reiterate that the value of language data has significantly increased and will continue to increase in Europe – within public administration and organisations, but also in national regulations. The 2022 edition of the ELRC White Paper is now available for download. Download
自COVID-19大流行以来,以语言为中心的人工智能的重要性大大增加,这不仅仅是在欧洲,因为语言技术(LT)提供了宝贵的工具和服务,以促进--在许多情况下实际上是促成--信息交流。我们工作方式的这些重大变化促进了新的趋势和以语言为中心的人工智能的更大可用性和普遍接受。如今,LT的使用越来越多,不再局限于机器翻译,因为越来越多的组织已经认识到LT工具(如假新闻检测、匿名、语音识别或文本到言语)的有用性,以方便其日常操作,仅举几例。 语言数据管理和共享 对所有LT应用来说,语言数据起着至关重要的作用。当我们考虑到数字通信平台的指数级增长时,这一点就更加真实了,这反过来又增加了对更有效和可靠的LT的需求。然而,这组织只有在时间和资源方面投入大量精力,才能收集发展以语言为中心的竞争性人工智能所需的必要数量的语言数据。因此,越来越多的人认为数据共享是实现真正可持续的语言数据管理的最佳途径。尽管如此,在欧盟许多国家,语言数据的共享仍不常见,尽管每天在公共行政、研究和工业部门产生大量数据。 在我们对2022年版ELRC白皮书进行调查期间,我们试图进一步了解欧洲巴勒斯坦权力机构和中小型企业的语言数据管理和共享。 根据2022年的调查结果,欧洲各地越来越认识到语言数据的价值(见图1):欧洲语言研究中心国家锚点(NAPs)中有17个表示,它们的组织正在尽可能地存储语言数据(深蓝色),而其中只有4个组织表示这种情况仍然很难或从未发生(黄色)。在其余国家/组织,至少有时会存储语言数据。 同样,绝大多数外部调查撰稿人表示,尽可能将语言数据存储在其组织中(59%),但表示几乎或从未存储此类数据的人所占百分比(19%)并不低。 尽管取得了令人鼓舞的成果,但这也证实了必须保持提高认识的努力--欧洲政府方面也是如此,这也反映在以下事实中:尽管17个国家行动计划知道语言数据在他们国家的人工智能法规中被明确提及,但只有4个国家知道有相应的战略或财政计划。此外,6个国家表示,语言数据只是作为附带说明被提及--例如,作为人工智能的有用例子,而5个国家表示,语言数据根本就没有被提及。 然而,这并不一定意味着这个话题完全被国家的人工智能规定所忽视。例如,来自西班牙的外部调查贡献者给出了部分完全不同的答桉,而其中大部分(34%)公开表示不知道国家AI法规中是否提及语言数据。因此,这方面可能确实缺乏沟通/信息。 事实上,我们已经可以在欧洲人工智能法规中找到许多关于语言数据管理的最佳实践实例。仅举几例: 挪威的战略包括一个关于LT和语言数据的完整章节,其中强调了语言资源的重要性,特别是对于针对萨米语等资源较少的语言的NLP系统。 《西班牙大赦国际战略》提到,将促进国家中长期计划和创造西班牙语资源作为其行动项目之一。 在爱尔兰,语言数据的价值得到了宣传,因为他们的行动项目之一是摆脱基于美国的语言数据,使用包括爱尔兰公民使用的日常语言的来源。除此之外,爱尔兰的语言资源开发被提到是提供爱尔兰语数字服务的关键推动因素之一。 这样的发展重申了语言数据的价值已经大大增加,并将在欧洲继续增加--在公共行政和组织内,但也在国家法规中。 2022年版ELRC白皮书现已可供下载。 下载

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文