Data collection powered by ELRC: More than 80 million TUs to support the development of language-centric AI

由 ELRC 提供支持的数据收集:超过 8000 万个 TU,支持以语言为中心的开发

2021-07-21 12:00 ELRC-欧洲语言资源协同化

本文共670个字,阅读需7分钟

阅读模式 切换至中文

Language-centric Artificial Intelligence applications are popping up all around us in our everyday lives. Voice assistants, TV show recommenders, automated translation services, to name a few, are used more and more. The development of such powerful applications is fuelled by three key factors: algorithms, computing power, and data. Nowadays, well-established deep learning algorithms are available and can be implemented via open-source machine learning libraries. In parallel, hardware IT infrastructures are continuously extended in terms of computing power and big data storage. The third factor concerns the acquisition of sizable data appropriate for the problem at hand (e.g. millions of translated sentence pairs in various language combinations for training Machine Translation engines). With the goal to enable multilingual communication across Europe, the EC provides access to a variety of language tools for public administrations and SMEs in all CEF-affiliated countries. Automated translation services like eTranslation, however, require a continuous extension of the supported languages and quality improvement. ELRC contributes to this objective by coordinating the collection and processing of language resources (LR) and by maintaining the language data repository ELRC-SHARE, through which the collected data are made available not only to CEF AT but also, depending on their terms of use, to the general public. To this end, the Institute for Language and Speech Processing / Athena Research Centre, one of the founding partners of the ELRC initiative, has set up a workflow and developed a pipeline for parallel language data acquisition from the web. Τhe ELRC workflow supports several CEF languages in any domain. To address the EC’s requirements for domain-specific language data, the three domains in focus are currently “Health”, “Culture” and “Scientific Research”. The process is triggered by identifying multi/bi-lingual websites with content related to the targeted domains. The main sources are websites of National Agencies, International organisations and broadcasts. Then, the ILSP Focused Crawler (ILSP-FC) toolkit is used to acquire the main content of the detected websites and to identify pairs of candidate parallel documents. Depending on the format of the source data, efficient methods for text extraction are applied, including for instance OCR on PDF files. The next step leverages multilingual embeddings to extract Translation Units (TUs). Finally, a battery of criteria is applied to filter out TUs of limited or no use (e.g. sentences containing only numbers) and thus generate parallel LRs of high quality. It is worth mentioning that the constructed datasets are clustered into groups, according to the conditions of use indicated on the websites the data originated from. In the current COVID-19 crisis, the ELRC language data collection activities also addressed the growing demand for improved technology-enhanced multilingual access to COVID-19 information. As part of the data collection activities in the “Health” domain, efforts focused on identifying reliable sources of language data and on compiling dedicated resources on the pandemic. To this end, the relevant MEDISYS metadata collections have been parsed and harvested in order to extract pairs of parallel sentences from comparable corpora by applying the above-mentioned workflow. Parts of these datasets were offered to the MLIA-Eval initiative. The total number of TUs for EN-X language pairs that have been collected as part of the ELRC activities during the last two years amount to more than 40 million translation units in total. Further to the above, a considerable number of TUs have been identified for X-Y language pairs, where X and Y are CEF languages other than EN, while millions of TUs have been extracted from websites with multi-domain content, with the aim to cluster them in domain-specific subsets. In total, the constructed LRs comprise approximately 80 million TUs.1 Depending on their conditions of use as indicated by the source website, parts of these datasets are available for download through the ELRC-SHARE repository (https://elrc-share.eu/). _______________________________________ [1] Note: Data acquisition and identification of parallel sentences are work in progress. The numbers provided reflect only the current state of the tasks and are growing constantly.
在我们的日常生活中,以语言为中心的人工智能应用层出不穷。 语音助手、电视节目推荐、自动翻译服务等等,越来越多地被使用。 如此强大的应用程序的开发受到三个关键因素的推动:算法、计算能力和数据。 如今,已有完善的深度学习算法可用,并且可以通过开源机器学习库实现。 与此同时,硬件 IT 基础设施在计算能力和大数据存储方面不断扩展。 第三个因素涉及获取适合手头问题的大量数据(例如,用于训练机器翻译引擎的各种语言组合的数百万个翻译句子对)。 为了在整个欧洲实现多语言交流,EC 为所有 CEF 附属国家的公共行政部门和中小企业提供了各种语言工具的访问权限。 然而,像电子翻译这样的自动化翻译服务需要不断扩展支持的语言和提高质量。 ELRC 通过协调语言资源 (LR) 的收集和处理以及维护语言数据存储库 ELRC-SHARE 来实现这一目标,通过它收集的数据不仅可以提供给 CEF AT,还可以根据其使用条款 ,面向大众。 为此,ELRC 计划的创始合作伙伴之一的语言和语音处理研究所/雅典娜研究中心建立了一个工作流程,并开发了从网络获取并行语言数据的管道。 ELRC 工作流程支持任何域中的多种 CEF 语言。 为了满足 EC 对特定领域语言数据的要求,目前关注的三个领域是“健康”、“文化”和“科学研究”。 该过程是通过识别包含与目标域相关的内容的多/双语网站来触发的。 主要来源是国家机构、国际组织和广播的网站。 然后,使用ILSP Focused Crawler(ILSP-FC)工具包获取检测到的网站的主要内容并识别候选平行文档对。 根据源数据的格式,应用有效的文本提取方法,例如 PDF 文件上的 OCR。 下一步利用多语言嵌入来提取翻译单元 (TU)。 最后,应用一系列标准来过滤掉有限使用或没有使用的 TU(例如仅包含数字的句子),从而生成高质量的并行 LR。 值得一提的是,构建的数据集会根据数据来源网站上指示的使用条件进行分组。 在当前的 COVID-19 危机中,ELRC 语言数据收集活动还满足了对改进技术增强型多语言访问 COVID-19 信息日益增长的需求。 作为“健康”领域数据收集活动的一部分,工作重点是确定可靠的语言数据来源并汇编有关大流行的专用资源。 为此,相关的 MEDISYS 元数据集合已被解析和收集,以便通过应用上述工作流程从可比较的语料库中提取成对的平行句子。 这些数据集的一部分已提供给 MLIA-Eval 计划。 过去两年作为 ELRC 活动的一部分收集的 EN-X 语言对的 TU 总数总计超过 4000 万个翻译单元。 除上述之外,已经为 XY 语言对确定了相当多的 TU,其中 X 和 Y 是除 EN 之外的 CEF 语言,同时从具有多域内容的网站中提取了数百万个 TU,目的是将它们聚类 在特定领域的子集中。 构建的 LR 总共包含大约 8000 万个 TUs.1 根据源网站指示的使用条件,这些数据集的一部分可通过 ELRC-SHARE 存储库 (https://elrc-share.eu/) 下载。 __________________________________________ [1] 注:平行句的数据采集和识别正在进行中。 提供的数字仅反映了任务的当前状态,并且还在不断增长。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文