A Recap on CLARIN Café: Bilingual and Multilingual Corpora

《克拉林咖啡馆:双语与多语语料库》综述

2022-05-24 22:25 CLARIN

本文共835个字,阅读需9分钟

阅读模式 切换至中文

About The CLARIN Café on Bilingual and Multilingual Corpora took place on 29 April 2022. More than forty participants from various countries and organisations participated in the online event. The event was organised by Eva Soroli, Thomas Gaillat and Franck Cinato, and was divided into two parts: Part I: The CLARIN infrastructure and the new French CLARIN Knowledge Centre CORLI, dedicated to providing expertise in corpus linguistics and the languages spoken in France, and to support academic communities through actions towards FAIR and Open data Part II: Examples of parallel, comparable and dialectal corpora (new or already published), together with demonstrations on how to collect/build, annotate, explore, analyse and archive such corpora in an interoperable way. Part I: Introduction to CLARIN and Its Knowledge Centres After a short introduction to CLARIN and its services by Eva Soroli (CLARIN ambassador and associate professor at the University of Lille, France), the CORLI team coordinators presented the new French CLARIN Knowledge centre CORLI, and the new directions of the consortium. CORLI speakers were: Christophe PARISSE, INSERM researcher in cognitive and computer sciences working at the University of Nanterre (France) in the domain of corpus linguistics, language development, language change and language pathology; and Céline POUDAT, associate professor of linguistics and discourse analysis at the University of Cote d’Azur (France). They presented the CORLI consortium - a consortium involving members from more than 20 research labs and 15 universities, part of the French infrastructure Huma-Num and certified CLARIN K-centre since 2020 – and discussed its recent national developments and the progress made in building a sustainable national consortium similar to what European Research Infrastructure Consortia are doing on a European scale. The speakers described the development of the CORLI K-Centre, its scope and organization in working groups (e.g., the working group Multilingualism of the consortium), and its intuitive and interactive online platform that centralizes and offers both proactive and reactive services about available language resources, databases and depositories, training opportunities and best research practices. The speakers also discussed the new directions of the consortium, the actions towards the development of a collaborative annotation platform, of solutions for sustainable citation of corpora, and the creation of an open reference corpus for the French language. The large number of participants shows that this centers’ topics and services are of great interest and relevant to other CLARIN national initiatives, as well as to researchers and professionals from other research infrastructures and communities (data scientists, engineers, educators, etc.). The last speaker and co-organiser of this event was Thomas GAILLAT, associate professor of corpus Linguistics at the University of Rennes (France). He is working at the intersection between natural language processing, corpus linguistics and machine learning. His current research is mostly focused on language acquisition questions and the development of tools that automatically extract and visualise linguistic profiles in texts written by learners of English. His talk covered the issue of storing a comparable learner corpus on a data repository. Comparable corpora are made up of many files, which need to be accessed in an orderly manner in order to extract coherent datasets. Subsequent analyses can then be conducted to make comparisons between speakers of different L1s or L2s. T. Gaillat illustrated the issue with the corpus InterLangue (CIL) which is a learner corpus of L2 French and English. He showed that the corpus storage architecture now supports online extractions and comparisons in both languages. Based on the Huma-Num Nakala infrastructure and with the use of R scripts, it is possible to extract corpus items, annotate texts automatically and create datasets supporting comparisons. Comparability is ensured in several stages. Firstly, the French and English subsets of the corpus were collected on the same basis, i.e. identical tasks, similar proficiency, same file types and same metadata types. Secondly, the corpus data was formatted following the same transcription protocol and the same data formats (WAV for audio recordings, XML and txt for transcriptions and CSV for metadata). Finally, the queries can be conducted with scripts that apply consistent extraction based on identical metadata information. The scripts also include automated linguistic annotation with UDPipe, providing French and English texts with Universal Dependency and part-of-speech annotation. The scripts can be modified, as they are distributed under the Creative Commons licence. This CLARIN Café offered the perfect space to encourage discussions regarding corpora and multilinguality. This event presented some national and international initiatives in the domain and highlighted the emergence of three new corpus projects, including their purposes and features. Thanks to the presence of researchers from all around the world (83 participants from Europe, two from South America, two from Canada, two from the United States, four from Africa and three from Asia), presentations and discussions provided some new insights in the specifics of multilingual and multidialectal corpora and underlined the need for common practices in the domain of multilingual corpus building and management. Additional information on this CLARIN Café and the slides of the event are available on the event page.
关于 克拉林咖啡馆双语和多语言语料库于2022年4月29日举行。来自不同国家和组织的40多名与会者参加了这次在线活动。该活动由Eva Soroli、Thomas Gaillat和Franck Cinato组织,分为两个部分: 第一部分:CLARIN基础设施和新的法语CLARIN知识中心CORLI,致力于提供语料库语言学和法国语言方面的专门知识,并通过采取行动实现数据的公平和开放,支持学术界 第二部分:平行语料库、可比语料库和方言语料库(新的或已经出版的)的例子,以及如何以互操作的方式收集/建立、注释、探索、分析和归档这些语料库的演示。 第一部分:介绍CLARIN及其知识中心 在Eva Soroli(法国里尔大学CLARIN大使和副教授)对CLARIN及其服务作了简短介绍之后,CORLI小组协调员介绍了新的法国CLARIN知识中心CORLI和联合会的新方向。CORLI的演讲者是:Christophe PARISSE,INSERM认知和计算机科学研究员,在法国楠泰尔大学从事语料库语言学、语言发展、语言变化和语言病理学研究;和法国蓝色海岸大学语言学和语篇分析副教授Céline POUDAT。他们介绍了CORLI财团--一个由20多个研究实验室和15所大学的成员组成的财团,自2020年以来是法国基础设施Huma-Num和认证的CLARIN K-centre的一部分--并讨论了其最近的国家发展以及在建立一个类似于欧洲研究基础设施财团在欧洲规模上所做的可持续国家财团方面取得的进展。 发言者介绍了CORLI K-Center的发展情况、其范围和工作组组织(例如联合会的多语种工作组)及其直观和互动的在线平台,该平台集中并提供关于现有语文资源、数据库和保存库、培训机会和最佳研究做法的主动和被动服务。发言者还讨论了联合会的新方向、开发合作注释平台的行动、语料库可持续引用的解决方案以及法语开放参考语料库的创建。参与者众多表明,这些中心的专题和服务对CLARIN的其他国家倡议以及来自其他研究基础设施和社区的研究人员和专业人员(数据科学家、工程师、教育工作者等)非常感兴趣,并与之相关。 最后一位演讲者和这次活动的联合组织者是法国雷恩大学语料库语言学副教授托马斯·盖拉。他正致力于自然语言处理、语料库语言学和机器学习的交叉领域。他目前的研究主要集中在语言习得问题和开发工具,自动提取和可视化英语学习者所写的文本中的语言概况。他的演讲涵盖了在数据存储库中存储可比学习者语料库的问题。可比语料库是由许多文件组成的,需要有序地访问这些文件才能提取出连贯的数据集。随后的分析可以用来比较不同L1或L2语言的说话者之间的差异。 T.Gaillat用InterLangue语料库(CIL)来说明这个问题,该语料库是一个二语法语和英语学习者语料库。他展示了语料库存储架构现在支持两种语言的在线提取和比较。基于Huma-Num Nakala基础结构,通过使用R脚本,可以提取语料库条目,自动注释文本,并创建支持比较的数据集。 在几个阶段确保了可比性。首先,语料库的法语和英语子集在相同的基础上收集,即相同的任务、相似的熟练程度、相同的文件类型和相同的元数据类型。其次,语料库数据按照相同的转录协议和数据格式(WAV用于录音,XML和txt用于转录,CSV用于元数据)进行格式化。最后,可以使用基于相同元数据信息应用一致提取的脚本进行查询。脚本还包括使用UDPipe的自动语言注释,提供具有普遍依赖性的法语和英语文本以及词性注释。这些脚本可以被修改,因为它们是在知识共享许可下分发的。 这个CLARIN咖啡馆提供了一个完美的空间来鼓励关于语料库和多语言的讨论。这次会议介绍了该领域的一些国家和国际倡议,并强调了三个新的语料库项目的出现,包括它们的目的和特点。由于来自世界各地的研究人员出席了会议(83名与会者来自欧洲,2名来自南美洲,2名来自加拿大,2名来自美国,4名来自非洲,3名来自亚洲),这些发言和讨论对多语种和多方言语料库的具体问题提供了一些新的见解,并强调了在多语种语料库建设和管理领域共同做法的必要性。 关于这个CLARIN咖啡馆的更多信息和活动的幻灯片可在活动页面上获得。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文