A Recap on the CLARIN and Libraries Workshop

CLARIN与图书馆研讨会综述

2022-09-09 03:50 CLARIN

本文共1367个字,阅读需14分钟

阅读模式 切换至中文

The CLARIN and Libraries workshop took place at KB National Library of the Netherlands on 9 and 10 May 2022. This was the first workshop with the explicit aim of bringing together the CLARIN community across Europe and research libraries to discuss issues relating to the delivery of digital content for researchers, and to plan practical steps for future collaboration. There were 30 participants in the workshop, the majority with library-based roles, from 15 different European countries, which led to a stimulating and fruitful discussion. Participants were able to reflect on a number of major library initiatives (past, present and future) involving the delivery of textual content from large text collections. These projects, usually with target audiences of both readers and researchers, are often somewhat disconnected from each other, and also disconnected from research infrastructures. Bridging Two Cultures While new research infrastructures for arts, humanities and social sciences, such as CLARIN and DARIAH have emerged in recent decades, libraries have been for many centuries the most important resource for researchers, and remain so today in the digital age. For virtual, digital, distributed research infrastructures such as CLARIN to be effective, they need to work closely with libraries, which play key roles as creators and curators of digital data, and as intermediaries between researchers and digital data, tools and expertise. While there are already existing collaborations that have broken down the separation between the new and old infrastructures, it was acknowledged that there are different communities of practice used to working with different datasets, different software environments and tools, and using different methods. The workshop explored a number of past, current and future initiatives to overcome these barriers. Nederlab Hennie Brugman presented the Nederlab research portal, which offers a platform where researchers can access library textual data and perform a number of operations on it, and which was developed in collaboration with CLARIAH-NL. The platform offers access to digital Dutch historical text collections that are aggregated, harmonised and collectively made searchable and analysable. The project to develop Nederlab ended in 2018, and although it remains live and is regularly used by scholars, there is no continued software development, and there are limited updates to collections. Nevertheless, the scope is impressive, with 24 collections, 19 billion words and almost 100 different annotation layers. The lessons learned from the project to build and deliver the portal included the necessity of keeping a firm grip on the difference between running a project and a service, keeping know-how on board, the importance of delivering enrichments to collections back to the providers, and offering services that users want, including direct access to text files, access via APIs, and flexible ways of segmenting documents. Over the period in which Nederlab was developed, and since, the architects detected a shift of emphasis for researchers from wanting interactive research environments to the need for online accessible data, and the need for an ‘IIIF for text’, meaning effective, flexible and robust ways to reference and use fragments of text, in the same way that the International Image Interoperability Framework (IIIF) makes this possible for images (including images of text on pages). Text+ Peter Leinen from the German National Library presented Text+, a new German research data activity, which is being developed in a major project, also involving a range of partners from academia, including CLARIN and DARIAH, and infrastructure institutions including libraries. Text+ is a part of the National Research Data Infrastructure. The aim is to build a research data infrastructure focused on language and text data, for a wide range of disciplines in the humanities and social sciences. The data which Text+ aims to deliver includes not only collections of historical texts, but also contemporary language corpora, lexical resources, and digital editions. Text+ will be not just a network of repositories but will offer a comprehensive support infrastructure for all issues regarding collections, including interfaces, standards, authority data, long-term preservation, etc. Access to Data for Researchers at a National Library The KB, National Library of the Netherlands, has a relatively long history of offering online data services, which has included access to datasets of historical printed books, newspapers, mediaeval manuscripts, transcripts of radio news, and parliamentary papers. This can be dated back through more than 30 years of digitisation, and 10 years of providing interfaces to support distant reading, including projects such as Delpher, KB Lab Datasets and Linked Open Data. These initiatives have resulted in data-driven humanities research projects, and also in the development of new research tools and environments. Looking to the future, the KB is looking for more and better ways to make data available via a variety of routes to researchers, with activities such as the FAIR@KB manifesto and CLARIAH FAIR dataset register, a new text and data mining room, and plans for developing a text suite for corpus selection operations and a tools-to-data solution for in-copyright collections. Linking with Cultural Heritage A current project in Belgium, entitled DATA-KBR-BE, is an interdisciplinary collaboration between cultural heritage experts, digital humanities researchers and data scientists. It is also highly relevant in this context, addressing many of the same issues which have already been highlighted in the other projects. The project is taking place in collaboration with the DARIAH and CLARIN consortia in Flanders and Belgium, and builds on much recent and ongoing work in DARIAH and in the digital humanities community more widely relating to the topic of ‘collections as data’. The vision behind DATA-KBR-BE of the optimal set of conditions for the proper exploitation of collections by researchers, was an important point of reference for the discussion in the KB workshop. DATA-KBR-BE will offer data-level access to digitised collections for digital humanities research. Unlocking Digital Texts Neil Jefferies (Bodleian Libraries, University of Oxford) presented Unlocking Digital Texts, a new collaboration between the Universities of Cambridge (UK), Oxford (UK), and Notre Dame (USA), with contributions from other institutions, and part of the AHRC/NEH New Directions in Digital Scholarship in Cultural Institutions programme. The project aims to make it easier to use a variety of textual formats as data in research, and will develop outline standards, prototypes, and proofs-of-concept, emulating the approach used with IIIF. It will build on existing standards and technologies (such as Text Encoding Initiative XML, IIIF, and the Oxford Common File Layout), rather than creating new formats or specific code dependencies. The project has links to Text+ and Nederlab, and is looking for further collaboration and knowledge exchange opportunities. The workshop also reflected on the digital libraries landscape and differing levels of ongoing collaboration with CLARIN in Bulgaria, Czechia, Finland, Lithuania, Norway, Poland and Sweden. Next Steps An initial list of possible areas for collaboration included sharing CLARIN technologies in areas such as: interactive online corpus linguistics platforms, many now curated and developed by CLARIN centres, e.g. Korp, Corpuscle linguistic annotation of texts to enable more effective search higher-level processing of texts, e.g. stylometry, named entity recognition platforms to connect tools and texts to each other and in processing pipelines, via services such as the Language Resources Switchboard and Weblicht. Discussion at the workshop identified further areas where more collaboration could be useful, which included: making use of the libraries’ role in providing front-line research support embedded in universities and research institutions working to overcome barriers presented by copyright and other legal and ethical restrictions on the use of digital texts other parts of the research life-cycle: technologies, formats and tools in the digitisation and representation of texts. Discussion started in the workshop will undoubtedly be continued in the new projects such as Text+ and DATA-KBR-BE, via existing forums such as the Conference of European National Librarians, and in emerging initiatives such as the European data space for cultural heritage. The organisers were very happy to have taken part in CLARIN’s first post-pandemic international in-person gathering, and to have had the opportunity once again to meet old friends and make new ones, after such a long suspension of normal social activity. We look forward to more! More details of the event, including the slides from the presentations, are available on the event page.
CLARIN与图书馆研讨会于2022年5月9日至10日在荷兰KB国家图书馆举行。这是第一次研讨会,其明确的目的是将欧洲的CLARIN社区和研究图书馆聚集在一起,讨论与为研究人员提供数字内容有关的问题,并计划未来合作的实际步骤。 研讨会有30名与会者,其中大多数来自15个不同的欧洲国家,他们在图书馆担任职务,这导致了一场令人鼓舞和富有成果的讨论。与会者能够思考涉及从大量文本收藏中提供文本内容的一些主要图书馆举措(过去、现在和未来)。这些项目的目标受众通常是读者和研究人员,它们之间往往有些脱节,也与研究基础设施脱节。 沟通两种文化 虽然近几十年来出现了新的艺术、人文和社会科学研究基础设施,如CLARIN和DARIAH,但许多世纪以来,图书馆一直是研究人员最重要的资源,在数字时代的今天仍然如此。为了使虚拟、数字、分布式研究基础设施(如CLARIN)发挥有效作用,它们需要与图书馆密切合作,图书馆作为数字数据的创建者和管理者,以及研究人员与数字数据、工具和专业知识之间的中介机构,发挥着关键作用。 虽然已经有一些合作打破了新旧基础设施之间的分离,但人们承认,有不同的实践社区习惯于使用不同的数据集、不同的软件环境和工具,并使用不同的方法。讲习班探讨了过去、现在和将来为克服这些障碍而采取的一些举措。 内德拉布 Hennie Brugman介绍了Nederlab研究门户网站,该网站提供了一个平台,研究人员可以在该平台上访问图书馆文本数据并对其进行一些操作,该网站是与CLARIAH-NL合作开发的。该平台提供了对荷兰历史文本的数字化收集,这些文本经过汇总、协调和集体搜索和分析。开发Nederlab的项目于2018年结束,虽然它仍然在线,并经常被学者使用,但没有持续的软件开发,对收藏的更新也有限。尽管如此,范围是令人印象深刻的,有24个集合,190亿字和近100个不同的注释层。 从构建和交付门户的项目中吸取的经验教训包括:必须牢牢把握运行项目和服务之间的区别,掌握专门知识,将丰富的集合返回给提供商的重要性,以及提供用户想要的服务,包括直接访问文本文件、通过API访问和灵活的文档分段方式。 在Nederlab开发期间,以及之后,设计师们发现研究人员的重点发生了转移,从需要交互式研究环境到需要在线可访问的数据,以及需要一个“文本IIIF”,这意味着有效、灵活和健壮的方式来引用和使用文本片段。以与国际图像互操作性框架(IIIF)使得这对于图像(包括页面上的文本图像)是可能的相同的方式。 文字+ 来自德国国家图书馆的Peter Leinen介绍了Text+,这是一项新的德国研究数据活动,正在一个重大项目中开发,还涉及来自学术界的一系列合作伙伴,包括CLARIN和DARIAH,以及包括图书馆在内的基础设施机构。Text+是国家研究数据基础设施的一部分。其目的是为人文和社会科学的广泛学科建立一个以语言和文本数据为重点的研究数据基础设施。Text+的目标是提供的数据不仅包括历史文本的集合,而且包括当代语言语料库、词汇资源和数字版本。Text+将不仅仅是一个储存库网络,而且将为所有与收藏品有关的问题提供全面的支持基础设施,包括接口、标准、权威数据、长期保存等。 国家图书馆研究人员的数据访问 荷兰国家图书馆(KB)在提供在线数据服务方面有着相对较长的历史,包括历史印刷书籍、报纸、中世纪手稿、广播新闻抄本和议会论文的数据集。这可以追溯到30多年的数字化和10年的提供接口以支持远程阅读,包括Delpher、KB Lab Datasets和Linked Open Data等项目。这些举措导致了数据驱动的人文研究项目,也导致了新的研究工具和环境的发展。 展望未来,知识库正在寻找更多更好的方法,通过各种途径向研究人员提供数据,包括FAIR@KB宣言和CLARIAH FAIR数据集注册、一个新的文本和数据挖掘室,以及为语料库选择操作开发一个文本套件和为版权内收集开发一个工具到数据解决方案的计划。 与文化遗产相关联 比利时目前正在开展一个名为DATA-KBR-BE的项目,这是文化遗产专家、数字人文研究人员和数据科学家之间的跨学科合作。它在这方面也具有高度的相关性,解决了其他项目中已经强调的许多相同问题。该项目是与位于佛兰德和比利时的DARIAH和CLARIN财团合作进行的,并建立在DARIAH和数字人文社区最近和正在进行的与“作为数据的集合”这一主题更广泛相关的工作基础上。DATA-KBR-BE背后的愿景是研究人员正确利用馆藏的最佳条件集,这是知识库研讨会讨论的一个重要参考点。DATA-KBR-BE将为数字人文研究提供对数字化收藏的数据级访问。 解锁数字文本 Neil Jefferies(牛津大学Bodleian图书馆)介绍了“解锁数字文本”,这是剑桥大学(英国)、牛津大学(英国)和圣母大学(美国)之间的一项新合作,其他机构也提供了捐助,是澳大利亚人权委员会/国家教育学院文化机构数字奖学金新方向方案的一部分。该项目旨在使研究中更容易使用各种文本格式作为数据,并将开发大纲标准、原型和概念验证,模仿IIIF使用的方法。它将建立在现有的标准和技术(如文本编码倡议XML、IIIF和牛津通用文件布局)之上,而不是创建新的格式或特定的代码依赖关系。该项目与Text+和Nederlab建立了链接,并正在寻找进一步合作和知识交流的机会。 研讨会还回顾了数字图书馆的前景以及与CLARIN在保加利亚、捷克、芬兰、立陶宛、挪威、波兰和瑞典的不同程度的合作。 后续步骤 初步列出的可能合作领域包括在以下领域分享CLARIN技术: 交互式在线语料库语言学平台,许多现在由CLARIN中心策划和开发,例如Korp、Corpuscle 对文本进行语言注释以实现更有效搜索 文本的较高级处理,例如,字形学、命名实体识别 通过语言资源交换台和Weblicht等服务,将工具和文本相互连接起来,并在处理管道中连接起来。 讲习班上的讨论确定了进一步加强合作可能有益的领域,其中包括: 利用图书馆在提供嵌入大学和研究机构的第一线研究支持方面的作用 努力克服版权和其他法律及伦理限制对数字文本使用的障碍 研究生命周期的其他部分:文本数字化和表现的技术、格式和工具。 研讨会上开始的讨论无疑将在新项目中继续进行,如Text+和DATA-KBR-BE,通过现有的论坛,如欧洲国家图书馆员会议,以及在新兴的倡议,如欧洲文化遗产数据空间。 组织者非常高兴能够参加CLARIN在大流行后的首次国际面对面聚会,并在正常的社交活动中断这么长时间后,再次有机会见到老朋友,结交新朋友。我们期待更多! 活动页面上提供了活动的更多详细信息,包括演示文稿中的幻灯片。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文