Tour de CLARIN: Interview with Sidsel Boldsen

克拉林巡回赛:西德塞尔·博尔德森采访

2022-04-25 19:00 CLARIN

本文共1710个字,阅读需18分钟

阅读模式 切换至中文

Sidsel Boldsen is a PhD Student in Natural Language Processing ( Natural Language Processing See: http://en.wikipedia.org/wiki/Natural_language_processing read more ) and digital humanities, with a special interest in historical languages and linguistic knowledge representation. She is part of the interdisciplinary research project ‘Script and Text in Time and Place’ at the University of Copenhagen. 1. Please describe your academic background and current position. I’ve just handed in my PhD thesis, and am due to defend it at the beginning of May. My background is in historical linguistics and comparative linguistics, which I did for my BA. But then I moved on to language technology and did a Master’s in IT and Cognition at the University of Copenhagen. In my PhD, I focus on language technology and I have a special interest in language change, which comes from my background in comparative linguistics. I became interested in language technology because I thought that programming and scripting could offer interesting research avenues for linguistic studies involving digital corpora. 2. You are part of the research project ‘Script and Text in Time and Place’ at the Department of Nordic Research at the University of Copenhagen. Could you describe the project? The project is very interdisciplinary and is a qualitative and quantitative study of about 300 medieval Danish charters from the thirteenth to the sixteenth centuries. The goal is to study the script and language of medieval Denmark through these resources. These charters are very interesting from both a historical and linguistic point of view because they have not been edited in any way, so they’re direct sources of language and history. They are also dated and geographically localised, based on where they were produced, so you can get a very nuanced picture. Often when you work with historical texts, it’s an edition of an edition, and it can be difficult to say what the ‘real’ language actually is, and what the later additions are. So there are philologists working on the project, but also historians looking into the monastic history. The project will end in May 2022. The main output will be an open-source, digital scholarly edition of the charters. This scholarly edition will enable scholars within philology and history to search these charters – to see the texts in different layers, where we have annotated the different features of the script, and other levels, too. For instance, we lemmatised the texts so that you can search for word forms, and we also annotated the people and places that occur in the text. 3. What has your role in the project been? One focus of the project has been to develop tools for automatic linguistic analysis of texts, as well as automatic dating, localising and identifying scribal schools. Such customised digital tools should improve the quantitative analysis of historical sources. I am involved in that part. There aren’t many tools available to date or localise for Danish script, and little systematic analysis has been conducted on Danish medieval texts so far. We have been working towards new methods for automated dating, localisation and grouping of texts based on machine learning techniques (MLT), which will improve our understanding of what the relevant factors are for establishing the date, place and scribe of primary sources from the Middle Ages. The benefit of using MLT in the project is that it allows us to take advantage of the dated charter material for building a reference and training corpus. What I did was to look at how language changed and whether it was possible to develop tools to date texts without a date. These corpora are all dated, so can you use these dated corpora to develop tools to automatically date texts that don’t have a date. Perhaps this tool could help to make undated texts more accessible. We started to developed the tool – that was the starting point for my thesis. But then my work became more theoretical and I began to study how language change is actually captured in language models, and what kind of features they recognise or are sensitive to. I’ve tried to look at different layers. Of course, there’s topical change, with different places being named or different expressions being mentioned differently through time, a sort of topic model. But then I also looked at sound change. In that period, we know that some sound changes were supposed to have happened, but when we look in the corpus, can we actually identify those? So I was also interested in change on a phonological level. 4. Can you say a little more about the tool you have been developing? The tool – it’s not a fully developed tool yet – uses support vector machines (SVMs) in order to automatically assign the manuscripts to a specific time period, or bin. It represents a text in a vector space, either the words it contains or smaller segments such as character n-grams, and then it projects these into the space and tries to create learning boundaries between different classes. In this case, the classes are the specific time periods – it could be centuries, or it could be spans of 50 years. And then the tool tries to learn how to divide those that are projected within that given space. When you receive a new document, you map that new document into that space, and then you can evaluate how well the space was constructed with respect to how well documents can be divided into those periods. We received pretty good accuracy. We reached around 75 % - which means we were able to date almost 75% of the charters with a 25-year error margin, which is used by philologists as a standard of the precision with which medieval texts can be dated manually. But it’s a bit complex also as to how these so-called bins are actually constructed. And we didn’t really find a way to address that in our paper on the topic. If I was to develop the tool further, I think the work should focus on what type of features it actually recognises. For it to be useful, it would have to work for another corpus, which had been annotated using different schemas, for example. We have yet to test that. But that would be the questions: how well is it able to generalise across corpora? We have another corpus of charters or medieval documents called Diplomatarium Danicum, and it would be interesting to test the tool on this resource because it is a much bigger corpus spanning a broad period. It would be interesting to see how well the tool that has trained on this very specific corpus would transfer to that. In principle, I think the tool could be useful for other corpora as well, at least within the same domain. We are currently working on a contribution investigating the generalisability of such methods, in which we also plan to make the tool available for scholars to test on their own data. So, stay tuned! 5. You are applying machine learning techniques to the analysis of medieval Danish texts with the cooperation of CLARIN-DK. How did you start collaborating with CLARIN-DK? Have you used any specific CLARIN tools as part of your research? One of our project members, Bart Jongejan, is in the CLARIN-DK team. So when we needed tools, we used one that was in the CLARIN-DK repository. Although the charters had already been transcribed, they were annotated in a CSV-like format, in which each row represents one token, and they needed to be converted to XML format for the actual edition. We used the workflow manager for NLP called Text Tonsorium in order to automatically convert the format. We used the same tool for automatic part-of-speech (PoS) tagging of the Latin charters. For the Danish charters, we wanted to annotate PoS and lemma manually and, in this case, we used the tools offered in Text Tonsorium as a starting point for the annotation. This was very useful as it dramatically reduced the workload of the manual annotation. One of the great features of the Text Tonsorium is that it offers many different pipelines and workflows, so you can quickly test out different parsers or lemmatisers. Otherwise, you would have to set up all the different tools and try them out, but this is one common format where you can try them all out in one go. For my research area, CLARIN-DK provides everything I need. They have trained both an old Danish lemmatiser and a Latin one, and also offer PoS-tagging. 6. Why is it important to take a computational approach, such as natural language processing, in the humanities? I think the contribution from computational methods is two-fold: the most important, in my view, is to make corpora more accessible and searchable, so that qualitative researchers can use these resources in a more focused way. I think it’s more about how language technology can assist certain research questions that qualitative researchers work with. So, a way to assist, not to replace different methods. You can work with larger resources and filter them, for example. And the other is to actually use those tools in order to carry out humanities research, which can also be very interesting. For example, if you develop a tool to date text manually, you could maybe also learn from those tools: what are the predictive features of language change? In that way you use those tools not only to be able to annotate, but also to learn from those models and use them to actually study language and text. But I think the first contribution is the more important one. 7. What is in store for your future collaboration with the CLARIN-DK? I don’t have a project in mind at the moment, but I’d definitely be open to working with CLARIN again. The collaboration with CLARIN-DK was a very positive experience for me. I’m busy finishing the project, getting the whole scholarly edition online. My defence is at the beginning of May and after that I’ll be able to think about what lies ahead. The scholarly digital edition will be accessible here at the end of May 2022.
Sidsel Boldsen是自然语言处理专业的博士生( 自然语言处理 参见:http://en.wikipedia.org/wiki/natural_language_processing 阅读更多 )和数字人文学科,对历史语言和语言知识表示特别感兴趣。她是哥本哈根大学跨学科研究项目“时间和地点的脚本和文本”的一员。 1.请描述你的学术背景和目前的职位。 我刚交了我的博士论文,并计划在五月初答辩。我的背景是历史语言学和比较语言学,这是我为学士学位所做的。但后来我转向了语言技术,并在哥本哈根大学攻读了信息技术和认知硕士学位。在我的博士期间,我专注于语言技术,我对语言变化有着特殊的兴趣,这源于我的比较语言学背景。我对语言技术产生了兴趣,因为我认为编程和脚本可以为涉及数字语料库的语言学研究提供有趣的研究途径。 2.你是哥本哈根大学北欧研究系“时间和地点的脚本和文本”研究项目的一员。你能描述一下这个项目吗? 该项目是非常跨学科的,是一个定性和定量研究约300中世纪丹麦宪章从十三世纪到十六世纪。目标是通过这些资源研究中世纪丹麦的文字和语言。从历史和语言的角度来看,这些宪章都很有趣,因为它们没有经过任何编辑,所以它们是语言和历史的直接来源。它们还根据它们的生产地点进行了日期和地理定位,所以你可以得到一张非常细致入微的照片。 通常,当你处理历史文本时,它是一个版本的一个版本,很难说出真正的语言是什么,以及后来的补充是什么。因此,有文字学家在这个项目上工作,但也有历史学家在调查修道院的历史。 该项目将于2022年5月结束。主要产出将是《宪章》的开放源码、数字学术版。这一学术版本将使文献学和历史学的学者能够搜索这些宪章--看到不同层次的文本,我们在那里注释了脚本的不同特征,以及其他层次。例如,我们对文本进行了引理,以便您可以搜索单词形式,我们还注释了文本中出现的人和地方。 3.你在项目中的角色是什么? 该项目的一个重点是开发对文本进行自动语言分析的工具,以及自动定年、本地化和识别抄写学校。这种定制的数字工具应该改善对历史来源的定量分析。我参与了那部分。 到目前为止,没有太多的工具可用于丹麦文字或本地化,而且迄今为止很少对丹麦中世纪文本进行系统的分析。我们一直致力于基于机器学习技术(MLT)的文本自动定年、本地化和分组的新方法,这将提高我们对确定中世纪主要来源的日期、地点和抄写的相关因素的理解。在项目中使用MLT的好处是,它允许我们利用过时的宪章材料来建立一个参考和培训语料库。 我所做的是研究语言是如何变化的,以及是否有可能开发工具来为没有日期的文本添加日期。这些语料库都有日期,所以您是否可以使用这些日期语料库来开发工具,自动为没有日期的文本添加日期。也许这个工具可以帮助使未注明日期的文本更容易获得。我们开始开发工具--这是我论文的起点。但后来我的工作变得更加理论化,我开始研究语言模型中如何真实地捕捉语言变化,以及它们识别或敏感于什么样的特征。 我试着看不同的层次。当然,话题是有变化的,随着时间的推移,不同的地方被命名或不同的表达被提到,这是一种话题模型。但后来我也看了声音变化。在那个时期,我们知道一些声音的变化是应该发生的,但当我们在语料库中查看时,我们真的能识别出这些变化吗?所以我对音系层面的变化也很感兴趣。 4.关于你一直在开发的工具,你能再多说一点吗? 这个工具--它还没有完全开发出来--使用支持向量机(SVM)来自动将手稿分配到特定的时间段或bin。它在一个向量空间中表示一个文本,它包含的单词或更小的片段,如字符n-grams,然后它将这些投射到空间中,并试图在不同的类之间创建学习边界。在这种情况下,类是特定的时间段--可以是几个世纪,也可以是50年。然后该工具尝试学习如何分割在给定空间内投影的那些。 接收新文档时,将该新文档映射到该空间中,然后可以根据文档在这些时段中的划分情况来评估该空间的构建情况。我们收到了相当好的准确性。我们达到了75%左右--这意味着我们能够以25年的误差幅度对近75%的宪章进行定年,这是文献学家用来衡量中世纪文本手工定年精度的标准。但是这些所谓的垃圾箱实际上是如何构造的,这也有点复杂。在我们的论文中,我们并没有找到解决这个问题的方法。 如果我要进一步开发这个工具,我认为工作应该集中在它实际识别的特征类型上。为了使它有用,它必须适用于另一个语料库,例如,该语料库使用不同的模式进行了注释。我们还没有测试。但这将是一个问题:它能在多大程度上概括整个语料库? 我们有另一个宪章或中世纪文件的语料库,称为Danicum,在这个资源上测试这个工具将是有趣的,因为它是一个跨越广泛时期的更大的语料库。看看在这个非常具体的语料库上训练过的工具能有多好地转移到那里将是很有趣的。原则上,我认为这个工具对其他语料库也是有用的,至少在同一领域内是这样。我们目前正致力于研究这些方法的推广性,其中我们还计划使该工具可供学者在他们自己的数据上进行测试。所以,敬请关注! 5.你在Clarin-DK的合作下,将机器学习技术应用于中世纪丹麦文本的分析。你是如何开始与Clarin-DK合作的?你有没有使用任何特定的克拉林工具作为你研究的一部分? 我们的项目成员之一,Bart Jongejan是CLARIN-DK团队的成员。因此,当我们需要工具时,我们使用CLARIN-DK存储库中的工具。尽管已经转录了这些宪章,但它们是以类似CSV的格式进行注释的,其中每一行代表一个令牌,并且需要将它们转换为XML格式以供实际版本使用。为了自动转换格式,我们使用了名为Text Tonsorium的NLP工作流管理器。 我们使用相同的工具自动词性标签的拉丁语宪章。对于Danish charters,我们希望手动注释词性和引理,在本例中,我们使用Text Tonsorium中提供的工具作为注释的起点。这非常有用,因为它大大减少了手工注释的工作量。Text Tonsorium的一大特点是它提供了许多不同的管道和工作流,因此您可以快速测试不同的解析器或引理器。否则,您将不得不设置所有不同的工具并试用它们,但这是一种常见的格式,您可以在其中一次性地试用它们。 对于我的研究领域,克拉林-DK提供了我所需要的一切。他们训练了一个古老的丹麦语和一个拉丁语,还提供词性标记。 6.为什么在人文学科中采用计算方法(如自然语言处理)很重要? 我认为计算方法的贡献是双重的:在我看来,最重要的是使语料库更容易获得和搜索,这样定性研究人员就可以更专注地使用这些资源。我认为更多的是关于语言技术如何帮助定性研究人员处理某些研究问题。所以,一种方法辅助,而不是替代不同的方法。例如,您可以使用较大的资源并对其进行筛选。 另一个是实际使用这些工具来进行人文研究,这也可能是非常有趣的。例如,如果您开发一个工具来手动确定文本的日期,那么您也可以从这些工具中学习:语言变化的预测特性是什么?这样,您不仅可以使用这些工具进行注释,还可以从这些模型中学习,并使用它们来实际学习语言和文本。但我认为第一个贡献是更重要的一个。 7.你与Clarin-DK未来的合作有什么准备? 我目前没有一个项目,但我肯定愿意再次与CLARIN合作。与CLARIN-DK的合作对我来说是一次非常积极的经历。我正忙着完成这个项目,把整个学术版上网。我的防线是在五月初,之后我就可以考虑未来的事情了。 学术数字版将于2022年5月底在这里提供。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文