手工标注的教与学语料库：CrowLL项目--翻译技术速递

By Tanara Zingano Kuhn, Rina Zviel Girshin, Špela Arhar Holdt, Kristina Koppel, Iztok Kosem, Carole Tiberius, and Ana R. Luís In February 2019, members of the enetCollect COST Action gathered in Brussels for the first edition of the Crowdfest, a hackathon for the development of projects for language learning using crowdsourcing techniques. At this event, the present project leader, together with a researcher who had already expressed interest in collaborating, pitched her idea to use crowdsourcing to remove offensive and/or sensitive content from corpora in order to find a more adequate process of making pedagogical corpora that could be used to develop a Portuguese version of the auxiliary language learning resource Sketch Engine for Language Learning − SKELL. Then, two other researchers joined them in the hackathon, and together they brainstormed ideas and came up with a first methodological approach. Later that year, other researchers from different countries joined the group and further developed the initial methodology, leading to the proposal of a crowdsourcing experiment using Pybossa. The languages for which the experiment was conducted were Dutch, Serbian, Slovene, and Portuguese. The lessons learned from this experiment have led us to make some changes to the initial idea. Firstly, the final application of the corpora was broadened. Not only did we want to develop SKELL for each one of the languages, but we also wanted to make the corpora available for language teachers, lexicographers, and Natural Language Processing See: http://en.wikipedia.org/wiki/Natural_language_processing read more researchers. Secondly, the objective of the crowdsourcing tasks changed from corpus filtering to corpus labelling. This way, sentences are not removed from the corpus, but rather labelled, so users can choose the sentences they want to use depending on their objectives. Finally, we decided to adopt a more engaging approach: a gamified solution. This is when CrowLL – the Crowdsourcing for Language Learning game was born. The main goal of this project was to create manually annotated corpora for teaching and learning purposes of Brazilian Portuguese, Dutch, Estonian, and Slovene that can be used by lexicographers, language teachers, and NLP researchers, as well as for the development of SKELL for each of the languages. The process involved two stages, namely, data preparation and game development, each with its own outcomes. In the future, researchers wanting to create such annotated corpora for their language can choose either the expert approach (the annotation guidelines), or/and opt for crowdsourcing (the game). STAGE 1 – Data Preparation In this stage, data for the game was prepared, which involved: 1. Definition of the source corpora from which sentences would be extracted 2. Provision of pedagogically oriented GDEX configurations 3. Creation of lemma lists to extract sentences from the corpora. The process is described in detail in here. The result is manually annotated corpora for teaching and learning Dutch, Estonian, Slovene, and Brazilian Portuguese, each containing 10.000 sentences. Sentences in the corpora are marked with Y if the sentence was considered to be ‘problematic’ for teaching the language and N if considered to be ‘non-problematic’. All the problematic sentences additionally have labels indicating the category of the problem (offensive, vulgar, sensitive content, grammar/spelling problems, incomprehensible/lack of context). These corpora are available on PORTULAN CLARIN, together with the guidelines and the list of lemmas used for extraction. STAGE 2 – Game Development By streamlining annotation and including more participants in the process, larger amounts of data can be manually processed. This is why we developed a crowdsourcing-based game for further corpus growth. The code for gamified annotation is published on Github as open access under an Apache 2.0 licence. The idea for CrowLL was originally inspired by the Matchin game (Hacker and von Ahn, 2009). In this, two players compete with each other to guess which of the two pictures they are shown will be chosen by their opponent. If their predictions match, they score points. For CrowLL, we opted to start with the development of a single-player mode, where players get scores if their choices match those previously made by other players. This means that, to launch the game, previously annotated sentences must be fed to the game. We thus used the manually annotated corpora created in stage 1 as ‘seed corpora’ so that players’ answers can be matched to existing answers (experts’ annotations). In terms of the type of crowdsourced work, we consider CrowLL as a crowdrating game, given that ‘crowdrating systems commonly seek to harness the so-called wisdom of crowds (Surowiecki, 2005) to perform collective assessments or predictions. In this case, the emergent value arises from a huge number of homogeneous “votes”’ (Morschheuser et al., 2017, p. 27). With this game, the definition of whether a sentence is problematic or not, to which category of problem it belongs, and what constituent part of the sentence is problematic, will emerge from the majority consensus. At present, the CrowLL game can be played on computers or mobiles devices at this link.

作者：Tanara Zingano Kuhn、Rina Zviel Girshin、Špela Arhar Holdt、Kristina Koppel、Iztok Kosem、Carole Tiberius和Ana R.路易斯 2019年2月，enetCollect COST Action的成员聚集在布鲁塞尔参加第一版的Crowdfest，这是一个使用众包技术开发语言学习项目的黑客活动。在这次活动中，现任项目负责人与一位已经表示有兴趣合作的研究人员一起提出了她的想法，即使用众包从语料库中删除攻击性和/或敏感内容，以便找到一个更适当的制作教学语料库的过程，可以用来开发葡萄牙语版的辅助语言学习资源Sketch Engine for Language Learning-SKELL。然后，另外两名研究人员加入了他们的黑客团队，他们一起集思广益，提出了第一种方法论方法。同年晚些时候，来自不同国家的其他研究人员加入了该小组，并进一步发展了最初的方法，从而提出了使用Pybossa进行众包实验的建议。进行实验的语言是荷兰语、塞尔维亚语、斯洛文尼亚语和葡萄牙语。从这个实验中吸取的教训使我们对最初的想法做了一些修改。首先，语料库的最终应用范围得到了拓展。我们不仅想为每一种语言开发SKELL，而且还想让语言教师、词典编纂者和自然语言处理请访问：http://en.wikipedia.org/wiki/Natural_language_processing 阅读更多研究人员其次，众包任务的目标从语料过滤转变为语料标注。通过这种方式，句子不会从语料库中删除，而是标记，因此用户可以根据他们的目标选择他们想要使用的句子。最后，我们决定采用一种更具吸引力的方法：游戏化的解决方案。这就是CrowLL -众包语言学习游戏诞生的时候。该项目的主要目标是为巴西葡萄牙语、荷兰语、爱沙尼亚语和斯洛文尼亚语的教学和学习目的创建手动注释的语料库，这些语料库可供词典编纂者、语言教师和NLP研究人员使用，也可用于每种语言的SKELL开发。该过程包括两个阶段，即数据准备和游戏开发，每个阶段都有自己的结果。将来，想要为他们的语言创建这样的注释语料库的研究人员可以选择专家方法（注释指南）或/和选择众包（游戏）。第1阶段-数据准备在这一阶段，准备了游戏的数据，其中包括： 1.定义从中提取句子的源语料库 2.提供以教学为导向的GDEX配置 3.创建词元列表以从语料库中提取句子。该过程在这里详细描述。其结果是手动注释语料库教学和学习荷兰语，爱沙尼亚语，斯洛文尼亚语和巴西葡萄牙语，每个包含10.000句。如果语料库中的句子被认为对语言教学是“有问题的”，则用Y标记，如果被认为是“无问题的”，则用N标记。此外，所有有问题的句子都有标签，表明问题的类别（冒犯性，粗俗，敏感内容，语法/拼写问题，不可理解/缺乏上下文）。这些语料库可以在PORTULAN EQUIPIN上找到，还有用于提取的指导方针和词元列表。第二阶段-游戏开发通过简化注释并在流程中包含更多参与者，可以手动处理更多数据。这就是为什么我们开发了一个基于众包的游戏，以进一步增长语料库。游戏化注释的代码在Apache 2.0许可下作为开放访问发布在Github上。 CrowLL的想法最初是受Matchin游戏的启发（Hacker and von Ahn，2009）。在这个游戏中，两个玩家互相竞争，猜测他们所展示的两张图片中的哪一张会被对手选择。如果他们的预测相匹配，他们得分。对于CrowLL，我们选择从单人模式的开发开始，如果玩家的选择与其他玩家之前做出的选择相匹配，那么玩家就会获得分数。这意味着，要启动游戏，必须将先前注释的句子输入游戏。因此，我们使用第一阶段中创建的手动注释语料库作为“种子语料库”，以便玩家的答案可以与现有答案（专家注释）相匹配。就众包工作的类型而言，我们认为CrowLL是一种众包游戏，因为“众包系统通常寻求利用所谓的群体智慧（Surowiecki，2005）来执行集体评估或预测。在这种情况下，涌现的价值来自大量的同质“投票”（Morschheuser等人，2017年，第27页）。通过这个游戏，一个句子是否有问题，它属于哪一类问题，以及句子的哪些组成部分是有问题的，这些定义将从多数人的共识中产生。目前，CrowLL游戏可以在此链接上在计算机或移动设备上玩。

以上中文文本为机器翻译，存在不同程度偏差和错误，请理解并参考英文原文阅读。

阅读原文

机器翻译

工具

翻译管理

本地化