Translation brawls: what happens when annotators disagree?

翻译争论:当注释者有不同意见的时候会发生什么?

2020-02-28 19:07 unbabel

本文共3217个字,阅读需33分钟

阅读模式 切换至中文

There’s this saying about how if you give the same text to 10 different translators, they will render 10 different, equally valid translations. After all, language is highly subjective, so when it comes to translation, there’s not one universally accepted answer. And so, naturally, linguists have very strong opinions on which translation best expresses the original meaning of the message.  Since we’re looking for the highest translation quality, this poses a big challenge to us. It turns out the same applies to the annotation of translation errors. Annotators don’t always agree, and not because a translation error has been categorized wrongly, but rather that the same error can be categorized differently, depending on what angle you look at it. So how can we ever hope to train our models to be accurate when even we can’t agree on what’s wrong? And could this diversity of opinions be a good thing? Supervised learning needs examples First, we need to take a step back: why are we interested in what annotators have to say? The reason is simple: currently, almost all the successful AI methods are supervised methods. This means they learn from examples. For image recognition, examples are images annotated with bounding boxes with labels (this part of the image is a cat, this part of the image is a dog, and so on), for speech recognition the examples are speech recordings with their text transcription, and for machine translation this means sentences with example translations. Some tasks require the classification of words or entire sentences into fixed classes — the challenge with named entity recognition (NER) is to recognize parts of the sentence that indicate certain classes of interest like location, name, date. An example of the type of data used and produced in NER: LOC is location, ORG is organization and NORP is nationalities or religious or political groups. This particular example is the prediction of Spacy’s large English model on a news article from Eater. Note that an entity can consist of multiple words, and that the last instance of Corona was mistakenly tagged as a location. This labeled data is the bedrock of any machine learning application that is successful in the real world, because these examples don’t only train models — they also evaluate whether the models have really learned the task at hand. After all, we do not simply want them to copy the examples they were shown, we want them to generalize to the unseen cases. For this reason, we always keep a number of examples aside, used to test the models later on. The important thing to remember is that these examples are provided by us, the humans! We carefully create the example translations, we decide on the categories for the images, we choose the taxonomy of classes that go into the NER system. We can call this effort, the process of creating examples with labels, annotation, and the person doing it an annotator. At Unbabel, we use the Multidimensional Quality Metrics framework, or MQM, to assess the quality of our translations. Annotators are a big part of the process — they conduct error annotation, a process which involves, for each translation error encountered, highlighting the span of the error; classifying it from the list of issues, and finally assigning it a severity (minor, major and critical). This is a bilingual effort — the annotator has to be competent in both languages. Their job comes in different magnitudes: some of it is fine-grained error annotation, like when they’re evaluating if words are incorrectly translated, or overly literal. But sometimes, error annotation exists at a higher level, for example, when they’re judging whether this sentence is a better translation than this other sentence (ranking) or this sentence is a 9/10 but this other one a 3/10 (direct assessment). In some cases, especially when it comes to situations where they performed a direct assessment, it might be hard to understand what drove the judgement of the annotator. It’s one of the reasons why we are particularly fond of the MQM approach: we get a lot of insight into the perceived nature of the errors. Because here’s the thing: annotators don’t always agree. When we on-board new annotators, it’s not uncommon to see disagreements where, in some instances, one annotator claims it’s a minor error, one claims it’s a major one, and one claims it’s critical! And these annotators are already highly qualified, it’s just not an easy task. Disagreement happens for several reasons. First of all, the annotation task is an inherently subjective one. Annotators can simply have different preferences: some prefer translations that show greater grammatical fluency, while others put greater value the preservation of meaning in the translation. But there’s other reasons. Despite the best efforts and constant tuning, instructions aren’t always crystal clear — we can’t predict all cases in which a particular tag should be used, and again, language is ambiguous and poses challenges when you try to classify it. Plus, humans make mistakes. A lot. They’re also famously riddled with biases, both at an individual level (e.g. they consistently prefer one reading/interpretation over the other) and at a group level, in the more socio-cultural sense of the term. Lastly, even the quality of a competent annotator may vary — just try taking a language test in your own native language when you are tired or distracted. But while disagreement is somewhat normal, it can certainly become a problem. If they don’t agree on the severity of an error, how do we know what it is? Measuring (dis)agreement For a start, we could use features of the annotation process to measure quality. But that can be problematic. Take as example the time the annotator takes to complete the task — a very simple quantity to obtain. We’re assuming that a fast annorator is probably hasty, and therefore prone to mistakes, while an annotator who takes a bit more time is just being thorough. But it might as well be the case that the fast annotator is just experienced and efficient, while the slow annotator is just dragging. It’s very hard to distinguish annotators by simple features alone. But when the metadata is more expressive of the task, like the keystrokes behaviour of an editor, then it can become very predictive of quality, as is shown by the Translator2Vec, a model developed at Unbabel. Instead of looking at behavioural data, we can look at the predictions themselves. If we gather multiple judgements on the same item, we can do something more than characterize — we can compare! And here is where the inter-annotation agreement comes in. Inter annotator agreement is typically measured with statistics that summarize — in a single number — the degree of agreement between different annotators. Take raw agreement, which is the number of times annotators agree on their judgement. This does present a problem: if people pick random labels often enough, they are bound to agree at some point. And we do not want to count that in. That’s precisely why Cohen’s kappa enjoys much greater popularity: it corrects against those chance agreements. This idea can be further extended to measure the consistency of an annotator, or in other words, the intra-annotator agreement. If there are multiple judgements by the same person on the same item — preferably with some time in between — then the same metrics as above can be used to measure the annotator against themselves. Illustration of annotator agreement (-1 to 1) on a clear example (first) and a questionable example (second) of sentiment rating (0 to 100), taken from Jamison and Gurevych (2015). The second example is one where coherence of the task and the labels breaks down because: “Is a war zone sad or just bad?”, while on the other hand: is a limit on war not a good thing? This objection is reflected in the agreement score which indicates that there was almost no correlation in the annotators’ judgments (0 means no correlation). At the end, these metrics can help you get a grip on the quality of your data. They provide you with a metric that can guide the decision making: Do you need to demote certain annotators? Do you need to discard certain examples? But don’t be fooled: all metrics have flaws, and Cohen’s kappa is no exception. Agree to disagree? Should we always punish difference of judgment? Some data labelling tasks are inherently ambiguous, and in those, disagreement could be telling us something. Consider this example: Unbabel example of MQM annotations on English-German from two different annotators. Yellow is a minor error, red a critical one. The example comes from an internally used test-batch used to train and evaluate annotators. (The visualization was created using an adaptation of Displacy.) The source sentence is “Could you also give me the new email address you would like me to attach to your account.” It’s clear that the annotators have different approaches, with one clear point of agreement (the word neuen) and one big disagreement: the last part of the sentence. The MQM resulting from the the second annotation is 70 while that resulting from the first annotation is 40, which illustrates the big influence a critical error can have on the final score. In this example, we prefer the second annotation. The first annotator claims that the last bit of the sentence is unintelligible, which, according to MQM guidelines, means that the exact nature of the error cannot be determined, but that it causes a major breakdown in fluency. This is an error you would apply to a garbled sequence of characters and numbers such as in “The brake from whe this કુતારો િસ S149235 part numbr,,.”, which is not necessarily what happens in the sentence above. But we could argue that there is an interesting question here. If the last section of the translation contains so many mistakes that it almost becomes impossible to understand, doesn’t this constitute a “major breakdown in fluency”? This example is taken from an experiment in which we compare and align annotators. Because both annotators are competent, and the source of disagreement can be understood, the step that follows the above observation is one of calibration: to make sure that all annotators are on the same page — with us and with each other. Embracing the chaos When dealing with this kind of disagreement, there’s always a few things we can do to mitigate it. Sometimes, you can reduce disagreement by just providing more guidance. This is a matter of investing more human hours, understanding which labels and which tasks are causing the disagreement, and the solution can include rethinking labels, tools, incentives, and interfaces. This is a tried and trusted approach here at Unbabel. Or you ask other experts to repair your data. When this was recently done for a classical, and still used NER dataset, researchers found label mistakes in more than 5 percent of the test sentence. That might not sound very significant, but that is a pretty large number for a dataset where the state of the art methods achieve performance of over 93 percent! Example of corrections made by Wang et al. (2019) to the CoNLL03 NER dataset. (Adapted from Wang et al. using Displacy) An interesting approach is to merge judgements — If you can get multiple annotations on the same data item, why not try to combine them into one? We tend to rely on experts, because we believe they are more accurate, thorough, and ultimately, reliable. Since the annotations we use deal with a specialized taxonomy of errors and require a great level of language understanding in order to be used correctly, we rely on highly qualified annotators. But here’s the fascinating thing: for some tasks that do not use very specialized typology or assume a specialized type of knowledge, the aggregated judgement from several non-experts is equally reliable as a single judgement from an expert. In other words: enough non-experts average into one expert. And the number of non-experts that is required for this can be surprisingly low. It’s this type of collective knowledge that built Wikipedia, for example. Take the task of recognizing textual entailment (RTE). Textual entailment is a logical relation between two text fragments — the relation holds whenever the truth of one sentence follows from another. For example: “Crude oil prices slump” entails that “Oil prices drop”; it does not entail that “The government will raise oil prices” (adapted from Snow et al., 2018). Aggregating the judgements of multiple non-experts into that of a single expert (green dashed line). Adapted from Snow et al. (2008) Here, we see how aggregating the judgement of those non-experts can improve the accuracy of annotations (black line). And we can boost it even further by weighing each non-expert judgement with an automatically determined score that can be computed from their agreement with an expert, effectively correcting for their biases, as the blue line shows. Instead of weighing your annotators by confidence, you can also try to weigh your examples by their difficulty. For example by assigning less importance to the easy examples — or even more rigorous: by removing them entirely. The beauty of the above two approaches is that the models themselves can be used to identify these candidates. All in all, it’s hard to remove all ambiguity. Take translation: for a single sentence, there are multiple (possibly very large amounts) of valid translations, perhaps each prioritizing a different aspect of the translation quality — just think about the multiple translations of a novel between translators, or even over the decades. This is explicitly accounted for in the evaluation of translation systems, where it is considered best practice to always consider multiple valid reference translations when using an automatic metric. In the training of machine translation models, on the other hand, it remains an open question how to promote diversity, or in more broader terms: how to deal with the fundamental uncertainty in the translation task. It turns out, too much agreement isn’t good for your models either. When that happens, annotators can start to leave behind easy patterns, the so called “annotator artifacts”, which are easily picked up by the models. The problem is caused by features in the input example that correlate strongly with the output label but do not capture anything essential about the task. For example, if all the pictures of wolves in the training show snow and all the pictures of huskies do not, then this is very easy to pick up on — and equally easy to fool. The models fail, assuming that the lack of snow is what characterises a husky. It turns out that language has its own version of snow, as was discovered for a dataset in natural language inference, a generalized version of RTA. The dataset is part of a very popular benchmark for training and evaluating language understanding systems that provides a “single-number metric that summarizes progress on a diverse set of such tasks”, and that has been an important driver of the trend for bigger, stronger, faster models. Natural language inference (NLI) example sentences created from a premise by following simple heuristics. (Taken from Gururangan et al. (2018).) The annotator is given the premise and constructs a sentence for each of the three logical relations (entailment, neutral, and contradiction). The generated sentence is called the hypothesis. The machine learning task is to predict the relation given the premise and the hypothesis. The examples in this dataset are created by humans, who, it turns out, often rely on simple heuristics in the process. The result is a dataset where hypotheses that contradict the premise disproportionately contain not, nobody, no, never and nothing, while the entailed hypotheses a riddled with hypernyms like animal, instrument and outdoors to generalize over dog, guitar and beach, or approximate numbers like at least three instead of two. No wonder many examples can be accurately predicted from the hypothesis alone: all the model needs is to pick up on the presence of such words! And because different annotators resort to different tactics, it helps the model to know which annotator created the example, while it struggles to correctly predict examples from new annotators. In practice, learning this type of relation will prevent generalization to examples that do not show this correlation. And this generalization is precisely what we are after. After all, you don’t want to be right for the wrong reasons: you will be very easy to fool with adversarially constructed examples. And the best solution to this problem in a dataset can be harsh, as in the above case where it was decided to not include it in the second iteration of the benchmark — a laudable example of attentiveness to advancing insights in our community. At some point, you’ll have to embrace the chaos. Diversity in data is a good thing, and we should cherish it. From this viewpoint disagreement of annotators is signal, not of noise. We could even make ambiguity an explicit feature of our models — an approach that has been successfully applied in quality estimation of machine translation systems. Explicit ambiguity in a dataset on frame semantics (from Dumitrache et al., 2019). The first example fits relatively neatly with the categorization, as demonstrated by the high confidence in both of the labels and in the sentence overall. The second example shows a much greater overlap in labels as it can be seen as a combination of each of them, to some degree. Taking this one step further, you can decide to create a dataset that contains ambiguity on purpose. Instead of providing a single label for data-points, annotators are allowed to provide multiple labels, and instead of a single annotator per item they request judgments from multiple annotators. This multitude of judgements allows you to create a dataset with multiple correct answers, each weighed by a disagreement-scores that indicates the confidence in that label. Take the example above, showing the results of that effort. The task is one of recognizing the multiple plausible word-senses (“frames”), and you get a sense of the uncertainty surrounding each item. This uncertainty is expressed by the weights assigned to the classes, and to the sentences (Dumitrache et al., 2019). The label score is the degree to which annotators agreed on that single label weighted by quality of the annotator, and the sentence score is the degree to which all annotators agreed on all the labels in the sentence. In their research, Anca Dumitrache and her colleagues “found many examples where the semantics of individual frames overlap sufficiently to make them acceptable alternatives for interpreting a sentence.” She argues that ignoring this ambiguity creates an overly arbitrary target for training and evaluating natural language processing systems: “if humans cannot agree, why would we expect the answer from a machine to be any different?” And indeed, our research is constantly evolving in this direction. This diversity of annotations is actually helping us build better labels, better tools, and ultimately better machine learning models. And while someone who’s pretty organised wouldn’t normally admit this, sometimes you just need to stop worrying and learn to embrace the chaos.
俗话说,如果您将相同的文本提供给10个不同的译者,他们会给出10个不同的,同样有效的译文。 毕竟,语言是高度主观的,因此在翻译方面,没有一个标准答案。 因此,自然而然地,语言学家对哪种翻译最能表达信息的原始含义有很激烈的见解。 由于我们追求的是最高质量的翻译,这对我们来说是一个很大的挑战。 事实证明,这同样适用于翻译错误的注释。 注释者并不总是赞同,这并不是因为翻译错误被错误地分类了,而是因为同一错误可以被不同地分类,这取决于您从什么角度看待它。 所以,我们怎么可能希望训练我们的模型是准确的,即使我们都不能就什么是错的达成一致? 这种意见的多样性会是一件好事吗? 监督学习需要示例 首先,我们需要退一步去想,为什么我们对注释者所要说的内容感兴趣? 原因很简单:目前,几乎所有成功的AI方法都是监督方法。 这意味着他们从示例中学到东西。 对于图像识别,示例是带有带有标签的边界框注释的图像(图像的这一部分是猫,该图像的这一部分是狗,依此类推),对于语音识别,示例是带有文本转录的语音记录, 对于机器翻译,这意味着带有示例翻译的句子。 有些任务需要将单词或整个句子分类到固定的类别--命名实体识别(NER)的挑战是识别句子中指示特定类别(如位置,姓名,日期)的部分。 NER中使用和产生的数据类型的一个示例:LOC是位置,ORG是组织,NORP是国籍或宗教或政治团体。 这个特定示例是根据Eater的新闻报道对Spacy大型英语模型的预测。 请注意,一个实体可以包含多个单词,并且Corona的最后一个实例被错误地标记为位置。 这些标记的数据是在现实世界中成功的任何机器学习应用程序的基础,因为这些示例不仅训练模型,而且还评估模型是否真的学会了手头的任务。毕竟,我们希望他们不是简单照搬他们所看到的例子,我们要他们把那些看不见的例子一概而论。 由于这个原因,我们总是把一些例子放在一边,之后就会用于测试模型。 记住!重要的是这些例子是我们人类提供的! 我们创建详尽的示例翻译,我们决定图像的类别,我们选择进入NER系统的分类法。 这项工作中,我们可以把做标签,注释义和创建例子的人称为注释者。 在Unbabel,我们使用多维质量度量框架,或MQM,来评估我们翻译的质量。 注释者是这个过程的一个重要部分——他们把错误的注释出来。这个过程中,遇到的每个翻译错误,突出显示错误的范围, 从问题列表中对其进行分类,最后为其指定严重性(次要,重大和严重)。 这是一项双语工作——注释者必须精通两种语言。 他们的工作都一样重要:有些是细粒度的错误注释,比如他们评估单词是否有翻译错误,或者过于字面化。但是有时,错误注释存在于更高的级别,例如,当他们判断该句子是否比另一句子更好的翻译时(排名),或者该句子是9/10,而另一句子是3/10( 直接评估)。 在某些情况下,尤其是当他们进行直接评估时,可能很难理解是什么推动了注释者的判断。 这是我们特别喜欢MQM方法的原因之一:我们对错误的感知本质有很多了解。 因为问题是:注释者并不总是一致的。 当我们接触新的注释者时,经常会看到不同意见,在某些情况下,一个注释者声称这是一个小错误,一个声称这是一个大错误,还有一个声称这是一个关键错误! 而且这些注解员已经很合格了,这可不是一件容易的事。 发生分歧有几个原因。 首先,注释任务是固有的主观任务。 注释者可以简单地具有不同的偏好:一些注释者更喜欢显示语法流利度更高的翻译,而另一些注释者则更重视保留翻译中的意义。 但还有其他原因。 尽管尽了最大的努力和不断的调优,指令并不总是非常清晰--我们不能预测在什么情况下应该使用某个特定的标签,而且,语言也是含糊不清的,当你试图对它进行分类时也会给你带来挑战。 另外,人会犯错误。 很多。 从术语的社会文化意义上说,他们在个人层面(例如,他们一贯喜欢另一种阅读/口译)和群体层面上也充满偏见。 而且·,即使是合格的注释者的素质也可能有所不同,比如疲倦或分心时,只需尝试使用自己的母语进行语言测试即可。 但是,尽管意见分歧有些正常,但它肯定会成为一个问题。 如果他们对错误的严重性不一致,我们怎么知道到底谁是对的? 测量(dis)协议 首先,我们可以使用注释过程的功能来衡量质量。 但这可能是有问题的。 以注释者完成任务所需的时间为例-这是一个非常简单的数量。 我们假设快速的注释器可能很仓促,因此容易出错,而花费更多时间的注释器只是彻底的。 但是也有可能是快速注释者只是经验丰富而高效的,而慢速注释者只是在拖动。 单凭简单的特征很难区分注释者。 但是,当元数据更能表达任务时,比如编辑器的击键行为,那么它就能很好地预测质量,正如Unbabel开发的Translator2Vec模型所示。 除了查看行为数据,我们还可以查看预测本身。 如果我们对同一个项目收集多个判断,那么我们可以做的不仅仅是描述,而是可以做比较! 注释者之间的协议就在这里出现。注释者之间的协议通常使用统计数据来度量,这些统计数据以单个数字概括不同注释者之间的协议程度。 采取原始协议,这是注释者同意其判断的次数。 这确实存在一个问题:如果人们足够频繁地选择随机标签,那么他们必然会在某个时候达成共识。 而且我们也不想算在内。这就是科恩(Cohen)的kappa受欢迎程度更高的原因:它纠正了这些机会协议。 这种思想可以进一步扩展到度量一个注释器的一致性,或者换句话说,注释器内部协议。 如果同一个人对同一项有多个判断--最好是中间有一段时间--那么可以使用与上面相同的度量标准来衡量注释者自身。 注释者协议(-1到1)的示例(第一个清晰的例子)和可疑例子(第二个)的情感等级(0到100)示例,取自Jamison和Gurevych(2015)。 第二个例子是,任务和标签的一致性由于以下原因而破裂:“战区是悲伤还是不好?”而另一方面:限制战争不是一件好事吗? 此异议反映在一致性评分中,该评分表明注释者的判断几乎没有关联(0表示没有关联)。 最后,这些指标可以帮助您掌握数据的质量。 它们为您提供了一个可以指导决策的度量:您需要降级某些注释器吗? 你需要抛弃某些例子吗? 但不要被愚弄了:所有的度量都有瑕疵,科恩的kappa也不例外。 同意还是不同意? 我们是否应该一直惩罚判断力的差异? 有些数据标记任务本来就模棱两可,在这些情况下,分歧可能会告诉我们一些事情。 看以下的示例: 来自两个不同批注器的关于英语-德语的MQM批注的Unbabel示例。 黄色是一个小错误,红色是一个关键错误。 这个例子来自一个内部使用的测试批处理,用于训练和评估注释器。 (可视化是使用位移的一种改编来创建的。) 源句是“Could you container give me The new email address you would me with me The new email address you would me附加到您的帐户中”,很明显注释者有不同的方法,有一个明确的一致点(neuen一词)和一个大的分歧:句子的最后部分。 第二个注释得到的MQM为70,而第一个注释得到的MQM为40,这说明了一个关键错误可能对最终得分产生的巨大影响。 在本例中,我们更喜欢第二个注释。 第一个注释者声称句子的最后一位是不可理解的,根据MQM指南,这意味着错误的确切性质无法确定,但它导致了流利性的重大崩溃。 这是一个错误,你会应用于乱码的字符和数字序列,如“the brake from whe This S149235 part numbr,,。”,这不一定是发生在上面的句子中。 但我们可以说,这里有一个有趣的问题。 如果译文的最后一节包含了如此之多的错误,以至于几乎变得无法理解,这难道不构成“流畅性的重大崩溃”吗? 本示例取自我们比较和对齐注释器的实验。 因为两个注释者都可以胜任,并且可以理解分歧的根源,所以上述观察之后的步骤是一种校准:确保所有注释者与我们和彼此都在同一页上。 拥抱混乱 处理此类分歧时,我们总可以采取一些措施来缓解这种分歧。 有时,您可以通过提供更多指导来减少分歧。 这需要投入更多的人力,了解哪些标签和哪些任务导致了分歧,解决方案可以包括重新考虑标签,工具,激励措施和界面。 在Unbabel,这是一种经过尝试且值得信赖的方法。 或者您要求其他专家修复您的数据。 当最近针对经典且仍在使用的NER数据集进行此操作时,研究人员发现超过5%的测试句子中出现标签错误。 这听起来可能并不十分重要,但是对于数据集而言,这是一个相当大的数字,在该数据集中,最先进的方法可以实现超过93%的性能! Wang等人的更正举例。 (2019年)转换为CoNLL03 NER数据集。 (改编自Wang等人使用Displacy) 一个有趣的方法是合并判断--如果您可以在同一数据项上获得多个注释,为什么不尝试将它们合并为一个呢? 我们倾向于依赖专家,因为我们相信他们更加准确,更彻底,最终,更可靠。 由于我们使用的注释处理的是一种特殊的错误分类,并且需要很高的语言理解水平才能够正确使用,因此我们依赖于高度合格的注释者。 但是,这是一件有趣的事:对于某些没有使用非常专业的类型学或假定专门知识类型的任务,来自数位非专家的综合判断与来自专家的单一判断同样可靠。 换句话说:足够多的非专家平均一个专家。 而且为此所需的非专家人数可能非常少。 举例来说,正是这种集体知识造就了Wikipedia。 接受识别语篇蕴涵(RTE)的任务。 语篇蕴涵是两个语篇片段之间的一种逻辑关系--每当一个句子的真理从另一个句子的真理之后,这种关系就成立。 例如:“原油价格暴跌”意味着“石油价格下跌”; 它并不意味着“政府将提高油价”(改编自斯诺等人,2018年)。 将多名非专家的判断汇总为一名专家的判断(绿色虚线)。 改编自斯诺等人。 (2008年) 这儿,我们看到汇总那些非专家的判断如何可以提高注释的准确性(黑线)。 而且,我们可以通过根据自动确定的分数对每个非专家判断进行加权,从而进一步提高该判断力,该分数可以根据他们与专家的协议计算得出,有效地纠正了他们的偏见,如蓝线所示。 除了自信地衡量注释者之外,你还可以尝试根据示例的困难来衡量示例。 例如,通过将简单示例的重要性降低或什至更严格:将其完全删除。 上面两种方法的优点在于,模型本身可以用于识别这些候选对象。 总而言之,很难消除所有歧义。 以翻译为例:对于一个句子,有多个(可能非常大量)有效的翻译,也许每个都优先考虑翻译质量的不同方面-只需考虑一下翻译在翻译人员之间的多次翻译,甚至几十年来。 这在翻译系统的评估中得到了明确说明,在最佳实践中,使用自动度量标准时始终考虑多个有效参考译文被认为是最佳实践。 另一方面,在机器翻译模型的训练中,如何促进多样性或更广泛地说仍然是一个悬而未决的问题:如何处理翻译任务中的基本不确定性。 事实证明,太多的协议也不适合您的模型。 发生这种情况时,注释者可能会开始抛弃简单的模式,即所谓的“注释者工件”,这些模式很容易被模型拾取。 该问题是由输入示例中的功能引起的,这些功能与输出标签密切相关,但未捕获到有关该任务的任何重要信息。 例如,如果在训练中所有狼的照片都下雪,而所有爱斯基摩人的照片都没有,那么这很容易掌握,也很容易傻。 假设缺少雪是沙哑的特征,则模型将失败。 事实证明,语言具有自己的下雪版本,就像自然语言推断的数据集(RTA的广义版本)中发现的那样。 该数据集是用于培训和评估语言理解系统的非常受欢迎的基准测试的一部分,该基准测试提供了“单项指标,总结了各种此类任务的进展情况”,并且是推动更大,更强大的趋势的重要驱动力 ,更快的模型。 自然语言推理(NLI)示例句子是通过遵循简单的启发式方法从前提中创建的。 (摘自Gururangan等人(2018)。)为注释者提供前提,并为三种逻辑关系(蕴含,中立和矛盾)中的每一种构建一个句子。 生成的句子称为假设。 机器学习任务是在给定前提和假设的情况下预测关系。 事实证明,该数据集中的示例是由人类创建的,他们通常在此过程中依赖简单的启发式方法。 结果是一个数据集,其中与前提相矛盾的假设不成比例地包含了,没有,没有,从不和没有,而所包含的假设则充斥着动物,乐器和户外等高音,以概括狗,吉他和沙滩,或近似数字 至少三个而不是两个。 难怪仅凭一个假设就可以准确地预测出许多例子:所有模型需要的是掌握这些单词的存在! 而且由于不同的注释者采用不同的策略,因此它可以帮助模型知道哪个注释者创建了示例,而它却难以从新的注释者正确地预测示例。 在实践中,学习这种类型的关系将防止泛化到没有显示这种相关性的例子。而这种泛化正是我们所追求的。毕竟,您不希望因为错误的原因而成为正确的人:您将很容易被反向构造的示例愚弄。而在数据集中解决这个问题的最佳方案可能会很苛刻,就像在上面的例子中,我们决定不把它包括在基准的第二次迭代中——这是一个值得称赞的例子,说明我们的社区在不断推进见解。 在某些时候,你将不得不接受这种混乱。数据的多样性是一件好事,我们应该珍惜它。从这个观点来看,注释者的分歧是信号,而不是噪音。我们甚至可以使模糊成为我们的模型的明确特征——这种方法已经成功地应用于机器翻译系统的质量估计。 数据集在框架语义上的明确歧义(来自Dumitrache等,2019)。 第一个示例相对整洁地适合分类,如对标签和整个句子的高置信度所示。 第二个示例显示了标签上更大的重叠,因为可以将它们视为每个标签在某种程度上的组合。 更进一步,您可以决定创建一个故意包含歧义的数据集。 允许注释者提供多个标签,而不是为数据点提供单个标签,并且代替每个项目使用单个注释者,它们请求多个注释者的判断。 大量的判断使您可以创建具有多个正确答案的数据集,每个答案均由表示该标签的置信度的分歧分数来衡量。 以上面的示例为例,显示该工作的结果。 这项任务是识别多个合理的词义(“框架”)之一,您可以了解每个项目周围的不确定性。 这种不确定性由分配给班级和句子的权重表示(Dumitrache et al。,2019)。 标签分数是注释者对单个标签的共识程度,以注释者的质量加权;句子分数是所有注释者对句子中所有标签的共识程度。 在他们的研究中,安卡·杜米特拉什(Anca Dumitrache)和她的同事“发现了许多例子,其中单个框架的语义重叠得足以使它们成为解释句子的替代方案。” 她认为,忽略这种歧义会导致训练和评估自然语言处理系统的目标过于随意:“如果人类无法达成共识,我们为什么会期望机器的答案会有所不同?” 确实,我们的研究正在朝着这个方向不断发展。 注释的这种多样性实际上正在帮助我们构建更好的标签,更好的工具,最终建立更好的机器学习模型。 虽然有组织的人通常不会承认这一点,但有时您只需要停止担心并学会忍受混乱。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文