ELRC at LREC 2022

2022年LREC上的ELRC

2022-07-20 12:00 ELRC-欧洲语言资源协同化

本文共461个字,阅读需5分钟

阅读模式 切换至中文

ELRC participated in the LREC 2022 in Marseille with an on-site booth at the HLT Village and a remote presentation of the paper ELRC Action: Covering Confidentiality, Correctness and Cross-linguality, which describes the LT assessments performed as part of the ELRC action of the European Commis-sion. The assessments consist of testing various tools and techniques, documenting them in a hands-on way, performing experiments with them, and setting up proof-of-concept environments that demon-strate their potential and their challenges to EC staff and EU Member State representatives, thus facilitating their uptake by public sector users. The paper zoomed in on the two most extensive as-sessments (LT specifications), including a consultation round with various types of stakeholders. In the Automated Anonymisation specification, tools and techniques for deidentifying monolin-gual or bilingual texts were investigated. They aim at replacing Named Entities (NE) and specific patterns with NE labels or other words of a similar type, thus supporting the effort to make (un-structured) text GDPR compliant for organisations that want to store and process text containing personal information and/or share it with other organisations. Of course, the sensitivity of the in-formation has an influence on the choice of replacement strategy; the use of similar words instead of NE labels might be better suited for hampering malicious attempts at reidentification, but is al-so more challenging from a linguistic point of view. One of the long-term goals is to give the user of an anonymisation tool sufficient control, i.e. the possibility to create custom NE lists and patterns and, ideally, to run the tool in-house. However, a lot of exploration is still needed before MT systems can properly translate anonymised text. In the Multilingual Fake News Processing specification, tools and techniques for detecting arti-cles that spread false information were investigated. The goal is to deceive readers and, as such, help prevent damage on political or other levels. Despite the global character of disinformation, the publicly available datasets required for training deep learning models are limited in terms of available languages. Therefore, the specification concentrates on ways to increase multilingual support. In addition to experiments with supervised classification, which make use of text-inherent as well as of categorical and numerical features (such as the Alexa rank), a novel approach for unsuper-vised classification was proposed. This approach applies the anomaly detection technique to train a model using various types of features, but making use only of articles known to constitute true news. This strategy aims at reducing the impact of data sparsity, on the level of language as well as topics. When applying the model to an unseen article, it is considered to be fake news if it is identified as an anomaly.
ELRC参加了在马赛举行的LREC 2022,在HLT村设立了现场展台,并远程演示了《ELRC行动:涵盖机密性、正确性和跨语言性,描述了作为欧盟委员会ELRC行动的一部分进行的LT评估。 评估包括测试各种工具和技术,以实际操作的方式记录它们,用它们进行实验,并建立概念验证环境,向欧盟委员会工作人员和欧盟成员国代表展示它们的潜力和挑战,从而促进公共部门用户的吸收。本文重点介绍了两个最广泛的评估(LT规范),包括与各类利益相关方的一轮磋商。 在自动匿名化规范中,研究了用于去除单语或双语文本身份识别的工具和技术。它们旨在将命名实体(NE)和特定模式替换为NE标签或其他类似类型的单词,从而支持那些希望存储和处理包含个人信息的文本和/或与其他组织共享这些信息的组织努力使(非结构化)文本符合GDPR。当然,信息的敏感性对替代策略的选择也有影响;使用相似的词而不是NE标签可能更适合于阻止重新识别的恶意尝试,但是从语言学的观点来看也更具有挑战性。 长期目标之一是为匿名化工具的用户提供足够的控制,即创建自定义NE列表和模式的可能性,以及理想情况下在内部运行该工具的可能性。然而,在机器翻译系统能够正确地翻译匿名文本之前,还需要进行大量的探索。 在多语言虚假新闻处理规范中,研究了用于检测传播虚假信息的文章的工具和技术。其目的是欺骗读者,因此,有助于防止在政治或其他层面上的损害。尽管虚假信息具有全球性特征,但训练深度学习模型所需的公开可用数据集在可用语言方面受到限制。因此,该规范集中在增加多语言支持的方法上。 除了利用文本固有特征以及分类和数字特征(如Alexa等级)的监督分类实验之外,还提出了一种新的非监督分类方法。该方法应用异常检测技术,使用各种类型的特征来训练模型,但仅使用已知构成真实新闻的文章。该策略旨在降低数据稀疏性对语言和主题级别的影响。当将该模型应用于一篇看不见的文章时,如果它被识别为异常,则被认为是假新闻。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文