[GALA Valencia 2024] Can we obtain enough quality data for AI…

[GALA Valencia 2024]我们能为AI获得足够的高质量数据吗?

2024-05-04 00:50 GALA

本文共350个字,阅读需4分钟

阅读模式 切换至中文

AI's efficacy relies heavily on the quality and quantity of training data. In recent years, we witnessed a profound shift in the field, driven by the advent of Neural Machine Translation (NMT). Today, we stand on the cusp of another transformative era with the rise of GenAI. Language Service Providers (LSPs) often grapple with a dual challenge: acquiring the technical expertise to create, integrate, and harness these cutting-edge technologies, and ensuring access to sufficient high-quality data to effectively train and personalize deep learning-based systems. The overarching question in the forums is clear – while technical know-how is crucial, the backbone of AI's prowess lies in the availability of robust data. Obtaining such data can be an expensive and labor-intensive endeavor. To thrive in this rapidly evolving landscape, we need a technology capable of surmounting this obstacle whether we are building models from scratch or customizing Large Language Models (LLMs) for translation and other applications. Are you in need of a bilingual TMX corpus with a specific language pairing? Do you require this corpus to cover particular topics or within certain dimensions? Or, perhaps, are you looking for monolingual corpora in various languages? Enter SmartBiC (Bilingual Copora), a project funded by the European Union under the NextGenerationEU initiative's 2021 call for Research and Development projects in Artificial Intelligence and digital technologies, driven by the Spanish Public Business Entity RED.ES. SmartBic is built upon the success of the Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages project (paracrawl.eu) and is scheduled to conclude in June 2024. SmartBiC is designed to equip us with the technology needed to efficiently identify, collect, align, tag, and filter bilingual data from the Internet. This data will serve as the bedrock for training selective neural engines and Deep Learning models, empowering a range of applications and sectors. These include targeted search for bilingual text based on specific criteria or domain focus; training and customization of neural machine translation systems and LLMs; training other neuro-linguistic systems; terminology extraction, and text preprocessing, cleaning, filtering, and annotation, among other.
AI的有效性在很大程度上取决于训练数据的质量和数量。近年来,我们见证了该领域的深刻转变,这是由神经机器翻译(NMT)的出现所推动的。今天,随着GenAI的崛起,我们正站在另一个变革时代的尖端。语言服务提供商(LSP)通常面临双重挑战:获得创建、集成和利用这些尖端技术的技术专业知识,并确保获得足够的高质量数据,以有效地培训和个性化基于深度学习的系统。论坛中的首要问题是明确的-虽然技术知识至关重要,但人工智能实力的支柱在于强大数据的可用性。获取这些数据可能是一项昂贵且劳动密集型的工作。 为了在这个快速发展的环境中蓬勃发展,我们需要一种能够克服这一障碍的技术,无论是从头开始构建模型还是为翻译和其他应用程序定制大型语言模型(LLM)。您是否需要具有特定语言配对的双语TMX语料库?你是否要求这个语料库涵盖特定的主题或在某些方面?或者,您是否正在寻找各种语言的单语语料库? 进入SmartBiC(双语Copora),这是一个由欧盟根据NextGenerationEU倡议的2021年人工智能和数字技术研发项目资助的项目,由西班牙公共商业实体RED. ES推动。SmartBic建立在欧洲语言项目(paracrawl.eu)更广泛/持续的网络规模并行语料库提供的成功基础上,计划于2024年6月结束。SmartBiC旨在为我们提供有效识别、收集、对齐、标记和过滤互联网上双语数据所需的技术。这些数据将作为训练选择性神经引擎和深度学习模型的基础,为一系列应用和行业提供支持。这些包括基于特定标准或领域焦点的双语文本的目标搜索;神经机器翻译系统和LLM的训练和定制;训练其他神经语言系统;术语提取和文本预处理,清理,过滤和注释等。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文