崛起中的大型语言模型(LLM)--翻译技术速递

Following the recent technical workshop by ELRC and DG CNECT on large language models (LLM), it is worth taking a closer look on the European developments in this area, particularly when they concern less popular and morphologically rich languages like Polish. The National Information Processing Institute (Ośrodek Przetwarzania Informacji - Państwowy Instytut Badawczy – OPI PIB), a Polish interdisciplinary research institute, may boast of interesting achievements in this field. The experts from the Laboratory of Linguistic Engineering (LIL) developed the Polish RoBERTa large model, which was trained on the largest text corpus ever used for Polish. The works started with the extension of the existing text corpus – a collection of about 15 GB of text data used in the past to train the ELMo model. As BERT-type models have a much larger capacity and require a corresponding dataset to fully exploit their potential, in December 2019, OPI PIB experts started downloading data from Common Crawl, a public archive containing petabytes of web page copies. The Common Crawl data from November-December 2019 and January 2020 allowed – after filtering and cleaning – to accumulate a sufficiently large set. The actual training of the model lasted from February to May 2020. With a corpus of 130 GB of data, equivalent to over 400 thousand books, Polish RoBERTa large became the largest model ever trained in Poland. The model was tested using the Comprehensive Language Evaluation List (KLEJ benchmark) developed by company Allegro, which made it possible to evaluate the model’s performance based on nine tasks, such as sentiment analysis or semantic similarity testing of texts. Based on the KLEJ analysis, the OPI PIB model took the first place in this ranking. In 2021, updated versions of the Polish RoBERTa models and the GPT-2 model designed for text generation tasks were released. The base part of their data corpus consists of high-quality texts (Wikipedia, documents of the Polish parliament, statements from social media, books, articles, longer written forms). The web part of the corpus, on the other hand, includes extracts from websites (CommonCrawl project), which have been previously filtered and properly cleaned. It takes about 3-4 months to train one single neural model of a language, but the results are very promising. All neural models developed in OPI PIB concern texts in Polish, which is particularly valuable, as most of the solutions of this type are developed for English. The mentioned transformer-type models allow for a precise representation of the syntax and semantics of Polish and make it possible to build advanced Polish language processing tools. Commendably enough, the Institute makes the models available to the public and free of charge: they are available on the website of the Institute: https://opi.org.pl/modele-uczenia-maszynowego-udostepnione-przez-opi-pib/ In September, researchers from the Institute are expected to deliver a presentation at the 3rd National ELRC workshop in Warsaw.

在最近由ELRC和DG CNECT举办的大型语言模型(LLM)技术研讨会之后，值得仔细研究一下欧洲在这一领域的发展，尤其是当它们涉及像波兰语这样不太受欢迎和形态丰富的语言时。波兰的一个跨学科研究机构----国家信息处理研究所(Ooxrodek Przetwarzania informacji-pa'stwowy Instytut badawczy-opi PIB）可能在这一领域取得了令人感兴趣的成就。语言工程实验室(LIL)的专家开发了波兰语罗伯塔大模型，该模型是在波兰语有史以来最大的文本语料库上训练的。这项工作始于对现有文本语料库的扩展--一个过去用于训练ELMo模型的约15 GB文本数据的集合。由于伯特型模型的容量要大得多，需要相应的数据集才能充分发挥其潜力，2019年12月，OPI PIB专家开始从公共爬行中下载数据，公共爬行是一个包含百万字节网页副本的公共档案。2019年11月至12月和2020年1月的常见爬行数据允许在过滤和清理后积累足够大的集合。模型实战训练从2020年2月持续到5月。拥有130 GB数据的语料库，相当于超过40万本书，波兰人罗伯塔·拉格成为波兰有史以来训练的最大模型。该模型使用Allegro公司开发的综合语言评估表（KLEJ benchmark）进行测试，通过文本情感分析或语义相似度测试等九项任务来评估模型的性能。基于KLEJ分析，OPI PIB模型在这一排名中占据首位。 2021年，波兰RoBERTa模型和为文本生成任务设计的GPT-2模型的更新版本发布。他们的数据语料库的基础部分包括高质量的文本（维基百科、波兰议会文件、社交媒体声明、书籍、文章、较长的书面形式）。另一方面，语料库的web部分包括网站（CommonCrawl project）的摘录，这些摘录以前已经被过滤和适当地清理过。训练一种语言的单个神经模型大约需要3-4个月，但结果是非常有希望的。在OPI PIB中开发的所有神经模型都涉及波兰语文本，这一点特别有价值，因为大多数此类解决方案都是为英语开发的。所提到的变压器型模型允许精确地表示波兰语的语法和语义，并使建立先进的波兰语处理工具成为可能。值得称道的是，研究所向公众免费提供这些模型：它们可在研究所网站上查阅：https://opi.org.pl/modele-uczenia-maszynowego-udostepnione-przez-opi-pib/ 9月，该研究所的研究人员预计将在华沙举行的第三届全国ELRC研讨会上发表演讲。

以上中文文本为机器翻译，存在不同程度偏差和错误，请理解并参考英文原文阅读。

阅读原文

机器翻译

工具

翻译管理

本地化