The First Large Language Model Supporting All EU Languages Is Here

首个支持所有欧盟语言的大语言模型来了

2024-09-27 09:00 slator

本文共465个字,阅读需5分钟

阅读模式 切换至中文

On September 24, 2024, researchers from Unbabel, the University of Edinburgh, CentraleSupélec, and other partners introduced the EuroLLM project and released its first models — EuroLLM-1.7B and EuroLLM-1.7B-Instruct — as part of an open-weight, open-source suite of large language models (LLMs). In a post on X, Pedro Martins, Senior AI Research Scientist at Unbabel, highlighted that the models can “understand and generate text in all EU languages.” Specifically, the models support 24 official EU languages and 11 other non-EU languages, including Arabic, Russian, Turkish, and Chinese. Manuel Faysse, Research Scientist at Illuin Technology, noted in another post on X, that EuroLLM has “a strong focus on multilinguality.” The researchers explained that while models like OpenAI’s GPT-4 and Meta’s LLaMA have brought significant advancements, they remain largely focused on English and a few high-resource languages. This leaves many languages underserved. To address this, the EuroLLM team aims to create “a suite of LLMs capable of understanding and generating text in all European Union languages […] as well as some additional relevant languages.” EuroLLM-1.7B was trained on 4 trillion tokens divided across the considered languages and several data sources, including web data, parallel data (en-xx and xx-en), and high-quality datasets from various sources like Wikipedia and Arxiv. The EuroLLM-1.7B-Instruct model was further instruction-tuned on EuroBlocks, an instruction-tuning dataset designed for general instruction-following and machine translation (MT). The team evaluated the EuroLLM-1.7B-Instruct model on several MT benchmarks, including FLORES-200, WMT-23, and WMT-24, and compared it with Gemma-2B and Gemma-7B, both instruction-tuned on EuroBlocks. They used COMET-22 to evaluate the models’ MT performance. Despite its small size, EuroLLM-1.7B-Instruct outperformed Gemma-2B-Instruct on all language pairs and datasets and remained competitive with Gemma-7B-Instruct. Martins, in another X post, emphasized, “EuroLLM-1.7B excels at machine translation.” Faysse added, “For the small size, it really excels on translation tasks, which is super promising once we’ll scale up.” While the models demonstrate strong translation capabilities, the researchers acknowledged that EuroLLM-1.7B hasn’t yet fully aligned with human preferences, which means it may occasionally produce problematic outputs, like hallucinations or inaccurate statements. Looking ahead, the EuroLLM team plans to scale up the model and improve data quality. Both Martins and Ricardo Rei, Senior Research Scientists at Unbabel, confirmed this in posts on X, with Rei teasing “New models are coming (9B and 22B) as well as strong instruct models! Stay tuned!” The EuroLLM models are now available on Hugging Face. Authors: Pedro Henrique Martins, Patrick Fernandes, João Alves, Nuno M. Guerreiro, Ricardo Rei, Duarte M. Alves, José Pombal, Amin Farajian, Manuel Faysse, Mateusz Klimaszewski, Pierre Colombo, Barry Haddow, José G. C. de Souza, Alexandra Birch, and André F. T. Martins.
2024年9月24日,来自Unbabel、爱丁堡大学、CentraleSupélec和其他合作伙伴的研究人员推出了EuroLLM项目,并发布了其首个模型——EuroLLM-1.7B和EuroLLM-1.7B-Instruct——作为开放权重、开源的大型语言模型(LLM)套件的一部分。 Unbabel高级人工智能研究科学家Pedro Martins在X上的一篇帖子中强调,这些模型可以“理解和生成所有欧盟语言的文本”。具体来说,这些模型支持24种欧盟官方语言和11种其他非欧盟语言,包括阿拉伯语、俄语、土耳其语和中文。Illuin Technology的研究科学家Manuel Faysse在X上的另一篇文章中指出,EuroLLM“非常注重多种语言”。 研究人员解释说,虽然OpenAI的GPT-4和Meta的LLaMA等模型带来了重大进步,但它们仍然主要集中在英语和一些高资源语言上。 这使得许多语言得不到充分的服务。为了解决这个问题,EuroLLM团队旨在创建“一套能够理解和生成所有欧盟语言【……】以及一些其他相关语言文本的LLM。” EuroLLM-1.7B在4万亿个令牌上进行了训练,这些令牌分布在所考虑的语言和几个数据源中,包括web数据、并行数据(en-xx和xx-en)以及来自维基百科和Arxiv等各种来源的高质量数据集。 EuroLLM-1.7B-Instruct模型在EuroBlocks上进一步进行了指令调整,EuroBlocks是一个为一般指令跟踪和机器翻译(MT)设计的指令调整数据集。 该团队在几个MT基准上评估了EuroLLM-1.7B-Instruct模型,包括FLORES-200、WMT-23和WMT-24,并将其与Gemma-2B和Gemma-7B进行了比较,两者都在EuroBlocks上进行了指令调整。他们使用COMET-22来评估模型的MT性能。 尽管体积小,但EuroLLM-1.7B-Instruct在所有语言对和数据集上的表现都优于Gemma-2B-Instruct,并与Gemma-7B-Instruct保持竞争力。 Martins在另一篇X帖子中强调,“EuroLLM-1.7B擅长机器翻译。”Faysse补充道:“对于小尺寸来说,它确实擅长翻译任务,一旦我们扩大规模,这将非常有前途。” 虽然这些模型展示了强大的翻译能力,但研究人员承认,EuroLLM-1.7B尚未完全符合人类偏好,这意味着它偶尔可能会产生有问题的输出,如幻觉或不准确的陈述。 展望未来,EuroLLM团队计划扩大模型并提高数据质量。Unbabel的高级研究科学家Martins和Ricardo Rei都在X上的帖子中证实了这一点,Rei调侃道“新模型即将推出(9B和22B)以及强大的instruct模型!敬请关注!” EuroLLM模型现在可以在Hugging Face上使用。 作者:Pedro Henrique Martins、Patrick Fernandes、João Alves、Nuno M.Guerreiro、Ricardo Rei、Duarte M.Alves、JoséPombal、Amin Farajian、Manuel Faysse、Mateusz Klimaszewski、Pierre Colombo、Barry Haddow、JoséG.C.de Souza、亚历山德拉·伯奇和安德烈·F·T·马丁斯。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文