Hugging Face dives into machine translation with release of 1,000 models

Hugging Face正式涉足机器翻译领域,推出1000多个模型

2020-05-27 22:30 venturebeat

本文共568个字,阅读需6分钟

阅读模式 切换至中文

Hugging Face is taking its first step into machine translation this week with the release of more than 1,000 models. Researchers trained models using unsupervised learning and the Open Parallel Corpus (OPUS). OPUS is a project undertaken by the University of Helsinki and global partners to gather and open-source a wide variety of language data sets, particularly for low resource languages. Low resource languages are those with less training data than more commonly used languages like English. Started in 2010, the OPUS project incorporates popular data sets like JW300. Available in 380 languages, the Jehovah’s Witness text is utilized by a number of open source projects for low resource languages like the Masakhane to create machine translation from English to 2,000 African languages. Translation can enable interpersonal communication between people who speak different languages and empower people around the world to participate in online and in-person commerce, something that will be especially important for the foreseeable future. The launch Thursday means models trained with OPUS data now make up the majority of models provided by Hugging Face and the University of Helsinki’s Language Technology and Research Group the largest contributing organization. Before this week, Hugging Face was best known for enabling easy access to state-of-the-art language models and language generation models, like Google’s BERT, which can predict the next characters, words, or sentences that will appear in text. With more than 500,000 Pip installs, the Hugging Face Transformers library for Python includes pretrained versions of advanced and state-of-the-art NLP models like versions of Google AI’s BERT and XLNet, Facebook AI’s RoBERTa, and OpenAI’s GPT-2. Hugging Face CEO Clément Delangue told VentureBeat that the venture into machine translation was a community-driven initiative that the company undertook to build more community around cutting-edge NLP, following a $15 million funding round in late 2019. “Because we open source, and so many people are using our libraries, we started to see more and more groups of people in different languages getting together to work on pretraining some of our models in different languages, especially low resource languages, which are kind of like a bit forgotten by a lot of people in the NLP community,” he said. “It made us realize that in our goal of democratizing NLP, a big part to achieve that was not only to get the best results in English, as we’ve been doing, but more and more provide access to other languages in the model and also provide translation.” Delangue also said the decision was due to recent advances in machine translation and sequence-to-sequence (Seq2Seq) models. Hugging Face first started working with Seq2Seq models in the past few months, Delangue said. Notable recent machine translation models include T5 from Google and Facebook AI Research’s BART, which is an autoencoder for training Seq2Seq models. “Even a year ago we might not have done it just because the results of pure machine translation weren’t that good. Now it’s getting to a level where it’s starting to make sense and starting to work,” he said. Delangue added that Hugging Face will continue to explore data augmentation techniques for translation. The news follows an integration earlier this week with Weights and Biases to power visualizations that track, log, and compare training experiments. Hugging Face brought its Transformers library to TensorFlow last fall.
本周,Hugging Face一口气发布了1000多个模型,迈出了进入机器翻译领域的第一步。研究人员使用了无监督学习和开放并行语料库(OPUS)训练模型。OPUS是赫尔辛基大学及其全球合作伙伴开展的一个项目,旨在收集和开源各种语言数据集,特别是低资源语言数据集。 低资源语言是那些训练数据比英语等更常用的语言少的语言(即小语种)。 OPUS项目开始于2010年,整合了JW300等流行数据集。 耶和华见证人的文本有380种语言版本,被用于很许多低资源语言的开源项目,例如Masakhane,创建机器翻译将其从英语翻译成2000种非洲语言。 翻译可以使讲不同语言的人之间进行人际交流,并使世界各地的人能够参与在线或面对面交流,这在可预见的未来里尤为重要。 Hugging Face和赫尔辛基大学语言技术与研究小组(University of Helsinki Language Technology and Research Group)提供的模型中,大部分都是用OPUS数据集训练的模型。 在本周之前,Hugging Face最出名的是它能够方便地访问最先进的语言模型和语言生成模型,比如谷歌的BERT,它可以预测下一个将出现在文本中的字符、单词或句子。 Hugging Face的 Transformer Python 库目前已有超过 50 万 pip 安装量,其中收编了最先进的NLP模型,如Google AI的BERT和XLNet、Facebook AI的RoBERTa和OpenAI的GPT-2。 Hugging Face首席执行官Clément Delangue告诉本报,进入机器翻译领域初衷是建立更多社区,该公司在2019年末获得1500万美元的融资后,致力于围绕前沿的NLP建立更多社区。 他说:“因为我们是开源的,很多人都在使用我们的库,所以我们开始看到越来越多不同语言的人聚在一起,用不同的语言对我们的一些模型进行预训练,特别是低资源语言,这些语言有点被NLP社区的很多人遗忘了。” “它让我们意识到,在实现民主化NLP的目标中,不仅仅是要像我们此前一直在做的那样仅在英语领域中取得成绩,而是越来越多地在模型中提供访问其他语言的机会,同时还提供翻译。” Delangue还表示,这一决定是由于机器翻译和序列到序列(Seq2Seq)模型最近取得的进展。 Delangue说,HuggingFace最近几个月才开始使用Seq2Seq模型。 最近值得注意的机器翻译模型包括Google的T5和Facebook AI的BART,BART是用于训练Seq2Seq模型的自动编码器。 Delangue表示:“在一年前,我们可能不会这么做,因为纯机器翻译的效果不是那么好。 现在,机器翻译的水平逐渐提高了,机翻译文逐渐可用。”他还补充道:“Hugging Face将继续探索用于翻译的数据增强技术。” 这则新闻是在本周早些时候与Weights & Biases的整合之后发布的,以支持跟踪、记录和比较训练实验的可视化。 HuggingFace去年秋天将其Transformer库添进了TensorFlow。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文