The Masakhane Project Puts Africa on the Machine Translation Map

低资源语言如何做MT?且看 Masakhane 如何将MT版图延伸到非洲

2020-07-28 15:00 slator

本文共512个字,阅读需6分钟

阅读模式 切换至中文

Machine translation (MT) has been shown to be stronger and improving more quickly for languages where there is lots of reference data. One area where such data has historically been lacking is Africa, whose 2,000-plus languages are underrepresented in the world of natural language processing (NLP), according to Masakhane project co-founders and chief investigators Laura Martinus and Jade Abbott. The two South Africans have described a self-defeating cycle in which speakers believe that their languages will not be accepted as prime modes of communication. This, in turn, leads to a lack of funding for translation projects and a dearth of language resources; those that do exist are often siloed in country-specific institutions. Inspired by the Deep Learning Indaba theme for 2018, Martinus and Abbott started the Masakhane project (whose name means “we build together” in isiZulu) to connect NLP professionals in different countries, with the ultimate goal of translating the Internet “and its content into our languages, and vice versa.” Now, over 60 participants in 15 countries are involved in a continent-wide effort to build MT models for African languages. (The Masakhane project also collaborates with RAIL Lab at the University of Witwatersrand and Translators Without Borders.) The plan: Gather language data and develop MT models, which will then be analyzed and fine-tuned. Martinus and Abbott have already trained models to translate English into five of South Africa’s 11 official languages (Afrikaans, isiZulu, Northern Sotho, Setswana, Xitsonga) using Convolutional Sequence-to-Sequence (ConvS2S) and Transformer architectures. They presented their findings at the 2019 Annual Meeting of the Association for Computational Linguistics (ACL). Since being profiled by VentureBeat in November 2019, the group has continued its work with a range of languages, and made a point of making any gains publicly available to combat the “low discoverability” of relevant resources, a major challenge for many African languages. Chief Investigator Kathleen Siminyu told Slator that the project now has 16 languages with benchmarks, which can be seen on the Masakhane project’s GitHub page. “We are currently getting a lot of submissions, so this number is increasing often,” Martinus told Slator. “There are a few people I know who want to submit benchmarks soon, but have yet to finish up.” On a less field-specific platform, Abbott tweeted on January 22, 2020 that contributor Julia Kreutzer, a PhD student in Germany, had “used JoeyNMT to train an English-to-Afrikaans model and deploy it as a slack bot on our @MasakhaneMt slack account (Afrikaans chosen because as a German speaker, she could sorta figure out that it was sorta working).” Kreutzer has described JoeyNMT (also available on GitHub) as a “minimalist neural machine translation toolkit […] specifically designed for novices.” The Masakhane project plans to present at the AfricaNLP workshop set for April 2020 in Ethiopia. “At the moment, it looks like we will submit six papers, maybe more,” Siminyu said. Martinus added that many Masakhane participants are also currently writing papers for the first workshop on Resources for African Indigenous Languages (RAIL) in May 2020, to be hosted by the South African Centre for Digital Language Resources (SADiLaR).
事实证明,一种语言可用的参考数据越多,该语言的机器翻译(MT)就表现得越强大、改进越快。在全球,有一个区域此前一直缺乏当地语言的参考数据,那就是非洲。当地有2000多种语言,目前在自然语言处理领域都不曾涉及。为此,Masakhane 项目联合创始人兼首席研究员Laura Martinus和Jade Abbott 决心改变现状。 这两位南非人介绍,他们之前处在一种恶性循环中:人们认为他们的语言不会成为主流的交流方式;这反过来又导致翻译项目缺乏资金和语言资源的支持。仅有的一些翻译项目,也只是出现在特定国家的机构中。 Martinus和Abbott受到 Deep Learning Indaba 2018 大会主题的启发,启动了Masakhane项目(其名称在isiZulu语中的意思是“携手共建”),号召不同国家的NLP专业人士参与,最终目标是将互联网“及其内容翻译成我们的语言,反之亦然”。 现在,来自15个国家的60多名参与者着眼于整个非洲大陆,为非洲语言构建机器翻译模型。(Masakhane项目还与金山大学的RAIL实验室和Translators Without Borders机构合作。) 计划:收集语言数据并开发MT模型,然后对模型进行分析和优化。 Martinus和Abbott已经使用卷积序列到序列(ConvS2S)和Transformer架构来训练模型,将英语翻译成南非11种官方语言中的5种(Afrikaans, isiZulu, Northern Sotho, Setswana, Xitsonga)。 他们在2019年计算语言学协会(ACL)年会上展示了其研究成果。 自2019年11月被VentureBeat介绍以来,该小组一直在开展一系列语言方面的工作,并致力于将所有成果公之于众,以应对相关资源的“低可发现性”,这也是许多非洲语言面临的一个重大挑战。 首席研究员Kathleen Siminyu告诉Slator,这个项目现在有16种语言提交了 benchmark,可以在GitHub的Masakhane项目页面上看到。 Martinus告诉Slator说:“我们每天都有人提交,所以这个数字一直在增加。” “有几个我认识的人预计近期就会提交benchmark,目前还在紧锣密鼓地进行。” 2020年1月22日,Abbott 曾在一个非专业领域平台上发布消息称,志愿者Julia Kreutzer是一名德语博士生,她“使用JoeyNMT训练了一个英语到南非荷兰语的模型,并将其作为一个slack机器人部署在我们的@Masakhanemt slack账户上(选择南非荷兰语是因为她说德语,可以检验模型的效果)。” Kreutzer曾将JoeyNMT(也在GitHub上)描述为“专为新手设计的极简神经机器翻译工具包”。 Masakhane项目计划在定于2020年4月在埃塞俄比亚举行的AfricaNLP 研讨会上进行展示。“目前看来,我们将提交6篇论文,也许更多,”Siminyu说。 Martinus补充说,许多Masakhane参与者目前还在为2020年5月由南非数字语言资源中心(SADiLaR)主办的首届非洲土著语言资源研讨会撰写论文。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文