The Underrepresentation of African Languages in Tech

非洲语言在科技领域的不具有较强代表性

2021-06-17 01:00 multilingual

本文共387个字,阅读需4分钟

阅读模式 切换至中文

Like any other continent on Earth, Africa is home to hundreds — and potentially thousands, depending on who you ask—of indigenous languages. Some of them you’re likely to have heard of: languages like Kiswahili or Igbo, which are spoken by millions of people. Then, of course, there’s a wealth of languages spoken by smaller populations, often limited to small geographical areas (and a wide range of languages in between both ends of the spectrum). While the languages of Africa represent a host of language families, features, and cultures, there’s one thing they have in common, no matter how big or small: They’ve been all but neglected by the tech industry. Even widely spoken African languages like Kiswahili are underrepresented in fields like machine translation (MT) and speech recognition, in spite of the fact that they have a wide range of speakers across numerous nations. “We are not the main target market for these big companies, the companies like Google, Amazon, Apple, have really concentrated on maintaining their key business sectors, their biggest clients, who are mainly Europe and the U.S.,” said Joshua Businge, a Uganda-based tech entrepreneur, in an interview with DW News. For example, Apple’s virtual assistant Siri is capable of speech recognition in numerous languages and dialects, though most are Indo-European or native to Asia. Two Afro-Asiatic languages, Arabic and Hebrew, are available on Siri, however both of these are native to Asia. In spite of having a native speaker-base roughly ten times the size of Hebrew’s, Kiswahili is nowhere to be seen on the list of languages available on Siri, indicating a significant oversight on Apple’s part. In the field of speech recognition, it looks like there have been some major strides toward improving accessibility for African languages — Facebook’s recently launched wav2vec-Unsupervised program can develop speech recognition systems for under-resourced languages like Kiswahili, as it works on untranscribed speech data (unlike traditional speech recognition systems, which require transcribed speech, a process that takes significant time and energy). And as for MT, things may be looking up in this field as well: A team of researchers across the continent recently won the Wikimedia Research Foundation Award for their development of MT benchmarks for African languages classified as “left behind.”
像地球上任何其他大陆一样,非洲是数百种----甚至可能是数千种----土著语言的集聚地,这取决于你问谁。其中一些你可能听说过:像斯瓦希里语或伊格博语,有数百万人在说这些语言。当然,还有大量的语言被较少的人口所使用,通常局限于较小的地理区域(而介于两种语言之间的语言范围很广)。 虽然非洲的语言代表了许多语系,特点和文化,但它们有一个共同点,无论大小:它们几乎被科技行业所忽视。即使像斯瓦希里语这样被广泛使用的非洲语言在机器翻译(MT)和语音识别等领域的也不引人注意,尽管这些语言在许多国家使用者数量巨大 “我们不是这些大公司的主要目标市场,像谷歌,亚马逊,苹果这样的公司,真正集中精力维持他们的关键业务部门,他们最大的客户,主要是欧洲和美国,”乌干达科技企业家约书亚·布西格在接受DW News采访时说。 例如,苹果的虚拟助手Siri能够识别多种语言和方言的语音,尽管大多数是印欧语系或亚洲土生土长的语言。Siri上有两种亚非语言,阿拉伯语和希伯来语,不过这两种语言都是亚洲本土语言。尽管斯瓦希里语的母语人数大约是希伯来语的十倍,但在Siri的可用语言列表中却看不到斯瓦希里语的身影,这表明苹果方面存在重大失误。 在语音识别领域,似乎在提高非洲语言的可访问性方面取得了一些重大进展--Facebook最近推出的Wav2Vec--无监督程序可以为资源不足的语言(如斯瓦希里语)开发语音识别系统,因为它处理的是未转录的语音数据(与传统语音识别系统不同,后者需要转录的语音,这一过程耗费大量时间和精力)。至于机器翻译,这个领域的情况可能也在好转:一个来自非洲大陆的研究团队最近获得了维基媒体研究基金会奖,因为他们为被归类为“落后”的非洲语言开发了机器翻译标准。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文