The trials and tribulations of digitizing Urdu, the 10th most spoken...

将乌尔都语数字化的考验和磨难,乌尔都语是第十大最常用的语言。。

2022-02-14 16:12 multilingual

本文共497个字,阅读需5分钟

阅读模式 切换至中文

Urdu speakers have long lamented the poor digitization of one of their traditional scripts, Nastalīq. A recent story in Rest of World highlights the trials and tribulations that some Urdu speakers have taken on to adapt their script to digital formats. Nastalīq is an Arabic-based script that was originally developed to write Persian calligraphy — while the characters are mostly similar in shape to standard Arabic characters (also known as Naskh), Nastalīq’s characters are traditionally written along a fluid, diagonal orientation wherein letters at the beginning of a word are slightly higher than letters at the end of a word. Persian speakers reserve the script’s use for poetry, and as a result, have not suffered the same struggle to effectively use their language on digital formats. On the other hand, the use of Nastalīq in Urdu has been much more widespread in day-to-day life. As digital communications became increasingly important over the last three decades or so, Nastalīq users have struggled to make the script work on digital formats. Historically, system interfaces have primarily been built with Latinate scripts in mind, in which the characters are neatly arranged along horizontal lines, written from left to right. Tech leaders have generally been slow to functionally digitize scripts that deviate from the sort of grid-like orientation of scripts that are strictly written along straight lines from left to right or right to left — even the traditional Mongolian script, which is written along straight, vertical lines from top to bottom, has yet to be utilized fully online. Because Nastalīq has been so challenging to adapt to a digital format, many Urdu speakers have taken to using Naskh, which is written along straight horizontal lines, or even using a non-standardized form of the language that uses Latin script. In 2014, Pakistani American programmer Mudassir Azeemi wrote a letter to Apple explaining the issue — three years later, the company released its first Nastalīq typeface for iOS users, but some pitfalls remain. As time goes on, this Latin-centric approach to how we render text digitally has become sort of fossilized — one Pakistani software engineer, Zeerak Ahmed told Rest of World that some words are rendered too small to read in Apple’s typeface. Ahmed is currently working on developing an Urdu language dataset to aid in the development of better machine learning projects for the language. Because the language has taken so long to properly digitize, artificial intelligence models involving the language are also behind counterparts for other languages. Despite the fact that Urdu is spoken by more than 200 million speakers worldwide (Ethnologue states that the language is the tenth most widely spoken language in the world), the language has not enjoyed the same level of advancements in fields like machine translation as other widely spoken languages like Hindi, Arabic, or English. Ahmed told Rest of World that, unfortunately, “all Urdu software is broken because the underlying data is broken.”
说乌尔都语的人长期以来一直在为他们的一种传统文字纳斯塔利克的数字化程度不高而哀叹。 最近《世界其他地方》的一篇报道强调了一些讲乌尔都语的人为了使他们的文字适应数字格式而经历的考验和磨难。Nastalīq是一种基于阿拉伯语的文字,最初是为了书写波斯语书法而开发的--虽然字符的形状大多与标准阿拉伯语字符(也称为Naskh)相似,但Nastalīq的字符传统上是沿着流畅的对角线方向书写的,其中单词开头的字母比单词结尾的字母略高。 说波斯语的人将这种文字的使用保留在诗歌中,因此,在数字格式上有效使用他们的语言方面没有遭受过同样的挣扎。另一方面,乌尔都语中Nastalīq的使用在日常生活中要广泛得多。在过去的三十多年里,随着数字通信变得越来越重要,Nastalīq用户一直在努力使该文字在数字格式上发挥作用。 从历史上看,系统界面主要是以拉丁语系文字为基础建立的,其中的字符是沿着水平线整齐排列,从左到右书写。技术领先者在将那些偏离严格按照从左到右或从右到左的直线书写的那种网格状方向的文字进行功能数字化方面通常进展缓慢--即使是传统的蒙古文字,也是按照从上到下的垂直直线书写,尚未在网上得到充分的利用。 由于Nastalīq在适应数字格式方面具有很大的挑战性,许多讲乌尔都语的人已经开始使用沿水平直线书写的Naskh,甚至使用一种使用拉丁字母的非标准化语言形式。2014年,巴基斯坦裔美国程序员穆达西尔-阿泽米(Mudassir Azeemi)给苹果公司写了一封信,解释了这个问题--三年后,该公司为iOS用户发布了第一个Nastalīq字体,但仍有一些隐患。 随着时间的推移,这种以拉丁文为中心的数字文本呈现方式已经成为某种化石--一位巴基斯坦软件工程师Zeerak Ahmed告诉《世界其他地方》,一些单词在苹果的字体中被呈现得太小,无法阅读。艾哈迈德目前正在开发一个乌尔都语的数据集,以帮助开发更好的机器学习项目,用于该语言。 由于乌尔都语花了如此长的时间来进行适当的数字化,涉及该语言的人工智能模型也落后于其他语言的对应模型。尽管全世界有超过2亿人使用乌尔都语(Ethnologue指出,该语言是世界上第十种最广泛使用的语言),但该语言在机器翻译等领域的进步并不像其他广泛使用的语言如印地语、阿拉伯语或英语那样。艾哈迈德告诉Rest of World,不幸的是,“所有的乌尔都语软件都是坏的,因为基础数据是坏的。”

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文