Difficulties With Developing NLP for Vietnamese

越南语自然语言处理的难点

2023-02-27 18:25 GALA

本文共1171个字,阅读需12分钟

阅读模式 切换至中文

Sign up for our newsletter on globalization and localization matters. The Vietnamese language is spoken by around 75 million people across the world and it is the official language of Vietnam. The language itself has undergone several centuries of development starting with borrowing words from Chinese until the 17th century when the language was Romanized by a Jesuit missionary. Today, the Vietnamese alphabet contains 29 letters. This includes one digraph and nine with diacritics. Five of these diacritics are used to designate tone while the remaining four are used for separate letters of the Vietnamese alphabet. In terms of phonology and diacritic marks, it can be said that these indicate tones while others represent accents. What makes the Vietnamese language even more complicated is the fact that its tonal system is even more complicated than the Chinese one. In Vietnam, the language has six basic tones, two more than Chinese, and to make things even more difficult, these tones will be pronounced differently, depending on the region in Vietnam where one finds oneself in. Thus, when it comes to natural language processing (NLP) for Vietnamese, certain difficulties and challenges arise which need further exploration in order to ensure more accurate English to Vietnamese and Vietnamese to English translations. Wondering what some of these challenges are? Let’s take a look below. Challenges When Developing NLP for Vietnamese NLP for Vietnamese is a complex sphere when it comes to producing accurate language translations. Creating algorithms and software to translate any language is a complex enough task. But when it comes to English to Vietnamese translations, especially when NLP is involved, many difficulties arise for human translators and for the machines. Here are a few of these. Let’s begin by approaching the basics. A word is considered to be a linguistic unit that is made up of one or more morphemes. Meanwhile, word segmentation is the process of determining the word boundaries in a sentence/document by a computer program or specific software. With this in mind, we can now make some deductions about the Vietnamese language and NLP for Vietnamese. At its most basic, when approaching NLP for Vietnamese, word segmentation will be one of the first aspects to consider and getting this wrong can cause the rest of the translation to be nonsensical or inaccurate. This is why NLP must take into account word segmentation in Vietnamese to deliver accurate results. Next up, we come to POS tagging. In NLP, POS tagging refers to determining the meaning of certain words in relation to the parts of speech in the sentence to convey an accurate meaning as it relates to the definition of the word and its context. In Vietnamese, for example, the sentence “The old man walks too fast” can also mean “The father walks too fast”, “The old man died too fast”, “My father died too fast”, “You get old too fast”, “Grandfather gets old too fast”, and more. There is therefore a lot of ambiguity in the language that needs to be considered before an accurate translation is made. Following POS tagging is the challenge of syntactic parsing. This aspect of language understanding, development, and translation deal with the syntactic structure of a sentence. According to sources, “the word ‘syntax’ refers to the grammatical arrangement of words in a sentence and their relationship with each other. The objective of syntactic analysis is to find the syntactic structure of a sentence which is usually depicted as a tree.” In the Vietnamese language, the general grammatical rules of structuring a sentence include the fact that it is similar to English in the sense that sentence structure is based on Subject+Verb+Object. NLP for Vietnamese needs to take this into consideration when it comes to translations, too. Named-entity recognition is another aspect of NLP that must be taken into account when translating the Vietnamese language – whether from English to Vietnamese or Vietnamese to English. Essentially, NER looks at aspects such as names of people, organizations, locations, times, quantities, monetary values, percentages, and more within a sentence in order to provide the reader with more context and information about the depth of the text and result in a logical outcome and accurate translation. One example sentence that illustrates this point is: “Ousted XYZ founder John Jones sells London penthouse for £10 million”. At present, this sentence contains information about the organization (XYZ), the person (John Jones), the location (London), and the monetary value (£10 million). Each of these linguistic components builds up the sentence to give it meaning. This is why NLP for Vietnamese needs to take NER into account in order to produce accurate translations. However, NER is not always straightforward and NLP software must have accurate NER inputs to yield the desired result. Coreference resolution (CR), is a subtask of NER. When referring to entities in a sentence or a document that needs to be translated, it is common for pronouns to be used to refer to the entity instead of repeating the same entity several times throughout the sentence. For example, one would not say “John Jones is selling John Jones’ penthouse” but would rather say “John Jones is selling his penthouse” to convey a truer translation that’s free of repetition and uses accurate entity descriptions. When it comes to Vietnamese, however, it has been found that CR has received very little attention in the Vietnamese NLP community. In fact, it appears at present that there are only two researchers that have used CR as a subtask of NER in NLP. This is another challenge that arises with NLP for Vietnamese – there is simply too little data in the NLP database to yield better results. Other Challenges NLP for Vietnamese must also take into consideration Vietnam’s unique writing system and the lack of resources for the Vietnamese language. For example, some sources state that there are approximately 40,000 to 50,000 Vietnamese words that have been defined in modern dictionaries. This, coupled with the fact that several words in Vietnamese that are separated by spaces actually represent one word, make translating this language from English to Vietnamese and the other way around much more difficult. NLP for Vietnamese Is Making Progress Despite the challenges outlined above, research is slowly but surely making progress in identifying difficulties in NLP for Vietnamese and addressing the intricacies of the language when developing software and algorithms to produce more accurate translations. Several studies have found that using hybrid algorithms can address these challenges with a relatively high percentage of accuracy. Nevertheless, there remains a lack of resources to see that this takes off as effectively as it could. However, by ensuring a translator is aware of these language translation difficulties, better reproduction of the language will be possible. In addition, as advances in technology continue and more research is carried out in this field, we are likely to see better NLP for Vietnamese in the future. We’re always on the lookout for informative, useful and well-researched content relative to our industry. Write to us.
注册我们关于全球化和本地化问题的时事通讯。 越南语在全世界有大约7500万人使用,它是越南的官方语言。这种语言本身经历了几个世纪的发展,从借用汉语开始,直到17世纪,这种语言被一位耶稣会传教士罗马化。今天,越南语字母表包含29个字母。这包括一个连字符和九个带变音符号的连字符。其中五个变音符号用于指定音调,而其余四个用于越南语字母表中的单独字母。从音韵学和变音符号的角度来看,这些符号表示声调,而其他符号则表示重音。 使越南语更加复杂的是,它的声调系统比汉语更复杂。在越南,这种语言有六个基本声调,比汉语多两个,更困难的是,这些声调的发音会有所不同,这取决于你所在的越南地区。 因此,当涉及到越南语的自然语言处理(NLP)时,出现了需要进一步探索以确保更准确的英语到越南语和越南语到英语翻译的某些困难和挑战。想知道这些挑战是什么吗?下面我们来看一下。 为越南人开发NLP时面临的挑战 越南语的自然语言处理是一个复杂的领域,当谈到产生准确的语言翻译。创建算法和软件来翻译任何语言都是一项非常复杂的任务。但是,当涉及到英语到越南语的翻译,特别是当涉及到自然语言处理,许多困难出现了人类翻译人员和机器。下面是其中的一些。 让我们从基础开始。单词被认为是由一个或多个词素组成的语言单位。同时,分词是通过计算机程序或特定软件确定句子/文档中的词边界的过程。记住这一点,我们现在可以对越南语和越南语的自然语言处理做一些推论。最基本的是,在越南语的自然语言处理中,分词将是首先要考虑的方面之一,如果做错了,可能会导致翻译的其他部分变得毫无意义或不准确。这就是为什么NLP必须考虑越南语中的分词来提供准确的结果。 接下来,我们来看看词性标注。在自然语言处理中,词性标注指的是确定某些单词相对于句子中词类的含义,以传达准确的含义,因为它涉及单词的定义及其上下文。例如,在越南语中,“老人走得太快”这句话还可以指“父亲走得太快”、“老人死得太快”、“我父亲死得太快”、“你老得太快”、“爷爷老得太快”等等。因此,在做出准确的翻译之前,需要考虑语言中的许多歧义。 词性标注是句法分析的难点。语言理解、发展和翻译的这一方面涉及句子的句法结构。消息人士称,“'句法'一词指的是句子中单词的语法排列以及它们之间的关系。句法分析的目的是找出一个句子的句法结构,这个句子通常被描绘成一棵树。在越南语中,构造句子的一般语法规则包括这样一个事实,即它在句子结构基于主语+动词+宾语的意义上与英语相似。越南语的NLP在翻译时也需要考虑到这一点。 命名实体识别是NLP的另一个方面,在翻译越南语时必须考虑到这一点--无论是从英语翻译成越南语还是从越南语翻译成英语。基本上,NER着眼于句子中的人名、组织、地点、时间、数量、货币价值、百分比等方面,以便为读者提供更多关于文本深度的上下文和信息,从而得出合乎逻辑的结果和准确的翻译。下面的例句可以说明这一点:被赶下台的XYZ创始人约翰·琼斯以1000万英镑出售伦敦顶层公寓。目前,这句话包含关于组织(XYZ)、人员(John Jones)、地点(伦敦)和货币价值(1000万英镑)的信息。这些语言成分中的每一个都构成句子并赋予其意义。这就是为什么越南语的NLP需要考虑NER以产生准确的翻译。然而,NER并不总是简单明了的,NLP软件必须具有准确的NER输入以产生期望的结果。 共指消解(CR)是NER的一个子任务。当在需要翻译的句子或文档中引用实体时,通常使用代词来引用实体,而不是在整个句子中多次重复相同的实体。例如,人们不会说“John Jones is selling John Jones' penthouse”,而是说“John Jones is selling his penthouse”,以传达一个没有重复、使用准确实体描述的更真实的翻译。然而,当谈到越南语时,人们发现CR在越南NLP社区中得到的关注很少。事实上,目前似乎只有两个研究者将认知反应作为自然语言处理中的一个子任务。这是NLP为越南人带来的另一个挑战--NLP数据库中的数据太少,无法产生更好的结果。 其他挑战 越南语的自然语言处理还必须考虑到越南独特的书写系统和越南语资源的缺乏。例如,一些资料表明,现代词典中定义了大约40,000至50,000个越南语单词。这一点,加上越南语中由空格分隔的几个单词实际上代表一个单词的事实,使得将这种语言从英语翻译成越南语以及从越南语翻译成英语变得更加困难。 越南人的NLP正在取得进展 尽管存在上述挑战,但在确定越南语自然语言处理的困难以及在开发软件和算法以产生更准确的翻译时解决语言的复杂性方面,研究正在缓慢但肯定地取得进展。一些研究发现,使用混合算法可以以相对较高的准确率解决这些挑战。然而,仍然缺乏资源来确保这项工作尽可能有效地展开。然而,通过确保翻译人员了解这些语言翻译困难,可以更好地再现语言。此外,随着技术的不断进步和在这一领域进行更多的研究,我们很可能在未来看到更好的越南人的自然语言处理。 我们总是在寻找信息丰富,有用的和充分研究的内容相对于我们的行业。 写信给我们。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文