1. What is term extraction?
One of the major tasks that are part of any translation job is the identification of equivalents for specialised terms. Subject fields such as different sectors of law and industry all have significant amounts of field-specific terminology. In addition, many document initiators might use their own preferred terminology. Researching the specific terms needed to complete any given translation is a time-consuming task, and term extraction tools proved to be of great help.
Term extraction is normally either monolingual or bilingual. Monolingual term extraction attempts to analyse a text or corpus in order to identify candidate terms, while bilingual term extraction analyses existing source texts along with their translations in an attempt to identify potential terms and their equivalents.
Therefore, term extraction tools can assist in populating term bases and setting up the terminology for specific tasks or projects. Nevertheless, despite the fact that the extraction tools facilitate extraction, the resulting list of candidate terms must be verified by a human terminologist or translator. Therefore, the process of term extraction is computer-aided rather than fully automatic.
2. Main term extraction approaches/methods
There are three main term extraction approaches usually implemented in terminology management: linguistic, statistic, or hybrid.
Linguistic
Term extraction tools using a linguistic approach typically attempt to identify word combinations that match certain morphological or syntactical patterns (e.g. “adjective+noun” or “noun+noun”). For this purpose, parsers, part-of-speech taggers and morphological analysers are used to annotate the content of the corpus. Term candidates are filtered using different pattern matching techniques. Obviously the linguistic approach is heavily language-dependent because term formation patterns differ from language to language. Consequently, term extraction tools that use a linguistic approach are generally designed to work in a single language (or closely related languages) and cannot easily be extended to work with other languages. Therefore, they are not well suited for integration into TM systems, which are usually language-independent.
Statistical
Term extraction tools using a statistical approach basically look for repeated sequences of lexical items. Often the frequency threshold, which refers to the number of times that a word or a sequence of words must be repeated to be considered a candidate term, can be specified by the user. The major strength of the statistical approach is its language-independence.
Hybrid
That is why the most common approach in the term extraction is the hybrid one, using both statistical and linguistic information. Even though the main part of such approaches is statistical, syntactic rules and filters are incorporated to allow picking candidate terms that have certain syntactic structures.
Besides accuracy in selecting the term candidates, other important evaluation criteria for the terminology extraction tools are the supported files formats and languages. Not all extraction tools support all kind of formats texts are available in.
什么是术语提取?
找出专业术语对应的译文,是翻译工作重要的一部分。例如,法律和工业等学科领域,都具有大量针对特定领域的术语。 另外,许多文档的发起者都有自己倾向于使用的术语。 在翻译过程中,研究翻译所需的特定术语很耗时间,而事实证明,术语提取工具对这个过程大有助益。
术语提取可以是单语的,也可以是双语的。 单语术语提取分析文本或语料库,以此识别候选术语,而双语术语提取则分析现有源文本及其译文,以识别潜在的术语及其对应译文。
因此,术语提取工具可以帮助填充术语库,并为特定任务或项目设置术语。 尽管如此,尽管术语提取工具有助于术语提取,但候选术语的最终列表必须由人类术语学家或翻译人员进行验证。 因此,术语提取过程由计算机辅助,但并非全自动的。
主流的术语提取方法
在术语管理中,主流的术语提取方法主要分三种:语言学法、统计学法和混合法。
语言学法:
采用语言学法的术语提取工具,通常通过匹配相似的词法、句法模式(比如“形容词+名词”模式、“名词+名词”模式,找出可能成为术语的词组。为此,要使用解析器,词性标记器和词法分析器来注释语料库内容。候选的术语可以通过不同的模式匹配技术进行过滤。显然,语言学法与语言紧密相关,因为术语的成分模式因语言而异。因此,使用语言学法的术语提取工具通常设计为只适用于一种语言,或紧密相关的几种语言,而不能轻易地适用于其他语言。 因此,它们不太适合集成到通常与语言无关的TM系统中。
统计法:
使用统计法的术语提取工具往往是查找词汇项的重复序列。用户通常可以指定频率阈值,该频率阈值规定了,一个单词或单词序列必须重复多少次才能被视为候选术语。统计方法的主要优势是它与语言种类关系不大。
混合法:
术语提取中最常见的方法是混合法,这种方法同时使用语言学和统计学信息。 即使此类方法以统计学法为主,但也融入了句法规则和句法过滤,因此可以筛选具有特定语法结构的候选术语。
评估术语提取工具的重要标准,除了筛选候选词的准确程度外,还有受支持的文件格式和语言。并非所有提取工具都支持所有格式的文本。
(有编译、删改)
以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。
阅读原文