Domain Classification with Natural Language Processing

自然语言处理领域分类

2021-08-19 21:50 TAUS

本文共1669个字,阅读需17分钟

阅读模式 切换至中文

There is a vast collection of textual data on the internet and in various organizational databases today, the overwhelming majority of which is not structured in an easily accessible manner. Natural language processing (NLP) can be used to make sense of unstructured data collections in a way that allows the automatization of important decision-making processes that would otherwise require a significant investment of time and effort to achieve manually. In this article, we discuss one of the ways in which NLP can be used to structure textual data. Domain classification, also known as topic labeling or topic identification, is a text classification method which is used to assign document domain or category labels to documents of various types and lengths. A “domain” or “category” can be understood as either a conversational domain, a particular segment of the industry, or even a specific genre of text, depending on the application. For instance, a textual database may contain documents that pertain to the Legal domain, Healthcare, the Hospitality industry, and many others. Organizations structure their data in this manner in order to make individual documents more readily accessible and the retrieval of relevant information more efficient. This article is accompanied by a hands-on tutorial that navigates the reader through the entire process of building a domain classification pipeline from data preprocessing to the training and evaluation of an artificial neural network. With just basic knowledge of the Python programming language, anyone can use this tutorial to achieve around 84 percent accuracy in a sentence-level domain labeling task using the BBC news data set and learn about sentence embeddings, an increasingly popular numerical text representation method. Applications Automatic text classification methods, such as domain classification, make it possible for data owners to structure their data in a scalable and reproducible manner, which not only means that large numbers of documents can be sorted automatically in a very short amount of time, but also that classification criteria can remain consistent over longer periods. Moreover, text classification methods allow businesses to acquire valuable, real-time knowledge about the performance of their tools or services, which enables better and more accurate decision-making. For example, a domain classification algorithm can be used to organize documents in an unstructured database by topic. This opens the way to a large variety of further, more specific applications, which can include anything from analyzing current trends in online discussion forums to selecting the right kind of training data for a machine translation (MT) system. Other text classification tasks include natural language processing problems such as spam filtering, sentiment analysis, and language identification, all of which share the fundamental challenge of having to make sense of structural, linguistic, and semantic cues in written documents to successfully assign a correct category label to them. Domain identification, along with related tasks like the ones listed above, provides the foundation for a wide range of NLP solutions. Methods Domain classification can be performed either manually or automatically. However, since manual text classification requires intensive labor conducted by experts and is therefore no longer applicable in the age of big data, we focus on automatic methods for the remainder of this article. Domain identification is a subfield of NLP, a discipline that lies at the intersection of linguistics and information technology. Its aim is to gain an understanding of human language, using the tools of computer science, primarily machine learning (ML) and artificial intelligence (AI). Text classification algorithms are generally divided into three distinct categories: rule-based systems, machine learning models, and hybrid approaches. Rule-based systems, which were particularly popular in the early days of NLP, make use of carefully designed linguistic rules and heuristics that are able to identify patterns in the language of a specific document based on which domain labels can be applied. They often rely on meticulously crafted word lists that help determine the topic of a given text; consider, for example, the Aviation category, which is likely to contain words such as “aircraft”, “altitude”, and “radar”. However, although such algorithms can perform rather well at determining the domain of a piece of text and their inner workings are easy to understand, they require considerable domain expertise and considerable effort to create and maintain. Nevertheless, instead of applications based primarily on manually prepared linguistic rules, the field is currently dominated by machine learning systems. These can be further separated into three main categories, which are supervised, unsupervised, and semi-supervised systems. Supervised machine learning algorithms learn from associations between a collection of data points and their corresponding labels. For example, one can train an ML system to correctly identify the topic of news articles by showing the model thousands or even millions of examples from a variety of categories and employing one of many available learning mechanisms. Whether based on deliberately selected features, such as bag-of-words representations or tf-idf vectors, or characteristics of the data that the models discover themselves, they are capable of applying learned “knowledge” when predicting labels for previously unseen texts. Supervised learning mechanisms include the Naive Bayes algorithm, support vector machines, neural networks, and deep learning. In contrast, unsupervised systems can be used when the training set does not contain any labels and so the model has to essentially learn based solely on the internal characteristics of the data it encounters. Clustering algorithms, which are popular in text classification, fall under this category. On the other hand, semi-supervised ML algorithms learn associations from a combination of labeled and unlabeled training data. Many domain classification approaches employ a combination of handcrafted rules and machine learning techniques, which makes them reliable and flexible tools for data enhancement. These methods are often referred to as hybrid systems. Datasets All machine learning algorithms require plentiful and high-quality training data in order to perform well. Fortunately, there are a number of open-source datasets on the internet that anyone can download free of charge and use to train or improve domain classification models. One of the most popular benchmarks in text classification research is the BBC news data set, which contains 2225 articles from five different domains (Business, Entertainment, Politics, Sport, and Tech). This data set is also used in the accompanying tutorial. Another collection of news articles, namely the Reuters-21578 data set, contains an even larger variety of domains that cover a broad range of economic subjects. Some examples are Coconut, Gold, Inventories, and Money-supply. As the name suggests, the collection contains 21,578 news articles of variable length. Similarly, the 20 Newsgroups data set contains 20,000 messages taken from twenty newsgroups, which are organized into hierarchical categories. Some of the top-level categories are Computers, Recreation, Science, while lower-level domains include Windows, Hardware, Autos, Motorcycles, Religion, Politics. Furthermore, many organizations possess a large number of internal datasets in the form of structured or unstructured collections of text that they have been collecting over the years. It is possible to convert such collections into useful data to train ML models, including domain classifiers and other advanced NLP applications, by first applying a set of data cleaning and enhancement techniques. Available Tools Automatic domain classification relies on NLP which involves a variety of techniques from text preprocessing tools to machine learning algorithms. Therefore, in order to build a domain identification system, one must be familiar with a programming language (Python is the most widely used one in the NLP community), various libraries and toolkits, and the basics of machine learning and statistical analysis. The term “text preprocessing” refers to the steps in an NLP pipeline that are related to the preparation of data for ML systems. Basic preprocessing steps include, but are not limited to, tokenization (the splitting of sentences into roughly word-level chunks), lemmatization (the conversion of words into their dictionary form), and part-of-speech tagging (labeling each word according to its grammatical properties). Preprocessing can also refer to the process of converting human language into a numerical representation so that computers can make sense of the textual information. An example of this is vectorization, in which words or sentences are converted into vectors, which encodes their linguistic and semantic characteristics. Some Python libraries that offer text preprocessing capabilities include Stanza, spaCy, AllenNLP, TextBlob, and the NLTK platform. When it comes to the ML component, there exists a variety of well-maintained and easy-to-use libraries to choose from. Scikit-learn is a popular option for beginners, as it has excellent documentation and offers multiple different classifier models that can be implemented with little effort. For deep learning, both the TensorFlow and PyTorch software libraries are popular choices. Using either of these platforms, anyone with basic programming skills can build efficient neural networks that are capable of performing at nearly state-of-the-art level at various NLP tasks. Tutorial In our tutorial, we build a domain classification system based on the BBC News data set. The goal is to create a domain classifier that is capable of assigning category labels to the sentences of the collection. This is somewhat different from the traditional use case of this task that requires documents, such as articles and reviews, to be sorted by topic rather than individual sentences. One might even argue that the task is considerably more difficult at this level, because sentences contain much less domain-specific information than longer texts. For example, consider the following sentences from the data set: But it was so dramatic! I found him very powerful. It’s not good, but it is very understandable. Taken out of context, it is practically impossible to determine what domain these sentences belong to, regardless of whether a human or an ML model attempts to do so. In reality, domain classification might not be appropriate for such sentences at all. In the tutorial, we cover the following steps of the domain classification pipeline: Prerequisites Instructions on how to access the data Data preprocessing Building a simple feedforward neural network Evaluation Head over to this GitHub repository for the full tutorial.
如今,在互联网和各种组织数据库中有大量的文本数据,其中绝大多数都不是以易于访问的方式结构化的。自然语言处理(NLP)可用于理解非结构化数据集合,从而实现重要决策过程的自动化,否则需要投入大量的时间和精力才能手动实现。 在本文中,我们将讨论一种使用NLP构建文本数据的方法。领域分类,又称主题标记或主题识别,是一种文本分类方法,用于为各种类型和长度的文档分配文档域或类别标签。根据应用程序的不同,“领域”或“类别”可以理解为会话领域、行业的特定部分,甚至是文本的特定类型。例如,文本数据库可能包含与法律领域、医疗保健、酒店行业和许多其他领域相关的文档。各组织以这种方式组织其数据,以便更容易取得个别文件和更有效地检索有关资料。 本文包含了一个实用教程,引导读者完成构建领域分类管道的整个过程,从数据预处理到人工神经网络训练和评估。任何人都可以使用本教程学习句子嵌入(一种越来越流行的数字文本表示方法),并使用BBC新闻数据集执行句子级域标记任务,准确率约为84%。 应用程序 自动文本分类方法(例如域分类)使数据所有者能够以可伸缩和可重复的方式对数据进行结构化,这不仅意味着可以在很短的时间内自动对大量文档进行排序,而且分类标准可以在较长时间内保持一致。此外,文本分类方法允许企业获得有关其工具或服务性能的有价值的实时知识,从而实现更好、更准确的决策。 例如,可以使用领域分类算法按主题组织非结构化数据库中的文档。这为各种各样更深入、更具体的应用打开了道路,其中包括从分析在线论坛的当前趋势到为机器翻译(MT)系统选择合适的训练数据。 其他文本分类任务包括自然语言处理问题,如垃圾邮件过滤、情感分析和语言识别,所有这些问题都面临一个基本挑战,即必须理解书面文档中的结构、语言和语义线索,以便成功地为它们分配正确的类别标签。域标识,以及上面列出的相关任务,为广泛的NLP解决方案提供了基础。 方法 域分类可以手动或自动执行。然而,由于手工文本分类需要专家进行密集的劳动,因此在大数据时代已不再适用,本文其余部分将重点介绍自动方法。 域识别是NLP的一个子领域,是语言学和信息技术的交叉学科。它的目的是利用计算机科学的工具,主要是机器学习(ML)和人工智能(AI)来理解人类语言。 文本分类的算法通常分为三类:基于规则的系统、用于机器学习的模型和混合策略基于规则的系统使用精心设计的语言规则和启发式来识别文档语言中可用于应用领域标签的模式。这些系统在NLP的早期特别流行。他们经常依靠精心制作的单词列表来确定文本的主题;以航空类别为例,这个类别最有可能包括“飞机”、“高度”和“雷达”等术语。然而,尽管这些算法能够确定文本的领域,而且它们的内部工作原理很容易理解,但创建和维护它们需要大量的领域专业知识和时间。 然而,该领域目前由机器学习系统主导,而不是主要基于人工准备的语言规则的应用程序。这些系统可以进一步分为三个主要类别,即监督系统、无监督系统和半监督系统。 监督式机器学习算法从数据点集合与其对应标签之间的关联中学习。例如,可以训练ML系统正确识别新闻文章的主题,方法是向模型展示来自各种类别的数千甚至数百万个示例,并使用许多可用的学习机制之一。无论是基于故意选择的特征,如单词袋表示或tf-idf向量,还是模型自己发现的数据特征,它们都能够应用所学的“知识”来预测以前未见过的文本的标签。监督学习机制包括朴素贝叶斯算法、支持向量机、神经网络和深度学习。 相比之下,当训练集不包含任何标签时,可以使用无监督系统,因此模型基本上只能根据它遇到的数据的内部特征进行学习。在文本分类中很流行的聚类算法属于这一类。另一方面,半监督ML算法从有标签和无标签训练数据的组合中学习关联。 许多领域分类方法使用手工规则和机器学习技术的组合,这使它们成为可靠和灵活的数据增强工具。这些方法通常被称为混合系统。 数据集 为了有效,每个机器学习算法都需要大量高质量的训练数据。幸运的是,任何人都可以从互联网上免费下载大量开源数据集,并使用它们来训练或增强领域分类模型。 文本分类研究中最流行的基准之一是BBC新闻数据集,它包含来自五个不同领域(商业、娱乐、政治、体育和技术)的2225篇文章。此数据集也将在附带的教程中使用。 另一个新闻文章集合,即路透社-21578数据集,包含了更广泛的领域,涵盖了广泛的经济主题。例如椰子、黄金、库存和货币供应。顾名思义,该合集包含21578篇长短不一的新闻文章。 类似地,20 Newsgroups数据集包含来自20个新闻组的20,000条消息,这些消息被组织成层次分类。一些顶级的类别是计算机,娱乐,科学,而较低级别的领域包括Windows,硬件,汽车,摩托车,宗教,政治。 此外,许多组织拥有大量结构化或非结构化文本集合形式的内部数据集,这些数据集多年来一直在收集。通过首先应用一组数据清洗和增强技术,可以将这些集合转换为有用的数据来训练ML模型,包括领域分类器和其他高级NLP应用程序。 可用工具 自动领域分类依赖于自然语言处理,它涉及到从文本预处理工具到机器学习算法等多种技术。因此,为了构建域识别系统,必须熟悉编程语言(Python是NLP社区中使用最广泛的语言)、各种库和工具包,以及机器学习和统计分析的基础知识。 术语“文本预处理”指的是NLP管道中与ML系统的数据准备相关的步骤。基本预处理步骤包括但不限于标记化(将句子大致分成单词级的块)、词元化(将单词转换为字典形式)和词性标记(根据语法属性标记每个单词)。预处理也可以指将人类语言转换为数字表示,以便计算机能够理解文本信息的过程。这方面的一个例子是向量化,将单词或句子转换为向量,对它们的语言和语义特征进行编码。一些提供文本预处理功能的Python库包括Stanza、spaCy、AllenNLP、TextBlob和NLTK平台。 说到ML组件,有各种维护良好且易于使用的库可供选择。对于初学者来说,Scikit-learn是一个很受欢迎的选择,因为它有优秀的文档,并提供了多种不同的分类器模型,可以轻松实现。对于深度学习,TensorFlow和PyTorch软件库都是受欢迎的选择。使用这两种平台,任何具有基本编程技能的人都可以构建高效的神经网络,能够在各种NLP任务中以接近最先进的水平执行。 教程 在本教程中,我们将基于BBC新闻数据集构建一个域分类系统。目标是创建一个域分类器,能够为集合中的句子分配类别标签。这与此任务的传统用例有些不同,传统用例要求文档(如文章和评论)按主题而不是单个句子进行排序。有人甚至可能会说,在这个层次上,任务要困难得多,因为句子包含的领域特定信息比长文本要少得多。例如,考虑数据集中的以下句子: 但太戏剧化了! 我发现他很有力量。 不好,但很好理解。 断章取义,实际上不可能确定这些句子属于什么领域,无论人类或ML模型是否试图这样做。实际上,领域分类可能根本不适合这样的句子。 在本教程中,我们介绍了域分类管道的以下步骤: 先决条件 关于如何访问数据的说明 数据预处理 一种简单前馈神经网络的构建 评价 访问这个GitHub存储库,了解完整的教程。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文