NLP-driven Word Clouds in Data Marketplace

数据市场中NLP驱动的词云

2022-01-03 21:25 TAUS

本文共784个字,阅读需8分钟

阅读模式 切换至中文

Bilingual, NLP-driven word clouds are now available in TAUS Data Marketplace. In this article, we discuss what word clouds are and what they can tell us about the contents of a document containing bilingual text data. When it comes to understanding the contents of large collections of data, visualization is one of the key techniques that many organizations and individuals rely on. Choosing the right visualization method is crucial when attempting to convey certain kinds of information about a dataset that either its creators or users find important or interesting. For example, using different shades of a color on a map of the world is a great way to illustrate global population density, while a population pyramid can tell us a lot about the demographics of a certain region with regards to age and sex. Datasets composed of text, however, contain a very different kind of data - they are made up of words and sentences rather than numerical variables. One way to visualize such data is to use word clouds. A word cloud is a simple, weighted visual representation of the vocabulary contained in a textual dataset that allows us to estimate the contents of the data at a glance. It contains the most frequently occurring words in the data, with more frequent words appearing larger in size than less frequent ones. Additionally, word clouds can contain frequency counts for each word as well - on document sample pages, for example, you can see the number of times a certain word occurs in the document by hovering over it with the mouse pointer. You can try this yourself by browsing the documents on the Marketplace sellers page. Additionally, word clouds provide excellent insight into the domain of a text document. By inspecting a document’s word cloud, you can immediately see whether the vocabulary matches with what you would expect to find based on the dataset’s domain labels. For instance, if a document comes from the Healthcare / Medical Equipment & Supplies domain and the word cloud contains words like “treatment”, “clinical”, and “patients”, then you can be sure the document in question contains high-quality, domain-specific data. A word cloud based on this article and its Hungarian translation (machine- generated) Generating word clouds for bilingual documents might sound like a relatively simple task, but in reality, there is a lot going on under the hood. TAUS applies natural language processing techniques to produce high-quality word clouds that represent the contents of each document in the best possible way. All of the data in the TAUS Data Marketplace comes in the form of sentences, so the first step in the generation of our “NLP-driven” word clouds is to split these into word-level units, also known as tokens. This process, called tokenization, requires specific solutions for almost every language and can get quite tricky when dealing with languages with logographic writing systems such as Chinese. This is why we rely on the spaCy NLP library, which allows us to tokenize data in dozens of different languages quickly and efficiently. Having obtained a list of all tokens in a document, some additional filtering must be applied so that only the most important content words are retained. Therefore, the next step is to remove stop words, which include short function words such as articles (“the”, “a”, “an”), prepositions (“to”, “from”, “in”, “on”, etc.), and words that are common in all kinds of texts, such as “was”, “were”, and “can”. For this, we use a combination of lists of stop words provided by spaCy and a collection of such words that we maintain ourselves. Naturally, this must be done separately for every single language in our database. In addition, tokens that convey little or no information, such as digits and single-character tokens, are also removed. Following stop words removal, the number of times each content word occurs in a given document is counted and saved in a frequency table. To obtain the final counts for a bilingual document, we merge the frequency data from both the source and the target language and retain only the most frequent entries. These counts indicate how many times each token occurs in the document and are used for the generation of word clouds on the Data Marketplace website. Word clouds are a simple, yet effective way to visualize textual data in a clear and easily digestible manner. By adding them to TAUS Data Marketplace, we hope to improve the user experience so that both data sellers and buyers can gain a better understanding of the contents of their documents. Take a look at the word cloud on one of our document sample pages to explore the data yourself.
双语,NLP驱动的词云现在可以在TAUS数据市场中获得。在本文中,我们讨论什么是词云,以及它们可以告诉我们关于一个包含双语文本数据的文档的内容。 当涉及到理解大量数据集合的内容时,可视化是许多组织和个人依赖的关键技术之一。当试图传达数据集的创建者或用户认为重要或感兴趣的某些类型的数据集信息时,选择正确的可视化方法至关重要。例如,在世界地图上使用不同的颜色深浅是说明全球人口密度的一个很好的方法,而人口金字塔可以告诉我们关于某一地区的人口统计数据,包括年龄和性别。然而,由文本组成的数据集包含了一种非常不同的数据--它们是由单词和句子而不是数字变量组成的。 可视化这类数据的一种方法是使用词云。词云是文本数据集中包含的词汇表的一种简单,加权的可视化表示,它允许我们一眼就估计数据的内容。它包含了数据中出现频率最高的单词,出现频率较高的单词比出现频率较低的单词在大小上显得更大。此外,单词云还可以包含每个单词的频率计数--例如,在文档示例页面上,您可以通过将鼠标指针悬停在某个单词上来查看该单词在文档中出现的次数。您可以通过浏览Marketplace sellers页面上的文档来尝试此操作。 此外,word Cloud提供了对文本文档领域的出色洞察。通过检查文档的word cloud,您可以立即查看词汇表是否与您期望基于数据集的域标签找到的内容匹配。例如,如果一个文档来自Healthcare/Medical Equipment&Supplies领域,并且单词cloud包含诸如“treatment”,“Clinical”和“Patients”之类的单词,那么您可以确信该文档包含高质量的,特定于领域的数据。 基于本文的一个词云及其匈牙利语翻译(机器生成) 为双语文档生成单词云听起来似乎是一项相对简单的任务,但实际上,隐藏着很多事情。TAUS应用自然语言处理技术产生高质量的词云,以尽可能最好的方式表示每个文档的内容。 TAUS数据市场中的所有数据都是以句子的形式出现的,因此生成我们的“NLP驱动”字云的第一步就是将这些字云拆分成字级别的单元,也称为代币。这个过程被称为标记化,几乎每种语言都需要特定的解决方案,而且在处理具有标识书写系统的语言(如汉语)时可能会变得相当棘手。这就是为什么我们依赖spaCy NLP库,它允许我们快速高效地用几十种不同的语言对数据进行标记化。 获得了文档中所有令牌的列表后,必须应用一些额外的筛选,以便只保留最重要的实词。因此,下一步要去掉停用词,停用词包括冠词(“the”,“a”,“an”),介词(“to”,“from”,“in”,“on”等)等短虚词,以及各类文本中常见的词,如“was”,“were”,“can”等。为此,我们使用spaCy提供的停用词列表和我们自己维护的此类单词集合的组合。当然,这必须针对数据库中的每一种语言分别进行。此外,传达很少或不传达信息的令牌,如数字和单字符令牌也被移除。 在去除停用词之后,计算每个实词在给定文档中出现的次数,并将其保存在频率表中。为了获得双语文档的最终计数,我们合并源语言和目标语言的频率数据,只保留频率最高的条目。这些计数表示每个令牌在文档中出现的次数,并用于数据市场网站上的词云生成。 Word clouds是一种简单而有效的方法,可以以清晰且易于消化的方式将文本数据可视化。通过将它们添加到TAUS数据市场,我们希望改善用户体验,使数据卖家和买家都能对其文档内容获得更好的理解。看一看我们的一个文档示例页面上的cloud一词,自己探索数据。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文