3 Tips to Process Your Text Data

处理文本数据的3大事项

2021-03-29 23:25 TAUS

本文共1187个字,阅读需12分钟

阅读模式 切换至中文

The discipline concerned with extracting information from text data is known as natural language processing, or NLP for short. NLP has many different use cases in artificial intelligence (AI) and machine learning (ML) tasks. It can be defined as the process of analyzing text data. Text processing is a pre-processing step during training data preparation for any text-based machine learning model, such as a natural language processing (NLP) model. Some common use-cases of NLP include spam filtering, sentiment analysis, topic analysis, information retrieval, data mining, and language translation. Why is NLP Important? Much of the technology consumers use on a day-to-day basis contains some sort of NLP-based model behind the scenes. In fact, text data has become very useful for companies to understand how consumers interact with their product or business through deriving text-driven insights. This is because users interact with the technology by inputting text data, such as writing emails, sending text messages, posting on social media, etc. An example of this is predictive typing suggestions one can see on an iPhone or in a Gmail account. To suggest a response or the next word in a sentence, the technology behind the application uses a model that looks at large-scale text training data. Hence, text data can be considered as an important factor for human communication today. From emails to text messages to annotated videos, text data is everywhere. Furthermore, NLP has allowed for the creation of language models. Language or translation models power NLP-based applications which can perform a variety of tasks, such as language translation, speech recognition, audio to text conversion, and so on. Nowadays, there exist platforms for enriching language data and making it AI-ready through techniques such as annotation, labeling, etc. TAUS HLP is one such platform where a global community of data contributors and annotators can perform a variety of language-related tasks. NLP Tips After understanding what NLP is and why it matters, we now look to understand some methods and tips during model pre-processing and building. These methods could improve an ML-based language model to ultimately extract better text-based insights. Three important tips to consider when building any NLP model include proper text data pre-processing, feature extraction, and model selection. 1. Pre-processing Before we feed any text-based machine learning model with data, we need to pre-process this data. This means we clean our data to remove words with little meaning to be able to capture meaning more appropriately. This step includes the removal of stop words, capitalization normalization, punctuation, tags, etc. The techniques outlined below are commonly used text data preprocessing methods: Data normalization includes enforcing consistent capitalization and word normalization (“uk”, “UK”, or “united kingdom” to “United Kingdom”). Data cleaning consists of removing unnecessary punctuation, tags, or other noise. Stopwords removal omits frequent words such as “the”, “is”, “a”, and so on, which provide little to no meaning. Stemming is a technique that refers to when word variations (suffixes, prefixes) are reduced to the root word. (“running” or “runs” becomes “run”) Lemmatization is similar to stemming but more thorough, in that it tracks parts of speech and word context during word processing. Tokenization is the process of converting or reducing sentences to only words. It uses the above methods to add structure to a body of text. 2. Feature Extraction Once our data has been cleaned after pre-processing, we now look to feature extraction. Feature extraction prepares the data for model use through categorization and/or organization. Some common ways to perform text feature extraction are count-based vectorization and word embeddings. Vectorization is the process of mapping text data into a numerical structure. This can ultimately optimize a model by speeding up compute time and outputting more accurate results. One of the most common vectorization techniques is bag of words (BOW). BOW focuses on the number of occurrences per word in a document. This method encodes a unique word from a corpus of vocabulary to a unique number and maps its value to the count of its frequency. TF-IDF vectorization is another text feature extraction method where the relative frequency of a word is computed. In other words, the frequency of a given word appearing in a given document is compared across all documents in the corpus. This particular technique is common in applications such as information retrieval, search engine scoring, and document clustering. The biggest drawback, however, of both BOW and TF-IDF is that both leave out the contextual meaning of words. One approach to make up for this problem is word embedding. Word embedding captures the relationships between words, thereby storing contextual meaning. Using a vector space structure, words that are related to each other are paired closer together. One way to implement this is by using Word2Vec, a popular technique that learns word embeddings in a corpus. Examples of corpora containing language data can be seen in the TAUS Data Marketplace, where sentence embeddings are used for data cleaning purposes. The example diagram below illustrates word embeddings in a vector space, where similar words are in close proximity with one another. 3. Model Selection The third and final tip of NLP is choosing the appropriate model for your text-based model. The first thing to do is to assess your output or target of your model. In a text-based setting, a supervised learning approach using a classification model is common. Next, determine the volume of your data. Deep learning approaches are often better suited for large volumes of data, whereas classical supervised machine learning works better with a smaller volume of text data. Lastly, consider multiple models and assess the best outcome. Ensemble methods in machine learning consist of using a combination of models rather than relying on a single algorithm. For example, the random forest algorithm uses multiple decision trees to output a single aggregated output prediction. After choosing and testing different approaches, model evaluation is an important last step to assess how well your model performed. Techniques like cross-validation (training on multiple subsets of the training data in multiple iterations) and grid-search (finds the best set of parameters across all combinations in a grid) can help to assess and fine-tune the results of your model. Furthermore, if optimal results are not achieved, reassessing your data pre-processing during the initial step is a good place to evaluate. Perhaps further text processing optimization is required for the model to better understand the text data and perform more accurate predictions. Summary In conclusion, the three NLP tips outlined above can have a great effect on your text-based model and output. Pre-processing text data helps to extract, clean, and give structure to a body of text. Feature extraction assigns meaning to words through vectorization and word embedding methods. Model selection uses the results of the first two steps to learn and produce predictions on unseen text data. TAUS can help you through all of these text data processes. Contact our team of experts for a custom solution for your specific needs.
从文本数据中提取信息的学科被称为自然语言处理,简称NLP。自然语言处理在人工智能(AI)和机器学习(ML)任务中有很多不同的用例,可以将其定义为分析文本数据的过程。对于任何基于文本的机器学习模型,例如自然语言处理模型来说,文本处理都是指样本数据在准备期间的预处理步骤。自然语言处理的一些常见用例包括垃圾邮件过滤、情感解析、主题辨识、信息检索、数据挖掘和语言翻译。 自然语言处理为什么重要? 许多消费者日常使用技术的背后都包含有某种基于自然语言处理的模型。事实上,文本数据对于公司了解消费者是如何通过所获取的文本驱动型见解来与他们的产品或业务相互影响上非常有用。这是因为用户在输入文本数据时,即与该技术进行了交互,例如写电子邮件、发送文本消息、在社交媒体上发帖等。其中的一个例子就是用户可以在苹果手机或谷歌邮箱中看到预测的输入建议。为了预测某个回答或句子中的下一个单词,该应用程序背后的技术使用了一个可以查看大规模文本训练数据的模型。 因此,文本数据可以被认为是支持当今人类交流的一个重要因素。从电子邮件到短信再到带有注解的视频,文本数据无处不在。此外,自然语言处理已允许创建语言模型。语言或翻译模型支持基于自然语言处理的应用程序,这些应用程序可以执行各种任务,如语言翻译、语音识别、音频到文本的转换等。如今,已经有支持人工智能的平台,通过注解、标注等技术,处理了大量的语言数据。TAUS HLP就是这样一个平台,它是一个由数据贡献者和注解者组成的全球社区,可以在这个平台上执行各种与语言相关的任务。 注意事项 在理解了自然语言处理是什么以及它为什么重要之后,我们现在希望了解一些在模型预处理和构建过程中的方法和技巧。这些方法可以改进基于一种机器学习的语言模型,最终提取出更准确的基于文本的见解。在构建任何自然语言处理模型时,都需要考虑三个重要事项,即对文本数据的适当预处理、特征提取以及模型选择。 1.预处理 在我们向任何基于文本的机器学习模型馈送数据之前,我们需要对这些数据进行预处理。这意味着我们清理我们的数据,删除没有什么意义的单词,以便能够更恰当地捕捉意义。这一步包括去除停用词,大写规范化,标点符号,标记等,下面概述的技术是常用的文本数据预处理方法: 数据规范化包括执行一致的大小写和单词规范化(如“uk”和“UK”,或“united kingdom”和“United Kingdom” )。 数据清理包括删除不必要的标点符号、标记或其他干扰。 停用词删除会剔除一些常用词,如“the”、“is”、“a”等,这些词几乎没有任何意义。 词干化类似于词干化,但更彻底,因为它在字处理过程中跟踪词类和词的上下文。词干提取是指将单词变体(如后缀和前缀)简化为词根时的一种技术。(如现在分词“running”或第三人称单数“runs”变成原型 “run” ) 词元化类似于词干提取,但更为彻底,因为它在文字处理过程中跟踪词性和文字上下文。 标记化是将句子转换或简化为单词的过程。它会使用上述方法来为文本增添结构。 2.特征提取 在预处理完成数据清理之后,我们现在将开始进行特征提取。特征提取通过分类、组织或两者结合,来为模型的使用准备数据。一些常用的文本特征提取方法有基于计数的向量化和单词嵌入。 向量化是将文本数据映射成数字结构的过程。通过加快计算时间并输出更准确的结果,最终可以优化模型。最常见的一种向量化技术就是单词包(BOW)。单词包关注文档中每个单词的出现次数。这种方法将词汇库中的每个唯一单词都编码成一个唯一数字,并将其数值映射到它的频率计数。 TF-IDF(词在文档中的频率-逆文档频率)向量化是另一种提取文本特征的方法,其计算的是单词的相对频率。换句话说,就是将给定单词在给定文档中出现的频率同语料库中的所有文档中进行比较。这种特定的技术在信息检索、搜索引擎评分和文档聚类等应用中非常常见。然而,BOW和TF-IDF的最大缺点是都忽略了单词的上下文含义。解决这个问题的一种方法就是单词嵌入。单词嵌入捕捉词与词之间的关系,从而存储了上下文的含义。使用向量空间结构,彼此相关的单词会被更紧密地配对在一起。实现此目的的一种方法是使用Word2Vec(指一群用来产生词向量的相关模型),这是一种如今非常流行的,用于语料库学习中的单词嵌入技术。在TAUS数据市场上,可以看到一些包含语言数据的语料库示例,其中句子嵌入正被用于数据清理。下方的事例图即说明了向量空间中的单词嵌入,其中相似的单词彼此非常接近。 3.模型选择 自然语言处理的第三个也是最后一大事项是为基于文本的模型选择适当的模型。首先要做的就是评估模型的输出对象或目标。在基于文本的设置中,通常使用分类模型来进行监督学习。接下来,确定你的数据量。深度学习方法通常更适合于较大的数据量,而传统的监督式机器学习更适合于小数据量的文本。最后,考虑多种模型并评估最佳结果。机器学习中的集成方法包括使用模型的组合,而不是依赖于某一个算法。例如,随机森林算法使用多个决策树来输出单个聚合的输出预测。 在选择并测试了不同的方法之后,模型评估是评估模型执行情况的重要的最后一步。 交叉验证(在多次迭代中对训练数据的多个子集进行训练)和网格搜索(在网格中的所有组合中查找最佳参数集)等技术都可以帮助评估和微调模型的结果。 此外,如果未获得最佳结果,则在评估初始步骤时重新评估数据的预处理是一个不错的评估方法。 也许模型需要进一步的文本处理优化,以更好地理解文本数据并执行更准确的预测。 总结 总而言之,上面概述的三项自然语言处理过程会对基于文本的模型和输出产生很大的影响。对文本数据的预处理有助于提取、清理文本,并为文本主体提供结构。特征提取通过向量化和单词嵌入的方法为单词赋予意义。模型选择则使用前两个步骤的结果,来学习并生成对未知的文本数据的预测。 TAUS可以帮助您完成所有这些文本数据的处理。如有特定需求,请与我们的专家团队联系,我们将为您定制解决方案。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文