What are Sentence Embeddings and Their Applications?

什么是句子嵌入及其应用?

2021-02-10 21:50 TAUS

本文共1287个字,阅读需13分钟

阅读模式 切换至中文

Embeddings have radically transformed the field of natural language processing (NLP) in recent years by making it possible to encode pieces of text as fixed-sized vectors. One of the most recent breakthroughs born out of this innovative way of representing textual data is a collection of methods for creating sentence embeddings, also known as sentence vectors. These embeddings make it possible to represent longer pieces of text numerically as vectors that computer algorithms, such as machine learning (ML) models, can handle directly. In this article, we will discuss the key ideas behind this technique, list some of its possible applications, and provide an overview of some of the state-of-the-art sentence embedding approaches commonly used in NLP-research and the language industry. A principal question in NLP is how to represent textual data in a format that computers can understand and work with easily. The solution is to convert language to numerical data - traditional methods like TF-IDF and one-hot encoding have been applied in the field for several decades. However, these methods have a major limitation, namely that they fail to capture the fine-grained semantic information present in human languages. For example, the popular bag-of-words approach only takes into account whether or not vocabulary items are present in a sentence or document, ignoring wider context and the semantic relatedness between words. For many NLP tasks, however, it is crucial to have access to semantic knowledge that goes beyond simple count-based representations. This is where embeddings come to the rescue. Embeddings are fixed-length, multi-dimensional vectors that make it possible to extract and manipulate the meaning of the segments they represent, for example by comparing how similar two sentences are to each other semantically. Background: Word Embeddings To understand the way sentence embeddings work, one must first become familiar with the concept that inspired them, namely word embeddings. In contrast to binary vectors (e.g. one-hot encoding) that are computed by mapping tokens to integer values, word embeddings are learned by ML algorithms in an unsupervised manner. The idea behind learning word embeddings is grounded in the theory of distributional semantics, according to which similar words appear in similar contexts. Thus, by looking at the contexts in which a word appears frequently in a large body of text, it is possible to find similar words that occur in nearly the same context. This allows neural network architectures to extract the semantics of each word in a given corpus. Word vectors, the end product of this process, encode semantic relations, such as the fact the relationship between Paris and France is the same as the one between Berlin and Germany, and much more. Common approaches for computing word vectors include: Word2Vec: the first approach that efficiently used neural networks to learn embeddings from large data set fastText: an efficient character-based model that can process out-of-vocabulary words ELMo: deep, contextualized word representations that can handle polysemy (words with multiple meanings in different contexts The Rise of Sentence Embeddings Sentence embeddings can be understood as the extension of the key ideas behind word embeddings. Being representations of longer chunks of text as numerical vectors, they open up the realm of possibilities in NLP research. The same core characteristics apply to them as to word embeddings - for example, they capture a range of semantic relationships between sentences, such as similarity, contradiction, and entailment. A simple and straightforward baseline method for creating sentence vectors is to use a word embedding model to encode all the words of a given sentence and take the average of all the resulting vectors. While this provides a strong baseline, it falls short of capturing information related to word order and other aspects of overall sentence semantics. On the other hand, sentence embeddings can be learned using ML algorithms in both supervised and unsupervised ways. Often, these algorithms are trained to achieve a number of different objectives in a process known as multi-task learning. By solving some NLP task using a labeled dataset (supervised learning), these models produce universal sentence embeddings that can be further optimized for a variety of downstream applications. Ultimately, these methods provide semantically richer representations and have been shown to be highly effective in applications where semantic knowledge is required. Furthermore, by using multilingual training data, it is possible to create language-agnostic sentence embeddings that are capable of handling text in different languages. hbspt.cta._relativeUrls=true;hbspt.cta.load(2734675, 'a25cdac6-be9f-443a-adea-8adc72fd6e87', {}); Applications Sentence embeddings can be applied in nearly all NLP tasks and can dramatically improve performance when compared to counts-based vectorization methods. For instance, they can be used to compute the degree of semantic relatedness between two sentences expressed as the cosine similarity between their vectors: [Sentence 1] Sentence embeddings are a great way to represent textual data numerically. [Sentence 2] Sentence vectors are very useful for encoding language data as numbers. --------------------------------------------------------------- [Cosine similarity] between the above: 0.775 Based on this simple mathematical calculation, it is possible to adapt sentence embeddings for tasks such as semantic search, text clustering, intent detection, paraphrase detection, and in the development of virtual assistants and smart-reply algorithms. Moreover, cross-lingual sentence embedding models can be used for parallel text mining or translation pair detection. For example, TAUS Data Marketplace uses a data cleaning algorithm which leverages sentence vectors to compute the semantic similarity between parallel segments in different languages to estimate translation quality. State-of-the-art Sentence Embedding Methods There exist a variety of sentence embedding techniques for obtaining vector representations of sentences that can be used in downstream NLP applications. The current state of the art includes: SkipThought: an adaptation of the Word2Vec algorithm that produces embeddings by learning to predict the surroundings of an encoded sentence SentenceBERT: based on the popular BERT model, this framework combines the power of transformer architectures and twin neural networks to create high-quality sentence representations InferSent: produces sentence embeddings by training neural networks to identify semantic relationships between sentences in a supervised manner Universal Sentence Encoder (USE): a collection of two models that leverages multi-task learning to encode sentences into highly generic sentence vectors that are easily adaptable for a wide range of NLP tasks Cross-lingual Embeddings Sentence embeddings were originally conceived in a monolingual context, or in other words, they were only capable of encoding sentences in a single language. Recently, however, multilingual models have been published that can create shared, cross-lingual vector spaces in which semantically equivalent or similar sentences from different languages appear close to each other. The LASER model calculates universal, language-agnostic sentence vectors based on a shared byte-pair encoding vocabulary that can be used in different NLP tasks LaBSE generates language-agnostic BERT sentence embeddings that are capable of generalizing to languages not seen during training by combining the powers of masked and cross-lingual language modeling Conclusion We have seen that sentence embeddings are an effective and versatile method of converting raw textual data into numerical vector representations for use in a wide range of natural language processing applications. Not only are they useful for encoding sentences in a single language, but they can also be applied to solve cross-lingual tasks, such as translation pair detection and quality estimation. The state-of-the-art approaches discussed in this article are easily accessible and can be plugged into existing models to improve their performance, which makes them an essential part of the NLP professional’s toolkit and an exciting addition to the offering of language data service providers. hbspt.cta._relativeUrls=true;hbspt.cta.load(2734675, '9abd9d36-2828-427d-9633-5e32d7c56fea', {});
近年来,嵌入技术使文本编码成为可能,从而彻底改变了自然语言处理(NLP)领域。这种创新的文本数据表示方式带来的最新突破之一是创建句子嵌入(也称为句子向量)的方法集合。这些嵌入使得将较长的文本片段以数字形式表示为向量成为可能,而计算机算法,如机器学习(ML)模型,可以直接处理这些向量。在本文中,我们将讨论这种技术背后的关键思想,列出它的一些可能的应用,并提供一些在NLP研究和语言行业中常用的最先进的句子嵌入方法的概述。 NLP中的一个主要问题是如何用计算机能够理解和容易处理的格式来表示文本数据。解决方案是将语言转换为数值数据--传统的方法如TF-IDF和one-hot编码已经在该领域应用了几十年。然而,这些方法都有一个主要的局限性,即它们不能捕获人类语言中存在的细粒度语义信息。例如,流行的单词袋方法只考虑词汇项是否存在于句子或文档中,而忽略了更广泛的上下文和单词之间的语义相关性。 然而,对于许多NLP任务来说,访问语义知识是至关重要的,而不仅仅是简单的基于计数的表示。这就是嵌入来拯救的地方。嵌入是固定长度的多维向量,它使得提取和操纵它们所代表的片段的意义成为可能,例如通过比较两个句子在语义上的相似性。 背景:单词嵌入 要理解句子嵌入的工作方式,首先要熟悉激发它们的概念,即单词嵌入。与通过将令牌映射到整数值来计算的二进制向量(例如,one-hot编码)相反,字嵌入由ML算法以无监督的方式学习。 学习词汇嵌入的思想是基于分布语义学理论,根据分布语义学理论,相似的词汇出现在相似的语境中。因此,通过查看一个词在大量文本中频繁出现的上下文,就有可能找到出现在几乎相同上下文中的相似词。这允许神经网络架构提取给定语料库中每个单词的语义。词向量是这个过程的最终产物,它编码语义关系,例如巴黎和法国之间的关系就像柏林和德国之间的关系一样,等等。 计算字向量的常用方法包括: Word2VEC:第一个有效利用神经网络从大数据集中学习嵌入的方法 FastText:一个高效的基于字符的模型,可以处理词汇外的单词 ELMO:深度的,上下文化的单词表示,可以处理一词多义(单词在不同的上下文中具有多种含义 句子嵌入的兴起 句子嵌入可以理解为词语嵌入背后关键思想的延伸。它们将较长的文本块表示为数字向量,为自然语言处理研究开辟了可能的领域。同样的核心特征也适用于词汇嵌入--例如,它们捕捉了句子之间的一系列语义关系,如相似性,矛盾性和蕴涵性。 一种简单直白的创建句子向量的基线方法是使用单词嵌入模型对给定句子的所有单词进行编码,并取所有得到的向量的平均值。虽然这提供了一个强大的基线,但它不能捕获与语序和整体句子语义的其他方面有关的信息。 另一方面,句子嵌入可以通过有监督和无监督两种方式使用ML算法进行学习。通常,这些算法在一个称为多任务学习的过程中被训练以实现许多不同的目标。通过使用标记数据集(监督学习)来解决一些NLP任务,这些模型产生了通用的句子嵌入,可以进一步优化用于各种下游应用。最终,这些方法提供了语义上更丰富的表示,并且已经被证明在需要语义知识的应用中是非常有效的。此外,通过使用多语言训练数据,可以创建能够处理不同语言文本的语言不可知论语句嵌入。 hbspt.cta._relativeURLS=true;hbspt.cta.load(2734675,'A25CDAC6-BE9F-443A-ADEA-8ADC72FD6E87',{}); 应用程序 句子嵌入可以应用于几乎所有的NLP任务,与基于计数的向量化方法相比,它可以显著地提高性能。例如,它们可以用来计算两个句子之间的语义关联程度,表示为它们的向量之间的余弦相似度: “句子1”句子嵌入是用数字表示文本数据的一种很好的方法。 “句子2”句子向量对于将语言数据编码为数字非常有用。 -------------------------------------- 两者之间的“余弦相似度”:0.775 基于这种简单的数学计算,可以使句子嵌入适应诸如语义搜索,文本聚类,意图检测,释义检测等任务,以及在虚拟助理和智能回复算法的开发中。而且,跨语言句子嵌入模型可用于并行文本挖掘或翻译对检测。例如,TAUS Data Marketplace使用数据清洗算法,利用句子向量来计算不同语言中平行片段之间的语义相似度,从而估计翻译质量。 最先进的句子嵌入方法 存在多种句子嵌入技术来获得句子的向量表示,这些向量表示可用于下游的NLP应用。目前的技术状况包括: SkipThought:Word2Vec算法的一种改编算法,通过学习预测编码句子的周围环境来产生嵌入 SentenceBert:基于流行的BERT模型,这个框架结合了transformer架构和孪生神经网络的能力来创建高质量的句子表示 推理:通过训练神经网络生成句子嵌入,以监督的方式识别句子之间的语义关系 通用语句编码器(USE):两个模型的集合,利用多任务学习将语句编码成高度通用的语句向量,这些向量很容易适应广泛的NLP任务 跨语言嵌入 句子嵌入最初是在单语语境下构思的,或者说,它们只能对单一语言中的句子进行编码。然而,最近,多语言模型已经发表,可以创建共享的,跨语言的向量空间,在这些向量空间中,来自不同语言的语义等同或相似的句子出现在彼此接近的地方。 激光模型基于共享字节对编码词汇表计算通用的,语言不可知的句子向量,该词汇表可用于不同的NLP任务 LaBSE生成语言不可知论的BERT语句嵌入,通过结合掩蔽和跨语言语言建模的能力,它能够泛化到训练期间看不到的语言 结论 我们已经看到,句子嵌入是将原始文本数据转换为数值向量表示的一种有效且通用的方法,可用于广泛的自然语言处理应用。它们不仅对单一语言的句子编码有用,而且还可以应用于解决跨语言任务,如翻译对检测和质量估计。本文中讨论的最先进的方法很容易访问,并且可以插入到现有的模型中以提高它们的性能,这使得它们成为NLP专业人员工具包的一个重要部分,并且是语言数据服务提供商提供的一个令人兴奋的补充。 hbspt.cta._relativeURLS=true;hbspt.cta.load(2734675,'9ABD9D36-2828-427D-9633-5E32D7C56FEA',{});

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文