Language Models: The Big Money Race

语言模型:一个大型的竞赛

2020-11-24 06:40 Lingua Greca

本文共1261个字,阅读需13分钟

阅读模式 切换至中文

Let’s Start Simple. What is a Language Model? A Language Model, or LM, is nothing more than several statistical and probabilistic techniques used to determine the probability of words occurring in a sentence, or the sentence itself. These models interpret the language by feeding it through algorithms. Such algorithms are responsible for creating rules for the context in a natural language scene and they use big corporas of text to provide a basis for their word predictions. Currently, LMs are the backbone of almost every application inside natural language processing (NLP). Let’s take a look at some of the most important applications. Maybe you’re already using an LM without even knowing it! Optical Character Recognition (OCR) OCR is used all over the world to recognize text inside images, going from scanned documents to photos. This technology can convert virtually any kind of image containing written text to computer-readable text. Machine Translation On a very basic level, Machine Translation (MT) is the substitution of words in one language for words in another. In a globalized world, it’s a major skill needed across text and speech applications in fields as varied as government, military, finance, health care and e-commerce. Speech Recognition Voice assistants such as Siri or Alexa are examples of language models in our everyday lives. They’re big exponents of how LMs help machines in processing speech audio. Sentiment Analysis There’s a good chance that every opinion or comment you’ve ever posted in social media has been used somehow in a sentiment analysis process. Businesses use sentiment analysis to understand social sentiment about their brands. In fact, it’s one of the major ways of monetizing social networks. No wonder why they’re billion-dollar businesses! But getting back to our topic, Language Models are crucial in modern NLP applications. They’re the main reason machines are able to understand language, transforming qualitative information into quantitative information, therefore allowing machines to understand people. The roots come from the 1948 paper, “A Mathematical Theory of Communication,” where Claude Shannon introduced the use of a stochastic model called the Markov chain to create a statistical model for the sequences of an English text — a shocking discovery for even making references to N-Grams. But it wasn’t until the 1980s and the rise of computers that more complex systems made statistical models the norm. It was a big decade for NLP. John Hopfield introduced Recurrent Neural Networks, and Geoffrey Hinton, one of the Fathers of modern AI, introduced the idea of representing words as vectors. We had to wait until 2003 for the first Neural Language Model, with the very first feed-forward neural network language model, but from then on, we haven’t looked back. Let’s take a look at some of the most important language models based on today’s neural networks. The Top 5 Language Models that Accelerated Natural Language Processing Google BERT BERT, or Bidirectional Encoder Representations from Transformers, is a pre-trained NLP language model developed by Google in 2018. Unlike previous models, BERT was the first truly bidirectional or nondirectional unsupervised language representation. Previous models, such as Word2vec or GloVe, generated a single-word embedding representation for each word in the vocabulary, where BERT takes into account the context and position in a sentence for each occurrence of a given word. For example, models as Word2vec will have the same representation for the word “right” in the three following sentences: You have the right to defend yourself. She just gave the right answer. We should make a right in the next corner. BERT, on the other hand, will provide a contextualized embedding that’s different according to the sentence in each case therefore being a completely different word with different meaning so humans understand it with no effort. T-NLG In 2019, T-NLG, or Turing Natural Language Generation, became the largest model ever published, with 17 billion parameters, outperforming state-of-the-art language models bench marks in very practical lists of tasks, such as summarization or question answering. T-NLG is based in a Transformer-based generative LM, which means it can generate words to complete open-ended textual tasks. “Beyond saving our users time by summarizing documents and emails, T-NLG can enhance experiences with the Microsoft Office suite by offering writing assistance to authors and answering questions that readers may ask about a document,” noted Microsoft AI Research applied scientist Corby Rosset. OpenAI’s GPT-3 From OpenAI, we encounter GPT-3, the successor of GPT and GPT-2, in that order. For comparison, the previous version, GPT-2, was trained with around 1.5 billion parameters — very far from the largest Transformer-based language model by Microsoft. But erase all that from your memory because OpenAI went to 175 billion parameters with GPT-3, which was 10 times larger to the next closest thing. “GPT-3 achieves strong performance on many NLP data sets, including translation, question-answering, and close tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing three-digit arithmetic,” according to researchers. The huge size of the model goes beyond bounds for almost everyone except a few select companies and research labs. But as a main contribution, it makes cutting-edge NLP more accessible without requiring large and specific data sets, and without requiring task specific model architectures. MegatronLMq Following the latest advancements in NLP, BERT, GPT-2 and GPT-3, it was a matter of time before a big competitor in GPU manufacturing pushed the limits in the technology. Enter Nvidia’s Megatron-LM, an 8.3 billion parameter transformer language model trained on 512 GPUs. Nvidia demystified the question, “Is having better NLP models as easy as having larger models?” They proved that increasing the size of the BERT model from 336 million to 1.3 billion decreased accuracy and compounded the larger models’ issues with memory. To see how careful you must be with layering normalization when increasing model size, Language modeling using megatron A100 GPU is a must-read. ELMo (Author’s attachments to this model should be considered.) In 2018, the paper “Deep Contextualized Word Representations” introduced ELMo as a new technique for embedding words into a vector space using bidirectional LSTMs trained on a language modeling objective. In addition to beating several NLP bench marks, ELMo proved to be the best technology to reduce training data by a potential 10 times, while achieving the same results. The model developed by AllenNLP and based on a deep bidirectional model (biLM) on top of biLSTM was pre-trained on a huge text corpus. The main differentiator is how easily it can be added to existing models, drastically improving functions such as Q&A, sentiment analysis or summarization. The Future of Language Models Although bigger is not always better, when working with language models, the amount of data is critical. The bigger the model, and the more diverse and comprehensive the pre-training data, the better the results. As Microsoft scientist Corby Rosset put it, “We believe it is more efficient to train a large centralized multitask model and share its capabilities across numerous tasks.” Like GPT-3 or BERT, language models may be able to complete open-ended textual tasks generating words, and build summaries or answer direct questions, but the costs are high, with expensive data sets and millions in resources. So, although we’re not in the race yet, we’ll definitely stay tuned and take the benefits that come our way. Let’s just hope they’re for everyone.
让我们从简单的问题开始。什么是语言模型? 语言模型(LM)不过是几种统计和概率技术,用于确定单词在句子或句子本身中出现的概率。这些模型通过算法来解释语言。这些算法负责为自然语言场景中的上下文创建规则,它们使用大语料库的文本为它们的词汇预测提供基础。目前,LMs是自然语言处理(NLP)中几乎每个应用程序的主干。 让我们来看看其中一些最重要的应用。也许你已经在使用一个LM,甚至不知道它的存在! 光学字符识别技术(OCR) OCR在世界各地都被用于识别图像中的文本,从扫描的文档到照片。这项技术几乎可以将任何包含书面文本的图像转换为计算机可读的文本。 机器翻译 在一个非常基础的层面上,机器翻译(MT)就是用一种语言的单词替代另一种语言的单词。在一个全球化的世界里,它是一项重要的技能,在政府,军事,金融,医疗保健和电子商务等不同领域,它的文本和语音应用都是必需的。 语音识别技术 语音助手如Siri或Alexa是我们日常生活中语言模型的例子。它们是LMs如何帮助机器处理语音音频的重要代表。 情感分析技术 很有可能你在社交媒体上发布的每一个观点或评论都一定程度被用在情绪分析过程中。企业使用情绪分析来了解社会对其品牌的看法。事实上,这是社交网络盈利的主要方式之一。难怪它们是亿万美元的企业! 但是回到我们的话题,语言模型在现代NLP应用中至关重要。它们是机器能够理解语言,将定性信息转化为定量信息,从而能够理解人类的主要原因。 其词根来自于1948年的论文《通信的数学理论》,克劳德·香农(Claude Shannon)在文中介绍了一种名为马尔可夫链的随机模型的使用,从而为英语文本的序列创建了一个统计模型——对于引用n个字母来说,这是一个令人震惊的发现。但直到20世纪80年代计算机的兴起,更复杂的系统才使统计模型成为常态。对NLP来说,这是一个重要的十年。John Hopfield引入了递归神经网络,而Geoffrey Hinton,现代人工智能的创始人之一,引入了将单词表示为向量的想法。直到2003年才出现了第一个神经语言模型,第一个前馈神经网络语言模型,但从那时起,我们就没有回头看。 让我们来看看基于当今神经网络的一些最重要的语言模型。 加速自然语言处理的排名前5种语言模型 谷歌伯特系统 BERT,即来自transformer的双向编码器表示,是谷歌在2018年开发的一种预训练的NLP语言模型。与以前的模型不同,BERT是第一个真正的双向或无监督语言表示。以前的模型,如Word2vec或GloVe,为词汇表中的每个单词生成了单个单词的嵌入表示,其中BERT考虑了给定单词出现时的上下文和句子中的位置。 例如,作为Word2vec的模型将对以下三个句子中的单词“right”具有一致的含义: 你有权利为自己辩护。 她仅仅是给了正确的答案。 我们应该在下一个路口处右转。 另一方面,BERT会提供上下文化的嵌入,根据不同的句子在不同的情况下是不同的,因此它是一个完全不同的词,有不同的意思,所以人们可以毫不费力地理解它。 T-NLG 在2019年,T-NLG,或图灵自然语言生成,成为迄今为止发表的最大的模型,有170亿参数,在非常实际的任务列表,如总结或回答问题,超过了最先进的语言模型的基准。T-NLG基于基于转换者的生成式LM,这意味着它可以生成单词来完成开放式的文本任务。 除了通过总结文档和电子邮件来节省我们的用户时间,T-NLG还可以通过为作者提供写作帮助和回答读者可能就文档提出的问题来提高使用Microsoft Office套件的体验,”微软AI研究应用科学家Corby Rosset说。 OpenAI的GPT-3技术 在OpenAI中,我们遇到了GPT和GPT-2的继承者GPT-3。相比之下,前一个版本,GPT-2,训练了大约15亿个参数——与微软最大的基于转换器的语言模型相差甚远。但是把这些都从你的记忆中抹去,因为OpenAI用GPT-3有1750亿个参数,它比最近的一个大10倍。 研究人员说:“GPT-3在许多NLP数据集上取得了强大的性能,包括翻译、问答和关闭任务,以及一些需要即时推理或领域适应的任务,如整理单词、在句子中使用新单词或执行三位数算术。” 除了少数几家公司和研究实验室之外,几乎所有人都无法接受这种巨大的模型。但作为主要贡献,它使前沿的NLP更容易访问,而不需要大型和特定的数据集,也不需要特定任务的模型架构。 MegatronLMQ技术 随着NLP、BERT、GPT-2和GPT-3的最新进展,GPU制造领域的一个大竞争对手突破该技术的极限只是时间问题。进入英伟达的Megatron-LM,一个83亿参数的transformer语言模型,在512个gpu上训练 英伟达澄清了这个问题:“拥有更好的NLP机型和拥有更大的机型一样容易吗?”他们证明,将伯特模型的尺寸从3.36亿个增加到13亿个会降低精确度,并使更大模型的内存问题复杂化。要想了解在增加模型大小时,你必须多么小心地使用分层标准化,使用威震天A100 GPU进行语言建模是必读的书。 ELMo(应考虑作者对此模型的附件。) 2018年,论文《深度上下文化单词表示》(Deep contexts tualized Word representation)引入了ELMo,将其作为一种新技术,使用经过语言建模目标训练的双向LSTMs将单词嵌入到向量空间中。除了击败几个NLP的板凳标志,ELMo被证明是最好的技术,减少训练数据的潜力10倍,同时取得相同的结果。 由AllenNLP开发的基于深双向模型(biLSTM)的模型在一个大型文本语料库上进行了预训练。主要的区别在于,它可以多么容易地添加到现有模型中,从而极大地改善诸如Q&;amp;A,情绪分析或总结等功能。 语言模型的未来 虽然越大并不总是越好,但在使用语言模型时,数据量是很重要的。模型越大,预训练数据越多样,越全面,得到的结果越好。 正如微软科学家Corby Rosset所说,“我们相信训练一个大型的集中的多任务模型并在多个任务中共享其能力会更有效率。” 与GPT-3或BERT一样,语言模型可能能够完成生成单词的开放式文本任务,并构建摘要或回答直接的问题,但成本很高,需要昂贵的数据集和数以百万计的资源。所以,虽然我们还没有进入竞争阶段,但我们肯定会继续关注并从中获益。希望他们是为每个人准备的。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文