Article co-written by Laszlo Varga and Yulia Akhulkova.
It won’t come as any surprise that, like so many other organizations, Nimdzi Insights has been experimenting with ChatGPT since its public release in November 2022. While large language models (LLMs) have been around for a number of years, they have mostly gone unnoticed until ChatGPT broke the dam. We at Nimdzi are undoubtedly thankful and excited that language technology suddenly became the talk of the town, and we are closely watching and researching the propagation of LLMs in the translation and localization industry and beyond.
As we’ve been monitoring the technology, we couldn’t help but notice a lot of confusion around the core fundamentals (and this is understandable as many are still getting up to speed on the topic and learning the proper lingo). We see GenAI, LLMs, GPT and more buzzwords and acronyms flying around, intermingled and misused. In this article, we sort out the terminology so you don’t have to. Get ready for a lot of three-letter acronyms (TLA)!
Language models: from n-grams to LLMs
The modeling of languages has a long history. Similarly to machine translation engines, it began with statistical models (n-grams) in the 1950s with moderate success. The advent of neural networks (NN) – the principles of which are based on how the brain works – brought about a revolution in AI and language modeling.
The underlying mechanism is still probabilistic, but NNs allow for machines to learn more effectively than previous models – especially in the new age of deep neural networks (DNN), where multiple layers of artificial neurons (“nodes”) make up the model. DNNs already proved to be useful for various tasks, such as categorization and image processing, but were not very effective at handling languages.
“Wait, DNNs are not effective language models?” you might ask with surprise. Well, in order to optimize their effectiveness, they needed a novel architecture. This is where transformers and vast amounts of text data came into play – and upgraded computing power.
Transformer models
In the – by now legendary – 2017 “Attention is all you need” paper from Google, transformers were invented. Models based on this specific DNN architecture signaled a quantum leap in AI, and are applied for various purposes, including language modeling. By introducing attention and self-attention layers, as well as allowing for efficient parallelization of computation, the transformer architecture opened up the exponential scaling of neural language models.
To further mark their importance, in 2021 Stanford researchers named transformers “foundation models,” causing a paradigm shift in the field of AI.
The main features of transformer models:
Based on the concept of “self-attention,” they learn to predict the strength of dependencies between distant data elements – such as the long-range dependencies and relationships between words (“tokens”) in text.
They can be pre-trained on massive amounts of (multilingual) text data with enormous potential to scale, and be fine-tuned with custom datasets.
Based on representing text as numbers in vectors or tensors (“embeddings”), transformer models utilize specialized hardware such as graphical or tensor processing units (GPUs or TPUs) for parallelization and compute efficiency. Recent developments in these computing technologies have greatly contributed to the advances in language modeling.
Their most well-known applications are the large language models, which achieve human-like performance when tasked with content generation, summarization, labeling tasks, and more.
Generative pre-trained transformer
The first generative pre-trained transformer (or GPT), GPT-1 by OpenAI, appeared in 2018. It generated text, was pre-trained with billions of words, and was based on the transformer architecture. It was the first proof that DNNs based on the scalable and efficient transformer architecture could be used for generative tasks.
GPT-1, however, was not the first transformer language model — that was created by Google in 2017, named BERT. While BERT by now powers most of Google search, it is not a generative model as it doesn’t actually create text and is more useful for natural language understanding (NLU) tasks. GPT-1 was created with that specific purpose: the generation of text by predicting the next word.
Large language models
So GPT-1 is a generative transformer DNN LM. So far so good, right? But it is NOT an LLM. Let’s unpack this.
Size turns out to be a critical characteristic for neural networks in two distinct ways: the number of parameters the model is made up of, and the size of the data set used to train the model.
GPT-1 had 117 million parameters, and was trained on a few billion words. While these were big numbers in 2018, they are in fact miniscule in comparison to today’s scales. The successor, GPT-2, had ten times the number of parameters (1.5B), ten times the training data, and quadrupled the neural layers. In 2020, GPT-3 upped the game again: a set of parameters 100 times larger (175 billions), ten times the training data, and double the number of layers than GPT-2. When the number of parameters exceeded a few billion (an arbitrary cut-off), these models started to be dubbed “large.”
And so the large language model was formally born with GPT-3.
Other companies have also created LLMs in different shapes and sizes, such as LaMDA and PaLM (Google), the LLaMA family (Meta), Claude (Anthropic), Cohere, BLOOM (led by Hugging Face), Luminous (Aleph Alpha), or Jurassic (AI21) with varying accessibility and usability features. OpenAI’s latest generation LLM (GPT-4) arrived in March 2023, and we can be certain that GPT-5 is already in the works.
So what do they do? All that LLMs are really doing is guessing the next word (rather, “token”) that should come after the existing text. Every single token produced by an LLM requires a complete run through the huge model and is added to the calculation for the next token to be generated.
What are these tokens? Why not just predict words? Great question. The simple answer is that languages are complex, and chunking up long(er) words into so-called tokens yields better results. And it also has additional benefits. For instance, programming languages – strictly speaking – are not made up of words, but LLMs can learn them just as well if not better than human languages.
Generative AI
So, here’s what we have learned so far: GPT-1 is a generative LM, and GPT-4 is a generative LLM. Oh, and BERT is a non-generative LLM, but they are all transformers. How do these relate to generative AI (GenAI)?
All generative LLMs are GenAIs, but generated output can be more than just text. Diffusion models (which are also DNNs utilizing transformer-like features) such as Midjourney, Stable Diffusion, DALL-E, and Bing Image Creator do just that — create images, from text or even image input. There are also attempts for voice, music, video and even game creation using AI, with architectures widely different from transformers.
What about ChatGPT?
LLMs have long stayed under the public radar because they needed an engineer to operate them. With all the rapid technology development, the breakthrough came with a product innovation: providing a public natural language interface to a GPT-3.5 model, fine-tuned for conversational use. Suddenly, practically anyone, and not just developers, could give this language technology a try.
Since then, GPT-4 has been integrated into Bing AI Chat and powers Microsoft 365 Copilot, and other LLMs also received a simple conversational interface: Google Bard, You.com, Perplexity.ai are just a few examples.
And so, with the birth of ChatGPT in November 2022, all the terms explained above became part of public discourse as well as language industry events and publications.
Note: ellipses do not correspond to the real conceptual size of listed terms
Unlike LLMs, we can’t predict the next big word in the AI-powered world, but we can at least hope that with this explanatory note, the modern terminology of language AI has become clearer.
Disclaimer: No parts of this post were written or prompted by AI.
文章由拉斯洛·瓦尔加和尤利娅·阿克胡尔科娃共同撰写。
与许多其他组织一样,Nimdzi Insights自2022年11月公开发布以来一直在尝试ChatGPT,这不足为奇。虽然大型语言模型 (llm) 已经存在了很多年,但直到ChatGPT打破大坝之前,它们几乎没有引起注意。Nimdzi的我们无疑对语言技术突然成为小镇的话题感到感谢和兴奋,我们正在密切关注和研究llm在翻译和本地化行业及其他领域的传播。
当我们一直在监控这项技术时,我们不禁注意到围绕核心基础的许多混乱 (这是可以理解的,因为许多人仍在加快这个话题,学习正确的行话)。我们看到GenAI,LLMs,GPT和更多的流行语和首字母缩写词四处飞来飞去,混杂和滥用。在本文中,我们整理了术语,因此您不必这样做。准备好很多三个字母的首字母缩略词 (TLA)!
语言模型: 从n-grams到LLMs
语言的建模由来已久。与机器翻译引擎类似,它从20世纪50年代的统计模型 (n-gram) 开始,取得了一定的成功。神经网络 (NN) 的出现-其原理基于大脑的工作原理-带来了AI和语言建模的革命。
潜在的机制仍然是概率的,但是NNs允许机器比以前的模型更有效地学习-尤其是在深度神经网络 (DNN) 的新时代,其中多层人工神经元 (“节点”) 组成了模型。Dnn已被证明可用于各种任务,例如分类和图像处理,但在处理语言方面并不十分有效。
“等等,dnn不是有效的语言模型?” 你可能会惊讶地问。好吧,为了优化其有效性,他们需要一种新颖的架构。这就是变压器和大量文本数据发挥作用的地方-并升级了计算能力。
变压器模型
在谷歌2017的 “注意力就是你所需要的” 论文中,变形金刚是发明的。基于此特定DNN体系结构的模型标志着AI的飞跃,并被用于各种目的,包括语言建模。通过引入注意力层和自我注意力层,以及允许计算的有效并行化,transformer体系结构打开了神经语言模型的指数缩放。
为了进一步表明它们的重要性,2021年斯坦福大学的研究人员将变形金刚命名为 “基础模型”,这导致了人工智能领域的范式转变。
变压器型号的主要特点:
基于 “自我注意” 的概念,他们学会了预测遥远数据元素之间的依赖强度,例如文本中单词 (“令牌”) 之间的长期依赖和关系。
它们可以对具有巨大扩展潜力的大量 (多语言) 文本数据进行预先培训,并可以使用自定义数据集进行微调。
基于将文本表示为矢量或张量中的数字 (“嵌入”),变压器模型利用专用硬件,例如图形或张量处理单元 (gpu或tpu),以实现并行化和计算效率。这些计算技术的最新发展极大地促进了语言建模的进步。
他们最著名的应用程序是大型语言模型,当执行内容生成,摘要,标记任务等时,它们可以实现类似人类的性能。
生成预训练变压器
由OpenAI GPT-1的第一个生成预训练的变压器 (或GPT) 出现在2018年。它生成了文本,经过了数十亿个单词的预训练,并且基于transformer体系结构。这是基于可扩展且高效的transformer体系结构的dnn可用于生成任务的第一个证明。
然而,GPT-1并不是第一个变压器语言模型 -- 它是由谷歌2017年创建的,名为BERT。虽然BERT现在为大多数Google搜索提供了动力,但它并不是一个生成模型,因为它实际上不会创建文本,并且对于自然语言理解 (NLU) 任务更有用。GPT-1是出于特定目的而创建的: 通过预测下一个单词来生成文本。
大型语言模型
所以GPT-1是一个生成变压器DNN LM。到目前为止还不错,对吧?但这不是法学硕士。让我们打开包装。
大小以两种不同的方式被证明是神经网络的关键特征: 模型组成的参数数量以及用于训练模型的数据集的大小。
GPT-1有1.17亿个参数,并接受了几十亿个单词的培训。虽然这些数字2018年很大,但与今天的规模相比,它们实际上是微不足道的。GPT-2,后继者具有十倍的参数数量 (1.5B),十倍的训练数据,并使神经层增加了四倍。2020年,GPT-3再次提高了游戏水平: 一组参数大100倍 (175亿),训练数据的十倍,层数比GPT-2增加一倍。当参数数量超过几十亿 (任意截止) 时,这些模型开始被称为 “大”。
因此,大型语言模型正式诞生了GPT-3。
其他公司也创建了不同形状和大小的llm,例如LaMDA和PaLM (Google),美洲驼家族 (Meta),Claude (Anthropic),Cohere,BLOOM (由拥抱脸领导),发光 (Aleph Alpha),或侏罗纪 (AI21),具有不同的可访问性和可用性功能。OpenAI的最新一代LLM (GPT-4) 于2023年3月问世,我们可以肯定GPT-5已经在进行中。
那么他们是做什么的?LLMs真正要做的就是猜测应该在现有文本之后出现的下一个单词 (而不是 “令牌”)。LLM产生的每个令牌都需要完整地运行庞大的模型,并将其添加到要生成的下一个令牌的计算中。
这些代币是什么?为什么不只是预测单词?好问题。简单的答案是语言很复杂,将长 (er) 单词分块到所谓的令牌中会产生更好的结果。它也有额外的好处。例如,严格来说,编程语言不是由单词组成的,但是LLMs可以学习它们,即使不是比人类语言更好。
生成人工智能
因此,到目前为止,我们已经了解到: GPT-1是生成性LM,GPT-4是生成性LLM。哦,伯特是一个非生成性LLM,但它们都是变形金刚。这些与生成AI (GenAI) 有何关系?
所有生成llm都是GenAIs,但是生成的输出可以不仅仅是文本。扩散模型 (也是使用类似变压器的功能的dnn),例如Midjourney,Stable Diffusion,DALL-E和Bing Image Creator就是这样做的-从文本甚至图像输入创建图像。也有尝试使用AI进行语音,音乐,视频甚至游戏创作,其架构与变形金刚大不相同。
ChatGPT呢?
Llm长期以来一直处于公众关注之下,因为他们需要工程师来操作它们。随着技术的飞速发展,产品创新带来了突破: 为GPT-3.5模型提供公共自然语言界面,并进行了微调以供对话使用。突然之间,几乎任何人,而不仅仅是开发人员,都可以尝试这种语言技术。
从那以后,GPT-4被集成到Bing AI聊天中,并为Microsoft 365 Copilot提供了动力,其他llm也获得了一个简单的对话界面: Google Bard,You.com,困惑。ai只是几个例子。
因此,随着ChatGPT在2022年11月的诞生,上面解释的所有术语都成为公共话语以及语言行业活动和出版物的一部分。
注: 省略号与所列术语的真实概念大小不符
与LLMs不同,我们无法预测AI驱动世界中的下一个大词,但我们至少可以希望,有了这个解释性说明,语言AI的现代术语变得更加清晰。
免责声明: 这篇文章的任何部分都是由人工智能编写或提示的。
以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。
阅读原文