机器口译中的情境意识--翻译技术速递

Machine Interpreting, a subset of spoken language translation, is undergoing rapid advancements. The recent strides in this domain are particularly evident in the development of robust end-to-end systems. These systems utilize a singular language model to directly translate spoken content from one language to another. As impressive as this technology is, it currently finds its best application only in offline speech translation tasks. When it comes to real-time simultaneous translation, which is my primary area of interest, cascading systems with their multifaceted components and possible configurations remain the gold standard. Despite their inherent complexity and intrinsic limitations, cascading systems present a distinct advantage: they are adept at incorporating the latest innovations in Generative AI. This compatibility paves the way for immediate enhancements in speech translation quality. Getting AI to Read Between the Lines I recently gave an interview to El País where I argued that one of the biggest challenges of real-world speech translation can be solved, at least to a certain extend, with Large Language Models (LLM) like ChatGPT or LLama2. The challenge I’m addressing is the capacity to translate in a manner that’s informed by the communicative context, necessitating a form of “understanding” of the contingent situation. I employ the term “understanding” in quotes, given its contentious nature and the absence of a universally accepted definition. For our purposes, let’s define understanding as the capability to amass enough knowledge to enable the system to respond coherently and aligned to the communicative context. This encompasses skills like basic co-reference resolution (identifying who is speaking to whom, their gender, status, role), adjusting terminology, register, and style (speaking as an expert versus a layperson), and discerning implied meanings beyond literal statements (inferring subtext, intent, etc), to mention a few. Traditional Neural Machine Translation (NMT) falls short in these areas. Conversely, and notwithstanding their intrinsic limitations, the reasoning and in-context learning prowess of LLMs have showcased remarkable proficiency in these domains. Thus, they could be pivotal in aiding speech translation to transcend its primary constraint: the lack of inherent ties to the communicative context. Needless to say, this paves the way for a more enriched translation experience. Enhancing Speech Translation Through LLMs If you’ve extensively interacted with an advanced large language model (comparable to GPT-3.5-turbo, for instance), its potential becomes clear. Dissect a communicative act into its core components. As the act progresses and a participant introduces new information, evaluate the likelihood of specific actions being taken by either party. Probe the model about the speakers’ intentions, predict the potential trajectory of the conversation, and, with sufficient contextual information at hand, you’ll observe the intriguing insights an LLM can glean from such data. This is the basis of what I call situational awareness (different to the “higher” level of awareness as described here). This capability warrants exploration. At present, my research is geared towards leveraging Large Language Models to: Disambiguate meaning through context. Comprehend and continuously increase the knowledge about the communicative event. Assess the system’s confidence for the gained knowledge. Trigger translation decisions based on this understanding of the communication. This process presents fascinating challenges across various dimensions. From a computer science perspective, the question arises: how deeply can an LLM comprehend a communicative event, and what measures can we take to aid its understanding? On the translation front, once we’ve amassed sufficient contextual data, how can we harness it to enhance machine interpreting strategically? Integrating Frame Semantics for Contextual Awareness The methodology I’m developing to address the initial aspect of this challenge draws inspiration from Frame Semantics, a theory formulated by Charles J. Fillmore in the 1970s. This theory relates linguistic semantics with encyclopedic knowledge. Within this framework, employing a word in a novel context means comparing it with past experiences to see if they match in meaning. Fillmore elucidates this using the notions of scenes and frames. The term “frame” denotes a collection of linguistic options or constructs, which in turn evokes a mental representation or “scene”. Fillmore characterizes a scene as any recognizable experience, interaction, belief, or imagination, whether visual or not. Scenes and frames perpetually stimulate one another in patterns like frame-to-scene, scene-to-frame, scene-to-scene, and frame-to-frame. Specifically, the activation process pertains to instances where a distinct linguistic structure, such as a clause, triggers associations. These associations then prompt other linguistic structures and incite additional associations. This interplay ensures every linguistic element in a text is influenced by another, facilitating the extraction or even construction of meaning from linguistic statements. Essentially, it fosters understanding—or interpretation—of the situation. My ambition is to synthetically instigate and regulate this interplay between scenes and frames using an LLM, aiming to infuse contextual awareness into the translation procedure. Challenges and Opportunities in Situational Awareness Speech translation offers an ideal setting to explore this approach. It bears similarities to conversation design—arguably its most direct application—yet comes with the advantage of more straightforward evaluation criteria and metrics. In my working hypothesis, for an LLM to significantly enhance the translation process, it needs to transform into a communicative agent adept at discerning the nuances, logic, and dynamics of real-world scenarios, and then channel this understanding into the translation. This is no small feat, given the current limitations of LLMs, on the one hand, and the complexity of real-life communication, especially multilingual, on the other. LLMs’ understanding is grounded solely in the insights text can offer. They lack for example the ability to process visual indicators and, vitally for speech translation, they are unable to decode nuances from acoustic cues like prosody. Undoubtedly, this expanding list of challenges, which are nowadays limitations, is crucial for effective oral communication. Deciphering Linguistic Input Yet, LLMs ability in deciphering pure linguistic input is commendable as it is the amount of knowledge that can derived by simple inference. It’s astonishing how good LLMs can extract insights from minimal and often partial input (the frame). The availability of specific contextual data (the scene) enhances top-down the agent’s comprehension of the situation, creating a continuous feedback loop of scene-frame-scene activations. Interestingly, while an initial understanding of the scene is vital to initiate this loop, and must therefore be forced externally for example by describing the general communicative setting, the scene will be progressively and automatically enriched by integrating new frames, as the communication evolves. This, in turn, allows the agent to autonomously adapt to the evolving communicative context. This is situational awareness at play. Ensuring Precision in Linguistic Processing Let’s be clear, this approach isn’t without its challenges. Since LLMs primarily operate on linguistic surface structures, the frame/scene activations can easily go astray. I am not referring to the well know hallucinations, but to glaring misinterpretations of ongoing conversations. Humans possess robust control mechanisms to prevent such deviations, and to allow the interlocutor (or the interpreter in our specific case) to remain aligned with the unfolding of the communicative situation. Surely, also humans are not perfect here, and “miscommunication” or “misunderstanding” happens all time. But these mechanisms are very intricate and they remain, as of now, particularly challenging to emulate by computers. Let’s not forget that we are not aiming at perfection, but at climbing this ladder of complexity one step at a time. Harnessing Insights for Real-Time Translation Now, provided we have obtained some level of “understanding” of the communication by means of scenes and frames activations, the pressing issue becomes how to harness these insights to improve translation. And how to do this in real-time, i.e. without knowing the full context of the conversation (which incidentally is one of the other peculiar challenges of machine interpreting). Two primary approaches surface: the implicit and the explicit one. They can coexist harmoniously. But let’s briefly consider them separately. The implicit strategy involves using the LLM to both grasp the context and simultaneously adapt the translation based on this comprehension. Essentially, the LLM offers directly – i.e. without external intervention – a more contextually appropriate translation due to its inherent processes. We could already demonstrate impressive improvements (around 25% depending on language combinations) by simply injecting a LLM into the translation pipeline and crafting instructions aligned with the task at stake. While this method is straightforward and produces visible improvements, I find it less captivating, and it’s not without its drawbacks. More fascinating is the explicit strategy. Here we seek to extract insights from the scene/frame activations and utilize this meta-linguistic information to steer the translation process, i.e. by embedding this knowledge into dynamic prompt sequences. This bears resemblance to both In-Context Learning and the Chain-of-Thought Prompting technique, but necessitates significant modifications to address the unique challenges posed by spoken translation, which are too extensive to delve into here. We’re always on the lookout for informative, useful and well-researched content relative to our industry. Write to us.

机器口译是口语翻译的一个分支，正在经历快速的发展。这一领域最近的进步在健壮的端到端系统的开发中尤为明显。这些系统利用单一语言模型将口语内容从一种语言直接翻译成另一种语言。尽管这项技术令人印象深刻，但它目前只在离线语音翻译任务中得到最佳应用。当谈到实时同声翻译时，这是我的主要兴趣领域，级联系统及其多方面的组件和可能的配置仍然是黄金标准。尽管级联系统具有固有的复杂性和内在局限性，但它们呈现出一个明显的优势：它们擅长整合生成式人工智能的最新创新。这种兼容性为语音翻译质量的即时增强铺平了道路。让人工智能读懂字里行间的意思我最近接受了《国家报》的采访，在采访中我认为现实世界语音翻译的最大挑战之一可以通过ChatGPT或LLama2这样的大型语言模型（LLM）来解决，至少在一定程度上。我要解决的挑战是以一种受交际语境影响的方式进行翻译的能力，这需要对偶然情况的某种形式的“理解”。鉴于“理解”一词的争议性和缺乏一个普遍接受的定义，我用引号括起来。出于我们的目的，让我们将理解定义为积累足够知识的能力，以使系统能够连贯地响应并与交流环境保持一致。这包括基本的共指解析（识别谁在和谁说话，他们的性别，地位，角色），调整术语，语域和风格（作为专家和外行人说话），辨别字面陈述之外的隐含意义（推断潜台词，意图等）等技能。传统的神经机器翻译（NMT）在这些方面存在不足。相反，尽管有其内在的局限性，法律硕士的推理和情境学习能力在这些领域表现出了非凡的熟练程度。因此，它们可能是帮助语音翻译超越其主要限制的关键，即缺乏与交际语境的内在联系。不用说，这为更丰富的翻译体验铺平了道路。利用LLMs增强语音翻译如果您已经与高级大型语言模型（例如，类似于GPT-3.5-turbo）进行了广泛的交互，它的潜力就变得显而易见了。将交际行为分解成其核心组成部分。随着行动的进展和参与者引入新的信息，评估任何一方采取具体行动的可能性。探索关于说话者意图的模型，预测对话的潜在轨迹，并且，有了足够的上下文信息，你将观察到LLM可以从这些数据中收集到有趣的见解。这是我所说的情境意识的基础（不同于这里描述的“更高”层次的意识）。这种能力值得探索。目前，我的研究旨在利用大型语言模型：通过上下文消除意思的歧义。理解并不断增加关于交际事件的知识。评估系统对获得的知识的信心。基于对交流的理解触发翻译决策。这一进程在各个方面提出了令人着迷的挑战。从计算机科学的角度来看，问题出现了：一个LLM能理解一个交流事件有多深，我们能采取什么措施来帮助它的理解？在翻译方面，一旦我们积累了足够的上下文数据，我们如何利用它来战略性地增强机器口译？集成框架语义实现上下文感知我正在开发的解决这一挑战的初始方面的方法从框架语义学中汲取灵感，框架语义学是Charles J.Fillmore在20世纪70年代提出的一种理论。这一理论将语言语义学与百科知识联系起来。在这个框架内，在一个新的语境中使用一个词意味着将它与过去的经历进行比较，看看它们在意义上是否匹配。菲尔莫尔用场景和框架的概念来解释这一点。术语“框架”指的是语言选项或结构的集合，这些选项或结构反过来唤起一种心理表征或“场景”。菲尔莫尔将场景描述为任何可识别的体验、互动、信念或想象，无论是否是视觉的。场景和帧以帧到场景、场景到帧、场景到场景和帧到帧的模式不断地相互刺激。具体来说，激活过程与独特的语言结构（如从句）触发联想的情况有关。然后，这些联想促进了其他语言结构，并引发了更多的联想。这种相互作用确保了文本中的每一个语言元素都受到另一个元素的影响，有助于从语言陈述中提取甚至构建意义。本质上，它促进了对情况的理解或解释。我的目标是使用LLM综合激发和调节场景和框架之间的相互作用，旨在将语境意识注入翻译过程。情境意识的挑战与机遇语音翻译为探索这种方法提供了一个理想的环境。它与对话设计有相似之处——可以说是它最直接的应用——但也有更直接的评估标准和指标的优势。在我的工作假设中，LLM要显著增强翻译过程，它需要转变为一个善于辨别现实世界场景的细微差别、逻辑和动态的交流代理，然后将这种理解引入翻译。这是一个不小的壮举，一方面考虑到LLMs目前的局限性，另一方面考虑到现实生活中交流的复杂性，尤其是多语言交流。LLMs的理解完全基于文本所能提供的见解。例如，他们缺乏处理视觉指标的能力，而对于语音翻译来说至关重要的是，他们无法从声学线索中解码细微差别，如韵律。毫无疑问，这一不断扩大的挑战清单对于有效的口头交流是至关重要的，而这些挑战现在已经成为限制。解读语言输入然而，LLMs破译纯语言输入的能力是值得称赞的，因为它是通过简单推理可以获得的知识量。令人惊讶的是，优秀的法律硕士能够从最少且通常是部分的输入（框架）中提取洞察力。特定上下文数据（场景）的可用性增强了代理对情况的自上而下的理解，创建了场景——帧-场景激活的连续反馈循环。有趣的是，虽然对场景的初步理解对于启动这个循环是至关重要的，并且因此必须从外部强制，例如通过描述一般的交流设置，但是随着交流的发展，通过集成新的框架，场景将逐渐和自动地丰富。这反过来又允许代理自主地适应不断发展的通信环境。这是情境意识在起作用。确保语言处理的精确性让我们明确一点，这种方法并非没有挑战。由于LLMs主要在语言表面结构上操作，帧/场景激活很容易误入歧途。我指的不是众所周知的幻觉，而是对正在进行的对话的明显误解。人类拥有强大的控制机制来防止这种偏差，并允许对话者（或我们特定情况下的口译员）与交际情境的发展保持一致。当然，人类在这里也不是完美的，“沟通不畅”或“误解”无时无刻不在发生。但是这些机制非常复杂，到目前为止，它们仍然很难被计算机模拟。让我们不要忘记，我们的目标不是完美，而是一步一个脚印地攀登这个复杂的阶梯。利用洞察力实现实时翻译现在，假设我们已经通过场景和框架激活获得了对交流的某种程度的“理解”，那么紧迫的问题就变成了如何利用这些见解来改进翻译。以及如何实时做到这一点，即在不知道对话的完整上下文的情况下（顺便提一下，这是机器口译的另一个特殊挑战之一）。出现了两种主要方法：隐式方法和显式方法。它们可以和谐共处。但是让我们简单地分别考虑一下。隐含策略包括使用LLM来掌握上下文，同时根据这种理解调整翻译。从本质上讲，由于其固有的过程，LLM直接提供了一个更适合上下文的翻译，即没有外部干预。我们已经可以展示令人印象深刻的改进（大约25%取决于语言组合），只需将LLM注入翻译管道，并精心制作与利害攸关的任务相一致的指令。虽然这种方法很简单，并产生了明显的改进，但我发现它没有那么吸引人，而且也不是没有缺点。更吸引人的是明确的策略。在这里，我们试图从场景/帧激活中提取洞察力，并利用这种元语言信息来指导翻译过程，即通过将这种知识嵌入到动态提示序列中。这与语境学习和思维链提示技术都有相似之处，但需要进行重大修改，以解决口语翻译带来的独特挑战，这些挑战太广泛，无法在此深入探讨。我们一直在寻找与我们的行业相关的信息丰富、有用和经过充分研究的内容。给我们写信。

以上中文文本为机器翻译，存在不同程度偏差和错误，请理解并参考英文原文阅读。

阅读原文

机器翻译

工具

翻译管理

本地化