ByteDance Unveils a ‘Human-Like’ Speech Translation System

字节跳动推出“类人”语音翻译系统

2024-08-28 06:00 slator

本文共691个字,阅读需7分钟

阅读模式 切换至中文

On July 31, 2024 ByteDance’s Cross Language Agent Team presented a system designed to deliver “high-quality” and “human-like” simultaneous speech translation (SiST). The researchers underscored the complexity of SiST, describing it as “one of the most challenging tasks in the translation domain.” Despite notable advancements in academic and commercial SiST models, they acknowledged that “the translation quality is still far from satisfactory,” highlighting the need for a more effective solution. Inspired by the success of large language models (LLMs) in machine translation (MT) and speech translation, the ByteDance team leveraged LLMs to tackle the SiST challenges. Their solution is a Cross-Lingual Agent that performs Simultaneous Interpretation (“CLASI”) through a systematic execution of various operations. CLASI operates through a structured five-step process, starting with the processing of incoming audio data. To mimic professional human interpreters, who often break down sentences into smaller “semantic chunks” based on natural pauses, punctuation marks, and meaning, CLASI employs a “data-driven policy learning” method. By training on human-annotated speech data, CLASI learns how to recognize natural breaks in speech, developing a robust “read-write policy” that guides it on when to listen (read) and when to translate (write) during the speech. In the second step, CLASI employs a multi-modal retriever to access relevant information from an external knowledge base. The third step involves retrieving context from the last round memory, which stores data from previous translations. By appending this retrieved information from the external knowledge base and the context from the translation memory into the LLM agent’s prompt, CLASI dynamically integrates relevant knowledge, significantly improving the accuracy and coherence of its translations, according to the researchers. After processing the input and retrieving relevant information, CLASI generates the transcription (if needed), the translation output, and a timestamp that indicates when the current translation round ends. This timestamp allows the system to determine where to begin for the next round of audio input. It then updates its memory with the new translations, ensuring the retention of context for future processing. This cycle then restarts from step one for the next speech segment. “Supported by LLMs, our approach can generate error-tolerated translation by considering the input audio, historical context, and retrieved information,” the researchers said. To assess CLASI’s performance, the team developed a new evaluation metric called “VIP” (Versatile Interpretation Performance), which measures the amount of information that can be successfully conveyed to listeners during simultaneous speech translation/interpretation. According to the researchers, VIP better reflects the performance of SiST systems in real-world scenarios. They tested CLASI against other top simultaneous interpretation systems, both commercial and open-source, and found that CLASI outperformed them “by significant margins.” CLASI achieved a VIP score of 81.3% for Chinese-to-English and 78.0% for English-to-Chinese translations. In contrast, state-of-the-art commercial or open-source systems only achieved VIP scores of 35.4% and 41.6%, respectively. Even on extremely challenging datasets, where other systems scored under 13% VIP, CLASI maintained a VIP of 70%, said the researchers. The researchers ventured as far as stating that “these results are close to the performance of human interpreters, who typically achieve around 80% VIP.” The researchers believe the system can be applied in various scenarios to facilitate cross-lingual communication, such as international conferences and daily meetings, enabling attendees to understand speeches in different languages. CLASI can also function as a system-level translation module, enhancing the viewing experience for users watching videos in foreign languages by providing real-time translations, added the researchers. In the online gaming sector, CLASI could aid communications among players speaking different languages, fostering a more inclusive gaming environment. Additionally, with its “human parity performance,” it could improve the efficiency of professional human interpreters, claim the researchers. “With the powerful translation ability of CLASI, we believe it can further make cross-lingual communication seamless across different places all over the world,” the researchers concluded. Looking ahead, the ByteDance team plans to expand CLASI to support additional languages, including low-resource ones. Demonstrations and human-annotated test sets are available on GitHub. Authors: Shanbo Cheng, Zhichao Huang, Tom Ko, Hang Li, Ningxin Peng, Lu Xu, Qini Zhang
2024年7月31日,字节跳动的跨语言代理团队展示了一个旨在提供“高质量”和“类人”同声语音翻译(SiST)的系统。 研究人员强调了SiST的复杂性,将其描述为“翻译领域最具挑战性的任务之一”。尽管学术和商业SiST模型取得了显著进步,但他们承认“翻译质量仍远不能令人满意”,强调需要更有效的解决方案。 受大型语言模型(LLM)在机器翻译(MT)和语音翻译方面的成功启发,字节跳动团队利用LLM来应对SiST挑战。他们的解决方案是一个跨语言代理,通过系统地执行各种操作来执行同声传译(“CLASI”)。 CLASI通过结构化的五步流程进行操作,从处理传入的音频数据开始。为了模仿专业的人类口译员,他们经常根据自然停顿、标点符号和含义将句子分解成更小的“语义块”,CLASI采用了“数据驱动的策略学习”方法。 通过对人类注释的语音数据进行训练,CLASI学会了如何识别语音中的自然中断,制定了一个强大的“读写策略”,指导它在语音过程中何时听(读)和何时翻译(写)。 在第二步中,CLASI采用多模态检索器从外部知识库访问相关信息。 第三步涉及从最后一轮存储器中检索上下文,该存储器存储来自先前翻译的数据。根据研究人员的说法,通过将从外部知识库中检索到的信息和翻译记忆库中的上下文附加到LLM代理的提示中,CLASI动态地整合了相关知识,显著提高了其翻译的准确性和连贯性。 在处理输入和检索相关信息之后,CLASI生成转录(如果需要)、翻译输出和指示当前翻译轮何时结束的时间戳。该时间戳允许系统确定下一轮音频输入的开始位置。然后,它用新的翻译更新其存储器,确保保留上下文以供将来处理。然后,对于下一个语音段,该循环从步骤1重新开始。 “在法学硕士的支持下,我们的方法可以通过考虑输入音频、历史上下文和检索到的信息来生成容错翻译,”研究人员说。 为了评估CLASI的表现,该团队开发了一种名为“VIP”(Versatile Interpretation Performance)的新评估指标,该指标衡量在同声语音翻译/口译过程中可以成功传达给听众的信息量。 根据研究人员的说法,VIP更好地反映了SiST系统在现实场景中的性能。他们将CLASI与其他顶级同声传译系统(包括商业和开源系统)进行了测试,发现CLASI的表现“大幅领先”。 CLASI在汉译英方面获得了81.3%的VIP分数,在英译汉方面获得了78.0%的VIP分数。相比之下,最先进的商业或开源系统分别仅获得35.4%和41.6%的VIP分数。研究人员表示,即使在极具挑战性的数据集上,其他系统的VIP得分低于13%,CLASI也保持了70%的VIP。 研究人员甚至大胆地表示,“这些结果接近人类口译员的表现,人类口译员通常达到80%左右的VIP。” 研究人员认为,该系统可以应用于各种场景,以促进跨语言交流,如国际会议和日常会议,使与会者能够理解不同语言的演讲。 研究人员补充说,CLASI还可以作为系统级翻译模块,通过提供实时翻译来增强用户观看外语视频的观看体验。 在在线游戏领域,CLASI可以帮助说不同语言的玩家之间的交流,营造一个更具包容性的游戏环境。此外,研究人员声称,凭借其“人类对等性能”,它可以提高专业人类口译员的效率。 研究人员总结道:“凭借CLASI强大的翻译能力,我们相信它可以进一步使跨世界各地不同地点的跨语言交流无缝衔接。” 展望未来,字节跳动团队计划扩展CLASI以支持更多语言,包括低资源语言。 GitHub上提供了演示和人工注释的测试集。 作者:程善波、黄志超、Tom Ko、Hang Li、Ningxin Peng、Lu Xu、Qini Zhang

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文