阿里巴巴称其FunAudioLLM为AI口译添加了原始语气和情感--翻译技术速递

In a July 11, 2024 paper, Alibaba Group’s Tongyi SpeechTeam presented FunAudioLLM, a large language model (LLM) family that integrates voice understanding and generation technologies to enable natural, speech-driven interactions. The researchers explained that recent advancements in artificial intelligence (AI) have transformed how humans interact with machines. Their key focus here is “to enhance natural voice interactions between humans and LLMs” by developing models that can effectively process and generate speech. Specifically, the researchers aimed to create models that could understand and generate speech, not just text. This would allow for more natural, hands-free interactions between humans and AI systems. This FunAudioLLM framework is built upon two core models: SenseVoice, a voice model for multilingual speech recognition and emotion detection, and CosyVoice, a text-to-speech synthesizer for speech generation. “FunAudioLLM leverages the strengths of SenseVoice and CosyVoice to push the boundaries of voice interaction technology, enabling more natural and seamless communication between humans and large language models,” said the researchers. FunAudioLLM is designed to improve a variety of voice interaction applications, including: “By combining SenseVoice, LLMs, and CosyVoice, we can effortlessly perform speech-to-speech translation,” said the researchers. SenseVoice recognizes the input speech in its original language, the LLM translates the source language into the target language, and CosyVoice synthesizes the translated text into speec, producing audio that retains the user’s voice characteristics through cross-lingual voice cloning. “This allows users to speak in foreign languages using their own voice,” they noted. In a post on X, the researchers highlighted that this method not only improves translation efficiency and fluency but also captures the emotions and tones in the original speech, reproducing these emotional nuances in the translated speech. “This makes conversations more authentic and engaging,” they said, and “significantly reduces language barriers and communication losses” in contexts such as multilingual conference interpreting, cross-cultural communication, or providing instant voice translation services for non-native speakers. FunAudioLLM supports a wide range of languages, enhancing its utility in global applications. Demos and the code are available on GitHub. Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang, Zhangyu Xiao, Zhijie Yan, Yexin Yang, Bin Zhang, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Siqi Zheng

在2024年7月11日的一篇论文中，阿里巴巴集团的同益语音团队展示了FunAudioLLM，这是一个大型语言模型（LLM）家族，集成了语音理解和生成技术，以实现自然的语音驱动交互。研究人员解释说，人工智能（AI）的最新进展已经改变了人类与机器的互动方式。他们的重点是通过开发可以有效处理和生成语音的模型来“增强人类和LLM之间的自然语音交互”。具体来说，研究人员的目标是创建能够理解和生成语音的模型，而不仅仅是文本。这将允许人类和人工智能系统之间进行更自然、更免提的交互。这个FunAudioLLM框架建立在两个核心模型之上：SenseVoice，一个用于多语言语音识别和情感检测的语音模型，以及CosyVoice，一个用于语音生成的文本到语音合成器。“FunAudioLLM利用SenseVoice和CosyVoice的优势来推动语音交互技术的边界，使人类和大型语言模型之间能够进行更自然和无缝的沟通，”研究人员说。 FunAudioLLM旨在改善各种语音交互应用程序，包括： “通过结合SenseVoice，LLM和CosyVoice，我们可以毫不费力地执行语音到语音翻译，”研究人员说。 SenseVoice以其原始语言识别输入语音，LLM将源语言翻译为目标语言，CosyVoice将翻译的文本合成为speec，通过跨语言语音克隆产生保留用户语音特征的音频。他们指出：“这允许用户用自己的声音说外语。” 在一篇关于X的文章中，研究人员强调，这种方法不仅提高了翻译效率和流畅性，而且还捕捉了原始语音中的情感和音调，在翻译的语音中再现了这些情感细微差别。 “这使得对话更加真实和吸引人，”他们说，并在多语言会议口译，跨文化交流或为非母语人士提供即时语音翻译服务等情况下“显着减少语言障碍和沟通损失”。 FunAudioLLM支持多种语言，增强了其在全球应用程序中的实用性。演示和代码可以在GitHub上找到。作者姓名：Keyu An，Qian Chen，Chong Deng，Zhihao Du，Changfeng Gao，Zhifu Gao，Yue Gu，Ting He，Hangrui Hu，Kai Hu，Shengpeng Ji，Yabin Li，Zerui Li，Heng Lu，Haoneng Luo，Xiang Lv，Bin Ma，Ziyang Ma，Chongjia Ni，Changhe Song，Jiaqi Shi，Xian Shi，Hao Wang，Wen Wang，Yuxuan Wang，Zhangyu Xiao，Zhijie Yan Yang，Bin Zhang，Qinglin Zhang，Shiliang Zhang，Nan Zhao，郑思琪

以上中文文本为机器翻译，存在不同程度偏差和错误，请理解并参考英文原文阅读。

阅读原文

机器翻译

工具

翻译管理

本地化