Social network giant Meta made a splash in August 2023 with SeamlessM4T, a model that offers different combinations of text and speech translation for dozens of languages, at a minimum.
What makes the multimodal model unique is its ability to perform in both text and speech — as opposed to siloing these capabilities in separate models.
In their 100-page paper on the model, the more than 60 authors encouraged comparisons to the science fiction touchpoint often invoked in hype for multilingual technological advancements. Observers obliged.
“How far are we from the BabelFish?” one asked on X. “Put this sucker in a phone and we are pretty much there.”
But, as any language industry veteran will attest, silver bullets seem to lose their luster upon closer inspection.
Responding to a LinkedIn post by Meta VP and Chief AI Scientist Yann LeCun, commenters asked whether SeamlessM4T offered speaker recognition, probed on the model’s ability to handle source speech containing more than one language, and pointed out specific languages currently unavailable for certain speech/text translation combinations.
Even fans shared their reservations, or more measured takes. Translator William Dan praised SeamlessM4T for its speed and its ability to run on a user’s local GPU — naturally a better option for data protection than a model accessed online.
Still, Dan admitted, the text it produces is undeniably machine translation, with the accompanying grammatical mistakes and even issues with missing translations.
“But to be honest, if companies don’t cut translator rates with SeamlessM4T, I’ll post-edit the outputs without a single complaint,” Dan added.
Of course, with a model that provides more than “just” text-to-text translation, translators are not the only language professionals interested in how SeamlessM4T might fit into their work.
Notably, Claudio Fantinuoli, CTO of interpreting tech company Kudo, was quoted as saying Meta’s tool — and others that similarly combine functions in one system — are the future. This was in an article in the English language edition of El Pais, a leading newspaper in Spain, about AI’s influence on simultaneous interpreting.
According to the article, Kudo already has 20 clients who use the company’s tool for automated interpreting; Kudo product manager Tzachi Levy describes this solution as “effective” for smaller meetings where human interpreters are not present.
Just as Kudo continues to improve its own offering — which could be likened to a real-time dub — Meta’s SeamlessM4T seems to be another big advance in speech translation research.
Beyond bragging rights, Meta’s interest in real-time speech translation is likely tied to the company’s goal of bringing its “metaverse” to the widest audience possible, with frictionless communication across languages as a selling point.
This was the case for Meta’s No Language Left Behind (NLLB) project, touted in July 2022 as helping speakers of thousands of languages access “new, immersive experiences in virtual worlds.” While NLLB’s focus was low-resource languages, its scale — 200 languages, and 40,000 possible translation directions — eclipses SeamlessM4T, at least for now.
Meta’s Massively Multilingual Speech project, which debuted in May 2023, brought the company one step closer to SeamlessM4T by offering the usually disparate speech-to-text and text-to-speech translation in a single system. Again, Meta went big, covering 1,100 languages, albeit with mixed results.
Voicebox, introduced in June 2023, was billed as the first model to generalize to speech-generation tasks (i.e., to be able to handle speech generation without specifically being trained for that kind of task). Trained on over 50,000 hours of recorded speech and transcripts in six languages, Voicebox can use input text or text and audio in one of six languages to generate audio output in another language.
Unlike other recent advancements, Meta decided not to open-source the Voicebox model or code in order to prevent misuse. SeamlessM4T, meanwhile, and its code and metadata, are available on GitHub.
SeamlessM4T has already inspired a September 2023 hackathon — to the tune of 265 participants and eight AI applications (finalists TBD). The question now is, in which direction Meta will take this latest development, and how soon will the public find out?
社交网络巨头Meta在2023年8月推出了无障碍M4T,这是一种至少为数十种语言提供不同的文本和语音翻译组合的模型。
多模态模型的独特之处在于它能够同时在文本和语音中执行-而不是将这些功能孤立在单独的模型中。
在他们关于该模型的100页论文中,60多位作者鼓励将其与科幻小说中的接触点进行比较,这些接触点经常被用于宣传多语言技术进步。观察员们不得不。
“我们离巴别鱼还有多远?“有人问X。“把这个吸盘放进电话里,我们就差不多到了。”
但是,正如任何语言行业的资深人士都会证明的那样,银弹似乎在仔细检查后失去了光泽。
在回应Meta副总裁兼首席人工智能科学家Yann LeCun在LinkedIn上发表的帖子时,评论者询问无障碍M4 T是否提供说话人识别,探讨了该模型处理包含多种语言的源语音的能力,并指出了目前无法用于某些语音/文本翻译组合的特定语言。
甚至球迷们也分享了他们的保留意见,或者更多的衡量。翻译William Dan称赞了M4 T的速度和在用户本地GPU上运行的能力-自然是比在线访问模型更好的数据保护选择。
尽管如此,丹承认,它产生的文本是机器翻译,伴随着语法错误,甚至与遗漏的翻译问题。
“但说实话,如果公司不削减翻译费率与无障碍M4T,我会后编辑的输出没有一个单一的投诉,”丹补充说。
当然,有了一个提供“不仅仅”文本到文本翻译的模型,翻译人员并不是唯一对无障碍M4T如何适应他们的工作感兴趣的语言专业人员。
值得注意的是,解释技术公司Kudo的首席技术官Claudio Fantinuoli表示,Meta的工具-以及其他类似的将功能结合在一个系统中的工具-是未来。这是西班牙主要报纸El Pais英文版的一篇文章,关于AI对同声传译的影响。
根据这篇文章,工藤已经有20个客户使用该公司的自动口译工具; Kudo产品经理Tzachi Levy将此解决方案描述为“有效”的小型会议,其中没有口译员。
正如工藤继续改进自己的产品--可以比作实时配音-- Meta的无障碍M4 T似乎是语音翻译研究的另一个重大进展。
除了吹嘘的权利,Meta对实时语音翻译的兴趣可能与该公司将其“虚拟世界”带给尽可能广泛的受众的目标有关,跨语言的无摩擦交流是一个卖点。
Meta的No Language Left Behind(NLLB)项目就是如此,该项目在2022年7月被吹捧为帮助数千种语言的使用者获得“虚拟世界中全新的沉浸式体验”。虽然NLLB的重点是低资源语言,但它的规模- 200种语言和40,000种可能的翻译方向-使无限制的M4 T黯然失色,至少目前如此。
Meta的大规模多语言语音项目于2023年5月首次亮相,通过在单个系统中提供通常不同的语音到文本和文本到语音的翻译,使公司向无障碍M4 T迈进了一步。Meta又一次做大了,覆盖了1,100种语言,尽管结果好坏参半。
Voicebox于2023年6月推出,被称为第一个推广到语音生成任务的模型(即,能够处理语音生成,而无需专门针对这种任务进行训练)。经过超过50,000小时的六种语言的录音和文字记录的训练,Voicebox可以使用六种语言之一的输入文本或文本和音频来生成另一种语言的音频输出。
与最近的其他进步不同,Meta决定不开源Voicebox模型或代码,以防止滥用。与此同时,无源M4T及其代码和元数据可以在GitHub上获得。
无障碍M4 T已经激发了2023年9月的黑客活动-共有265名参与者和8个AI应用程序(决赛待定)。现在的问题是,Meta将把这一最新发展带向哪个方向,公众多快会发现?
以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。
阅读原文