New Study Challenges LLM Dominance with Specialized Medical Translation Models

新研究挑战LLM主导地位与专业医学翻译模型

2024-08-14 12:40 slator

本文共619个字,阅读需7分钟

阅读模式 切换至中文

While large language models (LLMs) are quickly replacing neural machine translation (NMT) models, as Unbabel’s CTO João Graca mentioned in a recent Slator podcast, in certain niche fields NMT is holding out. Following a December 2023 study from Logrus Global, Ocean Translations, and the University of Manchester, which found that fine-tuning small-sized language models in the clinical domain produces significantly better translations than LLMs, a new study was published on July 26, 2024. In this latest study, Bunyamin Keles, Murat Gunay, and Serdar Caglar from AI Amplified, an AI research and infrastructure company specializing in training AI models, further explored the power of tailored NMT models in medical translation. Specifically, the AI Amplified team developed small NMT models tailored for medical texts, using the MarianMT base model. Diverging from the December 2023 study, they incorporated LLMs in the loop to create synthetic training data. “We’ve observed that LLMs are particularly effective at generating synthetic data, which has proven invaluable for training our models,” said Murat Gunay talking to Slator. Their models were trained on both synthetic and real medical data sourced from scientific articles, clinical documents, and other medical texts, and are available in six languages: English German, Turkish, French, Romanian, Spanish, and Portuguese. The authors argue that their LLMs-in-the-loop approach, combined with fine-tuning on high-quality, domain-specific data, enables these specialized NMT models to outperform general-purpose models and even some of the leading LLMs. They pointed out that models with more parameters do not necessarily yield better quality scores, stressing that the quality of the data and the fine-tuning process are often more important than model size alone. “LLMs may not necessarily be better [than NMT], and […] the quality of the data set and training is also essential,” they emphasized. The authors compared the translation quality of their models against Google Translate, DeepL, and GPT-4-Turbo across all language pairs. For the English-to-German medical translation model, they extended the comparison to include Claude-3. Their models outperformed Google Translate, DeepL, and GPT-4-Turbo across multiple automatic evaluation metrics, including BLEU, METEOR, ROUGE, and BERT, as well as through evaluation by ChatGPT and Claude AI as “impartial judges.” They opted for automatic and LLM-based evaluations over human evaluation “to mitigate time and cost constraints” while still obtaining “valuable insights into translation quality.” “Analysis […] demonstrates that our models achieve highly satisfactory and statistically significant results,” they said, though they remain committed to continually improving their datasets and models to achieve even higher performance scores. To this end, they also highlighted the need for “more shared open-source benchmark test data. ” In an effort to standardize evaluation in this domain, they introduced a new medical translation test dataset. Their models are available for testing on their website, where users can explore demo translations and witness the models’ capabilities firsthand. The author’s primary objective was to achieve “zero-error translation of medical texts,” recognizing the potential risks that mistranslations can pose in healthcare settings. “A mistranslation between patient and physician can jeopardize patient safety,” they said. Despite the availability of some medical translation models in various languages, they pointed out that “there is still a great need for medical text translation models,” given the “continued demand for high-end translation services,” in the medical field. They also stressed that medical translation is “crucial” for bridging communication gaps, underscoring the “indispensable” role of machine translation in healthcare. Designed for use by healthcare professionals and various stakeholders, these models aim to “significantly contribute to the global health community,” paving the way for “improved knowledge dissemination and better healthcare outcomes.” “This research […] paves the way for future healthcare-related AI developments,” the authors concluded.
虽然大型语言模型(LLM)正在迅速取代神经机器翻译(NMT)模型,但正如Unbabel的CTO João Graca在最近的Slator播客中提到的那样,在某些利基领域,NMT正在坚持。 继Logrus Global,Ocean Translations和曼彻斯特大学于2023年12月进行的一项研究之后,该研究发现,在临床领域微调小型语言模型可以产生比LLM更好的翻译,一项新的研究于2024年7月26日发表。 在这项最新研究中,来自AI Amplified(一家专门训练AI模型的AI研究和基础设施公司)的Bunyamin Keles、Murat Gunay和Serdar Caglar进一步探索了定制NMT模型在医学翻译中的作用。具体来说,AI Amplified团队使用MarianMT基础模型开发了为医学文本量身定制的小型NMT模型。 与2023年12月的研究不同,他们将LLM纳入循环中以创建合成训练数据。“我们已经观察到LLM在生成合成数据方面特别有效,这对于训练我们的模型来说是非常宝贵的,”Murat Gunay在接受Slator采访时说。他们的模型基于来自科学文章、临床文件和其他医学文本的合成和真实医学数据进行训练,并提供六种语言版本:英语、德语、土耳其语、法语、罗马尼亚语、西班牙语和葡萄牙语。 作者认为,他们的LLM在环方法,结合对高质量,特定领域数据的微调,使这些专门的NMT模型能够超越通用模型,甚至一些领先的LLM。 他们指出,具有更多参数的模型不一定会产生更好的质量分数,强调数据的质量和微调过程往往比模型大小本身更重要。“LLM可能不一定比NMT更好,而且数据集和培训的质量也很重要,”他们强调说。 作者在所有语言对中比较了他们的模型与Google翻译,DeepL和GPT-4-Turbo的翻译质量。对于英语到德语的医学翻译模型,他们将比较扩展到包括Claude-3。 他们的模型在多个自动评估指标(包括BLEU,METEOR,ROUGE和BERT)以及ChatGPT和Claude AI作为“公正法官”的评估中表现优于Google翻译,DeepL和GPT-4-Turbo。他们选择了自动化和基于LLM的评估,而不是人工评估,“以减轻时间和成本限制”,同时仍然获得“对翻译质量的宝贵见解”。 “分析[...]表明我们的模型实现了非常令人满意和统计上显著的结果,”他们说,尽管他们仍然致力于不断改进他们的数据集和模型,以实现更高的性能分数。 为此,他们还强调需要“更多共享的开源基准测试数据”。为了规范该领域的评估,他们引入了一个新的医学翻译测试数据集。 他们的模型可在其网站上进行测试,用户可以在那里探索演示翻译并亲眼目睹模型的功能。 作者的主要目标是实现“医疗文本的零错误翻译”,认识到误译在医疗环境中可能带来的潜在风险。他们说:“病人和医生之间的误译会危及病人的安全。” 尽管有一些不同语言的医学翻译模型可用,但他们指出,鉴于医疗领域“对高端翻译服务的持续需求”,“仍然非常需要医学文本翻译模型”。 他们还强调,医学翻译对于弥合沟通差距“至关重要”,强调了机器翻译在医疗保健中“不可或缺”的作用。 这些模型旨在供医疗保健专业人员和各种利益相关者使用,旨在“为全球卫生界做出重大贡献”,为“改善知识传播和更好的医疗保健结果”铺平道路。 “这项研究[...]为未来与医疗保健相关的人工智能发展铺平了道路,”作者总结道。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文