TAUS Data Sale to Boost Multilingual LLMs


2024-04-09 07:05 multilingual


阅读模式 切换至中文

TAUS is now offering its comprehensive data collection of close to 7.4 billion words for sale at discounts of more than 97% off the original value. The sale ends on April 30, 2024. The 7.4 billion words on offer are all non-public, unique, human translation quality data covering 483 language pairs. TAUS has been collecting translation data since 2008 and has been selling it to Big Tech companies for the training of their MT engines for the last 15 years. Now, the attention is of course completely shifting from MT to LLMs. LLMs are supposed to be good at translation as well. But they could be so much better, with better training on more quality multilingual data. In the early days of Statistical and then Neural MT, TAUS data served a relatively small audience of a few dozen MT developers. The landscape has changed drastically since 2023. With GenAI and LLMs there are thousands of new players interested in customizing and improving generic models. The TAUS multilingual data is particularly relevant and valuable, especially because most of the LLMs have been trained almost solely, (more than 90%), on English language data. However, the rates TAUS has historically charged – 1,500 to 2,500 Euros per million words – are now too high for the new generation of smaller-scale users, who are less focused on generic models and more on customized models. That’s why the TAUS data is now available at steep discounts of up to 97%. “There are shifts in the needs for data”, says Amir Kamran, solution architect at TAUS. “The LLM developers are now looking for data with a lot more context to improve the overall performance and accuracy of the language generation features. For the translation performance, they tend to rely on transfer learning, which results in underperformance of the multilingual and translation features of LLMs. The TAUS data helps to improve the translation quality scores with double-digit percentage points.” Please contact TAUS or complete the online form, to request the data catalog, samples, and pricing. You can purchase the entire collection or choose specific language pairs.
TAUS现在以超过原价97%的折扣出售其近74亿字的综合数据集。销售将于2024年4月30日结束。提供的74亿字都是非公开的,独特的,人工翻译质量数据,涵盖483种语言对。 TAUS自2008年以来一直在收集翻译数据,并在过去15年中将其出售给大型科技公司,用于培训他们的机器翻译引擎。现在,注意力当然完全从MT转移到LLM。LLM也应该擅长翻译。但是,如果能在更高质量的多语言数据上进行更好的培训,它们可以做得更好。 在统计MT和神经MT的早期,TAUS数据服务于几十个MT开发人员的相对较小的受众。自2023年以来,景观发生了巨大变化。有了GenAI和LLM,成千上万的新玩家对定制和改进通用模型感兴趣。TAUS的多语言数据特别相关和有价值,特别是因为大多数LLM几乎只接受过英语数据的培训(超过90%)。然而,TAUS历史上收取的费率-每百万字1,500至2,500欧元-现在对于新一代的小规模用户来说太高了,他们不太关注通用模型,而更关注定制模型。这就是为什么TAUS数据现在可以以高达97%的折扣获得。 “数据需求发生了变化,”TAUS解决方案架构师Amir Kamran说。“LLM开发人员现在正在寻找具有更多上下文的数据,以提高语言生成功能的整体性能和准确性。对于翻译性能,他们倾向于依赖迁移学习,这导致LLM的多语言和翻译功能表现不佳。TAUS数据有助于提高翻译质量分数,提高了两位数的百分点。 请联系TAUS或填写在线表格,以获取数据目录,样品和定价。您可以购买整个集合或选择特定的语言对。