Breaking the Publishing Ground: From Dictionaries to Linguistic Data


2021-09-14 21:50 TAUS


阅读模式 切换至中文

TAUS Data Marketplace has brought new opportunities to everyone, from individual linguists and LSPs to data and publishing companies, to leverage and monetize their content. The key to being a part of the surging trend of language data for AI is the successful conversion of available multilingual content into language data that is directly usable for AI model training. This remains to be challenging for many companies that have their roots in the publishing business. Lexicala is a peculiar example that has emerged from the publishing world, as a provider of quality lexicographic content for leading dictionary publishers worldwide, and has professionally overcome this challenge and joined the TAUS Data Marketplace as a data seller. They have come across the Marketplace in the context of their market research and decided that it would be an interesting platform for their business development goals and that it’d be fairly simple to adapt their data to publish and sell as language data. We talked to Ilan Kernerman, CEO; Raya Abu Ahmad, Content Manager; and Maayan Orner, Software Manager, from Lexicala about the journey that has led them onto this path. Lexicala was established in Tel Aviv as K Dictionaries, which had its roots in English learner’s dictionaries. During the 1990’s, K Dictionaries has developed a unique collaborative network with publishing partners around its innovative customized dictionaries, and established its name as a pioneer in bilingual, pedagogical, digital, and user-oriented lexicography. These days, their most notable close partner in these domains is Cambridge University Press and their world’s most popular dictionary website for learners of English, which includes dozens of K Dictionaries titles. “At the turn of the century, we expanded to multilingual lexicography and started exploring new methodologies and technologies. This has led to the creation of a systemic, ground-breaking series of monolingual datasets for selected world languages, focusing on the data structure and format, that served for developing fully bilingual language pairs and diverse multilingual combinations, and to our gradual evolution into a technology-driven content creator,” says Ilan. Today, they’ve converged smart automated processes for data generation and validation with expert human-curated editing, to make their resources interoperable and beneficial for NMT and other NLP and AI applications, offering high-end cross-lingual lexical data under the new trade name Lexicala. “TAUS Data Marketplace presents us with an excellent opportunity to reach more potential customers who can benefit from the added value of our parallel corpora to enhance the training of their ML models and improve the results of their NMT solutions,” adds Ilan, in line with their business strategy. In August 2021 Lexicala uploaded to TAUS Data Marketplace the first release of 357 bilingual datasets in 20 languages, including a total of 1.7 million parallel segments with 43 million tokens. The languages include Arabic, Chinese (Simplified), Danish, Dutch, English, French, German, Greek, Hebrew, Italian, Japanese, Korean, Norwegian, Polish, Portuguese – Brazilian and European, Russian, Spanish, Swedish, and Turkish as well as Latin, translated to French only. The segments in their datasets all stem from manually curated examples of usage and their translation equivalents, consisting only of full sentences and featuring general language, i.e. domain-independent – not vertical – vocabularies. Lexicala datasets are available for purchase on the TAUS Data Marketplace. Check the samples and start training! “The data were created by our editors around the world based on corpus evidence and frequency for each language. They create, review, select and manually curate examples of usage as part of compiling dictionary entries for the most important lemmas, senses and multiword expressions. These usage examples are then translated by professional translators and are at the heart of the parallel corpora now available on DMP,” says Raya. Lexicala explains that they’ve faced several challenges in the process, such as noisy segments and mislabeled datasets. For the first one, they have developed an algorithm to eliminate noisy segments based on basic statistic and heuristic rules and for the latter, they’ve developed and used another algorithm that classifies tuples of as correct or incorrect, based on an existing language identification model and a feed-forward neural network. “Our custom model improved over the baseline language identification model (checking if the labeled language is the same as the identified language) significantly for the specific task, mostly for highly ambiguous and mutually intelligible language groups, such as the Nordic ones,” explains Maayan. They hope that joining the TAUS Data Marketplace will increase their exposure in the MT training market and expand their clientele. “And vice-versa, the considerable volume and diversity of our data, and its advantages over more conventional automatically harvested parallel and comparable corpora, can help boost the appeal of the Marketplace to buyers,” says Ilan. As for data privacy and ownership concerns, Lexicala attains the utmost importance to this topic. “This has also been a vital topic in our discussions with TAUS, to make sure that the data we upload to DMP is both highly protected and serves customers uniquely for incorporating it into their NMT systems to upgrade inhouse processes and results without making them available as-is to others,” explains Ilan. As the CEO of a data company with roots in the publishing industry, Ilan shares that traditionally, in the dictionary industry, publishers tended to be conservative with regard to sharing their resources with others, but that has been changing with making dictionaries available freely online and gaining revenues from ads. Lexicala is one of the early adopters, however, it seems that more companies from the publishing world are about to hop on the LD4AI (Language Data for AI) bandwagon soon. Although it’s difficult to predict the future, particularly in the face of fast-paced advancements in the AI training sphere, Lexicala expects that the global NLP market will continue to grow enormously, as shown in a recent Fortune report estimating the overall size shooting from USD 21 billion in 2021 to USD 127 billion in 2028. They think that there will be a mix of more demand with more specialization and more customization for data sharing and marketplaces of all kinds.
TAUS数据市场给每个人带来了新的机会,从个人语言学家和LSP到数据和出版公司,都可以利用他们的内容并使之盈利。要成为人工智能语言数据激增趋势的一部分,关键是将可用的多语言内容成功转化为可直接用于人工智能模型训练的语言数据。 这对许多扎根于出版业的公司来说仍然是一个挑战。Lexicala是一个特殊的例子,它从出版界脱颖而出,为全球领先的词典出版商提供高质量的词典内容,并从专业角度克服了这一挑战,作为数据卖方加入了TAUS数据市场。他们在市场调研中接触到了市场,并认为这对他们的业务发展目标来说是一个有趣的平台,而且将他们的数据调整为语言数据进行发布和销售是相当简单的。 我们与来自Lexicala的首席执行官Ilan Kernerman、内容经理Raya Abu Ahmad和软件经理Maayan Orner讨论了导致他们走上这条道路的历程。 Lexicala成立于特拉维夫,前身是K词典,它的根基是英语学习者的词典。在20世纪90年代,K词典围绕其创新的定制词典与出版伙伴建立了一个独特的合作网络,并确立了其作为双语、教学、数字和面向用户的词典学的先驱者的地位。如今,他们在这些领域最引人注目的紧密合作伙伴是剑桥大学出版社和他们为英语学习者提供的世界上最受欢迎的词典网站,其中包括几十种K词典的书。 "在世纪之交,我们扩展到多语言词典学,并开始探索新的方法和技术。这导致我们为选定的世界语言创建了一系列系统的、开创性的单语数据集,重点是数据结构和格式,这些数据集为开发完全的双语语言对和多样化的多语组合服务,也导致我们逐渐演变为一个技术驱动的内容创造者,"伊兰说。 今天,他们已经将数据生成和验证的智能自动化流程与专业的人工编辑融合在一起,使他们的资源具有互操作性,有利于NMT和其他NLP和AI应用,以新的商品名称Lexicala提供高端跨语言词汇数据。 "TAUS数据市场为我们提供了一个极好的机会,可以接触到更多的潜在客户,这些客户可以从我们的平行语料库的附加价值中受益,以加强他们的ML模型的训练,提高他们的NMT解决方案的结果,"Ilan补充说,这与他们的商业战略是一致的。 2021年8月,Lexicala向TAUS数据市场上传了20种语言的357个双语数据集的第一个版本,包括总共170万个平行段,4300万个标记。这些语言包括阿拉伯语、简体中文、丹麦语、荷兰语、英语、法语、德语、希腊语、希伯来语、意大利语、日语、韩语、挪威语、波兰语、葡萄牙语(巴西和欧洲)、俄语、西班牙语、瑞典语和土耳其语,以及仅翻译成法语的拉丁语。他们的数据集中的语段都来自于人工策划的使用实例及其翻译对应物,仅由完整的句子组成,并具有一般语言,即独立于领域的词汇,而不是垂直词汇。Lexicala数据集可在TAUS数据市场上购买。请查看样本并开始训练! "这些数据是由我们世界各地的编辑根据语料库证据和每种语言的频率创建的。他们创建、审查、选择和手动策划使用范例,作为编纂最重要的词条、意义和多词表达的一部分。这些使用范例然后由专业翻译人员进行翻译,是现在DMP上提供的平行语料库的核心,"拉亚说。 Lexicala解释说,他们在这个过程中面临着几个挑战,如噪音段和错误标记的数据集。对于前者,他们开发了一种算法,以消除基于基本统计和启发式规则的噪音段,对于后者,他们开发并使用了另一种算法,基于现有的语言识别模型和前馈神经网络,将图元分类为正确或不正确。"我们的自定义模型比基线语言识别模型(检查标记的语言是否与识别的语言相同)在具体任务上有明显的改进,主要是针对高度模糊和相互理解的语言组,如北欧语言,"马扬解释说。 他们希望加入TAUS数据市场将增加他们在MT培训市场的曝光率,并扩大他们的客户群。"伊兰说:"反之亦然,我们的数据具有相当大的数量和多样性,与更传统的自动收获的平行和可比语料库相比,其优势可以帮助提高市场对买家的吸引力。 至于数据隐私和所有权问题,Lexicala对这个问题极为重视。"这也是我们与TAUS讨论的一个重要话题,以确保我们上传到DMP的数据受到高度保护,并为客户提供独特的服务,将其纳入他们的NMT系统,以升级内部流程和结果,而不将其原封不动地提供给他人,"Ilan解释说。作为一家扎根于出版业的数据公司的首席执行官,Ilan分享说,传统上,在词典行业,出版商在与他人分享他们的资源方面往往是保守的,但随着在网上免费提供词典并从广告中获得收入,这种情况已经在发生变化。Lexicala是早期采用者之一,然而,似乎更多来自出版界的公司即将跳上LD4AI(人工智能的语言数据)的行列。 虽然很难预测未来,特别是面对人工智能培训领域的快速发展,但Lexicala预计全球NLP市场将继续巨大的增长,正如最近的财富报告所显示的那样,估计整体规模从2021年的210亿美元拍摄到2028年的1270亿美元。他们认为,将有更多的需求与更多的专业化和更多的定制化的数据共享和各类市场的混合。

