2020-12-22


Data is no longer just a good idea — now that so many businesses are using and monetizing it, driving business through data and adopting this new trend have become essential for keeping up with the competition. According to the technology adoption lifecycle model, the first group of people to use new technology is called "innovators," followed by "early adopters". Next come the "early majority" and "late majority", and the last group to eventually adopt a new technology are called "laggards". Being among the early adopters brings up the head start advantage. A head start advantage can be simply defined as a company's ability to be better off than its competitors as a result of being first to market in a new category. Although no advantage lasts forever, companies that succeed in building durable head start advantages tend to dominate their categories for many years, from a market's infancy until well into its maturity. The use of language data in AI and ML applications is a fast-growing technology that has almost passed the early adoption phase. The linguists and language service providers who see the business opportunity this new data realm has in store for them will have the advantage over the late majority and laggards. TransLink, based in Russia and #84 in the global LSPs ranking, is one of the early adopters of the language data monetization business trend. "Having been among the earliest adopters of advanced technologies such as NMT, our team couldn't have let the opportunity the Data Marketplace offers slip," says Mikhail Gilin, Head of R&D at TransLink. They see their participation in the Data Marketplace as a business opportunity where they can monetize the language data that they generate as a company and they see their early adoption as a significant head start to grow in this new space. Additionally, they believe that by making their multilingual data available, they will be providing a great resource to be used in the research for the technologies benefitting the overall language industry. On Data Marketplace, they have published five corpora in the news and sports domain from Russian to English, German, Spanish and French. These are bilingual translation data they have collected over the course of the 2018 World Cup for which TransLink was the main language service provider for written communications. Whenever data sharing is the topic of conversation, the ever-returning question is the data ownership and privacy question for many LSPs. "It's significant to differentiate between the Translation Memories (TM) and the corpora we upload," says Mikhail. "The datasets can be crawled or based on an existing TM and in that case the segments are altered beyond recognition." He also highlights the fact that they ensure all numeric units are reduced in a way that they no longer represent any personally identifiable information and private information is removed or replaced before making that dataset available for purchase. As the Data Marketplace continues to grow, so do the expectations of the existing and potential data sellers and buyers. "We have multilingual corpora, and we are considering uploading multilingual data into the Data Marketplace. That improvement could let LSPs form quality multilingual MT systems, both for production and research purposes," says Mikhail and adds that "the latter is very important, too, because the collection of the exact same corpus in different languages allows for adequate research of the NMT algorithms. We see that as a great scientific opportunity for those doing linguistic research in the translation industry." By joining the Data Marketplace ahead of many other LSPs, TransLink manages to secure its share in the emerging new market for the language data for AI. "This platform is unique for the moment. And by the time there are any other similar platforms, Data Marketplace will provide a huge advantage both in terms of data processing expertise and overall volume of the uploaded corpora," says Mikhail Gilin.
数据不再仅仅是一个好的想法--由于如今多数企业都在使用数据并将其货币化,利用数据开展商业活动并紧跟这一新趋势对于参与市场竞争必不可少。根据技术使用生命周期模型,第一批使用新技术的人被称为"创新者",其次是"早期使用者",接下来是"早期大部分使用者"和"晚期大部分使用者",最后使用该新技术的一批人则成为了"落伍者"。成为早期使用者,能占据领先优势。 领先优势可以简单地定义为一个企业的能力比其竞争对手更好,这使得该企业能够首先进入一个新的领域推销自己。虽然没有什么优势是永存的,但一个新的市场从诞生到走向成熟,其中长期占领主导地位的企业往往是那些能够成功拥有屹立不倒的领先优势的企业。 在AI和ML应用中使用语言数据的技术快速成长,几乎已经过了早期使用阶段。因为语言学家和语言服务提供商在这个新的数据领域中看到了蓄势待发的商机,所以他们就会比晚期大多数使用者和落伍者更具优势。 TransLink总部位于俄罗斯,在全球LSPs排名第84位,是语言数据货币化业务的早期使用者之一。Translink公司研发主管米哈伊尔·吉林表示:"作为NMT等先进技术的最早使用者之一,我们的团队绝不会让Data Marketplace提供的机会溜走。"他们把参与数据市场看作是一个商业机会,他们可以将公司生成的语言数据在数据市场货币化,他们把这些数据的早期使用看作是在这个新领域发展的一个重要开端。此外,他们相信,借助提供多语言数据,他们还将继续提供有利于整个语言产业技术研究的资源。 在Data Marketplace上,TransLink已经发布了五个新闻和体育领域的语料库,涉及俄语、英语、德语、西班牙语和法语。TransLink是2018年世界杯的主要书面交流语言服务提供商,他们借此机会收集以上语料库。不论何时,提起数据共享,对许多LSP来说,一个老生常谈的问题就是数据所有权和隐私问题。"区分翻译记忆(TM)和我们上传的语料库是很重要的",Mikhail说,"这些数据集被抓取或基于现有TM生成后,数据段会变得无法识别。"他还强调,这些数据集能确保所有数字单位被缩减至不再代表任何个人可识别的信息,并且私人信息在数据集可供购买之前就会被移除或替换。 随着数据市场的持续增长,现有的和潜在的数据卖方和买方的期望也在不断增长。Mikhail说:"我们拥有多语种语料库,正在考虑将多语种数据上传到数据市场。这种做法可帮助LSP形成高质量的多语种MT系统,可用于生产和研究。"他还并补充道:"研究也非常重要,因为收集不同语种的同一语料库可以对NMT算法进行充分的研究。我们认为这对翻译行业中进行语言研究的人来说是一个很好的科学机会。" 通过比许多其他LSP提前进入数据市场,TransLink成果保证了其在人工智能语言数据这一新兴市场的份额。"这个平台目前是独一无二的。而且等以后任何其他类似的平台出现时,Data Marketplace将在数据处理专业知识和上传语料库总量方面拥有巨大的优势",Mikhail Gilin说。


