The Data + Signals Cycle in Translation

翻译中的数据和信号周期

2020-11-18 21:50 TAUS

本文共879个字,阅读需9分钟

阅读模式 切换至中文

TAUS has just published a new report on the role of language data in the AI paradigm – LD4AI. This explores the origins and scale-up of the current role for language data moderation in translation pipelines driven by machine learning, supported by “humans in the loop.” One finding is that large-scale data management will expand the kind of jobs required. In this respect, it may be useful to understand how language also acts to produce data beyond the translation moment. This will likely foster new types of work for language professionals. Let's look a little closer at why “language data” is a richer concept than you might think. Language becomes data in two distinct ways – let’s call them HLD (Human Language Data) and DLD (Digital Language Data). Human LD is the visual or acoustic content that humans produce and consume as physical words on screens and paper or transmitted through the air as speech. All these stack up historically into a mass of recordings we naturally now call data or content. But in a networked economy, HLD also becomes a source of secondary data because of the way it signals socio-cognitive meanings. More below. Digital LD is specifically digital, ultimately consisting of vectors of numbers representing various linguistic dimensions of HLD that are used to prime a machine learning algorithm inside an AI software program. In fact it is the formal twin of HLD, used to teach a machine, rather than acting interactively as a communication medium. When we use DLD to drive a translation process, we select a chunk of bilingual text to train an algorithm to seek patterns in data so that the machine can then help translate a new batch of target language data from a new source text between the same languages. To improve quality and machine-readability, we clean up and enrich the source data first by tagging phenomena such as untranslatables or ambiguous expressions, debugging any unwanted gender or racial references, annotating named entities, and so on. This human-moderated source data is then ready to enter the machine process of learning these data points and translating them all appropriately into another language. Data moderation therefore optimizes DLD for a machine learning or AI operation. Data as Signals Back in the social world of encounters between content, people and language, that same translated content will have a particular impact on each human reader. For them, language is not a mass of word embeddings and vectors familiar to neural MT engineers, it is a medium for messaging in a specific human tongue for some further purpose - informing, engaging, seducing, evaluating, making decisions, entertaining. And telling lies. Human language, in other words, is always grounded in speech acts that create various psychological effects. And in today’s online life, readers’ and listeners’ reactions - such as slowing down or speed-reading a text, hesitating over an unknown word, eyeballing a certain proper name for more than two seconds, “liking” it, requoting it, etc. – all become useful new data for the publisher of that text. These reactions are not the producer’s language data in our LD4AI sense, but information about a receiver’s behavior that signals attitude and sentiment, engagement or rejection. Surveillance data, if you prefer, though the term has dark connotations. hbspt.cta._relativeUrls=true;hbspt.cta.load(2734675, '93c72fc5-17f4-4e74-b9d2-0d33f09adebb', {}); Signals as Data Surveillance of this type is a constant in our own conversations: we instinctively scan each other’s faces and body language to spot signs of assent, discord, doubt, collusion, or rejection. We have evolved to be alert to unusual word choices, voice tones, hesitations. When we scan a Tweet we note tell-tale signs in the humor, register, misspellings, or word choices. Not all these micro-signals are encoded clearly in the language, but they are easily inferred from the overall communicative experience. Indeed, one of the distinguishing marks in digital network life as a whole has been the automation of surveilling content for signs that produce useful data for other uses. This is especially true for our acts of speaking and writing, reading and listening. Even silence can speak volumes... So now that content owners, marketers, communicators, and internauts globally are all able to elicit more insights from tracking the reactions of reader-users to the varied signals encoded in acts of language, they will inevitably attempt to control the game by designing forms of language communication that augment the desired signals. The aim is to optimize such audience reactions, even weaponize them. Not only for written text but even more effectively in the spoken language now spreading through all our new voice channels. This form of DLD will also expand the range of potentially translatable content. As part of this transition to LD4AI, therefore, we are entering a virtuous circle of mutual reinforcement between data and signals. Translation suppliers are already providing language data moderation services to better inform the machines that speak, write and translate; their journey may soon include harvesting new types of speech, signed and text data derived from human reactions to their clients’ translated content as well. hbspt.cta._relativeUrls=true;hbspt.cta.load(2734675, '18445e49-2db2-4712-a9f8-8a809fe0149c', {});
TAUS刚刚发表了一份关于语言数据在AI范式中的作用的新报告--LD4AI。本文探讨了在由“人控”支持的机器学习驱动下,语言数据调节在翻译管道中的角色起源和扩展。一个发现是,大规模的数据管理将扩大所需的工作类型。在这方面,了解语言在翻译之外如何产生数据可能是有用的。这将促进语言专业人士新型工作的发展。让我们进一步探讨为什么“语言数据”是一个比你想象的更为丰富的概念。 语言变成数据有两种不同方式--我们称其为HLD(人类语言数据)和DLD(数字语言数据)。 人类语言数据是人类在屏幕和纸上以文字的形式产生和消费的视觉或听觉内容,或是以言语的形式进行传播的视觉或听觉内容。经过漫长的历史发展,这些堆叠成大量的记录,我们现在自然地称之为数据或内容。但在网络经济中,由于人类语言象征社会认知意义,它也成为二次数据的来源。详见下文。 数字语言数据,特别是数字的,最终由表示HLD的各种语言维度的数字向量组成,这些数字向量用于在AI软件程序中准备机器学习算法。事实上,它是HLD的正式孪生体,用于教机器,而不是充当通信媒介进行交互。 当我们使用DLD来驱动一个翻译过程时,我们选择一个双语文本块来训练一个算法,以便在数据中寻找模式,这样机器随后就可以帮助在相同语言之间从一个新的源文本中翻译出一批新的目标语言数据。 为了提高质量和机器可读性,我们首先对源数据进行清理和补充,方法是标记诸如不可译或表达含糊不清之类的现象,排除任何不受欢迎的性别或种族事宜,注释命名实体等等。然后,这些人工调节的源数据准备进入机器翻译环节,学习这些数据点,并将它们全部适当地翻译成另一种语言。因此,对于机器学习或AI操作,数据调节能优化DLD。 作为信号的数据 回到内容、人和语言接触的社会世界,同样的翻译内容会对每个人类读者产生特殊的影响。对他们来说,语言不是神经MT工程师所熟悉的一大堆单词嵌入和向量,它是一种用特定人类语言传递信息的媒介,以达到某种进一步目的,比如告知,吸引,诱惑,评估,决策,娱乐。还有说谎。 换句话说,人类的语言总是以言语行为为基础的,而言语行为会产生各种心理效应。而在今天的网络世界中,读者和听众的反应--比如放慢或快速阅读一篇文章,对不认识的单词犹豫不决,对某个专有名词盯了两秒钟以上,“喜欢”它,引用它等等--都成为文章发布者有用的新数据。这些反应不是我们的LD4AI意义上的生产者的语言数据,而是关于接收者行为的信息,接收者的行为象征着态度和情绪,参与或拒绝。如果你喜欢的话,可称其为监控数据,虽然这个词有带有阴暗意味。 hbspt.cta._relativeURLS=true;hbspt.cta.load(2734675,'93c72fc5-17f4-4e74-b9d2-0d33f09adebb',{}); 作为数据的信号 这种监控在我们自己的谈话中是经常发生的:我们本能地察看对方的脸和肢体语言,以发现同意,不和,怀疑,勾结或拒绝的迹象。我们已经发展为对不寻常的词语选择、声调和犹豫保持警惕。当我们浏览一条推文时,我们会注意到幽默、语体风格、拼写错误或用词选择的迹象。并不是所有这些微小信号都在语言中被清楚地编码,但它们很容易从整体的交际体验中推断出来。事实上,数字网络作为一个整体,其显著标志之一是,自动监控为产生用于其他用途的有用数据标志的内容。这对我们的说、写、读、听的行为更是如此。即使沉默也能说明问题。 因此,现在全球范围内的内容所有者,营销人员,传播者和网民都能够从追踪读者的反应获取更多的见解,比如用户对语言行为中编码的各种信号,他们将不可避免地试图通过设计增强所需信号的语言交流形式来控制游戏。目的是优化观众反应,甚至将其武器化。这不仅用于书面文本,而且用于通过我们所有新的声音渠道传播的口头语言更加有效。这种形式的DLD还将扩大潜在可翻译内容的范围。 因此,作为向LD4AI转变的一部分,我们正在进入数据和信号之间相互加强的良性循环。翻译供应商已经在提供语言数据调节服务,以更好地通知机器进行说、写和翻译;这些供应商可能很快就会提供从人类对客户翻译内容的反应中获取新型的语音,标记和文本型数据。 hbspt.cta._relativeURLS=true;hbspt.cta.load(2734675,'18445e49-2db2-4712-a9f8-8a809fe0149c',{});

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文