Most machine translation (MT) may still need a human touch before being deemed usable in a professional context. But, even without it, it is becoming less easy to distinguish human writing from machine-generated text.
Yet these gains have one drawback: It is now easier for MT to be abused for malicious purposes, such as plagiarism and fake reviews.
Even Google is having trouble keeping up. In a 2018 Google Webmasters Hangouts session, Senior Webmaster Trends Analyst John Mueller told participants it was possible that some machine translated content had become fluent enough to fool Google’s own ranking algorithms.
A year later, on October 22, 2019, Mueller responded to related questions on Twitter, stating that text translated by services such as DeepL or Google Translate do not automatically trigger a Google penalty or manual action. (The caveat, though, is that if the translation quality is poor, the content might not rank well.)
Traditional methods of detecting machine translation have their own shortcomings, such as ignoring semantics or only working well with large texts. But an approach explored by Hoang-Quoc Nguyen-Son, Tran Phuong Thao, Seira Hidano, Shinsaku Kiyomoto, researchers at KDDI Research and the University of Tokyo, may help.
The team randomly selected 2,000 English-to-French sentence pairs to test the new method. In this case, researchers used the English sentences as the human-written text, and the French translations were back-translated into English using Google Translate. The resulting English texts were then back-translated by Google Translate (into French, back into English) several more times.
According to the team’s October 2019 paper, human-written sentences showed more variation in word usage and structure between back-translations than machine-translated sentences. In other words, the more times a text was machine-translated, the more similar the resulting back-translation was to the original text.
The researchers then used BLEU scores to estimate the similarity of a text and its back-translation to identify content that had been machine translated or machine back-translated.
The team concluded that their work with English and French, and later experiments on Japanese, outperformed previous methods.
Future research will evaluate how well the new technique can identify problematic text, such as fake news, which may become even more relevant to Google if the tech giant ever changes its webmaster guidelines to permit machine-generated content.
大多数机器翻译都需要经过人工译后编辑才可应用。但是,即使是不经过译后编辑的机器翻译,也很难与人类所写的文字区分开了。
机器翻译的进步却也带来了一个问题:现在,机器翻译更容易被恶意滥用,如剽窃和虚假评论。
甚至连谷歌也无法分辨。2018年谷歌网管环聊会议上,高级网站管理员趋势分析师约翰•穆勒(John Mueller)对与会人员说,某些机器翻译的内容十分流畅,甚至骗过了谷歌自身的排名算法。
一年后,2019年10月22日,穆勒在推特上回应了相关问题,称由DeepL或谷歌翻译等机器翻译引擎翻译的文本不会自动触发谷歌处罚或手动操作。(不过,需要注意的是,如果翻译质量不好,其内容可能排名不佳。)
检测机器翻译的传统方法有其自身的缺点,如忽略语义或仅在文本长度足够的情况下能有效检测。但是由KDDI综合研究所和东京大学的研究人员Hoang-Quoc Nguyen-Son、Tran Phuong Thao、Seira Hidano、Shinsaku Kiyomoto所探索的方法能改进这个不足。
该团队随机选择了2,000对英语到法语的句子对来测试新方法。在这次实验中,研究人员将英语句子用作人工撰写的文本,然后使用Google翻译将法语翻译反译为英语。然后,由Google翻译将生成的英文文本进行多次反翻译(译成法文,再反译成英文)。
根据研究团队在2019年10月发表的论文,与机器翻译的句子相比,人工撰写的句子在多次反翻译之间的单词用法和结构变化更大。换句话说,对文本进行机器翻译的次数越多,则反向翻译与原始文本越相似。
然后,研究人员使用BLEU评分来估计文本的相似性及其反向翻译,以识别经过机器翻译或经过机器反向翻译的内容。
该团队得出的结论是,他们用英语和法语进行的实验以及后来用日语进行的实验结果都表明,这种方法优于以前的方法。
未来的研究将评估该新技术如何更好地识别假新闻等有问题的文本。如果谷歌更改其网站管理指南,允许机器生成的内容,那么谷歌或可考虑采用该项新技术。
译后编辑:罗温馨(中山大学)
以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。
阅读原文