Still a Hassle for Translation Project Managers, But OCR Is Improving

对于翻译项目经理来说,光学字符识别仍然是一个烦扰,但是光学字符识别正在改进

2020-09-24 19:20 slator

本文共727个字,阅读需8分钟

阅读模式 切换至中文

Language industry project managers are all too familiar with this scenario: A client wants to translate a document in an uneditable file format. But, before anything, the PM must put the document through a round of optical character recognition (OCR) just to determine the word count. The task can be further complicated if the document is handwritten or contains text in an unknown language (or both — for a real headache). Many companies have found ways around the problem of OCR. For small businesses, Adobe Acrobat might get the job done; but as a company grows, it might explore other options, such as OpenText’s series of Capture engines. ABBYY FineReader Engine also offers a suite of recognition products, including OCR technology advertised as working for up to 200 languages. Google, for its part, has sponsored further development since 2006 of the open source OCR engine Tesseract, which was originally developed by Hewlett-Packard in the 1980s. The Google Cloud Platform also provides a tutorial on performing OCR using a collection of billable Cloud products. Amazon, meanwhile, prides itself on Textract’s ability to extract data from tables and charts while maintaining original formatting. Each newcomer to the OCR scene touts its algorithms and technology as the definitive answer to the OCR challenge. Language service provider Tarjama, based in Dubai, UAE, has built proprietary OCR tech based on neural networks. Singaporean startup Staple specializes in documents where layout is important, such as invoices, tax forms, and bank statements; users can input documents in 100 languages via WeChat, Google Drive, and Dropbox. Sid Newby, creator and CTO of Cullable (and owner of the domain ocrsucks.com), embraces OCR’s bad reputation. He founded Cullable in 2015 based on years of experience in business litigation with eDiscovery (i.e., sifting through thousands of pages of documents for any possibly relevant information). Attorneys can miss a needle of critical evidence in a haystack of unsearchable text, which could be disastrous for their case. Newby believes that the AI behind Cullable’s system makes it superior to competitors’ offerings. “Every page we process, essentially, we get a little bit better,” Newby told Slator. In terms of completing and recognizing partial words in text, he said, “We’re trying to understand thoughts. Then AI improves upon that knowledge base with new datasets that come in.” Available to consumers since 2019, Cullable’s customers are predominantly US-based, with a few in the UK and South Africa. “Several translation companies have come to us with projects in the past,” Newby said. “They send us what they have problems with: poor image quality, skewed images, partially redacted words, handwriting.” In addition to Cullable’s core OCR service, machine translation (MT) is integrated into the application. “Really good OCR machine translation sings and dances,” Newby said. “We use the Google Translate API because it’s native to our stack in Google.” Of course, a language service provider with its own proprietary MT engine would use that instead. Looking ahead, OCR still stands to benefit from research. A September 2020 paper details how two researchers in Argentina created a dataset of annotated images from Japanese manga. The goal: enable OCR in manga at the pixel level. Existing annotated, pixel-level datasets, the authors wrote, typically consist of real-world images, which lack speech balloons. Most of the text is usually in English, and is rarely hand-drawn in artistic styles, as in manga. Although this specific dataset was designed around manga, the principles behind it could be applied to OCR of Japanese texts in other domains. A recent literature review, published in July 2020, laid out the limitations of OCR research thus far. First, most research deals with the most widely spoken languages on the planet, partly because datasets are often unavailable for languages with fewer speakers. It can also be difficult for systems to recognize characters handwritten by many different people, each with their own distinct handwriting. Interest continues to grow in OCR of “text in the wild” — that is, on-screen characters and text in different settings — which might eventually be relevant to translators dealing with text in streaming media. But that may depend on the potential earnings at stake. The authors concluded that the commercialization of research needs to improve to help build “low-cost, real-life systems for OCR that can turn lots of invaluable information into searchable/digital data.”
语言行业的项目经理对这种场景再熟悉不过了:客户希望翻译一篇不可编辑的文件格式的文档。但是,在开始翻译之前,项目经理必须通过一轮光学字符识别(光学字符识别)来确定文档的字数。如果文档是手写的或者包含未知语种的文本(或者两者都有--真的很让人头疼),任务就会变得更加复杂。 许多公司已经找到了解决光学字符识别问题的方法。对于小型企业来说,Adobe Acrobat可能就会完成这项工作;但随着一家公司的成长,它可能会探索其他选择,比如OpenText的捕获引擎系列。ABBYY FineReader Engine还提供了一套识别产品,其中包括光学字符识别技术,根据宣传该技术可用于多达200种语言。 谷歌方面,自2006年以来一直赞助开源光学字符识别引擎Tesseract的进一步开发,该引擎最初是由惠普在上世纪80年代开发的。谷歌云平台还提供了一个使用可计费云产品集合执行光学字符识别的教程。与此同时,亚马逊对Textract从表格和图表中提取数据,同时保留原始格式的能力引以为豪。 光学字符识别领域的每一个新来者都把自己的算法和技术吹捧为光学字符识别挑战的最终答案。位于阿联酋迪拜的语言服务提供商Tarjama已经建立了基于神经网络的专有光学字符识别技术。 新加坡初创公司Staple专门研究版面很重要的文件,如发票,税单和银行对账单;用户可以通过微信,谷歌云端硬盘和多宝箱(Dropbox)输入100种语言的文档。 Cullable的创建者兼首席技术官锡德·纽比(Sid Newby)(同时也是Ocrsucks.com的所有者)承认光学字符识别的坏名声。他在2015年创立了Cullable,其基础是多年来与电子发现(eDiscovery)一起从事商业诉讼的经验(即从数千页文件中筛选任何可能相关的信息)。律师可能会在一堆无法搜索的文本中漏掉一条重要的证据,这对他们的案子来说可能是灾难性的。 纽比认为,Cullable系统背后的人工智能使其优于竞争对手的产品。纽比告诉Slator公司:“我们处理的每一个页面,本质上来说,我们都让它变得更好了一点。”在完成和识别文本中的部分单词方面,他说:“我们试图理解要表达的内容。然后人工智能通过新的数据集来改进这个知识库。“ Cullable自2019年开始向消费者开放,其客户主要来自美国,也有少数在英国和南非。“过去有几家翻译公司带着项目来找我们,”纽比说。“他们把有问题的东西寄给我们:质量很差的图像,歪斜的图像,部分编辑的单词,笔迹。” 除了Cullable核心的光学字符识别服务外,机器翻译(MT)也被集成到应用中。“真正好的光学字符识别机器翻译会唱歌跳舞,”纽比说。“我们使用谷歌翻译应用程序接口,因为它是谷歌栈的原生版本。”当然,拥有自己专有机器翻译引擎的语言服务提供商会使用它。 展望未来,光学字符识别仍将从研究中获益。2020年9月的一篇论文详细介绍了阿根廷的两名研究人员如何从日本漫画中创建一个带注释的图像数据集。其目标是:在漫画中启用像素级的光学字符识别。 作者写道,现有的带注释的像素级数据集通常由真实世界的图像组成,这些图像缺乏发言气球。大部分的文字通常是英文,很少具有手工绘制的艺术风格,如漫画。虽然这个特定的数据集是围绕漫画设计的,但其背后的原理可以应用于其他领域的日语文本的光学字符识别。 最近发表于2020年7月的一篇文献综述指出了光学字符识别研究的局限性。首先,大多数研究涉及的是地球上使用最广泛的语言,部分原因是,对于使用人数较少的语言,数据集往往不可用。系统也很难识别由许多不同的人手写的字符,每个人都有自己不同的笔迹。 人们对“野外文本”的光学字符识别(即屏幕上的字符和不同设置下的文本)的兴趣持续增长,这可能最终会与处理流媒体中文本的译者相关。但这可能取决于所涉及的潜在收益。 作者总结说,研究的商业化需要改进,以帮助建立“低成本的,真实的光学字符识别系统,这样的系统可以将大量宝贵的信息转化为可搜索的/数字数据。”

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文