Lilt编辑器中的标记：当时和现在--翻译技术速递

In written content, tags provide important formatting information, such as turning text into a hyperlink or making text bold. Lilt’s translation editor supports tags to save translators the manual effort of adding add tags back into a translation. A translation editor needs the following three features: Tag parsing: Automatically parse user uploaded content to extract tags. Tag projection: Project tags from the source content to the target content after translation. Lilt does this via machine learning -- see our blog post on Format Transfer for more details. Tag editing: Allow linguists to modify tag placement in the target content. Note that linguists should never be able to create or delete tags. If a linguist selects and deletes text containing tags, the tags should be moved. In this blog post, we’ll explore the efforts of our engineering team to improve our tag editing functionality and the challenges we’ve faced along the way. The Lilt CAT Editor, Then We’ll start our journey with the second version of Lilt’s Computer-Aided Translation (CAT) editor, internally known as CATv2. CATv2 existed in Lilt from September 2017 to May 2021. was built in Angular 1.5 using Quill, a rich text editor framework. However, Quill didn’t support tags, so we built a distinct tag editing mode that was separate from the Quill text editor system. In this mode-based system, after users added text in text editing mode, tags were invisibly projected from the source text into the target text. Linguists who wanted to view and edit tags would switch modes: - Text mode only displayed text, and allowed linguists to add/edit target text for a segment. - Tags mode displayed both text and tags, but only allowed linguists to edit tags. Here is a screenshot of the original version of Lilt (with dummy German text -- ist ja keine richtige Übersetzung!), where you can see the “Tags” mode at the bottom that had to be switched to: This system was clunky, but better than no tags! Towards the end of 2019, a tag editor overhaul became crucial. The mode-based system was painful for linguists and production managers, who are responsible for delivering final translations to customers. Modifying text in text mode did not update the relative position of tags, and there was no way to see where the tags were without switching into tags mode. So, editing text after manipulating tags would often result in misplaced tags. Further, changing tags in accepted segments took many clicks. Users first had to un-accept the segment, switch the segment to tags mode, move the desired tags, and then reaccept the segment. Production managers had to repeat this multiple times when running tag quality assurance. The solution was to allow users to simultaneously edit text and tags in the editor, requiring a complete rework of CATv2. Make way for CATv3! Strategy for Improvement Proudly Found Elsewhere We first thought of rebuilding the entire editor from scratch, thinking our needs to be highly specialized, and perhaps suffering from a bit of “not invented here” syndrome. But, we soon realized that text editors are complex beasts, and we were reinventing the wheel. So, we decided to stop initial development work and reinvestigate existing solutions. We chanced upon Slate, a customizable framework for creating rich text editors. Slate did not come with built-in support for tags, but it would allow us to build a custom tag editor system on top of it. Although it seemed to be in permanent beta and suffered from stale documentation, it was the most viable option for implementing a simultaneous text and tags editor. We stumbled at first, encountering long standing PRs like fixing a bug with cursor movement in right-to-left elements, but eventually overcame these hurdles. Syncing Tag Data Lilt’s editor autosaves, meaning whenever a user makes a change in the HTML editor, those changes are automatically synced within the editor and the backing database. Syncing is a challenge given our custom tag system on top of a text editor not designed to handle them. Specifically, we ran into the following issues in our previous implementation: - Our tags in HTML lacked IDs, meaning they could not be uniquely referenced. Thus, the editor sometimes could not determine which tag had moved, if more than one tag had occupied the same position. - We did not keep track of relative tag order, only the character position within the string. If multiple tags occupied the same position, there wasn’t a reliable way to determine the correct order of those tags. The Lilt CAT Editor, Now With the challenge in front of us, we set off to build our new text and tags system on top of Slate. We wrote the simultaneous text/tag system in React, although the rest of the surrounding editor, CATv2, remained Angular 1.5. Further, we used React DnD to create the drag and drop functionality for tags. And, we modified the editor to allow tags to be given unique IDs in HTML to refer to them consistently, ensuring proper syncing. Once everything was all put together, we ended up with a simultaneous text and tags editor that fell somewhere between a simple text editor and a WYSIWYG editor. Tags are now visibly projected when a segment is confirmed, and users could view and move tags directly in the editing area: When we launched, we ran into many issues syncing the HTML display, editor system, and internal representation. The CATv2 Angular codebase was reaching its limit with both the tag system and the various business rules surrounding editing, confirming, and accepting segments. Adding new functionality or making changes was dangerous, as the code was tightly coupled, beginning to age, and dense. Thus, after reworking the text/tags editor in React, we started rewriting the entire editor system in React + Typescript, using modern coding conventions. We rolled out CATv3 as an optional beta to customers, allowing a fallback to CATv2 to retain productivity. For a few months, we maintained both CATv2 and CATv3. We finally pulled the trigger to set CATv3 as generally available in May 2021, and tags became a lot simpler to manage. Users rejoiced! Tag Data Structure Current Structure Here is an example JSON structure of our tags, which typically come in pairs. Tags are uniquely identified with id, their absolute positions are given by the key position, and nested references are indicated via the parent key. solos are tags that do not have a pair, for example a or
tag in HTML. nts refer to non-translatables, which are usually placeholders in the translated text that should not be translated, but are also represented as tag-like structures. { "pairs": [ { "open": { "tag": "g id=\"1\" ctype=\"link\" equiv-text=\"[#$dp3]\"", "position": 47, "origin": "xliff" }, "close": { "tag": "/g", "position": 68, "origin": "xliff" }, "parent": -1, "id": 1 }, ... ], "solos": [], "nts": []} Future Structure The future JSON structure of our tags will be a one-dimensional array per segment, where each array represents all the tags in one text segment. The previous format can cause ambiguity in tag/nt positions if they are overlapping. The 1D array approach is a reduction in complexity as it would allow us to pass the data structure through various layers of our system (starting from the database) without transformation. An example follows: [ { "id":"1", "type":"openingTag", "position":4, "name":"strong", "style":..., "properties":{ "id":"1", "ctype":"x-strong", "equiv-text":"strong" } }, { "id":"1", "type":"closingTag", "position":20, "name":"strong", "style":{ }, "properties": ... }, ...] What’s Next Our future improvements include continuing to work on resolving corner cases with tag de-syncing, and reconfiguring the internal representation of tags to be a 1D array. Our tags solution has at times been programmatically crude, but functional. Keeping our linguists happy with a text editor -- a core part of their workflow -- while maintaining overlapping implementations is a hard challenge. One we gladly tackle for our customers!

在书面内容中，标记提供重要的格式设置信息，例如将文本转换为超链接或使文本粗体。Lilt的翻译编辑器支持标记，以节省翻译人员在翻译中添加添加标记的手动工作。翻译编辑器需要以下三个特性：标签解析：自动解析用户上传的内容，提取标签。标记投影：翻译后将标记从源内容投影到目标内容。Lilt通过机器学习来实现这一点--更多细节请参见我们关于格式传输的博客文章。标记编辑：允许语言学家修改标记在目标内容中的位置。注意，语言学家永远不能创建或删除标记。如果语言学家选择并删除包含标记的文本，则应移动标记。在这篇博文中，我们将探讨我们的工程团队为改进标签编辑功能所做的努力，以及我们在此过程中所面临的挑战。轻便猫编辑，然后我们将从Lilt的计算机辅助翻译(CAT)编辑器的第二个版本开始我们的旅程，内部称为CATv2。CATv2于2017年9月至2021年5月在Lilt存在。使用富文本编辑器框架Quill在Angular 1.5中构建。但是，Quill不支持标签，所以我们构建了一个独立于Quill文本编辑器系统的独特的标签编辑模式。在这个基于模式的系统中，用户在文本编辑模式下添加文本后，标签被无形地从源文本投射到目标文本中。希望查看和编辑标记的语言学家会切换模式： -文本模式只显示文本，并允许语言学家添加/编辑段的目标文本。 -标签模式同时显示文本和标签，但只允许语言学家编辑标签。这里是Lilt原始版本的截图（带有虚拟的德文文本--ist ja keine richtigeübersetZung！），在那里您可以看到底部的“标签”模式，该模式必须切换到：这个系统很笨拙，但总比没有标签好！ 2019年底，标签编辑器的大修变得至关重要。基于模式的系统对语言学家和生产经理来说是痛苦的，他们负责向客户提供最终的翻译。在文本模式下修改文本不会更新标签的相对位置，而且如果不切换到标签模式，就无法看到标签在哪里。因此，在操作标记后编辑文本通常会导致标记放错位置。此外，在接受的段中更改标记需要多次单击。用户首先必须取消接受该段，将该段切换到标签模式，移动所需的标签，然后重新接受该段。生产经理在运行标签质量保证时必须重复多次。解决方案是允许用户在编辑器中同时编辑文本和标记，这需要对CATv2进行彻底的返工。给CATV3让路！改进策略自豪地在别处发现我们首先想到从零开始重建整个编辑器，认为我们需要高度专业化，可能患有一点“不是在这里发明的”综合症。但是，我们很快意识到文本编辑器是复杂的野兽，我们正在重新发明轮子。因此，我们决定停止最初的开发工作，重新研究现有的解决方案。我们偶然发现了Slate，这是一个用于创建富文本编辑器的可定制框架。Slate没有内置标签支持，但它允许我们在上面构建一个自定义标签编辑器系统。尽管它似乎处于永久的测试阶段，而且文档陈旧，但它是实现同时使用文本和标记编辑器的最可行的选择。起初，我们遇到了一些问题，比如在从右到左的元素中修复光标移动的bug，但最终克服了这些障碍。同步标记数据 Lilt的编辑器自动保存，这意味着每当用户在HTML编辑器中进行更改时，这些更改都会在编辑器和后台数据库中自动同步。同步是一个挑战，因为我们的自定义标记系统是在一个文本编辑器之上的，而不是设计来处理它们的。具体地说，我们在前面的实现中遇到了以下问题： -我们在HTML中的标记缺少ID，这意味着它们不能被唯一引用。因此，如果一个以上的标记占据了相同的位置，编辑器有时无法确定哪个标记移动了。 -我们没有跟踪相对的标记顺序，只跟踪字符串中的字符位置。如果多个标签占据相同的位置，就没有可靠的方法来确定这些标签的正确顺序。 Lilt猫编辑，现在面对我们面前的挑战，我们开始在石板上建立我们的新文本和标签系统。我们在React中编写了同步文本/标签系统，尽管其他的编辑器CATv2仍然保持1.5角。此外，我们使用React DnD为标记创建拖放功能。并且，我们修改了编辑器，允许在HTML中为标记提供唯一的ID，以一致地引用它们，确保正确的同步。一旦所有的东西都放在一起，我们最终得到了一个同时的文本和标记编辑器，它介于简单的文本编辑器和所见即所得编辑器之间。现在，当一个片段被确认时，标签可以明显地投影出来，用户可以直接在编辑区查看和移动标签：当我们启动时，我们遇到了许多同步HTML显示、编辑器系统和内部表示的问题。随着标签系统和围绕编辑、确认和接受段的各种业务规则，CATv2角码库都达到了极限。添加新功能或进行更改是危险的，因为代码是紧密耦合的，开始老化和密集。因此，在React中重写了text/tags编辑器之后，我们开始在React+Typescript中使用现代编码约定重写整个编辑器系统。我们向客户推出了CATv3作为可选测试版，允许退到CATv2以保持生产力。几个月来，我们同时维护了CATv2和CATV3。我们最终扣动扳机，将CATv3设置为2021年5月普遍可用，标签变得更容易管理。用户欢欣鼓舞！标签数据结构电流结构下面是我们标记的JSON结构示例，这些标记通常是成对的。标记用id唯一标识，它们的绝对位置由键位置给出，嵌套引用通过父键指示。solo是没有对的标记，例如HTML中的或
标记。nts指的是不可翻译物，通常是翻译文本中不应该翻译的占位符，但也表示为类似标记的结构。 {“pairs”：[{“open”：{“tag”：“g id=\”1\“ctype=\”link\“equiv-text=\”[#$dp3]\“”,“position”：47,“原点”：“xliff”},“close”：{“tag”：“/g”,“position”：68,“原点”：“xliff”},“parent”：-1,“id”：1},...],“solos”：[],“nts”：[]} 未来结构标记的未来JSON结构将是每个段的一维数组，其中每个数组表示一个文本段中的所有标记。如果标签/NT位置重叠，前面的格式可能会导致它们的位置不明确。1D数组方法降低了复杂性，因为它允许我们将数据结构通过系统的各个层（从数据库开始）而无需转换。下面是一个例子： [{“id”：“1”,“type”：“openingtag”,“position”：4,“name”：“strong”,“style”：...,“properties”：{“id”：“1”,“ctype”：“x-strong”,“equiv-text”：“strong”}},{“id”：“1”,“type”：“closingtag”,“position”：20,“name”：“strong”,“style”：{},“properties”：...},...] 接下来是什么我们未来的改进包括继续致力于通过标签去同步来解决拐角情况，并将标签的内部表示重新配置为一维数组。我们的标签解决方案有时在编程上很粗糙，但功能上很好。让我们的语言学家满意文本编辑器--他们工作流的核心部分--同时保持重叠的实现是一个艰巨的挑战。我们很乐意为我们的客户解决这个问题！

以上中文文本为机器翻译，存在不同程度偏差和错误，请理解并参考英文原文阅读。

阅读原文

机器翻译

工具

翻译管理

本地化