Interview with Dagmar Gromann

Dagmar Gromann访谈录

2020-09-21 17:20 terminology Coordination

本文共2548个字,阅读需26分钟

阅读模式 切换至中文

Dagmar Gromann is a computer scientist and linguist currently working as Assistant Professor Tenure-Track at the Centre for Translation Studies of the University of Vienna in Vienna, Austria. Before that, she worked at the International Center for Computational Logic at TU Dresden in Dresden, Germany. She was a post-doc research fellow in the Marie Curie Initial Training Network at the Artificial Intelligence Research Institute (IIIA) in Barcelona, Spain. She has worked with numerous project partners in the field of Artificial Intelligence and NLP, such as the German Research Center for Artificial Intelligence, to mention just one. She earned her doctorate from the University of Vienna under the supervision of Prof. Gerhard Budin. Among her primary research interests are ontology learning and learning structured knowledge utilizing deep learning methods. Other areas of Gromann’s interest involve, among other things, machine learning and cognitive theories. She has been a host, co-organizer and member of numerous scientific committees and conferences. Most recent ones include EMNLP 2020, ISWC 2020, LREC 2020, IJCAI-PRICAI 2020, and AAAI 2020. She is active in the international language technology community as National Anchor Point contact person for the European Language Resource Coordination (ELRC) and National Competence Center main contact for the European Language Grid (ELG). She is also a management committee member and working group leader in the expert network created by the COST Action NexusLinugarum (CA18209) on Web-centred linguistic data science. Looking at your resume, I noticed that your professional background covers a vast array of different topics: cognitive linguistics, translation, computer science, and even business. How does this experience relate to your work in terminology? Let me start by explaining a little bit about those different research interests. In fact, they developed quite naturally out of my educational background, industry experience, and research positions that I have had in the past. For instance, I have completed my PhD and received a grant from the Vienna University of Economics and Business for my research. It explains the focus on terminology in the domain of finance. My educational background, however, includes linguistics and computer science. Working as a translator, I became fascinated with computational approaches to terminology. For me, combining the one most central language resource, that is terminology, with computer science seemed like a natural fit. Therefore, I started working on, among other things, computational concept modeling and terminology extraction. After completing my thesis, I joined the Artificial Intelligence Research Institute in Barcelona, Spain, where many people worked on mathematical models of embodied cognition. Their work sparked my interest, in particular, the theory of image schemas which has a clear connection to cognitive linguistics. This robust linguistic perspective prompted me to work on embodied cognition with a colleague of mine, Maria M. Hedblom, a cognitive scientist. The research in Barcelona, connections, and contacts that I have made there – they all shared a common terminological focus. Ultimately, the aim was to utilize my computational skills for terminology work and to integrate the cognitive component to answer the question of how image schemas help to analyze differences between languages in a specialized domain. One of your research interests involves integrating terminologies with ontologies (i.e. ontoterminology). Could you briefly explain the differences between these two knowledge representation models? What does ‘keeping ontological and terminological dimensions both separated and linked’ in knowledge modeling mean?  Ontologies and terminologies both seek to organize, structure, and represent knowledge. Their perspective on it, however, is radically different. Ontologies, for instance, are computational artifacts that formally model the structure of reality or, to put it another way, they represent relevant entities and relations observable in certain situations or events. Let us take the example of a university: What can you observe there? Who are the main actors? What are their main actions? One might have relevant entities such as students, professors, researchers, lecturers. There are some physical entities as well, such as the lecture_hall, offices, and so on. The idea behind ontology modeling is thus – similar to terminologies – to put these entities together, abstract their properties into concepts, and relate these concepts – or ontology classes – to each other. Such relations can be hierarchical: One thing is another, as in a student is_a person, a professor is_a person, a lecture_hall is_a room. What is left, is to relate these entities with non-hierarchical relations, as in professor supervises students.  Ontologies are known as formal representation systems which means that they must be machine-readable. Consequently, one can automatically draw conclusions about new knowledge based on the knowledge already existing. This process is called inference. It also means that one must represent knowledge in a strict way to avoid misinterpretation, as it is processed automatically. So, for instance, in our very basic relation of professor supervising students, the relation is modeled as asymmetric, or to put it differently, it is one-directional: The professor supervises the student but not the other way around. This piece of information must be specified in the ontology to avoid misinterpretation by adding formal axioms and structures. As you can see, with this heavy focus on formality, natural language becomes secondary, and the main issue is to make reality and knowledge about it machine-readable. Terminologies, on the other hand, are created based on natural language used in a specific domain. Rather than from observations of entities, events, and actions in these domains, one starts at the language level. Natural language automatically reflects how human beings perceive, measure, and understand the reality, which makes it a filtered version of it. It is what we call epistemology.  Terminologies are interested in HOW we talk about things more than how things ARE in specific domains. In talking about these domains, we use linguistic expressions which we then group to form concepts and relations between them. However, they are not formal (or machine-readable), and hence one cannot automatically draw any conclusions. Also, terminology science has been somehow weak on the definition of what the concept is. Literature or standards talk about concepts or concepts systems, but they do not provide an answer to their exact nature. In my approach to ontology-terminology modeling, I strived to combine the strengths of both these resources. For example, an ontology’s linguistic aspects can be enhanced by associating terminological information with ontological concepts. And, conversely, you can provide a formal and strongly specified concept system for terminologies by using the ontology as a concept system. One should, however, bear in mind that, since both those resources have a significantly different perspective on the knowledge one cannot simply convert a terminology into an ontology (or the other way around). They must be kept separate, intact yet interlinked, which is made possible through semantic web standards and specified relations. What advantages can be gained by integrating ontologies and terminologies? In how far can the terminology and/or ontology community benefit from it? The advantage is a fully machine-readable resource that has a rich multilingual terminological information. It is something that the industry can benefit from immensely. Not only is it possible to consult the knowledge in the sense of searching for (multilingual) information and then seeing what is out there, but also to reason on the knowledge that already exists. Does this general ontology-terminology idea find practical application in terminology management in the industry? Yes. For instance, major airplane producers use ontologies to model requirements in airplane designs: How much space is needed for the feet, between the seats, etc. Modeling this kind of information with an ontology-terminology is a natural choice, especially in a multilingual context, not only for the creation of (multilingual) documentation but reasoning on the previously collected knowledge too.  Not only modeling but also publishing, sharing, and connecting of terminological resources (LLOD) as a part of the Semantic Web is an interest of yours. How can the terminology (or linguistic resources in general) become a part of the Semantic Web? What are the requirements it must fulfill (formats)? The starting point for Linked Data (LD) was to specify several necessary principles that an LD resource must fulfill. Actually, the same principles apply to the Linguistic Linked Open Data (LLOD) cloud, and any kind of linguistic resource published as Linked Data. The key principles include: 1) The data has to be under an open license, 2) Each element in a dataset needs to be uniquely identified, 3) It should be represented in a specific web standard (usually Resource Description Framework (RDF) but it could also be another web standard, for instance, the Web Ontology Language (OWL)), 4) It should be linked to an already existing resource – this gives you all the benefit of interlinking resource on the LLOD cloud. What are the benefits and limitations of publishing terminological resources as linked data? Reusing is a benefit in itself, especially with this kind of format. However, it also allows one to interchange data easily since it is globally available. With an open representation, the resources evolve faster and are freely extendable. It differs significantly from a database, where it is difficult to add or change elements. Quite the opposite is the case with LD: LD is very flexible, both in terms of adding and changing resources. One of the limitations may be that, currently, certain types of information are unrepresentable like diachronic information for digital humanities or similar fields that utilize historical data and display the evolution of language concepts. This applies to other types of linguistic information descriptions too, for instance, phonological, morphological, and multimodal information. This, however, I am happy to report we work on quite actively in a COST Action called NexusLinguarum. COST Actions are networks of experts that come together to boost a certain field, which in this case, is the field of linguistic data science in general. Our main objective at NexusLinguarum is evaluating LLOD resources, approaches, standards, and providing reports on the state-of-the-art developments in linguistic data science and propose best practices, training schools and training materials. We strive to extend the current state of research by coming up with solutions to best model different levels of description, such as diachronic, morphological, and phonological information. Another aim is to report on and expand ways of utilizing deep learning, Big Data, and other Natural Language Processing (NLP) techniques in the creation, use, and application of LLOD, including an extensive collection of use cases in various domains. This allows newcomers to the field to see what linguistic data science can do.   Semantic Deep Learning is the name of the workshop series you co-organize. Can you tell me more about it? What fascinates you about this topic?  Semantic Deep Learning refers to the combination of Semantic Web and Deep Learning. These are also the two research fields that have been accompanying my career for the past couple of years. As the name suggests, it involves integrating ontologies and other types of Semantic Web technologies into deep learning to guide the machine’s decisions. To this end, we have organized five workshops that collocate with major artificial intelligence (e.g. IJCAI), computational linguistic, and international Semantic Web conferences to get different communities on board. We have also organized a special issue on Semantic Deep Learning at the Semantic Web journal. It is truly fascinating how creative people are in combining Semantic Web technologies with Deep Learning. Some even use the combination of ontologies to provide explanations for deep learning, which remains to be a recent research challenge. We understand the technical side of it, but how does the neural model learn the representations of texts and of images to make predictions? There is still a lot to be discovered here. What are your responsibilities as a National Anchor Point contact person for the European Language Resource Coordination (ELRC) and National Competence Center, and the main contact for the European Language Grid (ELG)? These are two different initiatives. The first one, ELRC, focuses predominantly on collecting and providing country-specific and multilingual language resources to train European Machine Translation (MT) systems. My role here is to keep track of publicly available language resources developed in Austria by different institutions, then, to point these out and provide them to the EU. My colleagues from the Center for Translation Studies (CTS), students, and I are actively creating language resources for this purpose too. Actually, ELRC has been operating long before I’ve joined the CTS, and I took over, only recently, from my dear colleague Gerhard Budin. He has been very active in this field and created a portal with publicly available resources. Additionally, we organize local workshops to bring industry and academia together in the field of language resources and technologies in Austria. The second initiative, ELG, is an EU project. Their goal is to make language technologies available globally and publicly in a possibly neat format, and in a consistent manner on one holistic platform. The idea here is to provide web-based and easily accessible tools for machine translation, terminology, and lexicography extraction, among many others. My task is, again, on the one hand, to make this initiative known to the Austrian public, and, on the other hand, to involve companies by asking them about their needs and possible contributions. Furthermore, we cooperate with public institutions such as the Language Center at the Military in Austria.  What are your next research goals? Though I have a couple of different machine translation-related projects, the most interesting ones relate to terminology: how can you integrate terminologies into neural machine translation models to guide decisions for low-resource languages, such as Standard Austrian German. Since we don’t have enough data to train machine translation systems on Austrian German, we need to use a model that is already trained on English to German and try to readjust it for the Austrian standard variety. This is what I’m currently working on – using Austrian-German terminologies and integrating them into the training process to help the system learn this variety which should be considered in machine translation systems, to make them more usable, for any application that requires Austrian-German.   The other two major terminology related projects concern term extraction. There are many term extraction tools but none of them provide a full concept system, merely a list of term candidates. For this purpose, we want to build on ontology learning. This project called Text To Terminological Concept System (Text2TCS) will be financed by and integrated into the ELG. The idea is to produce a very useful tool that can extract full terminological concept systems across languages.  Finally, my research activities reflect my cognitive interests. I try to use the idea of embodied cognition to analyze differences across languages in specific domains. This extends the idea of the ontology which assumes general knowledge that exists in the world – a universal knowledge – and the cognitive perspective focuses more on the individual, on the physical experiences people make with their bodies. I think it is interesting to bring this strong individualized approach into the mix and then to analyze specialized natural language expressions across different natural languages. Interview by Justyna Dlociok, former trainee at Polish Translation Unit, DG TRAD at the European Parliament. English Language and Linguistics, University of Vienna, Vienna; Specialized Translation and Language Industry, University of Vienna, Vienna.
DagmarGromann是一位计算机科学家和语言学家,目前在奥地利维也纳的维也纳大学翻译研究中心担任终身教授助理。在此之前,她在德国德累斯顿的TU Dresden国际计算逻辑中心工作。她是西班牙巴塞罗那人工智能研究所(IIIA)玛丽居里初始培训网络的博士后研究员。她曾与众多人工智能和NLP领域的项目合作伙伴合作,比如德国人工智能研究中心(仅举一例)。 Dagmar Gromann在Gerhard Budin教授的指导下获得了维也纳大学的博士学位。 她的主要研究兴趣是本体学习和利用深度学习方法学习结构化知识。格罗曼感兴趣的其他领域包括机器学习和认知理论。她是许多科学委员会和会议的主持人、联合组织者和成员。最近的组织包括EMNLP 2020,ISWC 2020,LREC 2020,IJCAI-PRICAI 2020和AAAI 2020。Dagmar Gromann活跃于国际语言技术社区,担任欧洲语言资源协调(ELRC)和欧洲语言网格(ELG)国家能力中心的主要联络人。她也是由网络中心语言数据科学成本行动NexusLinugarum(CA18209)创建的以网络为中心的语言数据科学专家网络的管理委员会成员和工作组组长。 看了您的简历,我注意到您的专业背景涵盖了大量不同的主题:认知语言学、翻译、计算机科学、甚至商务。这段经历与您的术语工作有何关联? 首先让我解释一下这些不同的研究兴趣。事实上,这些都是我的教育背景、行业经验和过去从事的研究工作自然而然地发展而来的。例如,我已经完成了我的博士学位,并获得了维也纳经济贸易大学的研究资助。它解释了金融领域术语的重点。然而,我的教育背景包括语言学和计算机科学。在担任翻译的过程中,我着迷于术语的计算方法。对我来说,将最核心的语言资源--术语与计算机科学相结合似乎是一种天然的契合。因此,我开始从事计算概念建模和术语提取等方面的工作。完成论文后,我加入了西班牙巴塞罗那的人工智能研究所,在那里许多人致力于体现认知的数学模型。他们的工作激发了我的兴趣,尤其是与认知语言学有着明显联系的意象图式理论。这种强有力的语言学观点促使我和一位同事玛利亚·M·赫德布洛姆(一位认知科学家)一起研究具身认知。我在巴塞罗那所做的研究、所建立的关系和联系方式–都具有共同的术语重点。最终,我的目标是将我的计算技能用于术语工作,并整合认知成分,以回答图像图式如何帮助分析一个专业领域中语言之间的差异的问题。 您的研究兴趣之一涉及到将术语与本体(即个体术语)集成在一起。您能简单解释一下这两种知识表示模型之间的区别吗?在知识建模中“保持本体和术语维度既分开又联系在一起”意味着什么? 本体和术语都试图去组织、构造和表示知识。然而,有人对此的看法却截然不同。例如,本体是正式对现实结构建模的计算产物,或者换句话说,它们表示在某些情况或事件中可观察到的相关实体和关系。让我们以一所大学为例:你在那里能观察到什么?谁是主要演员?他们的主要动作是什么?可能有相关的实体,如学生、教授、研究人员、讲师。还有一些物理实体,如演讲厅、办公室等。本体建模背后的思想类似于术语--将这些实体放在一起,将其属性抽象为概念,并将这些概念(或本体类)相互关联。这种关系可以是等级关系:一件事就是另一件事,就像学生是一个人,教授是一个人,教室是一个房间。剩下的就是把这些实体与非等级关系联系起来,就像教授管理学生一样。 本体被称为形式化表示系统,这意味着它们必须是机器可读的。因此,人们可以根据已有的知识自动地得出关于新知识的结论。这个过程叫做推理。这也意味着一个人必须以一种严格的方式来表达知识,以避免引起误解,因为本体是自动处理的。在我们的教授指导学生的最基本关系中,这种关系被建模为非对称关系,或者换句话说,它是单向的:教授指导学生,而不是反过来。这条信息必须在本体中指定,以避免通过添加正式公理和结构而造成误解。正如您所看到的,由于过于注重形式,自然语言成为次要的问题,主要的问题是使事实和有关它的知识机器可读。另一方面,术语是基于特定领域中使用的自然语言创建的。与其从观察这些领域中的实体,事件和动作开始,不如从语言层面开始。自然语言自动反映人类如何感知、衡量和理解现实,从而使得自然语言成为现实的过滤形式。这就是我们所说的认识论。 我们对术语真正感兴趣之处在于如何谈论事物,而不是事物在特定领域中是如何的。在谈论这些领域时,我们使用语言表达,然后将它们分组以形成概念和它们之间的关系。然而,这些语言不是正式的(或机器可读的),因此不能自动得出任何结论。此外,语科学在某种意义上对概念的定义还有些薄弱之处。 文献或标准讨论概念或概念系统,但均不能提供其确切性质的答案。 在我的本体术语建模方法中,我努力结合这两种资源的优点。例如,可以通过将术语信息与本体概念相关联来增强本体的语言方面。相反,通过使用本体作为概念系统,您可以为术语提供一个正式且严格指定的概念系统。然而,应该记住的是由于这两种资源对知识的看法都大相径庭,因此一个人不能简单地将术语转换为本体(或者反过来)。它们必须保持独立、完整但又相互联系,这通过语义web标准和指定的关系得以实现。 集成本体和术语可以获得哪些优势?术语和/或本体社区能从中受益多少? 其优势在于具有丰富的多语言术语信息的完全机器可读的资源。该行业可以从中获得巨大的利益。不仅可以在搜索(多语言)信息的意义上查阅知识,然后查看其中的内容,而且还可以根据现有的知识进行推理。 这种通用本体术语思想在行业术语管理中得到实际应用了吗? 是的。例如,主要的飞机制造商使用本体来建模飞机设计中的需求:足部、座椅之间需要多少空间等等。使用本体术语对这种信息进行建模是自然而然的选择,尤其是在多语言环境中 ,不仅用于创建(多语言)文档,而且还用于对以前收集的知识进行推理。 您感兴趣的不仅是建模,还有作为语义Web一部分的术语资源的发布、共享和连接都是兴趣所在。术语(或一般的语言资源)如何成为语义网的一部分? 它必须满足哪些要求(格式)? 链接数据(LD)的起点是指定LD资源必须满足的几个必要原则。实际上,同样的原则也适用于语言链接开放数据(LLOD)云及发布为链接数据的任何类型的语言资源。主要原则包括:1)数据必须获得开放许可,2)数据集中的每个元素需要唯一标识,3)应该以特定的Web标准(通常是资源描述框架(RDF))表示, 它也可以是另一个Web标准,例如Web本体语言(OWL)),4)应该链接到已经存在的资源--这为您提供了在LLOD云上互连资源的所有好处。 将术语资源作为链接数据发布有哪些好处和限制? 特别是对于这种格式,重用本身是一种好处。然而,由于术语资源是全球可用的,因此也允许人们轻松地交换数据。通过开放式表示,资源可以更快地发展并且可以自由扩展。术语资源与数据库有很大不同,在数据库中很难添加或更改元素。LD的情况恰恰相反: LD在添加和更改资源方面都非常灵活。 其中一个限制的可能是,目前某些类型的信息无法表示,比如数字人文或类似的利用历史数据和语言概念演变的领域的历时性信息。这也适用于其他类型的语言信息描述,例如语音、形态和多模态信息。 然而,很高兴能向您汇报我我们在名为NexusLinguarum的COST行动中正在积极开展工作。COST行动是由专家组成的网络,他们聚集在一起促进某一领域的发展,在在本例中,该领域通常是语言数据科学领域。我们在NexusLinguarum的主要目标是评估LLOD的资源、方法、标准,并提供语言数据科学的最新发展报告,并提出最佳实践,培训学校和培训材料的建议。我们通过提出对不同描述水平(例如历时,形态和语音信息)进行最佳建模的解决方案,努力扩展当前的研究状态。另一个目标是报告和扩展在LLOD的创建,使用和应用中利用深度学习、大数据和其他自然语言处理(NLP)技术的方法,包括广泛收集各个领域的用例。这使新手进入该领域可以了解语言数据科学可以做什么。 语义深度学习是您共同组织的研讨会系列名称。您能告诉我更多关于它的情况吗?这个话题有什么吸引您的地方? 语义深度学习是指语义Web和深度学习的结合。在过去的几年里,这两个研究领域一直伴随着我的职业生涯。顾名思义,它涉及到将本体和其他类型的语义Web技术集成到深度学习中,以指导机器的决策。为此,我们组织了五个研讨会,这些研讨会与主要的人工智能(例如IJCAI),计算语言学和国际语义Web会议配合使用,以吸引不同的社区参与。我们还在语义Web杂志组织了关于语义深度学习的专刊。真正令人着迷的是,富有创造力的人们如何将语义Web技术与深度学习相结合。有些人甚至使用本体的组合来提供深度学习的解释,这仍然是最近的研究挑战。我们了解它的技术方面,但是神经模型是如何学习文本和图像的表示以进行预测?这里还有很多东西有待发掘。 作为欧洲语言资源协调(ELRC)和国家能力中心的国家锚点联系人以及欧洲语言网格(ELG)的主要联系人,您的职责是什么? 这是两种不同的倡议。 第一个是ELRC,主要致力于收集和提供特定于国家/地区的多语言语言资源,以培训欧洲机器翻译(MT)系统。我在这里的职责是跟踪不同机构在奥地利开发的公开可用的语言资源,然后指出这些资源并提供给欧盟。我和翻译研究中心的同事、学生们也在为此积极创造语言资源。事实上,ELRC早在我加入CTS之前就已经开始运作了,而我只是在最近才从亲爱的同事Gerhard Budin那里接手。他在该领域一直非常活跃,并创建了一个拥有公开资源的门户。此外,我们还组织本地研讨会,将奥地利语言资源和技术领域的工业界和学术界聚集在一起。 第二种倡议是ELG,是欧盟的一个项目。 他们的目标是以统一的方式在整体平台上以尽可能整齐的格式在全球和公众范围内提供语言技术。 这里的想法是提供基于Web且易于访问的工具,用于机器翻译、术语和词典词典提取等。我的任务是一方面让奥地利公众了解这一倡议,另一方面使各公司参与进来,询问其需求和可能作出的贡献。此外,我们还与奥地利军方语言中心等公共机构合作。 您的下一个研究目标是什么? 尽管我有几个与机器翻译相关的项目,但最有趣的与术语有关:如何将术语集成到神经机器翻译模型中,以指导资源匮乏的语言(如标准奥地利德语)的决策。由于我们没有足够的数据来训练奥地利德语的机器翻译系统,因此我们需要使用已经接受过英语到德语培训的模型,并尝试针对奥地利标准品种重新调整模型。这就是我目前正在研究的工作--使用奥德术语,并将它们整合到培训过程中,以帮助系统学习机器翻译系统中应该考虑到的各种变化,使它们更适用于任何需要奥德术语的应用。 其他两个与术语相关的主要项目涉及术语提取。有许多术语提取工具,但没有一个提供完整的概念系统,仅提供术语候选列表。 为此,我们希望基于本体学习。这个名为“文本到术语概念系统”(Text2TCS)的项目将由ELG资助并纳入ELG。该想法是产生一种非常有用的工具,可以跨语言提取完整的术语概念系统。 最后,我的研究活动反映了我的认知兴趣。我尝试用具身认知的概念来分析特定领域中不同语言的差异。这扩展了本体论的概念,该本体论假设了世界上存在的常识–普遍知识–而认知视角则更多地侧重于个人以及人们对身体的身体体验。 我认为将这种强大的个性化方法引入组合,然后跨不同的自然语言分析专门化的自然语言表达十分有趣。 记者Justyna Dlociok,前波兰翻译部门实习生,欧盟议会总干事。维也纳大学英语语言和语言学,维也纳;维也纳大学专业翻译和语言工业,维也纳。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文