Apple and USC Propose Solution for Gender Bias in Machine Translation


2024-08-26 08:13 slator


In a July 29, 2024 paper, researchers from Apple and the University of Southern California introduced a new approach to addressing gender bias in machine translation (MT) systems. As the researchers explained, traditional MT systems often default to the most statistically prevalent gender forms in the training data, which can lead to translations that misrepresent the intended meaning and reinforce societal stereotypes. While context sometimes helps determine the appropriate gender, many situations lack sufficient contextual clues, leading to incorrect gender assignments in translations, they added. To tackle this issue, the researchers developed a method that identifies gender ambiguity in source texts and offers multiple translation alternatives, covering all possible gender combinations (masculine and feminine) for the ambiguous entities. “Our work advocates and proposes a solution for enabling users to choose from all equally correct translation alternatives,” the researchers said. For instance, the sentence “The secretary was angry with the boss.” contains two entities — secretary and boss — and could yield four grammatically correct translations in Spanish, depending on the gender assigned to each role. The researchers emphasized that offering multiple translation alternatives that reflect all valid gender choices is a “reasonable approach.” Unlike existing methods that operate at the sentence level, this new approach functions at the entity level, allowing for a more nuanced handling of gender-specific references. The process begins by analyzing the source sentence to identify entities (such as nouns or pronouns) with ambiguous gender references. Once identified, two separate translations are created: one using masculine forms and another one using feminine forms. The final step integrates these translations into a single output that maintains the grammatical integrity of the target language. To generate these translations, fine-tuned MT models or large language models (LLMs) can be employed. The researchers highlighted that, when combined with a proper user interface their approach allows translators to select the correct gender for each entity. “Our key technical contribution is a novel semi-supervised solution for generating alternatives that integrates seamlessly with standard MT models,” they explained. This solution not only facilitates new translation interfaces with precise gender control but also aids human translators by automatically identifying ambiguities and suggesting alternative translations, they added. To encourage further research, the researchers open-sourced training and test datasets for five language pairs: English > German, Spanish, French, Portuguese, Russian, and Italian. Looking ahead, they plan to explore other genderless source languages, such as Chinese, Korean, and Japanese, and the unique challenges they present. They also aim to extend their approach to include non-binary and gender-neutral forms. Authors: Sarthak Garg, Mozhdeh Gheini, Clara Emmanuel, Tatiana Likhomanenko, Qin Gao, and Matthias Paulik
在2024年7月29日的一篇论文中,来自苹果和南加州大学的研究人员介绍了一种解决机器翻译(MT)系统中性别偏见的新方法。 正如研究人员解释的那样,传统的机器翻译系统通常默认训练数据中统计上最普遍的性别形式,这可能导致翻译错误地表达了预期的含义并强化了社会刻板印象。他们补充说,虽然上下文有时有助于确定适当的性别,但许多情况下缺乏足够的上下文线索,导致翻译中的性别分配不正确。 为了解决这个问题,研究人员开发了一种方法,可以识别源文本中的性别歧义,并提供多种翻译选择,涵盖歧义实体的所有可能的性别组合(男性和女性)。 “我们的工作倡导并提出了一种解决方案,使用户能够从所有同样正确的翻译选择中进行选择,”研究人员说。 例如,句子“秘书对老板很生气。”包含两个实体--秘书和老板--并且可以根据分配给每个角色的性别,在西班牙语中产生四个语法正确的翻译。 研究人员强调,提供反映所有有效性别选择的多种翻译选择是一种“合理的方法”。 与在句子一级运作的现有方法不同,这种新方法在实体一级运作,可以更细致地处理针对性别的提法。 该过程首先分析源句子,以识别具有模糊性别引用的实体(如名词或代词)。一旦确定,两个单独的翻译创建:一个使用阳性形式,另一个使用阴性形式。最后一步是将这些翻译集成到一个输出中,以保持目标语言的语法完整性。 为了生成这些翻译,可以采用微调的MT模型或大型语言模型(LLM)。 研究人员强调,当与适当的用户界面相结合时,他们的方法允许翻译人员为每个实体选择正确的性别。“我们的关键技术贡献是一种新的半监督解决方案,用于生成与标准MT模型无缝集成的替代方案,”他们解释说。 他们补充说,这一解决方案不仅促进了具有精确性别控制的新翻译界面,而且还通过自动识别歧义并建议替代翻译来帮助人类翻译。 为了鼓励进一步的研究,研究人员开源了五种语言对的培训和测试数据集:英语、德语、西班牙语、法语、葡萄牙语、俄语和意大利语。 展望未来,他们计划探索其他无性别源语言,如中文、韩语和日语,以及它们所带来的独特挑战。他们还打算扩大其方法,以包括非二元和性别中立的形式。 作者:Sarthak Garg,Mozhdeh Gheini,Clara Emmanuel,Tatiana Likhomanenko,Qin Gao,and Matthias Paulik

