9 Types of Data Bias in Machine Learning

机器学习中的9种数据偏差

2021-03-23 19:08 TAUS

本文共1251个字,阅读需13分钟

阅读模式 切换至中文

Bias as a general term defines the tendency to lean towards a certain direction either in favor of or against the given topic, person or thing. When it comes to data science, a biased dataset can be classified as one that doesn’t represent a model’s use case fully and therefore produces results that are skewed or inaccurate and includes analytical errors. The introduction of the term bias in the Machine Learning space goes back to the paper written by Tom Mitchell in 1980, “The need for biases in learning generalizations”. However, perceptions towards bias in data and ML models have greatly changed. In this 1980 paper, the model could be biased in the sense that it would give importance to some of the features in order to generalize better for the larger dataset with various other attributes. Nowadays, it concerns people when an assumption for a generalized algorithm produces a systematic prejudice. When an ML model is trained on biased data, it’s thought that the output it will yield will not only be based on stereotypes but actually amplify them in society. One example of such a situation was Amazon’s resume filtering tool which was later ended as it showed bias against women. This perspective became more common and concerning as AI-enabled services and tools started to play a bigger role in our lives from smart devices and bot communication to medical diagnosis and recruitment decisions. The first step in being able to eliminate such bias is to be able to identify them. Here are nine types of bias that we have defined for you. Selection Bias Selection bias happens when the data used in training is not large or representative enough and results in a misrepresentation of the true population. Sampling bias as a part of the overall selection bias refers to any sampling method which fails to attain true data randomization before selection. For example, a voice recognition technology is trained only with the audio language data generated by individuals with a British accent. This model will have difficulty in voice recognition when an individual with an Indian accent interacts with it. This results in lower accuracy in performance. Overfitting and Underfitting When a model gets trained with large amounts of data, it also starts learning from the noise and inaccurate data entries in the dataset. Consequently, the model does not categorize the data correctly, because of too many details and noise. This situation is referred to as overfitting. When a machine learning model cannot capture the underlying trend of the data, underfitting occurs. Underfitting is commonly observed when there is less data to build an accurate model or when a linear model is being built with non-linear data. Underfitting creates a model with high bias. A model with a high bias may not be flexible enough when predicting outcomes. Outliers Outliers are extreme data points in a dataset that are exceptionally far from the mainstream of the data. Outliers can be caused by measurement/input error or data corruption. If an experiment’s results are aimed at making decisions based on the average, then extreme data points will alter this decision, causing bias in the output. Measurement Bias Measurement bias is linked to underlying problems with the accuracy of the training data and how it is measured or assessed. An experiment containing invalid measurement or data collection methods will create measurement bias and biased output. For example, when testing a new feature on a mobile app that is available both for Android and iPhone users, if you perform the experiment only with the subset of iPhone users the results cannot be truly reflective, and thus introduces measurement bias in the experiment. Recall Bias Recall bias in data commonly takes place in the data labeling stage when labels are inconsistently given based on subjective observations. This is also known as the false-positive rate. In machine learning, recall is defined as the rate of how many unseen points a model labeled accurately over the total number of observations. Let’s say a group of test subjects share how many calories they consumed per day over the last week. As they cannot recall the precise amount, they will provide an estimation. These estimates take away from the true values, resulting in a recall bias. Observer Bias Observer bias, or confirmation bias, occurs when the conductor of the experiment integrates their expected outcome into the study. It can happen if a researcher starts on a project with subjective thoughts about their study, knowingly or unconsciously. An example can be seen in data labeling tasks where one data worker chooses a different label based on their subjective thoughts as opposed to other workers who follow the provided objective guidelines. Imagine the guidelines suggest that all tomato images should be tagged as fruit, yet one labeler believes that it should be classified as a vegetable and labels it accordingly. This would result in inaccurate data. Exclusion Bias During data pre-processing, features that are considered irrelevant end up being removed. This can consist of removing null values, outliers, or other extraneous data points. The removal process may lead to exclusion bias and the removed features may end up being underrepresented when the data is applied to a real-world problem and result in the loss of the true accuracy of the data collected. Let’s imagine that referral rates from the English and Sinhala versions of the website are being compared. 98% of the clicks come from the English version and 2% come from the Sinhala version. One can choose to leave the 2% out, thinking it would not affect the final analysis. By doing so, one may miss that Sinhala clicks have a higher conversion rate compared to the English website clicks. This would lead to exclusion bias and delivers an inaccurate representation of the collected data. Racial Bias Racial bias, or demographic bias, occurs when the training data is reflecting a certain demographic, such as a particular race. When a model is trained on racially biased data, the outcome itself can be skewed. Imagine that image data used in the training of self-driving cars mostly features Caucasian individuals. This would mean that self-driving cars will be more likely to recognize Caucasian pedestrians than darker-skinned pedestrians, resulting in less safety for darker-skinned individuals as the technology becomes more widespread. Other forms of demographic bias include class and gender bias, which affect training outcomes in similar ways. Association Bias Association bias skews, misleads, or distorts the way a machine learning model learns to associate certain features to be true based on the training data. Essentially, this reinforces a cultural bias if the data was not collected thoughtfully. If the training dataset labels all pilots as men and all flight attendants as women, for that specific model female pilots and male flight attendants won’t exist, hence creating an association bias. In Conclusion Being aware of potential types of biases in your dataset is the first step towards eliminating them. With that knowledge, it is important to monitor the data preparation processes closely to make sure the datasets are as bias-free as possible before they are used in the training phase. If you’d like to ensure your training data meets the quality standards, contact us. With a highly specialized data science team and a community of over 3,000 data contributors, TAUS can help generate, collect, prepare and annotate text, image, or audio datasets fit for your project specifications.
偏见作为一个通用术语定义了倾向于某一特定方向的倾向,无论是赞成或反对给定的话题,人或事。当涉及到数据科学时,有偏差的数据集可以被分类为不能完全表示模型用例的数据集,因此产生的结果是有偏差的或不准确的,并包括分析错误。 机器学习领域中偏见这一术语的引入可以追溯到汤姆·米切尔(Tom Mitchell) 1980年的一篇论文《学习概括中偏见的必要性》(The need for bias in Learning generizations)。然而,对数据和ML模型的偏见的看法已经发生了很大的变化。在这篇1980年的论文中,该模型可能是有偏倚的,因为它会重视一些特征,以便更好地泛化到具有各种其他属性的更大数据集。一个广义算法的假设是否会产生系统性的偏见,是目前人们关注的问题。当ML模型在有偏见的数据上进行训练时,人们认为它产生的结果不仅是基于刻板印象,而且实际上会在社会中放大它们。其中一个例子是亚马逊的简历过滤工具,该工具后来因显示出对女性的偏见而被终止。随着人工智能服务和工具开始在我们的生活中发挥更大的作用,从智能设备和机器人通信,到医疗诊断和招聘决策,这个观点变得越来越普遍和令人担忧。 能够消除这种偏见的第一步是能够识别它们。下面是我们为你定义的九种偏见。 选择偏差 当训练中使用的数据不是很大或代表性不够时,就会产生选择偏差,从而导致对真实总体的错误描述。抽样偏差作为总体选择偏差的一部分,是指在选择前未能达到真实数据随机化的任何抽样方法。例如,一种语音识别技术仅用带有英国口音的个人生成的音频语言数据进行训练。这个模型在有印度口音的个体与之互动时,会在语音识别上出现困难。这导致性能精度较低。 过拟合和欠拟合 当一个模型用大量的数据进行训练时,它也开始从数据集中的噪声和不准确的数据条目中学习。因此,由于太多的细节和噪声,模型不能正确地对数据进行分类。这种情况称为过拟合。 当机器学习模型不能捕捉数据的潜在趋势时,就会发生拟合不足。当没有足够的数据来建立一个精确的模型,或者用非线性数据建立一个线性模型时,通常会观察到欠拟合。不拟合创建了一个具有高偏差的模型。一个高偏差的模型在预测结果时可能不够灵活。 离群值 异常值是数据集中的极端数据点,这些数据点与主流数据相差甚远。异常值可能是由测量/输入错误或数据损坏引起的。如果一个实验的结果是基于平均来做决定,那么极端的数据点会改变这个决定,导致输出的偏差。 测量偏差 测量偏差与训练数据的准确性以及如何测量或评估的潜在问题有关。一个实验包含无效的测量或数据收集方法将会产生测量偏差和偏置输出。例如,当你在一款手机应用上测试一项新功能时,该应用同时适用于Android和iPhone用户,如果你只针对iPhone用户的子集进行实验,结果无法真正反映出来,从而在实验中引入测量偏差。 回忆偏差 数据的回忆偏差通常发生在数据标签阶段,标签不一致地给出基于主观观察。这也被称为假阳性率。在机器学习中,回忆被定义为模型精确标记的看不见的点的数量占观测总数的比率。假设一组测试对象分享他们上周每天消耗了多少卡路里。由于他们不能回忆起准确的数量,他们将提供一个估计。这些估计偏离了真实值,导致召回偏差。 观察者偏差 当实验主持者将他们的预期结果整合到研究中时,观察者偏差或确认偏差就会发生。如果研究人员有意或无意地带着对研究的主观想法开始一个项目,这种情况就会发生。在数据标签任务中可以看到一个例子,一个数据工作者根据他们的主观想法选择不同的标签,而不是其他遵循提供的客观指导方针的工作者。想象一下,指导方针建议所有的番茄图片都应该被标记为水果,然而一个标签者认为它应该被归类为蔬菜,并相应地贴上标签。这将导致不准确的数据。 排斥偏见 在数据预处理过程中,被认为不相关的特性最终会被删除。这可以包括删除空值、异常值或其他无关的数据点。删除过程可能会导致排除偏见,当数据应用于现实问题时,删除的特征可能会被低估,并导致所收集数据的真实准确性丧失。让我们想象一下这个网站的英文版本和僧伽罗语版本的推荐率正在进行比较。98%的点击量来自英语版本,2%来自僧伽罗语版本。人们可以选择不考虑这2%,认为这不会影响最终的分析结果。这样做,一个人可能会错过,僧伽罗点击有一个更高的转化率相比英语网站的点击。这将导致排除偏见,并提供收集数据的不准确表示。 种族偏见 当训练数据反映了一个特定的人口统计数据时,比如一个特定的种族,就会出现种族偏见或人口统计偏见。当一个模型在带有种族偏见的数据上进行训练时,结果本身可能是不正确的。想象一下,用于自动驾驶汽车训练的图像数据主要以白人个人为特征。这意味着自动驾驶汽车更有可能识别出白人行人,而不是深色皮肤的行人,随着技术的普及,深色皮肤的人的安全性会降低。其他形式的人口统计偏见包括阶层和性别偏见,它们以类似的方式影响培训结果。 联想偏倚 关联偏见扭曲、误导或扭曲机器学习模型学习的方式,使其根据训练数据将某些特征关联为真实的。从本质上说,如果数据收集得不够仔细,这就会强化一种文化偏见。如果训练数据集将所有飞行员标记为男性,所有空乘人员标记为女性,那么对于特定的模型来说,女性飞行员和男性空乘人员就不存在,因此会产生关联偏见。 总而言之 要消除这些偏见,首先要意识到数据集中潜在的偏见类型。有了这些知识,密切监视数据准备过程是很重要的,以确保数据集在使用到培训阶段之前尽可能无偏差。如果您想确保您的培训数据符合质量标准,请与我们联系。拥有高度专业化的数据科学团队和超过3000名数据贡献者的社区,TAUS可以帮助生成、收集、准备和注释适合您的项目规范的文本、图像或音频数据集。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文