Why Do Data Cleaning and Anonymization Matter?

为什么数据清洗和匿名化如此重要?

2021-10-04 21:25 TAUS

本文共168个字,阅读需2分钟

阅读模式 切换至中文

Data cleaning is an essential step in machine learning and takes place before the model training step. It is important because your machine learning model will produce results only as good as the data you feed it. If your dataset contains too much noise, your model will capture that noise as a result. Furthermore, messy data can break your model and cause model accuracy rates to decrease. Examples of data cleaning techniques include syntax error removals, data normalization, duplicate removal, outlier detection/removal, and fixing encoding issues. Data anonymization is another imperative step in machine learning and entails the process of removing sensitive or personally identifiable information from datasets. For many organizations, data privacy laws make this a vital step. Some common data anonymization techniques include perturbation, generalization, shuffling, scrambling, and synthetic data generation. Synthetic data could be a good alternative when dealing with sensitive data. Synthetic data can be generated in-house and can use characteristics of naturally-occurring data, without the inclusion of personally identifiable data.
数据清洗是机器学习中必不可少的步骤,发生在模型训练步骤之前。它很重要,因为机器学习模型只会产生与所提供的数据一样好的结果。如果数据集包含太多噪声,那么模型将捕获这些噪声。此外,杂乱的数据会破坏模型并导致模型准确率下降。数据清理技术包括移除语法错误、数据规范化、移除重复、离群点检测/移除以及修复编码问题。 数据匿名化是机器学习中另一个必不可少的步骤,它需要从数据集中去除敏感或个人可识别的信息。对于许多组织来说,数据隐私法使这成为至关重要的一步。一些常见的数据匿名技术包括:扰动、泛化、洗牌、置乱和合成数据生成。在处理敏感数据时,合成数据可能是一种很好的替代方法。合成数据可以在内部产生,可以使用自然产生的数据特征,而不包括个人可识别的数据。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文