如何提高数据质量？--翻译技术速递

We live in a data-driven world where much of our society’s key decision-making is based on data, ranging from governmental to industrial, commercial, and so on. Data science and AI (artificial intelligence) would not be possible without an abundant amount of data. Now that data has become mainstream in almost every industry, the quality of data has become increasingly imperative. High-quality data is frequently talked about and sought after but what does data quality really mean? High quality data can be defined as any qualitative or quantitative data which is captured, stored, and used for its intended purposes. “Quality” data pertains to the data being accurate, complete, clean, consistent, and valid for its intended use case. There are a few key ways one could improve the overall quality of data. This includes refining and outlining data integrity, ensuring proper data sourcing, data cleaning techniques, and data storage methods. Data Capture Given the context of your data, how is the data being sourced and are these trusted sources? Once your data sourcing is validated, the next step would be to assess if your data complies with data entry standards. Standards of data entry often depend on the business context. As a whole, this means defining a set of guidelines your data needs to fit in order to be used or stored for business purposes. These guidelines can include things like features needed, duplication management, deletion of records, formatting, and data privacy standards. Often, companies have their own unique set of data entry standards. Data Modeling Once your data has been properly captured, another measure to take to improve data quality is data organization and storage, also known as data modeling. Data modeling is a process that bridges the gap between business processes and the data needed by analyzing the data requirements needed to support the business needs. This practice establishes the relationships between data elements and structures within the business model. Oftentimes, diagrams or other visual representations are used to model the flow of data within the organization. As the scale and volume of data increases, data modeling becomes more vital to ensure data consistency. The TAUS Data Marketplace is a great example of hundreds of data sources organized and stored in an efficient model, according to several differentiators including domain and language pair. Data models within an organization have many benefits, namely improved software development, analytics, application performance, risk management, data tracking, documentation, and faster market turnarounds. When data is organized within the organization, it becomes easier to understand and thus apply any layer of analytics or modeling on top of it. Subsequently, this improves the quality of your data, data modeling yields fewer errors in the data, better documentation, and overall fewer errors across the organization. Data Integrity Data quality is directly tied to data integrity. Data integrity refers to the accuracy, consistency, completeness, and reliability of data throughout its lifecycle. Data integrity is introduced during the database modeling and design phase. It is enforced through the use of standard procedures and rules consisting of various validation checks and procedures. Hence, data with high integrity indicates the quality of the data is high as well. Each organization creates these procedures independently and there is no one-size-fits-all method. There are, however, some commonplace procedures found across many enterprises. An example of a common method used across many different companies is following a software development lifecycle (SDLC), which are a set of guidelines and rules to follow standard business practices while building any application. SDLC methodologies improve data quality because it provides a scalable view of the organization both from a technological and business standpoint. It is a way for the organization to track all of their data, applications, tests, code, transactions, and so on. Furthermore, SDLC shows you exactly how the data is being used as well as any vulnerabilities or areas of improvement. Data Cleaning Once we have established data integrity practices and guidelines, the next step we can take to improve data quality is to clean our data. Data cleaning is a part of data integrity and an essential part of any data science and AI use case. Clean data yields better algorithms and results. Messy data containing noise will be sure to mask potentially insightful results or introduce bias in your outcomes. Hence, data cleaning is an effective measure to take to ensure data quality. Cleaning your data can fix common errors in data such as syntax, type conversions, and duplicates. There are a variety of data cleaning techniques to utilize, but ultimately these depend on the use case and business model. Some common data cleaning techniques to implement are data normalization, standardization, anonymization, duplicate prevention, and data inspection. Data inspection practices can help identify incorrect and inconsistent data points. Data normalization will help ensure things such as case, abbreviations, and other syntax-related issues. For example, we can normalize data points such as “U.S.” and “America” to a single representative entry such as “United States.” TAUS Data Services is an example of a service that cleans, prepares, secures, finetunes, and customizes data. Impact of Data Quality on Machine Learning One area of AI that portrays how data quality can have a profound impact is machine learning. Machine learning is a subset of AI and it is defined as a set of methods which automates model building and decision making. Machine learning models often require training data to build intuition and perform decisions, which generally improves with more time and increased training data. It is easy to see how inconsistent, noisy, or subpar data will skew a model’s output. In some cases, this can have a drastic impact on business implications. One way we can build intuition on our data quality in a machine learning model is by assessing the bias and variance. Underfitting occurs when the model has not sufficiently caught the underlying patterns of the data and the model has over-generalized. This means there was high bias, which indicates that there was not sufficient data for the model to train on. On the other hand, high variance occurs when the model learns noise from the training set. This leads to overfitting, where the model has over-generalized to the training data. In both scenarios, the quality of the training data plays a key role in the outcome of the model’s training phase. In this case, the factors that made the training data of low quality was either too little data or noisy data. You can improve the quality of the data here by increasing the training size and conducting proper data cleaning techniques to remove noise, as mentioned above. Data Quality Review Data quality review is an essential part of any organization. The above diagram outlines the lifecycle of data and shows how data quality is improving through each phase. It is important to note, however, that organizations may structure these phases differently according to their standards and processes. One example of a real-world data platform is the Human Language Project (HLP). HLP is a micro-task-based platform where people can generate and annotate data or evaluate data quality in given domains and projects. The more steps that are taken for data to accurately represent real-world constructs, the more trustworthy and meaningful results can be. Through proper data capture, integrity practices, and data modeling methodologies, data quality can be significantly improved.

我们生活在一个数据驱动的世界，从政府、工业到商业，我们社会的许多关键决策均是基于数据做出的。如果没有大量的数据，就不可能有数据科学和人工智能。现在数据已经成为几乎每一个行业的主流，数据的质量也变得越来越重要。高质量的数据经常被谈论和追捧，但数据质量究竟意味着什么？高质量数据可以定义为任何可以捕获、存储和用于其预期目的的定性或定量数据。“质量”数据是指数据准确、完整、干净、一致，并对其预期的用例有效的数据。有几种关键方法可以提高数据的总体质量。这包括精炼概述数据完整性、确保适当的数据来源，数据清理技术和数据存储方法。数据捕获给定数据的上下文，数据来源是什么，这些来源是可信的吗？一旦验证了您的数据来源，下一步就是评估您的数据是否符合数据输入标准。数据输入的标准通常取决于业务上下文。作为一个整体，这意味着需要明确您的数据需要符合的准则，以便用于或存储业务目的。这些准则可以包括所需的特性、重复管理、删除记录、格式化和数据隐私标准等内容。通常，公司都有自己独特的一套数据输入标准。数据建模一旦收据受到准确捕获，另一个可以提高数据质量的方法在于数据的组织和存储，统称建模。数据建模是一个消除业务流程和数据之间差距的过程，而这些用于分析数据需求的数据，可支撑业务发展，这个实践建立了业务模型中数据元素和结构之间的关系。通常是使用图表或其他可视化描述，对组织内的数据流进行建模。随着数据的规模和数量不断增加，数据建模对保证数据的连贯性发挥着重要的作用。TAUS DataMarketplace是一个很好的例子，它会依据几个微分器，包括领域和语言对，把数百个数据源组织存储于一个高效率的建模里。同一组织内的数据建模有许多好处，可以增强软件开发、提高分析能力，加强应用程序的性能，增加风险管理能力、数据追踪能力、文档编制以及快速的市场周转能力等。数据排列到一个组织里的时候，会极其容易理解，可轻易在其之上进行各个层面的分析和建模，数据的质量也会提高，数据、文档以及整个组织得数据建模会极少出现错误。数据完整性数据质量与数据完整性直接相关。数据完整性是指数据在整个生命周期中的准确性、一致性、完整性和可靠性。数据完整性是在数据库建模和设计阶段引入的，通过运用标准程序和规则进行实施的，而这些标准程序和规则包括各式检查。因此，具有高度完整性的数据表示其数据的质量也极高。各组织会独立创造这些程序，但并不存在一劳永逸的方法。然而，许多企业都有一共通的程序。各个公司使用的一个共同方法是遵循一个软件的生命发展周期（SDLC），这是开发任何一款应用程序都需遵循的准则。SDLC方法提高了数据质量，因为它从技术和业务的角度提供了组织的可伸缩视图。它是组织跟踪其所有数据，应用程序，测试，代码，事务等的一种方式。此外，SDLC还向您展示了数据的使用方式，以及任何漏洞或需要改进的地方。数据清理一旦我们建立了数据完整性实践和指南，我们可以采取的提高数据质量的下一步骤就是清理我们的数据。数据清理是数据完整性的一部分，也是任何数据科学和AI用例必不可少的一部分。干净的数据产生更佳的算法和结果。含有噪音的杂乱数据肯定会掩盖潜在的有洞察力的结果，或者产生不准确的结果，因此，数据清理是保证数据质量的有效措施。清理数据可以修复数据中的常见错误，如语法、类型转换和重复项等。数据清理技术多种多样，但终究使用哪一技术，还取决于用例和业务模型。一些常见的可使用数据清理技术是数据规范化、标准化、匿名化、防重复和数据检查。数据检查实践可以帮助识别不正确和不一致的数据点。数据规范化有助于确保诸如大小写、缩写和等语法相关的问题。例如，我们可以将诸如“U.S.”和“America”之类的数据点标准化为“United States”之类的单个代表性条目。TAUS data Services是一个清理，准备，安全，精细处理和定制数据的服务示例。数据质量对机器学习的影响 AI的一个领域描绘了数据质量如何能够产生深远影响，那就是机器学习。机器学习是人工智能的一个子集，它被定义为一组自动建模和决策的方法。机器学习模型经常需要训练数据来建立直觉和执行决策，这一般会随着更多的时间和训练数据的增加而改善。很容易看出不一致，有噪声或低于标准的数据会如何扭曲模型的输出。在某些情况下，这可能会对业务影响产生严重影响。我们可以在机器学习模型中建立对数据质量的直觉的一种方法是评估偏差和方差。当模型没有充分捕捉到数据的基本模式，并且模型已经过度概括时，就会发生欠拟合。这意味着存在高偏差，这表明没有足够的数据用于模型的训练。另一方面，当模型从训练集中学习噪声时，会出现高方差。这就导致了过拟合，即模型对训练数据的过度概括。在这两种情况下，训练数据的质量对模型训练阶段的结果起着关键作用。在这种情况下，造成训练数据质量不高的因素要么是数据太少，要么是数据噪声太大。您可以通过增加训练大小和进行适当的数据清理技术来去除噪声来提高这里的数据质量，如上所述。数据质量审查数据质量审查是任何组织必不可少的一部分。上图概述了数据的生命周期，并显示了数据质量是如何通过每个阶段得到提高的。然而，值得注意的是，组织可以根据他们的标准和过程不同地构造这些阶段。真实世界数据平台的一个例子是Human Language Project(HLP)。HLP是一个基于微任务的平台，人们可以在其中生成和注释数据或评估给定领域和项目中的数据质量。为数据精确地表示现实世界的构造所采取的步骤越多，结果就越可信，越有意义。通过适当的数据捕获，完整性实践和数据建模方法，可以显著提高数据质量。

以上中文文本为机器翻译，存在不同程度偏差和错误，请理解并参考英文原文阅读。

阅读原文

机器翻译

工具

翻译管理

本地化