What is Synthetic Data?

什么是合成数据?

2021-04-08 20:50 TAUS

本文共756个字,阅读需8分钟

阅读模式 切换至中文

As data science methodologies increasingly become more technologically advanced, new tools are created within the realm of artificial intelligence (AI). One such emerging and increasingly commonplace tool is known as synthetic data. Synthetic data is artificial data created by a computer program, hence the name “synthetic”. Although synthetic data is not a novel concept, the technological resources and computing power we have today have made this type of data grow in popularity. Why is it Relevant? Synthetic data can be crucial to organizations where a specific kind of dataset is needed but its conditions cannot be met through naturally occurring or organic data. It is primarily used for training data generation, model training, testing purposes, and privacy concerns, given that it meets certain predefined conditions specific to its use case. Given the large volume of data required to train machine learning models, oftentimes it can be difficult to acquire this data. Companies can have a tough time not only capturing data but also storing and handling it. Furthermore, labeling this data can also be a lengthy task often requiring dedicated resources. For these cases, producing synthetic data tailored to a specific use case can be a powerful alternative. How is Synthetic Data Produced? Synthetic data generation is a critical process that could potentially have a large impact on its application. Depending on the use-case, synthetic data should meet certain requirements when used in a production setting. Oftentimes, it is generated by a machine learning model, such as a deep learning model, for the sake of creating more training data. In most cases, statistical properties such as spread and distribution should be taken into consideration when attempting to mimic real-world data into synthetic data. Companies use a variety of techniques to generate synthetic data such as Monte Carlo simulation, deep learning models, decision trees, reverse-engineering techniques, and iterative proportional fitting. For the case when real data does not exist, intuition on the underlying distribution of the dataset can greatly benefit during synthetic data generation. For example, if the distribution is known to follow either gaussian, poisson, exponential, or some other well-known statistical distribution, then random sampling from one of these distributions could produce synthetic data. Of course, the level of resemblance of the synthetic data to real data ultimately depends on the backhand knowledge of the data. However, when real data is available to mimic and the distribution parameters are known, Monte Carlo simulations can be applied. Using deep learning to generate synthetic data is a good option when a very large set of data is needed and when the underlying distributions are not well-known. An example of a deep learning model that can be used to generate synthetic data is known as generative adversarial networks (GANs). GANs are a good option because they generate random variables from a given distribution. Using something called a generator, the model produces fake data from randomized inputs. This then gets fed into a discriminative network, which separates the fake data from the real data. Both the generator and discriminative networks train the model using forward and backward propagation to compare the synthetic dataset with the real one. The below diagram shows this system in action: Organic vs Synthetic Data Organic data is the data that many of us are used to on a daily basis. As mentioned above, it is naturally occurring data that is produced by real-world phenomena. Organic data can be difficult to capture, especially in larger volumes. It can also require a lot of cleaning and tweaking to fit a given business need. Furthermore, sometimes a specific dataset may not be readily available or naturally occurring. To overcome these drawbacks, synthetic data can be used as an alternative. Both organic and synthetic data have their pros and cons, depending on the use case. The below table outlines some of the comparisons between both types of data: Conclusion Depending on the application, both synthetic data or organic data could deem useful. When organic data is not readily available or attainable for a specific application, synthetic data may be a viable option. Synthetic data can be generated through many different statistical or classical machine learning methodologies, such as a GAN, as seen above. An important factor to note during synthetic data generation is to follow the desired data’s distributions and statistical properties as closely as possible, including comparison checks. As the need for tailored training sets, product testing, and privacy concerns increase, synthetic data generation could be used to fit business needs.
随着数据科学方法学在技术上变得越来越先进,人工智能(AI)领域中出现了新的工具。一种新出现的,越来越常见的工具就是合成数据。合成数据是由计算机程序创建的人工数据,因此得名“合成”。虽然合成数据并不是一个新颖的概念,但我们今天所拥有的技术资源和计算能力已经使这种类型的数据越来越受欢迎。 为什么它是相关的? 对于需要某种特定类型的数据集,但其条件无法通过自然发生的或有机的数据来满足的组织来说,合成数据可能是至关重要的。它主要用于训练数据生成,模型训练,测试目的和隐私考虑,假设它满足特定于其用例的某些预定义条件。 由于训练机器学习模型需要大量的数据,通常很难获取这些数据。公司可能会有一段艰难的时间,不仅是捕获数据,而且是存储和处理数据。此外,标记这些数据也可能是一项冗长的任务,通常需要专用资源。对于这些情况,生成针对特定用例的合成数据可能是一个强大的替代方案。 合成数据是如何产生的? 合成数据生成是一个关键过程,可能对其应用产生巨大影响。根据用例,合成数据在生产环境中使用时应该满足某些要求。通常,它是由机器学习模型(例如深度学习模型)生成的,以创建更多的训练数据。在大多数情况下,当试图将真实世界的数据模拟为合成数据时,应考虑到诸如扩展和分布等统计特性。公司使用各种技术来生成合成数据,如蒙特卡罗模拟,深度学习模型,决策树,逆向工程技术和迭代比例拟合。 对于真实数据不存在的情况,在合成数据生成过程中,对数据集底层分布的直觉可以大大受益。例如,如果已知分布遵循高斯分布,泊松分布,指数分布或其他一些众所周知的统计分布,那么从这些分布中的一个随机抽样就可以产生合成数据。当然,合成数据与真实数据的相似程度最终取决于对数据的反手知识。然而,当有真实数据可供模拟且分布参数已知时,可以应用蒙特卡罗模拟。 当需要非常大的数据集,并且底层分布不为人所知时,使用深度学习生成合成数据是一个不错的选择。可用于生成合成数据的深度学习模型的一个例子被称为生成式对抗网络(GANs)。GAN是一个很好的选择,因为它们从给定的分布生成随机变量。该模型使用一种叫做生成器的东西,从随机输入中产生假数据。然后,这些数据被输入到一个鉴别网络中,该网络将假数据与真数据分开。生成网络和判别网络都使用前向和后向传播来训练模型,以比较合成数据集和真实数据集。下图显示了该系统的运行情况: 有机数据与合成数据 有机数据是我们很多人日常习惯的数据。如上所述,由真实世界的现象产生的是自然发生的数据。有机数据可能很难捕获,尤其是在较大的体积中。它也可能需要大量的清理和调整,以适应给定的业务需求。此外,有时一个特定的数据集可能不容易获得或自然发生。为了克服这些缺点,可以使用合成数据作为替代方法。有机数据和合成数据都有各自的优缺点,这取决于用例。下表概述了这两类数据之间的一些比较: 结论 根据应用情况,合成数据或有机数据都可能被认为是有用的。当有机数据不容易获得或无法获得用于特定应用时,合成数据可能是一个可行的选择。合成数据可以通过许多不同的统计或经典机器学习方法来生成,如上文所见的一个GAN。在合成数据生成过程中需要注意的一个重要因素是尽可能密切地跟踪所需数据的分布和统计特性,包括比较检查。随着对定制训练集,产品测试和隐私关注的需求增加,可以使用合成数据生成来满足业务需求。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文