培训数据来源方法--翻译技术速递

Training data can be sourced from many different places, depending on your machine learning application. Data can be found just about anywhere - from free publicly available datasets to privately-held data available for purchase, to crowdsourced data. These types of datasets are known as organic data or naturally occurring datasets. Synthetic Data Synthetic datasets are one common option to use as training data, as mentioned above. The benefit of using synthetic data is that it can be sourced internally under any given set of applicable constraints. Furthermore, it can be abundantly produced, has a short generation to model training turnaround, and is easy to create when prior conditions are known. The downfall is that synthetic data production can be costly and it consumes resources. Public Datasets Other alternatives include using platforms like Google or Kaggle to pull datasets. The datasets on offer there are often maintained by government agencies or enterprise companies. Some companies have in-house teams or use a data labeling or data collection service to acquire the training data they are looking for. Crowd-sourced Datasets Crowd-sourced data is another option to source training data, depending on the given application. TAUS HLP Platform is an example that provides crowd-sourced data solutions. With this platform, TAUS offers tailor-made datasets based on specific requirements for an application. Marketplaces How and where you source your training dataset, whether organic or synthetic data, really depends on what you are using it for. If you wish to train an NLP model, for example, then you would need a hefty-sized dataset consisting of either audio or text data to train your model accordingly. An example of a platform that contains training data is the TAUS Data Marketplace, where hundreds of datasets in numerous world languages are present.

训练数据可以来自许多不同的地方，这取决于您的机器学习应用程序。数据几乎可以在任何地方找到--从免费的公开数据集到私人持有的可供购买的数据，再到众包数据。这些类型的数据集称为有机数据或自然发生的数据集。综合数据如上所述，合成数据集是用作训练数据的一个常用选项。使用合成数据的好处是，它可以在任何给定的一组适用的约束条件下从内部获得。此外，它可以大量生产，具有较短的生成时间来建模训练周转，并且在已知先验条件的情况下易于创建。其缺点是合成数据的生产成本很高，而且会消耗资源。公共数据集其他选择包括使用谷歌或Kaggle等平台来提取数据集。那里提供的数据集通常由政府机构或企业公司维护。一些公司有内部团队或使用数据标签或数据收集服务来获取他们正在寻找的培训数据。众包数据集根据给定的应用程序，众包数据是获取训练数据的另一种选择。TAUS HLP平台就是一个提供众包数据解决方案的例子。有了这个平台，TAUS根据应用程序的特定需求提供量身定制的数据集。市场如何以及从何处获取训练数据集，无论是有机数据还是合成数据，实际上取决于您使用它的目的。例如，如果您希望训练一个NLP模型，那么您将需要一个由音频或文本数据组成的大尺寸数据集来相应地训练您的模型。包含培训数据的平台的一个例子是TAUS data Marketplace，其中有世界多种语言的数百个数据集。

以上中文文本为机器翻译，存在不同程度偏差和错误，请理解并参考英文原文阅读。

阅读原文

机器翻译

工具

翻译管理

本地化