What is Speech Recognition and how to do it?

什么是语音识别以及如何进行语音识别?

2022-06-22 20:00 TAUS

本文共726个字,阅读需8分钟

阅读模式 切换至中文

Speech recognition is a complex mélange of linguistics, mathematics and statistics. Also known as speech-to-text, it attempts to identify spoken words to then process human speech into written format. To do so in the most natural and precise way, AI and ML are used to integrate grammar, syntax, structure, and composition of audio and voice signals to best understand & process human speech. When it comes to actually doing the work, different projects have different speech recognition requirements, which play a role when it comes to selecting the most adequate features to suit these specific needs. Some of the common features of speech recognition are: Language Weighting: by weighting specific words which may be used more frequently in specific scenarios (e.g. product or brand names, industry jargon) over more commonly used expressions, accuracy is increased. Speaker Labeling: this is useful in multi-speaker conversations, wherein each participant’s contribution is tagged separately, making it easier to identify who said what Acoustics Training: this practice ensures that a system adapts to external acoustics which may be present during a conversation (e.g. wind gusts, traffic noise, coughing), without allowing these to interfere with word recognition. Profanity Filtering: as the name suggests, in this case, filters are used to clear out unwanted words or phrases which come from a profanity nature. How does Speech Recognition work? Speech recognizers are composed of various components: there is speech input, feature extraction, feature vectors, a decoder, and a word output. Or in simpler terms, speech recognizers make use of algorithms to help with the interpretation of spoken words into text by following these steps: They analyze the audio They consequently break this audio into parts They digitize the audio into a computer-readable format They use an algorithm to match the audio to the most suitable text representation This fourth step is done by the decoder, which leverages acoustic models, a pronunciation dictionary, and language models to determine the appropriate output. In terms of quality metrics, speech recognition is measured based on its accuracy rate. Aspects such as pronunciation, accent, pitch, volume, and background noise all nuance the word error rate found in possible output, thus both acoustic and language models must be taken into consideration: Acoustic models: represent the relationship between linguistic units of speech and audio signals. Language models: here, sounds are matched with word sequences to distinguish between words that sound similar. Thus, AI and ML help with improving accuracy, through the implementation of various algorithms and computation techniques to recognize speech into text. The most commonly used ones are the following: Natural Language Processing (NLP) Hidden Markov Models N-Grams Neural Networks Speaker Diarization Use Cases: what is Speech Recognition typically used for? Automotive: in more recent car models, there are multiple voice-activated navigation tools that allow the driver to alter aspects such as navigation systems without looking away from the road or using their hand, thus increasing overall road safety Customer service: in this regard, virtual assistants are becoming increasingly common to help out in telephone calls for example Day-to-day technology: a clear example of speech recognition for this case would be our use of virtual assistants on our smartphones, such as Siri, or other devices, such as Alexa Education: speech recognition can help enhance pronunciation-related language instruction Emotion recognition: through the analysis of vocal characteristics, speech recognition software is able to determine a specific emotion someone is trying to convey. Emotion recognition is particularly useful when paired with sentiment analysis as it can help with understanding how a customer feels about a certain product or service Hands-free communication: similarly to the uses of speech recognition for automotive purposes, it can be further used in other instances, such as answering a call without having to pick up your smartphone Security: voice-based authentication is a way in which speech recognition is used for security purposes in our day-to-day activities Speech recognition can serve many benefits, but in order to do a good job at it, you need high-quality training data, where diversity is key. Through the TAUS HLP Platform, we are able to provide this data for your specific speech recognition project needs, with the help of our community of workers. Get in touch with us to receive more information about our speech recognition services.
语音识别是语言学、数学和统计学的一门复杂的语言。也称为语音对文本,它试图识别口语,然后将人类的语音处理成书面形式。为了以最自然、最精确的方式实现这一目标,AI和ML被用来整合语音和语音信号的语法、语法、结构和组成,以更好地理解和处理人类的语音。 当涉及到实际工作时,不同的项目有不同的语音识别要求,这在选择最合适的功能以满足这些特定需求时发挥了作用。语音识别的一些常见特征包括: 语言权重:通过将特定场景中可能更频繁使用的特定词语(如产品或品牌名称、行业术语)与更常用的表达进行权重计算,可以提高准确性。 说话人标记:这在多说话人对话中很有用,其中每个参与者的贡献都被单独标记,从而更容易识别谁说了什么 声学培训:该实践确保系统能够适应对话过程中可能出现的外部声学(例如阵风、交通噪音、咳嗽),而不会干扰单词识别。 亵渎过滤:顾名思义,在本例中,过滤器用于清除来自亵渎性质的不需要的单词或短语。 语音识别是如何工作的? 语音识别器由各种组件组成:语音输入、特征提取、特征向量、解码器和单词输出。或者更简单地说,语音识别器通过以下步骤利用算法帮助将口语翻译成文本: 他们分析音频 因此,他们将音频分解为多个部分 他们将音频数字化为计算机可读的格式 他们使用算法将音频与最合适的文本表示相匹配 第四步由解码器完成,解码器利用声学模型、发音词典和语言模型来确定适当的输出。 在质量度量方面,语音识别是基于其准确率来衡量的。语音、重音、音高、音量和背景噪音等方面都会影响可能输出的单词错误率,因此必须考虑声学和语言模型: 声学模型:表示语音和音频信号的语言单位之间的关系。 语言模型:这里,声音与单词序列相匹配,以区分声音相似的单词。 因此,AI和ML通过实现各种算法和计算技术将语音识别为文本,有助于提高准确性。最常用的方法如下: 自然语言处理(NLP) 隐马尔可夫模型 n元文法 神经网络 说话人日记化 用例:语音识别通常用于什么? 汽车:在较新的车型中,有多种声控导航工具,允许驾驶员在不看路或不用手的情况下改变导航系统等方面,从而提高整体道路安全性 客户服务:在这方面,虚拟助理越来越常见,例如在电话中提供帮助 日常技术:在这种情况下,语音识别的一个明显例子就是我们在智能手机(如Siri)或其他设备(如Alexa)上使用虚拟助理 教育:语音识别有助于加强与发音相关的语言教学 情感识别:通过对声音特征的分析,语音识别软件能够确定某人试图传达的特定情感。情感识别与情感分析相结合时尤其有用,因为它可以帮助理解客户对特定产品或服务的感受 免提通信:与语音识别在汽车上的用途类似,它还可以进一步用于其他场合,例如不必拿起智能手机就可以接听电话 安全性:基于语音的身份验证是一种在日常活动中出于安全目的使用语音识别的方法 语音识别有很多好处,但为了做好这项工作,您需要高质量的训练数据,其中多样性是关键。 通过TAUS HLP平台,我们能够在我们的工人社区的帮助下,为您的特定语音识别项目需求提供这些数据。请与我们联系,以获取有关我们语音识别服务的更多信息。

以上中文文本为机器翻译,存在不同程度偏差和错误,请理解并参考英文原文阅读。

阅读原文