Blog‎ > ‎

Overview of Speech Recognition Technology

posted May 25, 2018, 9:20 PM by MUHAMMAD MUN`IM AHMAD ZABIDI   [ updated May 26, 2018, 1:33 AM ]
Speech recognition is a technology that converts speech to text. This technology has evolved over several decades and has become a very promising application in the AI field. Well, what are the basic principles behind this mysterious speech recognition technology? For reasons of space, here, we can only briefly explain the basic principles of speech recognition.

Traditional speech recognition systems are created using statistical machine learning methods. A typical speech recognition system consists of the following modules:
  1. Voice Acquisition Module: In this module, the microphone audio input is acquired and expressed as a digitized voice signal. For example, a 16-bit digital voice signal with a 16k sampling rate means that each second of speech is represented as 16,000 16-bit integers.
  2. Feature Extraction Module: This module is primarily tasked with converting the digital voice signal into a feature vector and supplying it to the acoustic model for processing. The most common acoustic features extracted from speech input signals are Mel-Frequency Cepstral Coefficients, or MFCCs, although there are other methods, including Linear Predictive Coding (LPC) and Perceptual Linear Prediction (PLP).
  3. Acoustic Model: The acoustic model represents the relationship between the audio signal and basic units of the language (feature -> phoneme) to reconstruct what was actually uttered. Traditionally, speech recognition used Hidden Markov Model-Gaussian Mixture Model (HMM-GMM), but in recent years, many systems have used Hidden Markov Model-Deep Neural Network (HMM-DNN) or other improved models.
  4. Lexicon: The lexicon or pronunciation dictionary contains all the words and pronunciations that can be handled by a speech recognition system. It provides data about words and their pronunciations, and links acoustic model and language model.
  5. Language Model: The language model constrains the decoder's choice of words to those that are most likely to make sense. It predicts the next words based on the last words. Statistical language models (typically n-grams) are compiled from large corpora of text, usually from specific "domains" or subject areas.
  6. Decoder: The decoder is tasked with receiving the input feature vector and searching for the word string that most probably outputs this feature vector based on the acoustic and language models. Generally, this search process is completed using a beam search-based Viterbi algorithm.
Current speech recognition systems used on mobiles (Google Now, Apple Siri, Amazon Alexa, Microsoft Cortana) use deep learning methods instead. In the statistical method above, deep learning was used for the acoustic model only. Full deep learning approach are also known as end-to-end speech recognition.