A practical approach to Automatic Speech Recognition using Deep Learning

Build from scratch an Automatic Speech Recognition system that could recognise spoken numerical digits from 0 to 9. We discuss how Convolution Neural Networks, the current state of the art for image recognition systems, might just provide the perfect solution!


Deep learning is nothing less than magic! It is again and again amazing to see how such a simple (but deep) network of mathematical operations and real numbers could represent astonishingly complex phenomenans. Be it the unstopable Recurrent Neural Networks or treats for your GPU - Convolution Networks, deep learning has proven to be powerful enough to perform complex tasks like identifying faces, driving cars or understanding language.

In this project, using deep learning, we try to model something we all do in our daily lives, probably without ever worrying about its inner complexitiy - recognising speech.


When we speak, we produce sound through the movements and vibrations of our vocal chord, tongue, mouth and even teeths. The nature, intensity and variations of these movements depends on what words we speak, which is a composition of one or more phonemes. Hence a particular sounding word must produce some particular kind of movement, hence generating a sound having some charactersitcs depending on the movement which in turn is depending on the word spoken.

The human ear is very sensitive to these sound characterstics and can accurately distinguish between two different "kinds" of sounds. We are taught how to map these different sounds into words. Thats how we, humans, learn how to recognise and understand what is being spoken. To do the same with digital audio signal, our AI system should be able to:

  • distuinguish between two different types of sound
  • map one or more combination of sounds into words/phonemes

Distinguishing different sounds in audio signals

A sound signal is composed of many sin waves having different amplitudes and frequencies. Two different sound signals would have different decomposition into its contituent sin waves. Digital audio signal can be decomposed into its contituent frequencies by applying the Fast Fourier Transformation. The transformation generates an energy distribution of the audio signal among different frequencies. This can be visualised as a 2-d image like the ones shown below. This distribution is called frequency spectrum (y-axis: frequency, x-axis: time, color: intensity corresponding to a frequency).

Different type of sound signal would have different frequency spectrum. In the images given below, the first row contains spectrums of three different speakers, saying the word "one". Note that there is some similarity in the images in each row or same word is producing a similar spectrum. Also note that spectrums of words "one" and "two" are quite different. Thus we can say, the image of a frequency spectrum contains enough information to differentiate between sound signals of different words.

Too much information!

The frequency spectrum represents the sound signal, which contains information about the spoken word as well as the tone of the speaker, loudness, noise and many other signals. The challenge is to extract out only that part of this information which gives hints regarding the words being spoken. In other words, we need to filter this information to provide features capable enough to be mapped to a single word/phoneme.

In traditional ASR systems, feature engineering involves using the fact that the sensitivity of human ears to differentiate frequencies depends on the frequency range. Mel-frequency cepstrum (MFC) is a representation of sound signal based on the mel scale of frequency - a distribution modeled very close to a human auditory response system. This has proven to provide better representation for speech than linear frequency spectrum. Although being more effecient, this representation may also contain additional information (speaker tone, noise) that has to be filterd out.

Enters deep learning

In a traditional machine learning system, high quality features can be engineered manually, by carefully analysing the domain and weaving together all known information. However, in deep learning, the features are left for the machine itself to learn. The system is not just learning how to differentiate using features, but it is also learning how to generate only those features which would filter out irrelevant information, and keep those which are most useful to the problem.

Convolution Neural Network, a deep learning phenomenan, have become state of the art for image recognition tasks. The network generalises the process of feature extraction from images by representing a filter as a set of numerical parameters (convolving kernels). These paramters determine the type of filter being applied to an image. These paramters are learnt in the training process, thus we say that the filters adapt to the learning problem or the features are learnt.

Applying Convolution Neural Nets to ASR

We can represent the frequency spectrum (either linear scale or Mel scale) of a sound signal as a 2-d matrix or an image. The image could then be fed to a Convolution Neural Network which should extract out only those features relevant to speech recognition. In higher layers, the features can be classified into words/phonemes using a neural network classifier. (Fully Connected Layer)

It is estimated (see this research paper) that a particular sounding phoneme causes a distict frequency pattern to appear in the sound signal. These patterns usually appear in small groups of frequency clusters called frequency bands, which are localized closely. These bands could be processed independently by the network's convolving kernels to extract features that causes internal activations in the network corresponding to the relevant phoneme. Moreover, the following properties of CNN makes them very effecient for phoneme recognition:

1. Local Connectedness: This helps in tackling noise in the speech. Noise and actual speech lie on different regions in the frequency spectrum. In CNNs, since small regions in the image are fed to the network independently(kernel convolution), the speech region of the image could still activate neurons which lead to correct classification. The noise might also activiate some unwanted neurons but they should get supressed in higher level features. Moreover, the network can explicitly learn to recognise and discard noise if we train the model with some samples of speech in noisy environment.

2. Weight Sharing: Different speakers might have different frequencies at which they speak. Hence the speech information might experience little variations along the frequency dimention, although the phoneme determining pattern remaining the same. Hence, sharing weights for each kernel allows it to look for same patterns in multiple regions of the image. This also helps in reducing the number of tranining parameters in the network.

3. Pooling Layer: This layers pools the activations of a small region in the image, usually by only considering the maximum activation value in the region, discarding others. This helps to filter out those unwanted activations, generated by small frequency variations in the signal. The Maxpool layer, together with weight sharing, makes our model more robust and immune to nuances produced in the audio signal by different speakers at different speaking rates.


I conducted a small experiment to verify weather convolution networks could infact produce features that could distinguish different sounds. The experiment should also provide evidence to prove if the frequence spectrum of a sound signal could represent the speech signal well enough to serve as an input to ASR system. As a start, we train a system that could recognise english numerical digits from 0 to 9. We can call this the MNIST of Speech Recognition Systems.

The dataset used in this experiment can be downloaded from here thanks to It contains around 6000 audio samples of spoken numbers from 0 to 9 by many speakers with multiple speed variations. The first step is to convert these audio samples from wav format into trainable data which is done as follows:

  • Make each audio sample of uniform time duration by adding background noise of a fixed duration (2 seconds in this case) while keeping the spoken part at the center of the sample.
  • Generate 2-d frequency spectogram image (in Mel Scale) from the audio samples. I used this library for the spectrum generation.
  • Preprocess each image by resizing to standard dimentions (26 pixels wide and 198 pixels height in my case). Additionally, I subtract the mean of the image from each pixel.
  • It is crazy important to randomize the samples before feeding them for training. In some experiments, I could see the model not converging at all if I did not randomize the training data.
  • Train the images on a Convolution Neural Network with the input being the spectogram images predicting the correct digit spoken.

The architecture of the network contains one Convolution Layer of kernel size 8, stride 2 and number of feature maps being 128. This is followed by a pooling layer of size 6. After that, there are two fully connected layers each containing 1024 elements. Finally, a softmax layer of size 10 (for 10 digits) follows reponsible for classification. The complete Python code for this experiment can be downloaded from my repository.


After 40 epochs of training, I managed to get a validation set accuracy of almost 99%. This sounds bad news since the model might overfit on the voices of speakers present in the dataset. Nevertheless, we can say that CNN could sucessfully learn to recognise voices of the training set speakers using the frequency spectrum as the input. Now, I wanted to see how generalised this model was. For that, I tested using my own voice! Mind you, I'm Indian and my accent is a little different from that of speakers in the dataset (mostly Western accent). I recorded some samples of myself speaking out digits. I used this python script to make this task less tedious. I managed to record around 400 samples and put them directly to test this freshly trained model suspected to overfit.

Surprisingly, I got 75% test accuracy on the 40th epoch snapshot (snapshot meaning weights of the model at 40th epoch). After fiddling around with the model and choosing a different epoch snapshot, I managed to increase that accuracy to 80%. I would say not bad for a starter model to perform this well, given my Indian English accent and the exceptionally high validation set accuracy.


The results indicate that this deep learning model was infact able to learn the complex patterns hidden in frequency spectrum of sound signals containing speech. It is likely that the accuracy of such ASR systems could be increased by simply using a larger training dataset, fiddling around with model parameters, adding more layers or widening each layer. Internet giant Baidu was recently able to achieve state of the art performance in ASR systems using Convolution Nets together with LSTMs. In my view, deep learning is a really promising field with lots of research opportunities, however, with some questions still remaining to be answered.

Rohan Raja

Recently graduated, majoring in Mathematics and Computing from IIT Kharagpur, Rohan is a technology enthusiast and passionate programmer. Likes to apply Mathematics and Artificial Intelligence to devise creative solutions to common problems.