Sound Classification with a Convolutional Neural Net

Eric Pietrowicz
4 min readOct 2, 2019

Convolutional neural nets (CNNs) are one of the most effective network models being used for image classification given the algorithm’s ability to window the input neurons, improving learning time and accuracy. These advantages can be used to develop a sound classification model with the proper pre-processing of data.

The following algorithm is heavily influenced by the work done here:

https://www.youtube.com/watch?v=Z7YM-HAz-IY&list=PLhA3b2k8R3t2Ng1WW_7MiXeh1pfQJQi_P

In this example we’ll build a model that attempts to classify a two second sound bite as one of three categories:

  1. Noise on the microphone
  2. Wind interference
  3. Silence

First, we’ll need to gather a data set. Using python library pyaudio, we’ll build a short script to grab a sound bite that will be used for a training set.

 pip install pyaudio

Gathering each category separately, and starting with wind classification, the following script will grab a two second bite, allocate a file name, then store in an audio directory. When the script is run, for wind, try to blow on the computer’s microphone. For noise, rub a finger over the microphone to create a noisy response. For ambient, just let the script run with no microphone interference.

Change the cat_str variable for each category condition, running three times total to get the data set.

After all of the .wav files have been gathered, a metadata CSV file will need to be generated. The configuration can be found below:

index,slice_file_name,class_name1    ,noise_0.wav    ,noise
...
25 ,wind_0.wav ,wind
...
50 ,ambient_0.wav ,ambient

There certainly is a pythonic way to generate this file, however, as my data set is small I just used Excel.

Now the working directory should be configured as below:

Sound classification:
|_ audio_capture.py
|_ office_sounds.csv
|_ audio
|_ wind_0.wav
...
|_ wind_25.wav
|_ ambient_0.wav
...
|_ ambient_25.wav
|_ noise_0.wav
...
|_ noise_25.wav

Let’s explore the data a bit.

Notice the imports. The most important library used is python_speech_features.

pip install python_speech_features

https://github.com/jameslyons/python_speech_features

This converts the sound data from the time domain to the frequency domain with an emphasis on the low-end audible band.

Uncomment the different “plot_xx” sections to explore the progression of the data from raw signal data to MFCC
Noise in the time domain.
Wind in the time domain.

The “image” we’ll be feeding into the CNN classifier is generated from the signal’s Mel Frequency Cepstrum Coefficients (MFCC).

Let’s clean up the data by removing dead space (see: envelope) and dump it into a clean file directory.

Clean the stored data.

The new directory structure:

Sound classification:
|_ clean.py
|_ explore.py
|_ audio_capture.py
|_ office_sounds.csv
|_ audio
|_ wind_0.wav
...
|_ wind_25.wav
|_ ambient_0.wav
...
|_ ambient_25.wav
|_ noise_0.wav
...
|_ noise_25.wav
|_ clean
|_ wind_0.wav
...
|_ wind_25.wav
|_ ambient_0.wav
...
|_ ambient_25.wav
|_ noise_0.wav
...
|_ noise_25.wav

To build the model, create two new python files: cfg.py, and model_implement.py.

cfg.py will store some basic information about the model. This split will become useful if we want to experiment with additional neural nets.

Storing configuration data.

model_implement.py calls on cfg to create a CNN with four layers, three of which are hidden.

build_rand_feat() will perform the grunt work for each file. This is more or less a repeat of what was performed in explore.py, however, now we’re building random features based on a rolling window through our sound bite.

Training the model and saving it out for future use.

This will store the trained model in a “models” directory and configuration data in the “pickles” directory. The updated directory is shown here:

Sound classification:
|_ clean.py
|_ explore.py
|_ audio_capture.py
|_ office_sounds.csv
|_ models
|_ conv.model
|_ pickels
|_ conv.model
|_ audio
|_ conv.p
...
|_ wind_25.wav
|_ ambient_0.wav
...
|_ ambient_25.wav
|_ noise_0.wav
...
|_ noise_25.wav
|_ clean
|_ wind_0.wav
...
|_ wind_25.wav
|_ ambient_0.wav
...
|_ ambient_25.wav
|_ noise_0.wav
...
|_ noise_25.wav

Finally, add a predict.py script to check the performance of our model.

It should be noted that this is purely experimental data with quite a small data set. An accuracy this high should not be expected in a real world scenario.

Store predictions of the validation data.

This is an example of using a convolutional neural network to classify bites of sound data. A key component to successfully implementing a CNN in sound classification is the ability to transform the data from the time domain to the frequency domain using Mel Frequency Cepstrum Coefficients. In explore.py it’s clear that the MFCCs generated are unique for each classification, while plotting the signal in the time domain maintains a level of ambiguity that may have otherwise inhibited the algorithm.

--

--