Urban Sound Classification with Neural Networks in Tensorflow
This post discuss techniques of feature extraction from sound in Python using open source library Librosa and implements a Neural Network in Tensorflow to categories urban sounds, including car horns, children playing, dogs bark, and more.
To extract the useful features from sound data, we will use Librosa library. It provides several methods to extract different features from the sound clips. We are going to use below mentioned methods to extract various features:
- melspectrogram: Compute a Mel-scaled power spectrogram
- mfcc: Mel-frequency cepstral coefficients
- chorma-stft: Compute a chromagram from a waveform or power spectrogram
- spectral_contrast: Compute spectral contrast, using method defined in 
- tonnetz: Computes the tonal centroid features (tonnetz), following the method of 
To make the process of feature extraction from sound clips easy, two helper methods are defined. First parse_audio_files which takes parent directory name, subdirectories within parent directory and file extension (default is .wav) as input. It then iterates over all the files within subdirectories and call second helper function extract_feature. It takes file path as input, read the file by calling librosa.load method, extract and return features discussed above. These two methods are all we required to convert raw sound clips into informative features (along with a class label for each sound clip) that we can directly feed into our classifier. Remember, the class label of each sound clip is in the file name. For example, if the file name is 108041-9-0-4.wav then the class label will be 9. Doing string split by – and taking the second item of the array will give us the class label.
Classification using Multilayer Neural Network
Note: If you want to use scikit-learn or any other library for training classifier, feel free to use that. The goal of this tutorial is to provide an implementation of the neural network in Tensorflow for classification tasks.
Now we have our dataset ready, let’s implement two layers neural network in Tensorflow to classify each sound clip into a different category. But before starting with that, let’s encode class labels into one hot vector using the method one_hot_encode and divide the dataset into a train and test sets by using following code.
The code provided below defines configuration parameters required by neural network model. Such as training epochs, a number of neurones in each hidden layer and learning rate.
Now define placeholders for features and class labels, which tensor flow will fill with the data at runtime. Furthermore, define weights and biases for hidden and output layers of the network. For non-linearity, we use the sigmoid function in the first hidden layer and tanh in the second hidden layer. The output layer has softmax as non-linearity as we are dealing with multiclass classification problem.
The cross-entropy cost function will be minimised using gradient descent optimizer, the code provided below initialize cost function and optimizer. Also, define and initialize variables for accuracy calculation of the prediction by model.
We have all the required pieces in place. Now let’s train neural network model, visualise whether cost is decreasing with each epoch and make prediction on the test set, using following code:
In this tutorial, we saw how to extract features from a sound dataset and train a two layer neural network model in Tensorflow to categories sounds, without much tuning the above NN architecture achieved around 82% accuracy on fold1 of the Urban8K dataset. I would encourage you to check the documentation of Librosa and experiment with different neural network configurations i.e. by changing number of neurons, number of hidden layers and introducing dropout etc.
The python notebook is available at the following link.
Bio: Aaqib Saeed is a graduate student of Computer Science (specializing in Data Science and Smart Services) at University of Twente (The Netherlands).
Original. Reposted with permission.