Music Genre Recognition using Convolutional Neural Networks (CNN) — Part 1

Learn how to develop a deep learning model for recognising genre of music.

Kunal Vaidya
Towards Data Science

--

Photo by Mike Giles on Unsplash

Music is so important to everyone’ s life, it brings out so many emotions in us like nostalgia, excitement. Music can change someone’s mood, get them productive, the possibilities are endless.

I had just started out in this fascinating field of deep learning, and I was considering doing some project. I came across this problem of Music Genre Recognition and loved the thought of using Neural Networks to predict the genre of music and also I am also an avid listener of music; it was a perfect project for me; so I thought Let’s Do It!

This Article will feature two parts. In the first part, we will see the preprocessing of data and training of our deep learning model. In the second part, we will build an app and deploy it on Amazon EC2 Instance.

What is Music Genre Recognition?

Music Genre Recognition is an important field of research in Music Information Retrieval (MIR). A music genre is a conventional category that identifies some pieces of music as belonging to a shared tradition or set of conventions, i.e. it depicts the style of music.

Music Genre Recognition involves making use of features such as spectrograms, MFCC’s for predicting the genre of music.

Setup

Import all the required packages

Data Pre processing and Feature Extraction

I am going to make use of GTZAN Dataset which is really famous in Music Information Retrieval (MIR). The Dataset comprises 10 genres namely Blues, Classical, Country, Disco, Hip Hop, Jazz, Metal, Pop, Reggae, Rock.

Each genre comprises 100 audio files (.wav) of 30 seconds each that means we have 1000 training examples and if we keep 20% of them for validation then just 800 training examples.

We can learn the genre of a song or music by listening to it for just 4–5 seconds so 30 seconds are little too much information for the model to take at once, that is why I decided to split a single audio file into 10 audio files each of 3 seconds.

Now our training examples have become tenfold i.e. each genre has 1000 training examples and total training examples are 10,000. So we increased our dataset and this will be helpful for a deep learning model because it always requires more data.

As we are going to use a Convolutional Neural Network, we need an image as an input, for this we will use the mel spectrograms of audio files and save the spectrograms as an image file (.jpg or .png).

I will not dive into mel spectrograms as it will make this article a little long. You can read this article which talks about it Understanding the Mel Spectrogram | by Leland Roberts | Analytics Vidhya | Medium.

Step 1

Before we split the audio files make empty directories for each genre

Above code block is pretty self-explanatory, it is just making empty directories to store audio files (after we split them) and their corresponding spectrograms.

Note: I have not used Jazz genre because there was some error in generating mel spectrograms

Step 2

Now we will make use of AudioSegment from pydub package to split our audio files.

So, there are three for loops here. First, we loop over the genres we have, then for that genre second for loop goes through every audio file of it, and third for loop splits the audio file into 10 parts and saves it into empty directories we created for audio files previously.

Step 3

Now we will use librosa to generate mel spectrograms for the audio files.

The above code block will generate a mel spectrogram for each audio file in every genre and save them according to their genre in empty directories we created in step 1.

Note: This step might take a long time to complete because there are almost 10000 mel spectrograms to be generated corresponding to 10000 audio files so, be patient :).

Step 4

Now we have our complete data so, we need to split the data into training set and validation set. Our complete data is in spectrograms3sec/train directory so, we need to take part of the complete data and move it to our test directory.

For each genre, we randomly shuffle filenames, select top 100 filenames and move them to test/validation directory.

Above image shows our train directory structure (same for test directory as well). We have prepared our data in such a way to make use of the amazing ImageDataGenerator in keras, it is very helpful for training model on large datasets.

Step 5

We will create data generators for both training and testing set

flow_from_directory() method automatically infers the labels using our directory structure and encodes them accordingly.

ImageDataGenerator makes training on large datasets simple by using the fact that during training model gets trained on only one batch per step so, while training, data generator just loads one batch into memory at a time so, there is no exhaustion of memory resources.

Convolutional Network Model

We will build our CNN model using keras

Model comprises 5 Convolutional layers, then a Dropout layer to avoid over-fitting and finally Dense layer with Softmax activation to output class probabilities.

Training

Now, we will finally train our model on the dataset we prepared

get_f1() is used to compute the f1_score, and for using data generators while training we use fit_generator() method.

After training for 70 epochs, I got training accuracy of 99.57%, and on validation 89.03%.So, we got really good accuracy on validation, this is due to splitting of audio files into 10 equal parts.

I also trained the model on original GTZAN dataset which comprised 1000 audio files of 30s each and got accuracy of around 50%, so splitting audio files into parts increased our accuracy drastically.

Results

Now, I will pick some songs and try to recommend genre of that song using our model.

  1. Smells Like Teen Spirit — Nirvana
Predicted Genre: Rock and True Genre: Rock

2. She will be loved — Maroon 5

Predicted Genre: Pop and True Genre: Pop

3. Viva la Vida — Coldplay (My favourite song :P )

Predicted Genre: Rock and True Genre: Baroque Pop

So, we can see model is producing very good predictions, it correctly predicted all song’ genres.

Conclusion

We saw how to develop a Convolutional neural network for music genre recognition. This was part 1, in the next part we will build an app for music genre recognition and deploy it on Amazon EC2 instance.

I am new to writing on medium, please let me know if you like it, Thanks for reading!

Cheers! 😀

Find GitHub Repository Below

--

--