Building a Music Recommender with Deep Learning

I’ve spent a lot of money on music over the years and one website that I have purchased mp3’s from is JunoDownload. It’s a digital download website predominantly used by DJs and has a huge back catalogue of tracks for sale on its platform.

It’s a great music resource and they provide a generous 2 minute sample mp3 file for each song they have for sale. The only problem is…it’s really hard to find music on the site that isn’t a new release or currently top of the sales charts.

The website is heavily geared towards promoting new content, and that makes sense as it’s going to be the new music that generates the most revenue – but what about the other 99% of tracks for sale on the website?

Music recommendations

There are a number of track recommendations already on the website. On the main site, there’s sales chart lists, new release lists and a plethora of curated recommendation lists by staff and DJs.

On top of that, on each individual track / single web page, running along the right hand side of the window are recommenders for ‘people who bought this also bought’, ‘other releases by the artist’ and ‘other releases on this record label’, which are also useful.

But with such a large database of music, I feel that the site is missing a content-based ‘you might also like’ type recommender that suggests similar songs that a user might like based on what they are currently listening to, have added to their cart etc.

Wouldn’t it be cool if you could discover music that was released a few years ago that sounds similar to a new song that you like? Surely Juno are missing out on potential sales by not offering this type of feature on their website.

After being inspired by a blog post I’d read recently from somebody who had classified music genres for songs in their own music library, I decided to see if I could adapt that methodology to build a music recommender.


Achieving this goal required a number of data acquisition, processing and model training steps. Here’s a rundown of all the steps involved:

Download mp3 files

The first thing I needed to do was download a large number of the sample mp3 files to work with.

After scraping track information for more than 400,000 music files available for sale on the website, I arbitrarily picked 9 different music genres and then selected at random 1,000 tracks from each of these genres.

The 9 genres were:

  • Breakbeat
  • Dancehall
  • Downtempo
  • Drum and Bass
  • Funky House
  • Hop Hop / R&B
  • Minimal House
  • Rock/Indie
  • Trance

Over the next few days (my script was deliberately slow so I didn’t bombard the website with download requests) I downloaded all 9,000 mp3 files.

Convert audio to spectrograms

There’s way too much data contained within an audio file, and so a large part of this whole process is essentially trying to condense the information from the music and extract the main features while eliminating all the ‘noise’. It’s basically an exercise in dimensionality reduction, and the first stage of this was to convert the audio into an image format.

Using Discrete Fourier Transforms to convert the audio signals into the frequency domain, I processed each of my 9,000 mp3 audio files and saved spectrogram images for each song. A spectrogram is a visual representation of the spectrum of frequencies of sound as it varies with time. The intensity of colour on the image represents the amplitude of the sound at that frequency.

I chose to create monochrome spectrograms, like this one below:

This is around 20 seconds of audio generated from a hip hop track. On the x-axis is time, and on the y-axis are the frequencies of the sound.

Split images into 256×256 squares

In order to train a model on this data, I needed all of my images to be of equal dimensions, so I split all of my spectrograms into 256×256 squares. This represents just over 5 seconds of audio on each image.

By now, I had more than 185,000 images in total, each with a label for the music genre it represented.

I split my data into a training set of 120,000, a validation set of 45,000 and a holdout set of 20,000 images.

Train a Convolutional Neural Network on the images

I trained a CNN on my image data. I needed to teach it to recognise what the different types of music ‘looked’ like in the spectrogram images, so I used the genre labels and trained it to identify the music genre from the images.

Below is a visualisation of the CNN pipeline:

Starting with the spectrogram image on the upper left hand side, the image is converted into a matrix of numbers representing the colours in each of the pixels. From there, the data passes through various layers in the pipeline and through each layer the shape of the matrix is transformed until it eventually reaches a softmax classifier in the bottom right hand corner. This is a vector of 9 numbers and contains the probabilities for each of the 9 music genres the CNN assigns to the image.

One step in from that is the fully connected layer. This is a vector of 128 numbers and these are essentially 128 music features that have been extracted from the image after passing through the various layers. Another way of thinking about this layer is that all the key information in the original image has been compacted into 128 numbers that ‘explain’ the image.

So how well did the CNN do?

It was capable of classifying the music genre of a song with 75% accuracy, which I felt was pretty good. Music genres are somewhat subjective and music often transcends more than one genre, so I felt happy that it was doing a good job. Here’s a breakdown of the classification accuracies:

  • Trance: 91%
  • Drum & Bass: 90%
  • Dancehall: 79%
  • Breakbeat: 78%
  • Funky House: 71%
  • Downtempo: 71%
  • Rock/Indie: 70%
  • Minimal House: 63%
  • Hip Hop / R&B: 61%

It did a really good job classifying trance music while at the other end of the scale was hip hop / R&B with 61%, which is still almost 6 times better than randomly assigning a genre to the image. I suspect that there’s some crossover between hip hop, breakbeat and dancehall and that might have resulted in a lower classification accuracy. Trance music is quite different to the other 8 genres in the list, so perhaps that’s also why it did so much better at identifying that type of music.

Nevertheless, these numbers weren’t too important to me; what was important was that it was capable of differentiating between different types of music.

What about the music recommender?

Now that I had a trained neural network capable of ‘seeing’ music in spectrograms, I no longer needed the softmax classifier, so I removed that layer and extracted the 128 music feature vectors for all 185,000 images in my data set.

With each image representing just over 5 seconds of audio, and the sample mp3 files being around 2 minutes long in total, I had approximately 23 images – and therefore 23 feature vectors – for each music file. I calculated the mean (average) vector for each song, giving me 9,000 feature vectors; one for each of the 9,000 songs I had originally downloaded.

So just as a quick recap – I started with 9,000 audio files, converted them into 9,000 spectrograms, split them up into 185,000 smaller spectrograms and trained a convolutional neural network on these images. I then extracted 185,000 feature vectors for all these images and calculated the average vector for each of the 9,000 original audio files.

At this point I had now extracted 128 features from the music files that identified different characteristics in the music. So in order to create recommendations of songs that shared similar characteristics, all I needed to find were the vectors that were most similar to one another. To do that, I calculated the cosine similarity between all 9,000 vectors.

Example recommendations

The last step was to select a song at random, and then have the model return the best recommendations of similar music (the songs with the greatest cosine similarities) out of the entire data set of 9,000 mp3s I had downloaded.

Below are a few examples of the recommender in action. The first song that plays is the one that I picked at random, and the 3 that follow are the 3 best recommendations of songs it returned. I think it’s pretty awesome, but see what you think for yourself:

I think the coolest thing about it is that this is entirely unsupervised. Imagine how long it would take if you had to actually listen to each one of these 9,000 songs and take notes evaluating them against different features and characteristics. Now imagine doing it for one million songs or more.

The folks at Pandora are the only people I know of that have attempted this as part of their ongoing “Music Genome Project”. According to this article, they have 25 music analysts listening to and grading 10,000 songs a month (for some context, this entire project from start to finish took me three weeks) on up to 450 different musical features.

That is why I’m really pleased with the results from my recommender; it does a pretty good job of finding songs that sound alike without requiring anybody to have to listen to the audio beforehand – and in a fraction of the time it would take a human to do the same.



  • Matt

    August 5, 2017 at 11:26 am

    Thanks for sharing this link with me – looks cool! Also, I think I have you to thank for this blog post ending up at the top of Hacker News 🙂

  • New top story on Hacker News: Building a Music Recommender with Deep Learning – The Internet Yard

    August 2, 2017 at 10:43 am

    […] Building a Music Recommender with Deep Learning 7 by myautsai | 1 comments on Hacker News. […]

  • Mike Armstrong

    August 2, 2017 at 11:01 am

    Really good work!
    Please could you upload a graphical representation of the network you trained?

  • Hayden

    August 2, 2017 at 11:40 am

    Would be great to see some of the examples that ‘fail’. And are the relationships reversed too? eg (1st video) is Infectious the top recommendation for Spectral Radar too?

  • Robert Agthe

    August 2, 2017 at 2:34 pm

    soundcloud better be quick 😉 great stuff, this is was i really want on a music service. giving me stuff i like.

  • Kyle Flanigan

    August 2, 2017 at 4:02 pm

    A much better solution would be to create a cnn autoencoder and take the middle layer for the feature vectors. Then you don’t need any training data at all and you can scale it to N feature points which will greatly improve your accuracy. If you then changed from 256×256 cnn slices to using small slices fed to a biderectional LSTM layer, you would likely have state of the art.

    (I was doing some very similar work using CNN’s for audio processing)

  • Matt

    August 5, 2017 at 11:04 am

    Thanks for the suggestions Kyle, I will look into it 🙂

  • Renier Botha

    August 3, 2017 at 7:13 am

    Hey, great work and good read.
    Would you be willing to write something similar, albeit a bit shorter, on exactly how you went about writing the scraper for data acquisition? I feel many people (including me) fall short in their real world deep learning projects at the data acquisition stage, so this would be really helpful.


  • Matt

    August 5, 2017 at 11:06 am

    Hi Renier. In fact, I was intending on writing a few blog posts on web scraping because there are a number of different scraping methods for collecting data. Scrapy is probably the most difficult to use, but at the same time very powerful. I’m also looking into the possibility of uploading my data set of spectrograms so that people can use them for their own projects.

  • Gregg Tavares

    August 3, 2017 at 9:04 am

    I feel like the human classifications are a net negative. certain generes are all over the place in term of how they get categorized. In particular if a certain artist is labeled say hiphop, after that point all of their music is labeled hiphop even if some of their songs are dance music and others are ballads and yet others are rap.

    It seems like the algorithm would do best with no influence from bad human categories

    in other words , let the system make its own bins of similar music then see what it came up with.

  • Matt

    August 5, 2017 at 11:15 am

    Sure, the genre labels were created by a human and that’s a subjective thing (in reality, songs also tend to fit into more than one genre category) which will have introduced a bit of bias into the process, but I used them simply to ‘teach’ the neural network to recognise different types of music from the images. By training it on 1,000 songs from each genre, I hoped that the most important aspects of the images that defined each genre would be picked out by the model.

  • Sarnath k

    August 4, 2017 at 3:02 am

    Smart Matt! Impressive and Creative. Using Image processing and recognition for Music!!! Thats what a Data Scientist is all about. Fantastic!

  • Sarnath k

    August 4, 2017 at 3:04 am

    Did time stop at July 10, 2017 at 2:49 pm!?

  • Matt

    August 5, 2017 at 11:03 am

    Haha, that’s now fixed 🙂 It was a bug with the Theme and was showing the date I published my post.

  • Colin

    August 4, 2017 at 4:37 pm

    Love the result; I do however feel that the approach of looking for similar music might not be the way to go for a recommendation engine. If you listen to your Drum & Bass results for instance; buying the recommendations would give me multiple identical sounding songs..

    Perhaps if you combine these results with buyers / listening behavorial data the results would be even better?

    What I could imagine is that your system would be brilliant in detecting plagiarism. Don’t know whether such systems already exist automated. But I could imagine that music labels would be interested in such a service.

  • Matt

    August 5, 2017 at 11:23 am

    Yep, they do sound very similar to one another, and maybe for a general radio station it would benefit from using additional data to improve the recommendations like you say. But in the context of a DJ looking for music to buy, it’s actually quite nice to have a few songs that are similar to each other, because you can mix up your playlists and swap out different tracks that fit just as well, if that makes sense.

  • Renier Botha

    August 4, 2017 at 8:48 pm

    Thanks for updating with the scrapy code 😉

  • hannraoi

    August 8, 2017 at 2:16 am

    I have a question, will one song be contained in different genres in your dataset ? Thank u for ur great post and work BTW 🙂

  • Matt

    August 8, 2017 at 8:23 am

    Hi, no in this data set each song had only one genre.

  • Roman

    August 15, 2017 at 1:23 am

    I would expect the encoding NN to be more intuitive, as encoding NN completely removes the bias essential to songs grading.
    And the question to still remain – how to proof such recommendations.

  • Matt

    August 23, 2017 at 11:15 pm

    Possibly, yes – but the motivation for my project was to adapt somebody else’s methodology with a CNN and see if it would work.

  • DennisShaw

    August 23, 2017 at 3:05 am

    May I ask a question?I am New to this area.I wonder how I convert the audio signals into the frequency domain.Would you give me some help to this?Thanks

  • Matt

    August 23, 2017 at 11:18 pm

    Generating spectrograms is quite easy. I used (and my code on GitHub will work with) SoX. Try installing it and let me know if you need some further assistance.

  • DennisShaw

    August 29, 2017 at 8:18 am

    I am back again.I am wondering if there is any public music dataset for this project to train,I want to find out some datasets which I can trust to train it.

  • Matt

    August 29, 2017 at 4:40 pm

    I’m not sure. You could always train a model on your own labelled music library.

  • river

    September 2, 2017 at 1:54 pm

    Wοѡ that was odd. I just wrote an extremely
    long comment but after I clicked suƄmit my comment didn’t appear.
    Grrrr… well I’m not wrіting all that over again. Anyways, juѕt wanted to say
    excellеnt blog!

  • pat stroh

    October 13, 2017 at 7:21 pm

    Neat-o. A classification and recommendation system that is “75% accurate” might be good, because it introduces variety and “testing.”

  • DennisShaw

    October 16, 2017 at 3:14 am

    Would you mind telling me the sequence of your code on github ? I am confused about how to get start…Which file run at the beginning and which one run after the first file.Thanks^_^

  • cin

    November 23, 2017 at 3:50 pm

    Learning Ai will be awesome to music.


    Start by generating a 100% random audio file, someone give a score between 1 and 100 (will be the fitness). After some amount of time, it start to generate song based at your taste.


    Have an online station that use AI to discover what to play fitness is amount of users listening the radio when the previous song started divided by the amount of SAME users that are still listening to the song when this song end.

Your email address will not be published. Required fields are marked *