Predicting Musical Genre from MIDI Files

Sean Hobin
5 min readNov 2, 2019

When I was a kid, my musician Dad would play guitar and sing songs with a full backing band of drums, keyboards, bass, horns… the works. Was he hiring an entire band? No — he was utilizing the then-new technology of MIDI (Musical Instrument Digital Interface) files to play song backing tracks.

The technology was cutting edge at the time (the early to mid-90s), and it was very novel to have instrumentals of pop songs soon after they came out. I remember being amazed by MIDI, and curious about how it worked. That curiosity planted the seeds for this project some 20+ years later.

And so, I decided to work with MIDI files for this project! When deciding upon my target, I briefly considered predicting a classical music composer based on MIDI files of their compositions, but I decided that predicting musical genre would be more interesting.

How the MIDI file torrent is formatted

Data collection was slightly insane, because I wanted massive amounts of MIDI files, but they had to be labeled by genre, and had to have some modicum of quality (MIDI files can be hit or miss). I found a torrent of 130,000 MIDI files that formed the basis of my dataset. I also downloaded some MIDI file collections from Kaggle, Google Datasets, and various MIDI websites, but most of these either had few songs, were of low quality, or both. I ended up cleaning up the data I had by writing several batch (.bat) files to pull the songs out of thousands of folders. Since I ended up with a lot of quality classical, jazz, ragtime, and folk MIDI files, I created a folder for each genre and filled them with the appropriate MIDI files.

I decided my DataFrame would consist of a row for each song and columns with features of the songs. I quickly ran into the problem, though, of how I was going to parse these songs and pull out the useful information, like key signature, tempo, song length, number of instruments, etc. Well, it turns out there’s a miraculous Python library for MIDI called music21.

music21 (maintained out of MIT) has the ability to get into a MIDI file and find useful musical information by processing it with a series of time-proven algorithms. I won’t delve into details here, but further readings are available on the music21 website if you would like to learn more. Basically, most of my features either came from music21 or another Python library for MIDI called mido.

Each MIDI file was processed one at a time, which ended up taking around 5–7 seconds. This doesn’t seem like much, but 6 seconds times 130,000 MIDI files ends up being over 9 full days. That caused me to cut back on the size of my dataset really quick!

What my DataFrame looked like with processed features

I split the data into a train, validation, and test set; then used XGBoost to create my model. I did also did a touch of hyperparam tuning. The following feature importances were the result:

The ‘filesize’ feature offered pretty good target stratification :

SHAP graph demonstrating feature importances in an intuitive way:

Class 0 = Classical, Class 1 = Folk, Class 2 = Jazz, Class 3 = Ragtime

Max pitch and min pitch are broken features because I didn’t have time to engineer them properly. They would need to be converted from musical notation (C4 being the C-note on the fourth octave, A2 being an A-note on the second octave) into an ordinal int like 56. For key signature, my hypothesis for why it is so poor for stratification is that either they weren’t being processed properly, or key signature is simply not indicative of musical genre (which makes sense when you think about it). Since I had 4 classes of equal size, my baseline accuracy was 25%. Well, it turns out my model smashed the baseline accuracy, coming in at 90.625%:

The money shot.

I was pretty surprised with this result, because it was the first score I got, and it was on a dataset with broken engineered features. I think, given more time, I’d be able to increase the accuracy to over 96%. Or, I could also expand the genres to include things like rock, blues, hip-hop, etc. The issue, though, is that with newer styles of music, there is more sonic complexity; and it’s more difficult to transcribe (for example, the living, breathing warble of a dubstep bass would sound atrocious in MIDI). And so, another interesting direction to go in would be to use audio waveforms instead of MIDI files. This would of course involve a much bigger dataset in terms of file size, and I believe would require much more computation.

All in all, I really enjoyed this project. It was fun to work with such a large dataset — 130,000+ distinct files nested haphazardly in different folders — and engineer features using novel libraries I learned from scratch. I believe this project was more representative of the types of data and challenges I’ll encounter in the real world (not pre-cleaned, tabular data in tidy format), so it will be vital to my growth as a data scientist. Thanks for reading!

--

--