Data Driven Approach to Finding Average Music


I collected song files that were in wav format and used a program called "waon", homepage here, to convert them to midi files. Some note interpolation is done, so the conversion process is not perfect. Next I used midi to csv to make the files readable and easy to parse. I used an online converter here. Then I wrote python/php scripts that can accumulate the data from the csv files. They will be posted on github here.


How often are certain notes present in a song? Let's take a sample song: "Love Yourself" by Justin Bieber and produce a graph of the presence of certain notes. Let's use two different weighting processes. First, we'll use the sum of the lengths of the note in the song to determine the weight in the song. Second, we'll use the lengths of the notes multiplied by their respective velocities. In midi, the velocity attribute is how loud the note is in a song. By multiplying each note's length by it's velocity we can measure the presence of a note based on how loud it is. Both types weightings are normalized by dividing by the total length of the song.

To the left above, only the sum of the lengths of the notes in the song are considered. To the right above, the sum of the weighted lengths of the notes are graphed. The x-axis shows uses numbers to represent different nnotes, where 60 is middle C. E.g. 61 is C# and 62 is D. Lets try seeing if using only an instrumental music file (a file with only instruments and no singing) will make the audio more clearly show up on this graph.

Below is "Stayin Alive" by the Beegees. It has the same range as "Love Yourself" but has the majority of its mass shifted an octave or so up. Let's find and compare the average note (pitch) of several songs.

Notes that are an interval of 12 half-steps away from each other are one octave apart. These appear frequently in groups in major pop songs. "Love Yourself" is one exception in that it presents more single notes than layered chords. Still, one can see below that these octave are still present.


Below to the left are the note presence graphs of 20 classical piano songs and to the right 20 of the most popular R&B songs. Note how R&B music ismore spread out between notes 20 and 100 (G#0 - E7) where as the classical music has a thinner distribution, usually between 40 and 100 (E2 - E7). The shapes of the graphs are also of interest. The classical piano on the left has a triangle shaped distribution, indicating a longer presence of the peak note.

This information can be further condensed into a graph that displays only crucial parts about each song. The graphs below show the average pitch, standard deviation, and variance of each song. The first graph solely shows classical piano songs. The second shows R&B songs. Note how the averages of the classical piano songs are mostly below note 70 whereas R&B songs are more centered around 70. This is consistent with the previous graphs showing that the "center of mass" of most of the R&B songs are further right, or higher pitched, than the classical piano songs.

FUTURE work: divide song in 1/2,1/3,1/4. Beegees. average for beegees album. average note per top hit per year. average pitch for genre. presence of notes average over multiple songs for genre. can I get a curve? Volume groups.