32leaves.net

MFCC based feature vectors of my music collection

After reading a scriptum about neural networks I wanted to get my hands on them. So I wrote some rather simple implementations of Hopfield networks [1], multi-layer-perceptrons [2] and last but not least (serial) Self-Organizing Maps [3] (SOM).
As the Hopfield nets and multi-layer perceptrons worked fine with some artificial test data (hand-crafted patterns for the Hopfield nets and a simple

    \[\sin\]

function for the MLP), I needed some test data for the SOMs. The idea was to cartograph my iTunes library. So I used categorical data stored in the library as input vectors for the SOM. It turned out that the data was not suited for this purpose, as it was to evenly distributed and not significant enough to identify meaningful clusters on the resulting U-Matrix [4].

After searching for a more usefull way of extracting feature vectors from my music collection I eventually decided to go with so called “Mel frequency cepstral coefficients” [5].
According to Wikipedia the mel scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another.(Mel scale [6]). Actually converting “normal” frequency into Mel frequency is pretty easy:

    \[f_{mel} = 2595 \cdot \log_{10}(\frac{f}{700})\]

.
I did not find a suitable implementation for calculating those MFCCs, so I wrote my own:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
/*
 * stdin: WAV, Mono-16bit-44100 mit Standard 48-byte WAV-Header
 * stdout: dito
 */

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#include <assert.h>
#include <fftw3.h>

/**
 * The range of the fiter bank in Hz
 */

#define F_MIN   20.0
#define F_MAX   11025.0

/**
 * Basically N is the window size in samples. As the WAV we're reading is supposed
 * to have a sample size of 44.1kHz and we want a window size of 0.1sec, so we set
 * this to 4410 (Hz).
 */

#define N       4410
#define N2_p1   2206
#define N2_m1   2204

#define SAMPLE_RATE         44100.0
#define SAMPLE_RATE_DIV_N   10.0

/**
 * The amount of filters in the MEL filter bank
 */

#define M_NUMCH 24
#define M_COEFF 20

double melScaleInverse(double f) {
    return 700 * (pow(10, f / 2595.0) - 1);
}

double melScale(double f) {
    return 2595 * log10(1 + (f / 700.0));
}

void buildFilterBank(double* filterBank) {
    int i = 0;
   
    double melDelta = (melScale(F_MAX) - melScale(F_MIN)) / (M_NUMCH + 1);
    for(i = 0; i < M_NUMCH; i++) {
        filterBank[i] = melScaleInverse(melDelta * i);
    }
}

double filter(double* f /*filterBank*/, int m /* filterIndex */, double f_k) {
   
    if(f_k < f[m - 1] || f_k >= f[m + 1]) {
        return 0;
    } else if(f[m - 1] <= f_k && f_k < f[m]) {
        return (f_k - f[m - 1]) / (f[m] - f[m - 1]);
    } else if(f[m] <= f_k && f_k < f[m + 1]) {
        return (f_k - f[m + 1]) / (f[m] - f[m + 1]);
    } else {
        return 0;
    }
}

void computeFFT() {
    int i, k, nread;
    uint16_t *iobuf;
    double *in;
    double ONE_DIV_N;
    double filterBank[M_NUMCH];
    fftw_complex *out;
    fftw_plan plan;
   

    ONE_DIV_N = 1.0 / N;
    iobuf = calloc(sizeof(uint16_t), N);
    in = (double*) fftw_malloc(sizeof(double) * N);
    out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
    plan = fftw_plan_dft_r2c_1d(N, in, out,  FFTW_MEASURE);

    buildFilterBank(filterBank);

    /* pass thru the 48 byte wav-header */
    fread (iobuf, sizeof(uint16_t), 24, stdin);
   
    double c_ls[N2_m1];
    double c_mf[M_NUMCH];
    double c_mc[M_COEFF];
    while( (nread = fread(iobuf, sizeof(uint16_t), N, stdin)) == N ) {
        for(i = 0; i < N; i++) {
            in[i] = iobuf[i];
        }
       
        fftw_execute(plan);
       
        double result_re = 0;
        double result_im = 0;
       
        for(k = 0; k < N2_m1; k++) {
            c_ls[k] = sqrt(((out[k][0] / N) * (out[k][0] / N)) + (out[k][1] * out[k][1]));
        }
       
        for(i = 0; i < M_NUMCH; i++) {
            c_mf[i] = 0;
           
            for(k = 0; k < N2_m1; k++) {
                c_mf[i] += c_ls[k] * filter(filterBank, i, k * SAMPLE_RATE_DIV_N);
            }
           
        }
       
        for(k = 0; k < M_COEFF; k++) {
            c_mc[k] = 0;
           
            for(i = 0; i < M_NUMCH; i++) {
                if(c_mf[i] != 0) c_mc[k] += log(c_mf[i]) * cos(k * (M_PI / M_NUMCH) * (i - 0.5));
            }
        }
       
        fwrite(c_mc, sizeof(double), M_COEFF, stdout);
    }

    free(in);
    free(iobuf);
}


int main (int argc, char *argv[]) {
    computeFFT();
    return 0;
}
The implementation above reads raw WAV from the stdin and calculates the MFCCs from them. It’s called using

1
lame --decode -m m -t $filename | ./mfcc > $filename.features

and works as follows:

  1. Read the data using a rectangular 10ms window (usually a Hamming window [7] or similiar is used). This results in 4410 samples (16 bit wide) as the WAV is supposed to be sampled in 44.1kHz.
  2. Run FFT [8] on the acquired samples – here we use the FFTW [9] library. Then take the square of the transformation results: if

        \[Y_K\]

    is the

        \[K\]

    th transformation result we use

        \[\vert Y_K\vert^2\]

    .

  3. Scale the result of step 2 using the Mel scale filter bank. The one I used looks like this (this presentation [10] helped me alot for implementing the filter bank):
    filterbank We’ll call the Mel transformed value

        \[c^{mf}_i\]

    .

  4. Finally compute the

        \[n\]

    th coefficient using a discrete cosine transformation:

        \[c_n = \sum\limits_{i=0}^{N} \log_{10}(c^{mf}_i)\cdot\cos(\frac{n(2i-1)\pi}{2N})\]

    where

        \[N\]

    is the dimensionality of the resulting feature vector.

  5. Write the feature vector

        \[\vec{c} = (c_1, \dots, c_N)^T\]

    to the stdout as 8 byte wide little endian IEEE 754 double precision floats.

Running the programming above on my music collection (roughly 28 gigabytes) took roughly 24 hours and resulted in 487747560 feature vectors. That leaves me with to questions:
  1. How good is the quality of those feature vectors?
  2. How do I process such a vast amount of data?
In order to get an idea of the quality (in the context of feeding them into a SOM) I used some simple statistical methods. As the SOM estimates the propability density function of the samples, I had a closer look at the distribution of those vectors in the

    \[\mathbb{R}^20\]

.
The idea is to compute the mean vector using

    \[\vec{m} = \frac{1}{N}\sum\limits_{i=1}^N\vec{c}_n\]

, compute the eucledian distance (as this is the measure of distance a SOM implementation would use) of each vector

    \[\vec{c}_n\]

to

    \[\vec{m}\]

:

    \[d_i = \|\vec{m} - \vec{c}_i\|\]

. Finally we take the mean of the

    \[d_i\]

(

    \[d_m\]

) and calculate the

    \[\sigma_i = d_m - d_i\]

. Those

    \[\sigma_i\]

drawn plotted in a histogram look like this:histogram

The image above leaves us with two possible conclusions:
  1. The music I listen two is pretty much all the same
  2. The feature vectors are not really of good quality

Personally I’d go with the later one as my the music in my library is not that much the same. One way to improve the quality of those feature vectors could be use the first and second derivation of the cebstrum coefficients as proposed here [11] or to use a Hamming window for reading the data.

None the less, once I finally make it and finish a working batch SOM [12] (and this paper [13]) implementation I’ll do another post describing the results. Or I might simply abandon this project as the semester starts in a few days and I might run out of time.

References

2 Responses to “MFCC based feature vectors of my music collection”

  1. Arxi says:

    Now I would like to thank you very much, sir. I’ve been trying to understand the mel filter bank for quite some time now and been going desperate in my search for some understandable implementation in C++. While searching the google images for at least graphical implementation of melfilter bank, I randomly clicked on image on this website, which lead me here, and, to my surprise, I found the implementation here. Took me some time to understand your code thoroughly, but now I finally know what exactly the filterbank does and I really want to thank you again for this piece of information. Saved my sanity.

    Greetings from Slovakia,

    Arxi

  2. Josher says:

    There is a third possible conclusion: You are not implementing the MFCC feature vector for its intended use. Generally, feature vectors are concatenated with each other to generate a unique string of frequency-domain data for an utterance (or other audio feature). From my understanding, it is used on something as short as a syllable and as long as a word, with one MFCC for each 10~100ms time-domain segment. Using MFCC as a feature vector for an entire song seems like a strange application, and it seems that LLN would imply (or predict) the histogram result you’ve received, considering a songs are generally all over the mel scale, even if they have different musical styles.
    However, that is just my speculation as a novice. Feel free to correct me if I’m mistaken.

Leave a Reply

Fork me on GitHub