function for the MLP), I needed some test data for the SOMs. The idea was to cartograph my iTunes library. So I used categorical data stored in the library as input vectors for the SOM. It turned out that the data was not suited for this purpose, as it was to evenly distributed and not significant enough to identify meaningful clusters on the resulting U-Matrix [4].

According to Wikipedia the mel scale is

*a perceptual scale of pitches judged by listeners to be equal in distance from one another.*(Mel scale [6]). Actually converting “normal” frequency into Mel frequency is pretty easy:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | /* * stdin: WAV, Mono-16bit-44100 mit Standard 48-byte WAV-Header * stdout: dito */ #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <math.h> #include <assert.h> #include <fftw3.h> /** * The range of the fiter bank in Hz */ #define F_MIN 20.0 #define F_MAX 11025.0 /** * Basically N is the window size in samples. As the WAV we're reading is supposed * to have a sample size of 44.1kHz and we want a window size of 0.1sec, so we set * this to 4410 (Hz). */ #define N 4410 #define N2_p1 2206 #define N2_m1 2204 #define SAMPLE_RATE 44100.0 #define SAMPLE_RATE_DIV_N 10.0 /** * The amount of filters in the MEL filter bank */ #define M_NUMCH 24 #define M_COEFF 20 double melScaleInverse(double f) { return 700 * (pow(10, f / 2595.0) - 1); } double melScale(double f) { return 2595 * log10(1 + (f / 700.0)); } void buildFilterBank(double* filterBank) { int i = 0; double melDelta = (melScale(F_MAX) - melScale(F_MIN)) / (M_NUMCH + 1); for(i = 0; i < M_NUMCH; i++) { filterBank[i] = melScaleInverse(melDelta * i); } } double filter(double* f /*filterBank*/, int m /* filterIndex */, double f_k) { if(f_k < f[m - 1] || f_k >= f[m + 1]) { return 0; } else if(f[m - 1] <= f_k && f_k < f[m]) { return (f_k - f[m - 1]) / (f[m] - f[m - 1]); } else if(f[m] <= f_k && f_k < f[m + 1]) { return (f_k - f[m + 1]) / (f[m] - f[m + 1]); } else { return 0; } } void computeFFT() { int i, k, nread; uint16_t *iobuf; double *in; double ONE_DIV_N; double filterBank[M_NUMCH]; fftw_complex *out; fftw_plan plan; ONE_DIV_N = 1.0 / N; iobuf = calloc(sizeof(uint16_t), N); in = (double*) fftw_malloc(sizeof(double) * N); out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); plan = fftw_plan_dft_r2c_1d(N, in, out, FFTW_MEASURE); buildFilterBank(filterBank); /* pass thru the 48 byte wav-header */ fread (iobuf, sizeof(uint16_t), 24, stdin); double c_ls[N2_m1]; double c_mf[M_NUMCH]; double c_mc[M_COEFF]; while( (nread = fread(iobuf, sizeof(uint16_t), N, stdin)) == N ) { for(i = 0; i < N; i++) { in[i] = iobuf[i]; } fftw_execute(plan); double result_re = 0; double result_im = 0; for(k = 0; k < N2_m1; k++) { c_ls[k] = sqrt(((out[k][0] / N) * (out[k][0] / N)) + (out[k][1] * out[k][1])); } for(i = 0; i < M_NUMCH; i++) { c_mf[i] = 0; for(k = 0; k < N2_m1; k++) { c_mf[i] += c_ls[k] * filter(filterBank, i, k * SAMPLE_RATE_DIV_N); } } for(k = 0; k < M_COEFF; k++) { c_mc[k] = 0; for(i = 0; i < M_NUMCH; i++) { if(c_mf[i] != 0) c_mc[k] += log(c_mf[i]) * cos(k * (M_PI / M_NUMCH) * (i - 0.5)); } } fwrite(c_mc, sizeof(double), M_COEFF, stdout); } free(in); free(iobuf); } int main (int argc, char *argv[]) { computeFFT(); return 0; } |

1 | lame --decode -m m -t $filename | ./mfcc > $filename.features |

and works as follows:

- Read the data using a rectangular 10ms window (usually a Hamming window [7] or similiar is used). This results in 4410 samples (16 bit wide) as the WAV is supposed to be sampled in 44.1kHz.
- Run FFT [8] on the acquired samples – here we use the FFTW [9] library. Then take the square of the transformation results: if
is the

th transformation result we use

.

- Scale the result of step 2 using the Mel scale filter bank. The one I used looks like this (this presentation [10] helped me alot for implementing the filter bank):

We’ll call the Mel transformed value.

- Finally compute the
th coefficient using a discrete cosine transformation:

where

is the dimensionality of the resulting feature vector.

- Write the feature vector
to the stdout as 8 byte wide little endian IEEE 754 double precision floats.

- How good is the quality of those feature vectors?
- How do I process such a vast amount of data?

.

The idea is to compute the mean vector using

, compute the eucledian distance (as this is the measure of distance a SOM implementation would use) of each vector

to

:

. Finally we take the mean of the

(

) and calculate the

. Those

- The music I listen two is pretty much all the same
- The feature vectors are not really of good quality

Personally I’d go with the later one as my the music in my library is not that much the same. One way to improve the quality of those feature vectors could be use the first and second derivation of the cebstrum coefficients as proposed here [11] or to use a Hamming window for reading the data.

## References

- [1] J. J. Hopfield, "Neural networks and physical systems with emergent collective computational abilities", Proceedings of the National Academy of Sciences of the USA, vol. 79 no. 8 pp. 2554-2558, April 1982.: http://www.pnas.org/content/79/8/2554.abstract
- [2] Rosenblatt, Frank (1958): The perceptron : a probabilistic model for information storage and organization in the brain. Psychological Reviews 65 (1958) 386-408: http://www.citeulike.org/user/kira/article/2051038
- [3] Wikipedia: Self Organizing Map: http://en.wikipedia.org/wiki/Self_organizing_map
- [4] Jaakko Hollmén, 1996: Using the Self-Organizing Map: http://www.cis.hut.fi/jhollmen/dippa/node24.html
- [5] Beth Logan: Mel frequency cepstral coefficients for musical modeling: http://ciir.cs.umass.edu/music2000/papers/logan_paper.pdf
- [6] Wikipedia: Mel scale: http://en.wikipedia.org/wiki/Mel_scale
- [7] Wikipedia: Hamming window: http://en.wikipedia.org/wiki/Window_function#Hamming_window
- [8] FFTW documentation: The 1d Real-data DFT: http://www.fftw.org/fftw3_doc/The-1d-Real_002ddata-DFT.html#The-1d-Real_002ddata-DFT
- [9] Homepage: Fastest Fourier Transform in the West: http://www.fftw.org/
- [10] Stefan Scherer: MFCC Implementierung: http://www.informatik.uni-ulm.de/ni/Lehre/SS09/PraktikumNI/MFCC.pdf
- [11] Lehrbuch zur Mustererkennung: http://www5.informatik.uni-erlangen.de/fileadmin/Persons/NiemannHeinrich/klassifikation-von-mustern/m00-www.pdf
- [12] Teuvo Kohonen, et. al.: Self organization of a massive document collection (2000): http://130.203.133.121:8080/viewdoc/summary;jsessionid=73500594ABCEE7205702FB7CB3DBF8AB?doi=10.1.1.32.2117
- [13] R. D. Lawrence, et. al: A Scalable Parallel Algorithm for Self-Organizing Maps with Applicationsto Sparse Data Mining Problems: http://portal.acm.org/citation.cfm?id=593480

Now I would like to thank you very much, sir. I’ve been trying to understand the mel filter bank for quite some time now and been going desperate in my search for some understandable implementation in C++. While searching the google images for at least graphical implementation of melfilter bank, I randomly clicked on image on this website, which lead me here, and, to my surprise, I found the implementation here. Took me some time to understand your code thoroughly, but now I finally know what exactly the filterbank does and I really want to thank you again for this piece of information. Saved my sanity.

Greetings from Slovakia,

Arxi

There is a third possible conclusion: You are not implementing the MFCC feature vector for its intended use. Generally, feature vectors are concatenated with each other to generate a unique string of frequency-domain data for an utterance (or other audio feature). From my understanding, it is used on something as short as a syllable and as long as a word, with one MFCC for each 10~100ms time-domain segment. Using MFCC as a feature vector for an entire song seems like a strange application, and it seems that LLN would imply (or predict) the histogram result you’ve received, considering a songs are generally all over the mel scale, even if they have different musical styles.

However, that is just my speculation as a novice. Feel free to correct me if I’m mistaken.