function for the MLP), I needed some test data for the SOMs. The idea was to cartograph my iTunes library. So I used categorical data stored in the library as input vectors for the SOM. It turned out that the data was not suited for this purpose, as it was to evenly distributed and not significant enough to identify meaningful clusters on the resulting U-Matrix [4].

According to Wikipedia the mel scale is

*a perceptual scale of pitches judged by listeners to be equal in distance from one another.*(Mel scale [6]). Actually converting “normal” frequency into Mel frequency is pretty easy:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | /* * stdin: WAV, Mono-16bit-44100 mit Standard 48-byte WAV-Header * stdout: dito */ #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <math.h> #include <assert.h> #include <fftw3.h> /** * The range of the fiter bank in Hz */ #define F_MIN 20.0 #define F_MAX 11025.0 /** * Basically N is the window size in samples. As the WAV we're reading is supposed * to have a sample size of 44.1kHz and we want a window size of 0.1sec, so we set * this to 4410 (Hz). */ #define N 4410 #define N2_p1 2206 #define N2_m1 2204 #define SAMPLE_RATE 44100.0 #define SAMPLE_RATE_DIV_N 10.0 /** * The amount of filters in the MEL filter bank */ #define M_NUMCH 24 #define M_COEFF 20 double melScaleInverse(double f) { return 700 * (pow(10, f / 2595.0) - 1); } double melScale(double f) { return 2595 * log10(1 + (f / 700.0)); } void buildFilterBank(double* filterBank) { int i = 0; double melDelta = (melScale(F_MAX) - melScale(F_MIN)) / (M_NUMCH + 1); for(i = 0; i < M_NUMCH; i++) { filterBank[i] = melScaleInverse(melDelta * i); } } double filter(double* f /*filterBank*/, int m /* filterIndex */, double f_k) { if(f_k < f[m - 1] || f_k >= f[m + 1]) { return 0; } else if(f[m - 1] <= f_k && f_k < f[m]) { return (f_k - f[m - 1]) / (f[m] - f[m - 1]); } else if(f[m] <= f_k && f_k < f[m + 1]) { return (f_k - f[m + 1]) / (f[m] - f[m + 1]); } else { return 0; } } void computeFFT() { int i, k, nread; uint16_t *iobuf; double *in; double ONE_DIV_N; double filterBank[M_NUMCH]; fftw_complex *out; fftw_plan plan; ONE_DIV_N = 1.0 / N; iobuf = calloc(sizeof(uint16_t), N); in = (double*) fftw_malloc(sizeof(double) * N); out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N); plan = fftw_plan_dft_r2c_1d(N, in, out, FFTW_MEASURE); buildFilterBank(filterBank); /* pass thru the 48 byte wav-header */ fread (iobuf, sizeof(uint16_t), 24, stdin); double c_ls[N2_m1]; double c_mf[M_NUMCH]; double c_mc[M_COEFF]; while( (nread = fread(iobuf, sizeof(uint16_t), N, stdin)) == N ) { for(i = 0; i < N; i++) { in[i] = iobuf[i]; } fftw_execute(plan); double result_re = 0; double result_im = 0; for(k = 0; k < N2_m1; k++) { c_ls[k] = sqrt(((out[k][0] / N) * (out[k][0] / N)) + (out[k][1] * out[k][1])); } for(i = 0; i < M_NUMCH; i++) { c_mf[i] = 0; for(k = 0; k < N2_m1; k++) { c_mf[i] += c_ls[k] * filter(filterBank, i, k * SAMPLE_RATE_DIV_N); } } for(k = 0; k < M_COEFF; k++) { c_mc[k] = 0; for(i = 0; i < M_NUMCH; i++) { if(c_mf[i] != 0) c_mc[k] += log(c_mf[i]) * cos(k * (M_PI / M_NUMCH) * (i - 0.5)); } } fwrite(c_mc, sizeof(double), M_COEFF, stdout); } free(in); free(iobuf); } int main (int argc, char *argv[]) { computeFFT(); return 0; } |

1 | lame --decode -m m -t $filename | ./mfcc > $filename.features |

and works as follows:

- Read the data using a rectangular 10ms window (usually a Hamming window [7] or similiar is used). This results in 4410 samples (16 bit wide) as the WAV is supposed to be sampled in 44.1kHz.
- Run FFT [8] on the acquired samples – here we use the FFTW [9] library. Then take the square of the transformation results: if
is the

th transformation result we use

.

- Scale the result of step 2 using the Mel scale filter bank. The one I used looks like this (this presentation [10] helped me alot for implementing the filter bank):

We’ll call the Mel transformed value.

- Finally compute the
th coefficient using a discrete cosine transformation:

where

is the dimensionality of the resulting feature vector.

- Write the feature vector
to the stdout as 8 byte wide little endian IEEE 754 double precision floats.

- How good is the quality of those feature vectors?
- How do I process such a vast amount of data?

.

The idea is to compute the mean vector using

, compute the eucledian distance (as this is the measure of distance a SOM implementation would use) of each vector

to

:

. Finally we take the mean of the

(

) and calculate the

. Those

- The music I listen two is pretty much all the same
- The feature vectors are not really of good quality

Personally I’d go with the later one as my the music in my library is not that much the same. One way to improve the quality of those feature vectors could be use the first and second derivation of the cebstrum coefficients as proposed here [11] or to use a Hamming window for reading the data.

## References

- [1] J. J. Hopfield, "Neural networks and physical systems with emergent collective computational abilities", Proceedings of the National Academy of Sciences of the USA, vol. 79 no. 8 pp. 2554-2558, April 1982.: http://www.pnas.org/content/79/8/2554.abstract
- [2] Rosenblatt, Frank (1958): The perceptron : a probabilistic model for information storage and organization in the brain. Psychological Reviews 65 (1958) 386-408: http://www.citeulike.org/user/kira/article/2051038
- [3] Wikipedia: Self Organizing Map: http://en.wikipedia.org/wiki/Self_organizing_map
- [4] Jaakko Hollmén, 1996: Using the Self-Organizing Map: http://www.cis.hut.fi/jhollmen/dippa/node24.html
- [5] Beth Logan: Mel frequency cepstral coefficients for musical modeling: http://ciir.cs.umass.edu/music2000/papers/logan_paper.pdf
- [6] Wikipedia: Mel scale: http://en.wikipedia.org/wiki/Mel_scale
- [7] Wikipedia: Hamming window: http://en.wikipedia.org/wiki/Window_function#Hamming_window
- [8] FFTW documentation: The 1d Real-data DFT: http://www.fftw.org/fftw3_doc/The-1d-Real_002ddata-DFT.html#The-1d-Real_002ddata-DFT
- [9] Homepage: Fastest Fourier Transform in the West: http://www.fftw.org/
- [10] Stefan Scherer: MFCC Implementierung: http://www.informatik.uni-ulm.de/ni/Lehre/SS09/PraktikumNI/MFCC.pdf
- [11] Lehrbuch zur Mustererkennung: http://www5.informatik.uni-erlangen.de/fileadmin/Persons/NiemannHeinrich/klassifikation-von-mustern/m00-www.pdf
- [12] Teuvo Kohonen, et. al.: Self organization of a massive document collection (2000): http://130.203.133.121:8080/viewdoc/summary;jsessionid=73500594ABCEE7205702FB7CB3DBF8AB?doi=10.1.1.32.2117
- [13] R. D. Lawrence, et. al: A Scalable Parallel Algorithm for Self-Organizing Maps with Applicationsto Sparse Data Mining Problems: http://portal.acm.org/citation.cfm?id=593480