32leaves.net

Hello CUDA

As I wrote in my last post, we started fiddling around with CUDA today. First thing to notice: the demos are just totally awesome (like this particle demo here). We’re talking teraflops of computing power here. After doing the basic CUDA excercises (which were really helpful) I started to implement a CA again. And guess which one: A2 of course.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
// includes, system
#include <stdio.h>
#include <assert.h>

// Simple utility function to check for CUDA runtime errors
void checkCUDAError(const char *msg);

__device__ void get_field(int* d_res, dim3 d_pos, int *d_p0, int d_size) {
    (*d_res) = 0;
    if(d_pos.x < d_size && d_pos.y < d_size) (*d_res) = d_p0[(d_pos.y * d_size) + d_pos.x];
}

__global__ void caKernel( int *d_p0, int size, int *d_p1 )
{
    int x = (blockIdx.x * gridDim.x) + threadIdx.x;
    int y = (blockIdx.y * gridDim.y) + threadIdx.y;

    int sum = 0;
    int v = 0;
    get_field(&v, dim3(x - 1, y - 1), d_p0, size); sum += v;
    get_field(&v, dim3(x - 0, y - 1), d_p0, size); sum += v;
    get_field(&v, dim3(x + 1, y - 1), d_p0, size); sum += v;
    get_field(&v, dim3(x - 1, y - 0), d_p0, size); sum += v;
    get_field(&v, dim3(x + 1, y - 0), d_p0, size); sum += v;
    get_field(&v, dim3(x - 1, y + 1), d_p0, size); sum += v;
    get_field(&v, dim3(x - 0, y + 1), d_p0, size); sum += v;
    get_field(&v, dim3(x + 1, y + 1), d_p0, size); sum += v;
   
    d_p1[y * size + x] = sum == 2 ? 1 : 0;
}

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main( int argc, char** argv)
{
    if(argc < 3) {
    fprintf(stderr, "usage: %s <size> <gens>\n", argv[0]);
    return -1;
    }
    int size = atoi(argv[1]);
    int gens = atoi(argv[2]);

    printf("Calculating %i generations for a %ix%i (=%i cells) poulation\n", gens, size*size, size*size, size*size*size*size);

    // pointer for host memory
    int *h_a;

    // pointer for device memory
    int *d_a;
    int *d_b;

    size_t memSize = size * size * size * size * sizeof(int);
    h_a = (int *) malloc(memSize);
    cudaMalloc( (void**)&d_a, memSize  );
    cudaMalloc( (void**)&d_b, memSize  );

    memset(h_a, 0, memSize);
    int i; for(i = 0; i < size*size; i++) h_a[i] = i%2;
    cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );

    dim3 dimGrid( size, size );
    dim3 dimBlock( size, size );
    for(i = 0; i < gens; i++) {
    caKernel<<< dimGrid , dimBlock >>>( i % 2== 0 ? d_a : d_b, size * size, i % 2==0 ? d_b : d_a );
    }

    // block until the device has completed
    cudaThreadSynchronize();

    // check if kernel execution generated an error
    checkCUDAError("kernel execution");

    cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );

    // Check for any CUDA errors
    checkCUDAError("cudaMemcpy");

    // free device memory
    cudaFree(d_a);
    cudaFree(d_b);

    // free host memory
    free(h_a);

    return 0;
}

void checkCUDAError(const char *msg)
{
    cudaError_t err = cudaGetLastError();
    if( cudaSuccess != err)
    {
        fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );
        exit(-1);
    }                        
}

The implementation is roughly based on those execercises. What makes CUDA that powerful is that it employs massive parallel computing. The time complexity of the algorithm itself is

    \[\mathcal{O}(n)\]

(where

    \[n = k^2\]

,

    \[k\]

being the population size – the X-axis in the graph below) for computing one generation. But as we can see the factor by which the time grows is pretty low.

Note: This is all unproven. A formal prove may follow some time (especially for that algorithm being

    \[\mathcal{O}(n)\]

)

Time behaviour

It’s also pretty interesting to compare those results with the time required to do all this computation solely on the CPU. If one used the implementation of the A2 I presented two posts ago, he would notice that computing 4096 generations of a

    \[16\times 16\]

population takes roughly 0.154s. The CUDA implementation requires roughly 0.044s for the same work. I used a

nVidia Corporation GeForce 8600 GT (rev a1)

for the CUDA stuff and the

2Ghz Intel Core 2 Duo

for the CPU version. This stuff really is fun, but there really is one question left: what to do with all this power?

Louder now: A2

And no this is not about the Taking back Sunday EP :)

This is the sound I generated yesterday together with the corresponding cell states. Remember: the lower a tone, the less cells are alive and vice versa. So if one listens and watches very carefully, he can connect the sound to the pictures.
Just wanted to have this on YouTube

Listening to the A2

Short cut: Just listen here:
But if you want to know how this all came to existence, read on.

It got kinda boring today, so I decided to play arround with the A2 automaton once again. Once again I reimplemented the A2. This time in C:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
#include <stdlib.h>
#include <stdio.h>
typedef int uint_16;

int neighborhood_aliveness(uint_16 *m, int x, int y) {
    int s = 0, i = 0, j = 0;

    for(i = 0; i <= 2; i++)
        for(j = 0; j <= 2; j++)
            s += get_cell(m, (x + 1) - j, (y + 1) - i);
           
    return s;  
}

void set_cell(uint_16 *m, int x, int y, int val) {
    if(x > 15 || y > 15 || x < 0 || y < 0) return;
   
    if(val > 0) m[y] |= (1 << x);
    else m[y] &= ~(1 << x);
}

int get_cell(uint_16 *m, int x, int y) {
    return x > 15 || y > 15 || x < 0 || y < 0 ? 0 : (m[y] & (1 << x)) >> x;
}

uint_16* nextgen(uint_16* m) {
    uint_16 *ng = malloc(sizeof(uint_16) * 16);
   
    int i = 0,j = 0;
    for(i = 0; i < 16; i++) {
        for(j = 0; j < 16; j++) {
            set_cell(ng, j, i, neighborhood_aliveness(m, j, i) == 2 ? 1 : 0);
        }      
    }
    return ng;
}

void print_m(uint_16 *m) {
    int i = 0, j = 0;
    for(i = 0; i < 16; i++) {
        for(j = 0; j < 16; j++) {
            printf("%i", get_cell(m, j, i));
        }
        printf("\n");
    }  
    printf("\n");
}

int count_alive_cells(uint_16 *m) {
    int i = 0, j = 0, s = 0;
    for(i = 0; i < 16; i++) {
        for(j = 0; j < 16; j++) {
            s += get_cell(m, j, i);
        }
    }  
    return s;
}

int main(int argc, char **argv) {
    uint_16 *field = malloc(sizeof(uint_16) * 16);
    field[0] = 0xAAAA;
   
    int i = 0;
    for(i = 0; i < 32000; i++) {
        field = nextgen(field);
        printf("%i\t%i\n", i, count_alive_cells(field));
    }
   
    printf("\n");
    return 1;
}

It’s not a particullarly beautiful implementation, but it gets the job done. After this was settled I plotted the amount of alive cells per generation.
fig01
After doing some more rather standard stuff with data like this (like calculating standard derivation and plotting it: nothing fancy here) I noticed that the plot kinda reminded me of a wave pattern one can usually find in the sound editors. So, why not use the data the A2 gave us to generate some sort of music. No sooner said, than done. Once I had Super Collider downloaded and running, I started to do some tutorials and fiddling around with the massive and powerfull tool. Eventually I managed to come up with the following piece of code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
(
s.recSampleFormat = 'int16';
s.recHeaderFormat = 'wav'
)

(
SynthDef(\help_sinegrain,
    { arg out=0, freq=440, sustain=0.5;
        var env;
        env = EnvGen.kr(Env.perc(0.05, sustain, 0.2), doneAction:2);
        Out.ar(out, SinOsc.ar(freq, 0, env))
    }).send(s);
)



(
a = Pseq(#[14, 24, 14, 24, 22, 34, 33, 32, 47, 48, 38, 43, 47, 55, 57, 59, 67, 58, 60, 53, 65, 68, 75, 67, 65, 63, 77, 53, 74, 79, 86, 52, 68, 84, 61, 55, 55, 58, 62, 71, 74, 66, 59, 66, 78, 68, 64, 73, 71, 52, 80, 66, 56, 64, 67, 59, 69, 75, 62, 74, 62, 99, 49, 62, 54, 74, 57, 63, 70, 71, 67, 69, 75, 57, 65, 85, 72, 71, 68, 78, 67, 60, 81, 70, 52, 70, 83, 82, 57, 65, 64, 68, 61, 73, 69, 78, 41, 56, 58, 64, 73, 66, 54, 74, 62, 68, 64, 84, 72, 56, 66, 73, 83, 54, 78, 45, 68, 68, 79, 70, 59, 59, 77, 71, 56, 73, 53, 83, 84, 52, 62, 77, 59, 87, 62, 70, 65, 74, 80, 62, 56, 64, 73, 71, 79, 56, 71, 66, 86, 70, 77, 70, 76, 51, 57, 51, 65, 45, 56, 47, 79, 58, 87, 71, 71, 60, 59, 88, 55, 53, 68, 70, 78, 62, 79, 57, 55, 67, 66, 61, 68, 80, 85, 45, 55, 73, 56, 77, 58, 54, 71, 71, 67, 65, 73, 78, 73, 69, 81, 54, 66], 1).asStream;
Routine({
    loop({
    Synth(\help_sinegrain, [\freq, a.next.midicps]);
    0.2.wait;
    });
}).play();
)

That

Pseq

in line 18 contains the first 200 samples from the A2 automaton. This sequence is then used as MIDI notes a few lines below. The outcome actually sounds pretty neat. I did not expect to be able to create such nice sounding stuff within this short amount of time. But anyway, the A2 is still interessting and Super Collider is super cool. Alright, I really do need to get some sleep right now. It’s like 4 o’clock in the moring and a fellow student of mine and myself want to start playing around with CUDA tomorow.

Fork me on GitHub