32leaves.net

Hello CUDA

As I wrote in my last post, we started fiddling around with CUDA today. First thing to notice: the demos are just totally awesome (like this particle demo here). We’re talking teraflops of computing power here. After doing the basic CUDA excercises (which were really helpful) I started to implement a CA again. And guess which one: A2 of course.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
// includes, system
#include <stdio.h>
#include <assert.h>

// Simple utility function to check for CUDA runtime errors
void checkCUDAError(const char *msg);

__device__ void get_field(int* d_res, dim3 d_pos, int *d_p0, int d_size) {
    (*d_res) = 0;
    if(d_pos.x < d_size && d_pos.y < d_size) (*d_res) = d_p0[(d_pos.y * d_size) + d_pos.x];
}

__global__ void caKernel( int *d_p0, int size, int *d_p1 )
{
    int x = (blockIdx.x * gridDim.x) + threadIdx.x;
    int y = (blockIdx.y * gridDim.y) + threadIdx.y;

    int sum = 0;
    int v = 0;
    get_field(&v, dim3(x - 1, y - 1), d_p0, size); sum += v;
    get_field(&v, dim3(x - 0, y - 1), d_p0, size); sum += v;
    get_field(&v, dim3(x + 1, y - 1), d_p0, size); sum += v;
    get_field(&v, dim3(x - 1, y - 0), d_p0, size); sum += v;
    get_field(&v, dim3(x + 1, y - 0), d_p0, size); sum += v;
    get_field(&v, dim3(x - 1, y + 1), d_p0, size); sum += v;
    get_field(&v, dim3(x - 0, y + 1), d_p0, size); sum += v;
    get_field(&v, dim3(x + 1, y + 1), d_p0, size); sum += v;
   
    d_p1[y * size + x] = sum == 2 ? 1 : 0;
}

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main( int argc, char** argv)
{
    if(argc < 3) {
    fprintf(stderr, "usage: %s <size> <gens>\n", argv[0]);
    return -1;
    }
    int size = atoi(argv[1]);
    int gens = atoi(argv[2]);

    printf("Calculating %i generations for a %ix%i (=%i cells) poulation\n", gens, size*size, size*size, size*size*size*size);

    // pointer for host memory
    int *h_a;

    // pointer for device memory
    int *d_a;
    int *d_b;

    size_t memSize = size * size * size * size * sizeof(int);
    h_a = (int *) malloc(memSize);
    cudaMalloc( (void**)&d_a, memSize  );
    cudaMalloc( (void**)&d_b, memSize  );

    memset(h_a, 0, memSize);
    int i; for(i = 0; i < size*size; i++) h_a[i] = i%2;
    cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );

    dim3 dimGrid( size, size );
    dim3 dimBlock( size, size );
    for(i = 0; i < gens; i++) {
    caKernel<<< dimGrid , dimBlock >>>( i % 2== 0 ? d_a : d_b, size * size, i % 2==0 ? d_b : d_a );
    }

    // block until the device has completed
    cudaThreadSynchronize();

    // check if kernel execution generated an error
    checkCUDAError("kernel execution");

    cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );

    // Check for any CUDA errors
    checkCUDAError("cudaMemcpy");

    // free device memory
    cudaFree(d_a);
    cudaFree(d_b);

    // free host memory
    free(h_a);

    return 0;
}

void checkCUDAError(const char *msg)
{
    cudaError_t err = cudaGetLastError();
    if( cudaSuccess != err)
    {
        fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );
        exit(-1);
    }                        
}

The implementation is roughly based on those execercises. What makes CUDA that powerful is that it employs massive parallel computing. The time complexity of the algorithm itself is

    \[\mathcal{O}(n)\]

(where

    \[n = k^2\]

,

    \[k\]

being the population size – the X-axis in the graph below) for computing one generation. But as we can see the factor by which the time grows is pretty low.

Note: This is all unproven. A formal prove may follow some time (especially for that algorithm being

    \[\mathcal{O}(n)\]

)

Time behaviour

It’s also pretty interesting to compare those results with the time required to do all this computation solely on the CPU. If one used the implementation of the A2 I presented two posts ago, he would notice that computing 4096 generations of a

    \[16\times 16\]

population takes roughly 0.154s. The CUDA implementation requires roughly 0.044s for the same work. I used a

nVidia Corporation GeForce 8600 GT (rev a1)

for the CUDA stuff and the

2Ghz Intel Core 2 Duo

for the CPU version. This stuff really is fun, but there really is one question left: what to do with all this power?

Leave a Reply

Fork me on GitHub