Quick Play with the AMD Stream SDK

I mentioned in my last blog post that I was disappointed with the performance of Microsoft Accelerator, and wanted to play around with Brahma. I was going to do this sooner, but have been side-tracked with playing around with XNA on Windows Phone 7.

I downloaded the latest OpenCL version of Brahma, but had trouble with the nested loops and aggregation operations (force summations), so didn't get as far as I'd hoped. It’s a shame, as the concept of LINQ to GPU is a great one.

I then took a look at running the OpenCL NBody simulation from the Stream SDK. I couldn't get the simulation to run using the GPU despite trying various Catalyst versions, it failed with a runtime error message "This OpenCL build requires verison 1.4.879, version 1.4.696 installed", but in spite of this, I was impressed with the performance of using the Stream SDK, even running on the CPU.

Whereas my managed CPU-version of the nbody simulation achieved 5 fps (frames per second), (or 8 fps with the drawing disabled – as discussed earlier the WPF drawing code is slow), drawing 2000 bodies, the OpenCL SDK ran at 25 fps drawing 2048 bodies, i.e. a factor of 5 speedup. I didn't bother to parallelise my code but the theoretical maximum speedup on my dual core machine would obviously be a factor of 2, so that's a factor of 2.5 speedup using the Stream SDK on the same hardware.

I switched the Stream SDK NBody example to use the nBodyCPUReference() method to see whether it's slow because of the difference between managed and native code, and it runs at 5 fps compiled native on the CPU, i.e. in the same ballpark as the managed version. As it's not running on the GPU, the Stream version must be faster than the vanilla C++ version because it's making use of the processor's vector hardware, but I can't be bothered to manually code the SSE intrinsics to see if that's the case (but it might be cool to play around with Mono.SIMD if I get time).

Oh, I suppose I should talk about how the code looks - the guts of the algorithm doesn't look much different between the vanilla C++ and the OpenCL version, but there is a lot of hideous boilerplate/setup code different between the two. This is why it'd be great to get a workable managed library to hide all this (alternately, it'll be interesting to see whether C++ AMP abstracts away the OpenCL/DirectCompute complexity).

Labels: GPGPU