Monday 25 May 2015

sgemm, OpenCL

Yesterday I couldn't do much else so I played with some OpenCL code again. Just as my left foot was nearing recovery from gout ... I think I strained my right foot from too much walking or other activity and i'm immobile again. Argh.

With OpenCL still haven't managed to work out why I can't use SVM - I have a C test, a C test based on extracting all the relevant code from the BufferBandwidth sample (from amd sdk), and a C++ test based on the BufferBandwidth sample; they all crash as soon as I try to invoke a kernel against an SVM buffer, although BufferBandwidth runs fine. I even tried compiling linux 3.19.8 - I had to modify the catalyst driver a little bit to get it to build but I had it working for a bit, but suspend was broken and then I made one too many changes to the linux config and i couldn't get it to boot again and eventually just gave up. The linux config system is pretty shit and any changes force a full rebuild so i was getting sick of that. When I did have it running it made no difference to the SVM stuff anyway.

OTOH I did find that using CL_MEM_USE_HOST_PTR works in much the same way anyway (in terms of java usefulness) - even without mapping or unmapping the values are being updated on the via the GPU device, so with any luck map/unmap are just noops. I didn't really look too much further though.

What I looked at instead was implementing a basic matrix matrix multiply, i.e. lapack's sgemm. Not really for any need but just curiosity; how much effort is required vs the payoff.

My test case was a C=AxB where each is (1024,1024) with row-major order storage. CPU is a AMD A10-7850K (kaveri).

  java naive             20.5
  java copy col B         1.5
  java copy col B mt      0.50
  opencl cpu naive        6.5
  opencl cpu float4x4     0.49
  opencl gpu simple       0.26
  opencl gpu float4x4     0.045

  java ojalgo (double)    0.48
  java la4j (double)      1.7

  java copy stream        0.40
  java copy stream x4     0.30

java naive
This just implements the classic algorithm literally - i.e. for all rows of A, dot product of row by all columns of B, etc. The problem here is that each dot product scans a column of B in a potentially worst-case way in terms of cacheable memory access - this size hits that.
java copy col B
This just inverts the two outer loops so that it runs for all columns of B and then dot products that with all rows of A. It copies the current column of B in the outermost loop and so it only has to run once for every 2^20 accesses (in this case). Which is obviously worth it.
java copy col B mt
This replaces the outer loop with a IntStream.range(0, n).parallel().forEach(). It's not optimal memory-use wise but that makes little difference in this canned example (see the last couple of results). This is a trivial change and also easily worth it.
opencl * naive
This is a trivial opencl implementation that runs transposed with each work item calculating a single output value. The work size is 64,1,1 in each case. This shows that it isn't worth using OpenCL without a bit more effort on the algorithm.
opencl * float4x4
This is the most complex implementation whereby each work-item calculates a 4x4 output cell (calculates 4 rows at once). The number of columns and rows must be multiple of 4. It's basically just an unrolled loop using vector types; but the code is still quite straightforward. At least in this case - since the problem is embarrassingly parallel - the effort required is modest for the gains possible.

java copy stream
This replaces the outer loop of the copy col B loop with a custom non-gc-polluting spliterator over the columns of B. i.e. it allocates one row for each thread which is re-used for each call. This is moderately more work to set-up but the spliterator is re-usable. It's also possibly slightly misleading due to the way hotspot optimises callbacks.
java copy stream x4
Well what the hell - this unrolls the inner loop by 4x so only works on matrices with A_n a multiple of 4.

The opencl code is sub-optimal for the CPU case - something closer to the java implementation would make sense - I will try again at another time perhaps. I'm not that familiar with the compiler behavior or best processing model to use for the CPU driver but using vectors obviously helps. But if it's barely faster than Java there wont be much point. OTOH a 10x speedup using the kaveri GPU is a bit more interesting.

Sorry no code today - if i keep poking i might drop it into a zcl-samples thing later on. I'm sure there's plenty of (better) code out there in accelerated lapack libraries anyway.

Update: So before i posted this i came across the java matrix benchmark and the pretty simple 'java copy col b' is pretty close to ojAlgo running on this box. I only ran it on this same test and not the benchmark so i don't know if it's 'fast' on this machine although i imagine most of it's perf advantages in multiply in the benchmarks is from multi-threading. I also had some time to blow so I tried the row stream and row stream unrolled versions just out of curiosity.

ojalgo only seems to do double arrays, which doesn't make much difference with this problem size apart from double the storage space. I'm using a double accumulator for the dot product fwiw.

Update: Bored. Looked at la4j. It's maven only but a quick makefile fixed that abhorrence. Anyway it's just a tiny bit slower in this case and surprisingly only single-threaded (for something which is `current', this really is surprising). It's using the same algorithm as "java copy col B" but it uses 2D java arrays for it's dense matrix and creates a garbage copy of every column during operation rather than re-using the array. It looks like a pretty nice little library apart from a few odd looking decisions like these, especially the custom serialisation mechanism, and lack of threading. But there's a lot more to a matrix library than a multiply.

FWIW I didn't bother to include it in the above but on the weekend I also tried a direct ByteBuffer as storage. It takes a bit longer than the array backend for hotspot to optimise but it's quite close to the array version once it has. Or actually a bit faster in the mt case, for some strange reason.

No comments: