Saturday 10 September 2011

Masked Loops & LU Decomposition

Update 26/11/11: So it turns out the timing is completely bogus because the code not only had a typo in it - which probably allowed the loop to be optimised away - the algorithm was incorrect as well. So I have a feeling this is all just wrong now ... That's what I get for timing without validating first.

I have since re-coded this algorithm, and I think I have something that works. I resorted to building a code generator and manually unrolling all the loops using branchless code. For 4096 6x6 matrices it takes 28uS to perform the LU decomposition (they're the same matrix again).

Also see a follow-up on the LU decomposition problem.


I've been having trouble getting some code going for work so I thought i'd have a 'break' and visit porting some of the bits and pieces to OpenCL.

The LU decomposition seemed obvious.

Not because I really intended to use it, I thought i'd just try an almost direct translation to OpenCL to start with. All I did was have each thread work on a full solution independently. I used local memory for the column cache, but that was about it.

Fairly impressed - even disregarding all possible optimisations and memory issues, it only took 220uS (4096, 6x6 matrices). It can no doubt be made to run faster, but I was so surprised I almost considered just leaving it at that - relative to the other operations it is 'fast enough' ... but of course I didn't. (I was also solving the same matrix copied 4096 times, so it was 'best case' wrt thread divergence). I guess it isn't really so surprising though - apart from the poor memory access pattern the code path is consistent across threads.

I did a bit more work the serial version before tackling the parallel: e.g. by copying the whole matrix to LS first, and telling the compiler to unroll the loops I got it down to 118uS.

Anyway, along the way to parallising this I came across an interesting and not entirely obvious optimisation. It was to use the sort of result-masking you need when vectorising algorithms, but just for a simple loop.

Part of one inner loop the algorithm has this construct:
#define m 6
#define n 6

int kmax = min(i, j);
float s = 0.0f;

for (int k = 0; k < kmax ; k++) {
s += LU[i*n+k] * LU[k*n+j];
}

LU[i*n+j] = LU[i*n+j] -=s;

In this case, 6 threads working together on this problem and each are working on 1 row each. The i index is thus the matrix-local work-item, and j is the current column.

The problem with this code is the loop indexing is unknown at compile time, so the address calculations need to be done on the fly and the loop can't be unrolled very efficiently. And although the compiler could do what I do below, it seems a bridge too far at present.

So I changed the loop to this:
    int kmax = min(i, j);
float s = 0.0f;

for (int k = 0; k < n ; k++) {
float v = LU[i*n+k] * LU[k*n+j];

s += k < kmax ? v : 0;
}

LU[i*n+j] = LU[i*n+j] -=s;

And even though it appears to be doing more work - things just don't work like that on a GPU. Each 'thread' is still executing all parts of all loops anyway (or at best, sitting idle waiting for the longest running one to finish).

This simple change lead to a 25% improvement in the overall execution time in my test case.

My 'final' code executes in 38uS (I haven't verified it works! So this number might be nonsense!). And I still have further to go - the way I access memory isn't coalesced, I also have a lot of local-store bank conflicts to nut out.

So, assuming the code works, maybe that original 220uS wasn't so hot afterall.

No comments: