Monday 13 August 2012

ARM, NEON, etc

After a good few hours of experimentation I had some NEON code running. With all the OpenCL i've been doing for a while I kind of forgot some of the ways one writes SIMD code, and for a "RISC" cpu, a modern ARM has a large set of instructions to learn.

For example, whilst trying to build the 8 bits of the LBP code my first approach was to try to perform the 8 comparisons to form the bits in the code using a single instruction. The problem with this is that it takes a lot of fuffing about just to get the bytes in the right spot (vtbl helps a lot here), and even more fuffing about trying to turn those 8 results into 1x8 bit number. And even when you've done that you need to get 8 results before you write something out and it all just gets messy. Thinking about that jogged my memory on how one can stream data through a SIMD processor and instead produce 'n' independent results concurrently.

Which makes the whole thing a lot simpler - and allows one to re-use memory accesses as well without having to write custom extraction code for every position in the vector. For 6x64-bit memory accesses I can produce 8 output codes, and the code itself is fairly simple ...

Unfortunately, this only produces the 8 bit LBP codes, and I really want the u2 variant ... which requires a 256-byte lookup table. (you can't really do the lookup using NEON, and have to resort to ARM, and it's too slow moving the registers back, so it has to go through memory).

The obvious approach is to just run the whole lot twice, but it turns out that's pretty slow because of the slow memory on the beagleboard. e.g. my best LBP algorithm (which does 8 columns and 4 rows per inner loop) takes about 2.6ms to process a 512x512 image (simple C is about 12ms). And my best LBP to LBPu2 conversion routine takes about 2.5ms ... and together they end up 5.1ms.

So I took a slightly simpler LBP routine (does 8x1 results per loop, about 2.9ms) and tacked on the ARM code which processes the previous row's result after en-queuing all the NEON instructions (i.e. it does 8 lookups for the row above, whilst the current row calculates 8 results). This showed some promise, so I then took the ARM instructions and sprinkled them throughout the NEON ones so it spreads out the calculation a bit more/improves parallelism. Not bad - best effort is down to 3.9ms total. I only have a rough idea on the scheduling rules, I wish I had the static analysis tool for the SPU which showed dual-issue and delay slots.

I guess it's worth it - the C code that does the U2 lookup is about 13ms - but it didn't take a good few days to write and it's only 10 lines of code.

I also looked at a 'SIMD' like ARM version - it turns out you can do 4x 8-bit unsigned comparisons with a single instruction (uhsub8) - I got stuck working on that till late into the night one night (instead of watching tv), but then I figured it wasn't going to beat the NEON, so I couldn't be bothered debugging it. OTOH it would allow one to hook in the U2 lookup more easily, so it might actually be a win and worth looking into. The ARM code is very tight on registers though so i'm not sure it will fit.

I've done some testing with the classifier code as well - at first I thought I'd just wasted 10 hours on the NEON code because it was taking the same time as some simple C (the classifier is literally a 5 line double-loop). But then I realised I was running it 8x as many times as i should have. At least an 8x speedup is definitely not something to sniff at. And I still have the outer loop in C and with all the redundant function invocation overheads so there might be a bit more left to get. I haven't debugged it yet so i'm not sure it's working, but a 17x17 template on a 512x512 image is about 0.6s - this seems slow but this is much smaller than one would normally look for objects in such a source image, and at a more-realistic 4x this size it takes about 30ms.

There's still a lot of scaffolding before I can find out how it will actually run in practice, but it's looking ok so far. I kind of forgot how long it takes to write assembly code (at least, when one is learning the instruction set), but now I remember why I wasted so much of my youth on it - it is rather fun.

4 comments:

av500 said...

Hi notzed, we miss you in #beagle :)

or sign up on G+, quite a bunch of us embedded people there,

Regards,

av500

NotZed said...

Yeah I can't remember what happened - I think it was mostly work which dragged me in a different direction with the OpenCL stuff leaving no cycles for anything else. And just the fact I was working again - I think 2010 I had a welcome good few months break between contracts.

But now with doing android stuff on ARM machines, having a proper gnu/linux box to develop low-level stuff on becomes pretty handy (I wish I had've done it earlier - but I didn't have time, at the time).

I'm still on the bb mailing list, and glance at it occasionally but don't really have anything to add: I just haven't kept up.

av500 said...

I'm resharing all your ARM/NEON stuff on Google+ with the embeddded people there. Beats most of the "I made a LED blink with Python" posts :)

NotZed said...

No worries. I guess people have to start somewhere, and after seeing a gpu in action, scaling a greyscale bitmap probably isn't everyone's priority ;-)