Wednesday 15 August 2012

SIMD linear re-sampling

It's usually more efficient to do bi-linear re-sampling in two passes as it then only requires 2 multiplies per output pixel rather than 4. So today I had a look at the X sampling problem in SIMD - as re-sampling in Y is a trivial SIMD exercise.

I've done this before in SPU, but I couldn't find my code again, and in any event it wasn't so much the algorithm as the NEON translation that was the tricky bit.

A simple and efficient way to implement this re-sampling in C is to use fixed point 16.16 bit values for the calculation (this is more than enough accuracy for screen-resolution images), so a straightforward implementation is something like the following:

float scale = (float)srcwidth / dstwidth;
uint xinc = (int)(65536 * scale);
uint xoff = 0;
int x;

for (x=0;x<dstwidth;x++) {
   // pixel address
   uint sx = xoff >> 16;
   ubyte a = src[sx];
   ubyte b = src[sx+1];
   // pixel fraction
   uint rx = xoff & 0xffff;

   dst[x] = (ubyte)(a + ((b-a) * rx) >> 16);

   xoff += xinc;
}

This assumes the scale factor is >= 0.5, and bi-linear re-sampling breaks down outside of such scales anyway.

For RGBA data, converting this to SIMD is straightforward, all the calculations are simply extended to 4 wide vectors (if that is the machine size). But for greyscale this doesn't work because each position needs to be loaded independently.

The approach I came up with is to use a table lookup instruction to extract N consecutive pixel values into A and B vectors and then use per-lane scale factors. I also saw this guy did it very differently - using a transpose so that the problem is always the easily-SIMD y case.

Pseudo-code is something like this (using some opencl constructs):

float scale = (float)srcwidth / dstwidth;
uint xinc = (int)(65536 * scale);
uint4 xoff =  { xinc * 0, xinc * 1, xinc * 2, xinc * 3 };
int x;

for (x=0;x<dstwidth;x+=4) {
   // pixel address
   uint4 sx = xoff >> 16;
   ubyte8 d = vload8(src, sx.s0);
   // pixel fraction
   uint4 rx = xoff & 0xffff;
   // form lookup table
   uint4 ox = sx - sx.s0;

   ubyte4 a = lookup(d, ox);
   ubyte4 b = lookup(d, ox+1);

   vstore4(dst, x, (ubyte4)(a + ((b-a) * rx) >> 16));

   xoff += xinc * 4;
}

Where lookup takes the first argument as an array of elements, and the second a list of indices, and it returns a lookup of each index from the array. The limited lookup-table suffices, since the limited scale range prevents out-of-range accesses.

The astute reader will notice that this requires unaligned memory accesses for loading d, but with some small changes it could cope with forced-aligned memory accesses which is probably worth it. Actually the only change required is to use a aligned (masked) sx.s0 for the load and ox calculation.

In the NEON code I mixed it up a bit - the sx.s0 calculation I performed in ARM code (as it's needed for indexing lookups) as well as sx being calculated in NEON. But the scaling itself (and rx) uses 8.8 fixed-point so that 16-bit multiplies suffice. And 8 bits is enough precision for the pixel interpolation.

I tested scaling half a 512x512 image (i.e. 256x512) up by 2x in X to 512x512. Machine is a Beagleboard XM with default CPU clocking. I'm just using a C routine to call the assembly which processes one row at a time.

Routine         Time (ms)
C                9
NEON             2.9
NEON 2x          1.7
NEON 4x          1.5
The NEON code is processing 8 pixels at a time (minimum transfer size of NEON load).

One problem with the straight conversion of the above code to NEON is that there are many dependent instructions/stalls/no dual issue opportunities. So the 2x and 4x versions process 2 and 4 rows of data concurrently and interleaved. Apart from the better scheduling it also allows the addressing and rx calculations to be shared.

I also tried the masked version and aligning the reads and writes, I thought that gave better results - but it was only an error (i'd changed the scale factor). Although adding an appropriate alignment specifier to writes did make a stable measurable (if small) difference.

Y Scaling

I haven't looked at Y scaling but that is fairly straightforward. Actually rather than perform a complete X scale pass and a complete Y scale pass it's betterto scale the (Y, Y+1) rows into a double-row buffer, and then form the output one row at a time. Actually this can be extended to a pipeline of processing if you need to implement further passes and don't need all the data for a given pass - I used this in ImageZ to perform floating point compositing of integer data very fast in Java. This requires much less temporary memory, and should be faster because of the cpu cache (and see below about how slow memory is and how much the cache is needed). Actually since you're normally scaling up by some amount, Y(n+1) will probably be equal to Y(n)+1, so you can save some the X scaling work as well.

For this reason the 2x version is probably more useful even if it isn't the fastest. In fact it can just perform the Y scaling directly in-register and avoid the temporary buffer entirely. The overheads of calculating the same row more than once might be offset by not needing to pass intermediate results through memory.

I noticed that everything seems to take about 1ms + a bit. I wonder what the memcpy time is ... ok, just using memcpy on each row (512x512) is over 1.2ms. Wow, that's slow memory. I will have to try this on the Mele (which I still haven't had time to play with much) and the Tegra 3 tablet when I get it back. I gotta say the machine is running like a pig - emacs (terminal mode) crawls on it over the network with constant pauses and delays - it feels worse than running it via a 56k modem on an Amiga 1200. Maybe something is up with the network (looks like it, console seems a lot better) ... probably the old ADSL router which is causing other issues besides (can't turn off the dhcp server, which is affecting a couple of machines).

mip-maps

So I mentioned earlier that the above scaling routines only handle scaling by >= 0.5. So how does one go smaller? Either "do it properly" by implementing an "upfirdn" function (up-sample, filter, down-sample) - which is really just a 'sparse' convolution, or you approximate it with a mip-map.

i.e. you just pick the best-closest mipmap, and then scale that down/up to fit the output. Although building a mipmap is pretty fast it is still an overhead, so it helps if you are going to use it more than once - which is the case with sliding window object detectors which must run at multiple scales.

Although there are better ways to build the mipmaps (i.e. using the upfirdn stuff), using successive summation/a box filter works well enough for what I need. It can also be implemented rather neatly in NEON - I can perform a 1/2 and 1/4 scale scaling in the same routine (and there are enough registers to do 1/8th - but that requires 64-pixel-horizontal blocks).

I didn't bother trying to implement it in C, but the NEON code I wrote can re-sample a 512x512 image into two scales - 256x256 and 128x128 in under 1.3ms by processing 32x32 tiles. One then calls this successive times on the smallest result to build the rest, at least down to 8x8 (which is more than I need).

I've ignored the edge cases for now ... and usually it's better just to arrange ones data so that there aren't any. i.e. by aligning the stride and allocating extra invalid rows. The only problem with this is with getting data from/to external sources/destinations that may have their own alignment restrictions/policies. Obviously one wants to avoid a dumb memcpy but usually there's some work that can be done, and then the edge cases only need to be handled at these interfaces.

Update: I didn't have much to do today whilst waiting for a meeting, so I mucked about with the hardware today - tried to use a narcissus image - but that wouldn't boot, so just changed to a faster SD card I bought on the way back from buying milk & beer sugars (gee, sd cards are cheep now), ethernet cable, and different switch - and after that it's still slow in emacs ... But I also mucked about with the mele and got my test harness running there. Most of the stuff is 1.5x to 2x faster than the Beagleboard XM - which is pretty nice. The mipmap code is about 4x faster!

Update 2: Late last night I poked again - I thought I had enough to put it all together but then I realised I hadn't written the XY scaling. I tried adding Y interpolation to the Xx2 row scaler, but that just took twice as long to run as it ends up doing twice as many rows (stating the bleeding obvious). It still might be useful as it requires no temporary memory but I will work on another approach using temporary row buffers which I know works well.

Update 3: Well there's no particular reason I didn't include the NEON code ... so here it is. This is just the 1x version which isn't particularly quick due to data stalls and conflicts - but it can be extended obviously by unrolling the loop and interleaving the operations.

        @
        @ Scale row in X.
        @

        @ r0 = src address
        @ r1 = dst address
        @ r2 = width of output
        @ r3 = scale factor, 16.16 fixed point

        .global scalex_neon
scalex_neon:
        stmdb sp!, {r4-r7, lr}
        
        @ copy over scale
        vdup.32         q2,r3

        @ xinc = { scale * 0, scale * 1, scale * 2, ... scale * 7 }
        vldr    d0,index
        vmovl.u8        q3,d0
        vmovl.u16       q0,d6
        vmovl.u16       q1,d7

        @ xoff [0 .. 3]
        vmul.u32        q0,q2,q0
        @ xoff [4 .. 7]
        vmul.u32        q1,q2,q1
        @ xinc = 8 * inc
        vshl.u32        q2,#3

        vmov.u8         d28,#1
        
        @ xoff(s) = 0
        mov     r4,#0

        @ Load this, next
1:      add     r5,r0,r4,lsr #16
        vld1.u8 { d16,d17 }, [r5]

        # sx = xoff >> 16
        vshrn.u32       d18,q0,#16
        vshrn.u32       d19,q1,#16
        vdup.u16        d20,d18[0]

        @ mx = sx - sx[0]
        vsub.u16        d18,d20
        vsub.u16        d19,d20

        @ convert to 8 bits - this is lookup table for all 8 entries
        vmovn.u16       d18,q9

        @ a = src[sx]
        vtbl.8          d20,{d16,d17},d18
        @ b = src[sx+1]
        vadd.u8         d18,d28
        vtbl.8          d22,{d16,d17},d18

        @ rx = (xoff & 0xffff) >> 8
        @ (only need 8 bits for lerp)
        vmovn.u32       d24,q0
        vmovn.u32       d25,q1
        vshr.u16        q12,#8
        
        @ convert to short for arithmetic
        vmovl.u8        q10,d20
        vmovl.u8        q11,d22

        @ (b - a) * rx
        vsub.u16        q13,q11,q10
        vmul.u16        q13,q12

        @ (((b-x)*rx) >> 8) + a
        vshr.s16        q13,#8
        vadd.u16        q13,q10
        
        @ convert back to bytes
        vmovn.u16       d26,q13

        @ done
        vst1.u8 d26,[r1]!

        @ xoff += xinc*8
        add     r4,r3,lsl #3
        @ xoff += xinc
        vadd.u32        q0,q2
        vadd.u32        q1,q2

        @ for whole row
        subs    r2,#8
        bhi     1b
        
        ldmfd sp!, {r4-r7, pc}

        @ per-byte pixel offsets/multiplicands
index:  .byte   0,1,2,3,4,5,6,7

No comments: