Author: | Wojciech Muła |
---|---|
Added on: | 2010-05-01 |
In-place reversing is performed by following algorithm:
// n - table size int i = 0; // index1 int j = n - 1; // index2 int c = n/2; // counter int x, y; while (c--) x = table[i]; y = table[j]; table[i] = y; table[j] = x; i++; j--; c--; }
We can use some CPU-specific instruction to speedup the algorithm.
Program can load a part of table into registers, then swap bytes in register using one or more instructions and then store register's content at new position. If the size of a table is not multiply of register size, then few bytes in the middle of a table have to be swapped using default algorithm.
Modified algorithm outline:
// n - table size // k - register size - k=4 for BSWAP and k=16 for PSHUFB int i = 0; // index1 int j = n - k - 1; // index2 !!! int c = n/(2*k); // counter !!! register char x[k], y[k]; // hardware registers while (c--) x = table[i]; // load k bytes y = table[j]; // ... hardware_swap(x); // reverse these bytes hardware_swap(y); // ... table[i] = y; // store k bytes table[j] = x; // .. i += k; j -= k; c--; } if (n % (2*k) != 0) swap rest byte-by-byte
Function hardware_swap(x) is equivalent to single instruction: BSWAP reg or PSHUFB xmm, const; SSE2 code requires more instructions:
movaps %xmm0, %xmm1 psrlw $8, %xmm0 psllw $8, %xmm1 por %xmm1, %xmm0 ; swap adjacent bytes pshufd $0x4e, %xmm0, %xmm0 ; swap order of dwords pshuflw $0x1b, %xmm0, %xmm0 ; swap order of words in lower pshufhw $0x1b, %xmm0, %xmm0 ; .. and higher half of register
Test program contains several procedures that implement the modified algorithm:
Important note: SSE implementations use safe MOVUPS instruction for transfers data from/to memory. This instruction is much slower than MOVAPS even if accessing aligned data!
Two scenarios of test were considered:
Procedures were tested on following computers:
Quick results discussion: