| Author: | Wojciech Muła |
|---|---|
| Added on: | 2010-05-01 |
In-place reversing is performed by following algorithm:
// n - table size
int i = 0; // index1
int j = n - 1; // index2
int c = n/2; // counter
int x, y;
while (c--)
x = table[i];
y = table[j];
table[i] = y;
table[j] = x;
i++;
j--;
c--;
}
We can use some CPU-specific instruction to speedup the algorithm.
Program can load a part of table into registers, then swap bytes in register using one or more instructions and then store register's content at new position. If the size of a table is not multiply of register size, then few bytes in the middle of a table have to be swapped using default algorithm.
Modified algorithm outline:
// n - table size
// k - register size - k=4 for BSWAP and k=16 for PSHUFB
int i = 0; // index1
int j = n - k - 1; // index2 !!!
int c = n/(2*k); // counter !!!
register char x[k], y[k]; // hardware registers
while (c--)
x = table[i]; // load k bytes
y = table[j]; // ...
hardware_swap(x); // reverse these bytes
hardware_swap(y); // ...
table[i] = y; // store k bytes
table[j] = x; // ..
i += k;
j -= k;
c--;
}
if (n % (2*k) != 0)
swap rest byte-by-byte
Function hardware_swap(x) is equivalent to single instruction: BSWAP reg or PSHUFB xmm, const; SSE2 code requires more instructions:
movaps %xmm0, %xmm1 psrlw $8, %xmm0 psllw $8, %xmm1 por %xmm1, %xmm0 ; swap adjacent bytes pshufd $0x4e, %xmm0, %xmm0 ; swap order of dwords pshuflw $0x1b, %xmm0, %xmm0 ; swap order of words in lower pshufhw $0x1b, %xmm0, %xmm0 ; .. and higher half of register
Test program contains several procedures that implement the modified algorithm:
Important note: SSE implementations use safe MOVUPS instruction for transfers data from/to memory. This instruction is much slower than MOVAPS even if accessing aligned data!
Two scenarios of test were considered:
Procedures were tested on following computers:
Quick results discussion: