Author: | Wojciech Muła |
---|---|
Added on: | 2016-01-06 |
Contents
The aim of this text is to show how simple procedure which change case of letters could be rewritten to SWAR version gaining significant boost. In the article method "to lower case" is explored, however the opposite conversion is very easy to derive.
To be honest I have no idea if changing latter case is crucial task in any problem. My knowledge and experiences suggest that the answer is "no", but who knows.
The basic version of the procedure reads one character (a byte) and then classifies it. If the character is an upper case ASCII letter then fast conversion could be used; if the character is an extended ASCII then the system procedure tolower is called.
Note: lower and uppercase letters in ASCII encoding differs only in single, 5th bit.
void to_lower_inplace(char* s, size_t n) { for (size_t j=0; j < n; j++) { if (s[j] >= 'A' && s[j] <= 'Z') { s[j] ^= (1 << 5); } else if (static_cast<unsigned char>(s[j]) >= '\x7f') { s[j] = tolower(s[j]); } } }
In SWAR approach a fast-path could be used if all character within a chunk are not from extended ASCII set. Then all character are classified if are uppercase or not, resulting in a mask having set certain bits. The last step is to xor the mask with the input chunk to selective flip the bit in upper-case letters.
bool is_ascii(const uint64_t chars); uint64_t to_lower_ascii_mask(const uint64_t chars); uint64_t to_lower_ascii(uint64_t chars); uint64_t to_lower_inplace_swar(char* input, size_t size) { char* s = input; { const size_t k = n / 8; for (size_t i=0; i < k; i++, s+=8) { uint64_t* chunk = reinterpret_cast<uint64_t*>(s); if (is_ascii(*chunk)) { *chunk = to_lower_ascii(*chunk); // [***] } else { to_lower_inplace(s, 8); } } } { const size_t k = n % 8; if (k) { to_lower_inplace(s, k); } } }
The first step (is_ascii) and the last step (to_lower_ascii) of the algorithm are simple. Is ASCII tests only if highest bits and to lower swaps 5th bit:
#define packed_byte(b) ((uint64_t(b) & 0xff) * 0x0101010101010101u) bool is_ascii(uint64_t chars) { return (chars & packed_byte(0x80)) == 0; } uint64_t to_lower_ascii(uint64_t chars) { // MSB in mask could be set const uint64_t mask = to_lower_ascii_mask(chars) >> 2; // change case (toggle 5th bit) const uint64_t result = chars ^ mask; return result; }
Vectorization of comparison s[j] >= 'A' && s[j] <= 'Z' is the most interesting part. The key is a vector compare for relation "greater than", i.e. x > contant; x = 0..127. The expression have to be rewritten:
s[j] >= 'A' && !(s[j] >= 'Z' - 1)
When we add a byte of value 128 - constant, then for x greater than constant result would be 128 + something; otherwise the result would be less than 128. In other words, the result of comparison is saved in the highest bit of a byte; lower bits contain garbage that have to be cleared later.
A comparison is expressed by single addition. Thus the full, rewritten expression requires: 2 additions, 1 negation, and 2 bit-ands:
A = s[j..j+8] + packed_byte(128 - 'A'); Z = s[j..j+8] + packed_byte(128 - 'Z' - 1); result = (A & ~Z) & packed_byte(0x80);
An observation: it's not possible that x is less than 'A' (A is false) and at the same time greater than 'Z' (Z is true). Thanks to that the last expression could be simplified to:
result = (A ^ Z) & packed_byte(0x80);
Final version requires:
The sample program is available at github.
The test program loads given file to a 100 MiB memory region and then run the selected procedure: a scalar one, or improved. Two files were examined:
file | size [B] | scalar time [s] | SWAR time [s] | speedup |
---|---|---|---|---|
386INTEL.txt | 878'279 | 0.1801 | 0.0573 | 3.13 |
sutra_surangama.tex | 665'859 | 0.2179 | 0.1720 | 1.27 |