Speeding up letter case conversion

Author: Wojciech Muła 2016-01-06

Introduction

The aim of this text is to show how simple procedure which change case of letters could be rewritten to SWAR version gaining significant boost. In the article method "to lower case" is explored, however the opposite conversion is very easy to derive.

To be honest I have no idea if changing latter case is crucial task in any problem. My knowledge and experiences suggest that the answer is "no", but who knows.

Base version

The basic version of the procedure reads one character (a byte) and then classifies it. If the character is an upper case ASCII letter then fast conversion could be used; if the character is an extended ASCII then the system procedure tolower is called.

Note: lower and uppercase letters in ASCII encoding differs only in single, 5th bit.

```void to_lower_inplace(char* s, size_t n) {

for (size_t j=0; j < n; j++) {

if (s[j] >= 'A' && s[j] <= 'Z') {

s[j] ^= (1 << 5);

} else if (static_cast<unsigned char>(s[j]) >= '\x7f') {

s[j] = tolower(s[j]);
}
}
}
```

Improvement — SWAR approach

In SWAR approach a fast-path could be used if all character within a chunk are not from extended ASCII set. Then all character are classified if are uppercase or not, resulting in a mask having set certain bits. The last step is to xor the mask with the input chunk to selective flip the bit in upper-case letters.

Algorithm

```bool     is_ascii(const uint64_t chars);
uint64_t to_lower_ascii(uint64_t chars);

uint64_t to_lower_inplace_swar(char* input, size_t size) {

char* s = input;

{
const size_t k = n / 8;
for (size_t i=0; i < k; i++, s+=8) {

uint64_t* chunk = reinterpret_cast<uint64_t*>(s);

if (is_ascii(*chunk)) {

*chunk = to_lower_ascii(*chunk); // [***]
} else {
to_lower_inplace(s, 8);
}
}
}
{
const size_t k = n % 8;

if (k) {
to_lower_inplace(s, k);
}
}
}
```

Trivia

The first step (is_ascii) and the last step (to_lower_ascii) of the algorithm are simple. Is ASCII tests only if highest bits and to lower swaps 5th bit:

```#define packed_byte(b) ((uint64_t(b) & 0xff) * 0x0101010101010101u)

bool is_ascii(uint64_t chars) {

return (chars & packed_byte(0x80)) == 0;
}

uint64_t to_lower_ascii(uint64_t chars) {

// MSB in mask could be set
const uint64_t mask = to_lower_ascii_mask(chars) >> 2; // change case (toggle 5th bit)
const uint64_t result =  chars ^ mask;

return result;
}
```

Vectorization of comparison s[j] >= 'A' && s[j] <= 'Z' is the most interesting part. The key is a vector compare for relation "greater than", i.e. x > contant; x = 0..127. The expression have to be rewritten:

```s[j] >= 'A' && !(s[j] >= 'Z' - 1)
```

When we add a byte of value 128 - constant, then for x greater than constant result would be 128 + something; otherwise the result would be less than 128. In other words, the result of comparison is saved in the highest bit of a byte; lower bits contain garbage that have to be cleared later.

A comparison is expressed by single addition. Thus the full, rewritten expression requires: 2 additions, 1 negation, and 2 bit-ands:

```A = s[j..j+8] + packed_byte(128 - 'A');
Z = s[j..j+8] + packed_byte(128 - 'Z' - 1);

result = (A & ~Z) & packed_byte(0x80);
```

An observation: it's not possible that x is less than 'A' (A is false) and at the same time greater than 'Z' (Z is true). Thanks to that the last expression could be simplified to:

```result = (A ^ Z) & packed_byte(0x80);
```

Final version requires:

• 1 bit-xor,
• 1 bit-and.

Experiments

The sample program is available at github.

The test program loads given file to a 100 MiB memory region and then run the selected procedure: a scalar one, or improved. Two files were examined:

• 386INTEL.txt — plain text; English documentation from Intel for their 80386 CPU;
• sutra_surangama.tex — TeX file, encoding is ISO-8859-2; Polish translation of Sutra Surangama, a Buddhist text.
file size [B] scalar time [s] SWAR time [s] speedup
386INTEL.txt 878'279 0.1801 0.0573 3.13
sutra_surangama.tex 665'859 0.2179 0.1720 1.27

Summary

• Speedup strongly depends on processed text. For English texts speedup is impressive, as SWAR path is used in most cases. However, for Polish text containing many national characters, the fast path isn't used very often.
• In line marked with ***, a chunk is always written back, even if is not modified. Experiments have shown that additional branch slows things down.
• Since the algorithm require branching in each iteration, SSE wouldn't help in my opinion.