Parsing decimal numbers — part 1: SWAR

Author:	Wojciech Muła
Added on:	2014-10-12

Contents

Introduction
SWAR
Sample code
See also

Introduction

Naive parsing of unsigned decimal numbers can be coded as:

uint32_t parse(char* s) {

    uint32_t result = 0;

    while (*s) {
        const uint8_t digit = *s++;

        result = result * 10 + (digit - '0');
    }

    return result;
}

The procedure parse does not check validity of a string nor check for overflow — these problems won't be discussed in this text.

Processing single letter require just 3 operations:

subtract,
addition,
multiplication by 10.

On x86 multiplication by 10 is cheap; multiplication x * 10 is equivalent to x << 3 + x << 1. Shift by 8 is coded with LEA (it's executed by addressing unit, not ALU), and shift by 1 is simple addition or another LEA.

SWAR

In SWAR version we'll process 4 letters at once. For shorter strings naive fallback have to be used or proper mask applied on dword.

The common phase of all SWAR versions is converting ASCII to digit values, this is simple subtraction.

input — 4 ASCII letters:

x  = packed_byte(| '1' | '2' | '7' | '0' |)

subtract vector of '0':

x' = packed_byte(| '1' | '2' | '7' | '0' |)
   - packed_byte(| '0' | '0' | '0' | '0' |)
   = packed_byte(|  1  |  2  |  7  |  0  |)

Improvement 1

Suppose we have just two digits x and y stored in separate bytes:

w  = packed_byte(|  0  |  0  |  x  |  y  |)

Value of w is 256*x + y, but we want to get 10*y + x. By multiplication w by 2561 we get this value in the higher byte. Magic? No, 2561 = 256 * 10 + 1, and:

1 — keeps x on the original position;
2560 — moves y multiplied by 10 on the second byte;
adds them together.

Proof: (256 ⋅ x + y) ⋅ (1 + 10256) = (256 ⋅ x + y) + (655360 ⋅ x + 256 ⋅ 10 ⋅ y). Now reject components less than 256 and greater than 65535: (x ⋅ 256) + (y ⋅ 2560) = 256 ⋅ (10 ⋅ y + x).

Full expression:

v = ((w * 2561) >> 8) & 0xff

The same trick could be used to join 2-digit values ab & cd stored on 1st and 3rd byte:

Z  = packed_byte(|   0  |  ab  |   0  |  cd  |)

The "magic" multiplier is 6553601, result is 100*cd + ab saved in the higher word.

Algorithm

input — 4 digits:

t1  = packed_byte(|      d       |      c       |      b       |      a       |)

form two words suitable to make 2-digits values:

t2a = packed_byte(|      0       |      0       |      b       |      a       |)
t2b = packed_byte(|      0       |      0       |      d       |      c       |)

form 2-digits values by multiplication t2{a,b} by 2561:

t3a = packed_byte(|      ?       |      ?       |  10*a + b    |      ?       |)
t3b = packed_byte(|      ?       |      ?       |  10*c + d    |      ?       |)

prepare dword from t3{a,b} to get 4-digits value:

t4  = packed_byte(|      0       |  10*a + b    |      0       |  10*c + d    |)

form 4-digit value by multiplication t3 by 6553601:

t5  = packed_word(| 1000*c + 100* d + 10*a + b  |              ?              |)

extract higher word from t5.

Sample implementation

#define packed32(b) (uint8_t(b) * 0x01010101)

uint32_t parse1(char* s) {

    const uint32_t input = *reinterpret_cast<uint32_t*>(s);

    const uint32_t t1 = input - packed32('0');

    const uint32_t t2a = t & 0xffff;
    const uint32_t t2b = t >> 16;

    const uint32_t t3a = ((t2a * (2561)) >> 8) & 0xff;
    const uint32_t t3b = ((t2b * (2561)) >> 8) & 0xff;

    const uint32_t t4  = t3a | (t3b << 16);

    return (word * (6553601) >> 16);
}

Improvement 2

Another approach to making 2-digit values:

input — 4 digits:

t1 = |      d      |      c      |      b      |      a      |

mask 1st and 3rd digit:

t2 = |      d      |      0      |      b      |      0      |

move one digit left and multiply by 10:
```
t3 = |      0      |      d*10   |      0      |      b*10   |
```
This can be done without multiplication. We calculate (x * 10) >> 8, multiplication is replaced by shifts ((x << 3) + (x << 1)) >> 8, after simplify shifts: (x >> (8-3)) + (x >> (8-1)).

add t1 and t3:

t4 = |      d      |  c + d*10   |      b      |  a + b*10   |

mask 1st & 3rd digits again:

t5 = |      0      |  c + d*10   |      0      |  a + b*10   |

convert t5 to 4-digits value.

Sample implementation

uint32_t parse2(char* s) {

    const uint32_t input = bswap(*reinterpret_cast<uint32_t*>(s));

    const uint32_t t1 = input - packed32('0');
    const uint32_t t2  = t1 & 0xff00ff00;
    const uint32_t t3  = (t2 >> (8-1)) + (t2 >> (8-3));
    const uint32_t t4  = t1 + t3;
    const uint32_t t5  = t4 & 0x00ff00ff;

    // reversed order, so multiplier is diffrent
    return (t5 * (100 + 65536)) >> 16;
}

Improvement 3

Another approach to making 2-digit values:

input — 4 digits:

t1 = |      d      |      c      |      b      |      a      |

multiply by 10:

t2 = |      d*10   |      c*10   |      b*10   |      a*10   |

move digits one position right (shift dword right by 8 bits):

t3 = |      0      |      d*10   |      c*10   |      b*10   |

add t2 and t3:

t4 = |      d*10   |  c + d*10   |  b + c*10   |  a + b*10   |

mask unneeded bytes 1 and 3:

t5 = |      0      |  c + d*10   |      0      |  a + b*10   |

Now word t5 could be converted as in the version 2.

Sample implementation

uint32_t parse3(char* s) {

    const uint32_t input = bswap(*reinterpret_cast<uint32_t*>(s));

    const uint32_t t1 = input - packed32('0');
    const uint32_t t2 = (t1 * 10) >> 8;

    const uint32_t t3 = t1 + t2;
    const uint32_t t4 = t3 & 0x00ff00ff;

    return (t4 * (100 + 65536)) >> 16;
}

Comparison

Number of operations required to convert 4 ASCII digits.

operation	naive	version 1	version 2	version 3
bswap			1	1
bit-or		1
bit-and		3	2	1
shift		5	3	2
add	4		2	1
sub	1	1	1	1
mult		3	1	1
mult-by-10	4			1
total	9	13	10	6

Sample code

Sample code is available on gitub.

Parsing decimal numbers — part 1: SWAR

Introduction

SWAR

Improvement 1

Algorithm

Sample implementation

Improvement 2

Sample implementation

Improvement 3

Sample implementation

Comparison

Sample code

See also