Parsing series of integers with SIMD

Author:Wojciech Muła
Added on:2018-04-19
Updated on:2020-07-05 (fixed a typo in conversion 2-digit numbers, thank to Alexander Vingardt), 2020-03-30 (added method decimal_digits_mask_version2 by Carl Witty), 2020-03-29 (fixed a typo, thanks to Carl Witty), 2018-04-30 (info about overflows), 2018-04-25 (added speed-up comparison table)

1   Introduction

While conversion from a string into an integer value is feasible with SIMD instructions, this application is unpractical. For typical cases, when a single value is parsed, scalar procedures — like the standard atoi or strtol — are faster than any fancy SSE code.

However, SIMD procedures can be really fast and convert in parallel several numbers. There is only one "but": the input data has to be regular and valid, i.e. the input string must contain only ASCII digits. Recently, I updated article about SSE parsing with the benchmark results. The speed-ups are really impressive, for example the SSSE3 parser is 7 to 9 times faster than a naive, scalar code.

The obvious question is how these powerful SIMD procedures can be used to convert real data? By real I mean possibly broken inputs that contain series of numbers of different length separated with characters from a predefined set.

In this text I try to answer that question; major contributions of this article are:

The article starts with unsigned conversion, because it is easier than a signed one. The signed conversion shares the core idea, it just adds some extra steps.

The text is accompanied with BSD-licensed software, that includes fully functional implementations alongside the programs which validate and benchmark the procedures.

2   Parser specification

The input is defined as follows:

For the input string "123; -52, +432424 -999; 1234568, +879", and three separators ',', ';' and ' ', an algorithm is expected to extract six integers: [123, -52, 432424, -999, 1234568, 879]. The order of converted numbers must be exactly the same as in the input string.

A simplified parser is also tested. This parser simply considers all non-digit and non-sign characters as delimiters.

3   SSE conversion capabilities update

SSE instructions work on 16-byte vectors. When the input numbers have the same count of digits and the digits of each number are placed — depending of the digits count — on bytes, words, double words or quad words of an SSE vector, then an SSE procedure converts them in parallel. Below are required layouts of vectors; the most significant digit is at index 0 of each subword.

The conversion is done gradually. We start with an input vector filled with ASCII digits '0'..'9'; each byte of a vector contains single character. For string "1234567890123456" we have:

in = ['1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'|'0'|'1'|'2'|'3'|'4'|'5'|'6']

A single subtract _mm_subs_epu8 (psubusb) converts from ASCII into numbers 0..9:

t0 = [ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | 1 | 2 | 3 | 4 | 5 | 6 ]

At this point we converted 16 one-digit numbers. Now, by merging adjacent numbers from t0, we may obtain two-digit numbers. SSSE3 introduced the instruction _mm_maddubs_epi16 (pmaddubsw) which multiplies vectors of bytes and adds intermediate 16-bit values producing a vector of 16-bit values. Thus, when we multiply t0 by a vector filled with weights (1, 10) we get eight two-digit numbers, each saved on a 16-bit field:

t1 = [   12  |  34   |  56   |  78   |  90   |  12   |  34   |  56   ]

Similarly, these numbers can be merged into four-digit numbers. SSE has the instruction _mm_madd_epi16 (pmaddwd) which performs similar operation as _mm_maddubs_epi16, but works on 16-bit values and yields vector of 32-bit numbers. The other vector is filled with weights (1, 100) and we get four four-digit numbers, each saved on a 32-bit field:

t2 = [     1234      |     5678      |     9012      |     3456      ]

Four-digit numbers fits into 16-bit values, we might convert the above result into a 16-bit vector using _mm_packus_epi32 (packusdw) and again apply the _mm_madd_epi16 instruction, this time with weights (1, 10000). Then we get two eight-digit numbers:

t3 = [           12345678            |           90123456            ]

To summarize: conversion from a 16-digit string into two unsigned eight-digit numbers requires only one subtract, three multiply-add instructions and one pack operation.

If we need narrower numbers we simply stop the above procedure at an earlier step; for example, for four-digit numbers the result is t2. Since the intermediate vectors have different element size (8-, 16-, 32-bit) a cast to other vector type might be needed. Widening can be done with the instructions pmovsx (intrinsics _mm_cvtepi16_epi32, _mm_cvtepi8_epi32, etc.); narrowing with the mentioned pack instruction which is also available in different variants.

The same conversion using a naive scalar procedure would require 16 subtracts, 14 multiplications and 14 additions. More advanced SWAR procedures would require fewer multiplications, but still not as few as the SIMD procedure.

3.1   Range checking new

The maximum possible value from SSE conversion procedure is ± 99999999. This value is smaller than the maximum that can hold a 32-bit number (231 − 1 = 2147483647) and also absolute value of the minimum (231 = 2147483648). It means that if an output collection uses at least 32-bit numbers, then SSE procedures never cause overflow (signed nor unsigned).

Obviously, for 16-bit outputs, overflow is possible. For such case an extra step would be required, but I do not discuss it in this text.

3.2   Sample implementation

Below is a sample SSE implementation which shows all described steps.

// 1. convert from ASCII '0' .. '9' to numbers 0 .. 9
const __m128i ascii0 = _mm_set1_epi8('0');
const __m128i t0 = _mm_subs_epu8(input, ascii0);

// 2. convert to 2-digit numbers
const __m128i mul_1_10 = _mm_setr_epi8(10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1);
const __m128i t1 = _mm_maddubs_epi16(t0, mul_1_10);

// 3. convert to 4-digit numbers
const __m128i mul_1_100   = _mm_setr_epi16(100, 1, 100, 1, 100, 1, 100, 1);
const __m128i t2 = _mm_madd_epi16(t1, mul_1_100);

// 4a. convert form 32-bit into 16-bit element vector
const __m128i t3 = _mm_packus_epi32(t2, t2);

// 4. convert to 8-digit numbers
const __m128i mul_1_10000 = _mm_setr_epi16(10000, 1, 10000, 1, 10000, 1, 10000, 1);
const __m128i t4 = _mm_madd_epi16(t3, mul_1_10000);

4   Parsing and conversions of unsigned numbers

4.1   Algorithm overview

The conversion algorithm works in a loop, in each iteration exactly one input vector is loaded from memory and processed.

  • The vector is validated.
  • Depending on the layout of the digits, the input is normalized to the format required by the SIMD procedures. At this step we choose: how many numbers from the input are converted and which SIMD procedure is suitable for conversion.
  • After conversion, the input pointer is advanced and the loop continues.

4.2   Normalizing input

Since SIMD procedures impose specific layout of digits within a vector, an arbitrary input has to be properly aligned. In order to do this, we need to identify spans of digits in the input and move each span onto certain subarrays of a vector. Let's assume at this point that the input contains either digits or separators (denoted with _).

For instance, the vector with one-digit numbers 1___2_3__4__5_6_ must be transformed into 123456__________. Then an SSE procedure will convert in parallel all the numbers into the array [1, 2, 3, 4, 5, 6]. Likewise, the vector with two-digit numbers _12__34___56_78_ must be transformed into 12345678________ and then the result will be [12, 34, 56, 78].

Let's consider a more complicated case, when a string has numbers with different count of digits. There are a few ways to convert the input _1_2_34_567_89__:

  • If we choose conversion of one-digit numbers, then just the two first spans can be converted — because we need to keep the order of numbers from the input. After normalization the input into 12______________ just two values [1, 2] will be produced. The input's tail _34_567_89__ remain untouched.
  • If we choose conversion of two-digit numbers, then the three first spans can be converted. Shorter numbers are completed with zeros, then normalized input is 010234__________. The result is [1, 2, 34]; this time a bit shorter input's tail _567_89__ remain untouched.
  • Finally, if we choose conversion of four-digit numbers, then the four first spans can be converted. Again, shorter numbers are completed with zeros, and normalized input is 0001000200340567. The result is [1, 2, 34, 567], but still the chunk's tail, i.e. _89__, is unprocessed.

We can see that in order to convert given span combination we need to know:

  1. How to shuffle bytes in the input vector?
  2. Which SIMD procedure can be used then?
  3. How many numbers converted by the SIMD procedure must be stored?

Obtaining this information seems to be quite complicated, especially when we look at the last example. Fortunately, all parameters can be precalculated. A span combination can be saved as a bit-pattern, where ones represent digits. For example, from vector _1_2_34_567_89__ we get the span pattern 0b0101011011101100 = 0x56ec. A span pattern is used to fetch a record from the precalculated array. The record contains following fields:

  • shuffle_digits — an array of 16 bytes, which is the argument for the instruction _mm_shuffle_epi8 (pshufb); the instruction moves bytes at certain positions;
  • conversion_routine — an enumeration that selects an SSE conversion procedure; for instance, it tells that shuffled input is an array of two-digit numbers;
  • element_count — the number of elements from the SSE conversion procedure that must be stored in the output collection.

The solution with a precalculated array is suitable only for SSE, as span patterns have 16 bits. In cases of AVX2 and AVX512, where vectors are wider, such a table would be simply too large, respectively 232 or 264 entries. Additionally, the AVX2 version of pshufb instruction works on lanes, i.e. 128-bit halves of a vector, thus it is impossible to shuffle all inputs.

But AVX2 and AVX512 instructions still might be used in some parts of algorithms, especially in input validation.

4.3   Precalculating data

To build the whole precalculated table all 65536 span patterns have to be processed. Processing a pattern consists following steps:

  1. Determine digit spans. For pattern _ddd__d__ddd_dd_ we have four spans [(1,3), (6,6), (9,11), (14,15)]. But for pattern ____dddd___ddddd there is only one span [(4,7)]. The span at the vector's end might be continued in the next chunk, and we don't know how to convert it. However, we may assume that a span which starts at the vector's beginning is a new number and thus it is included into the span's list.
  2. Find the best conversion scheme for the spans. We systematically check which SSE conversion is possible, i.e. one-/two-/four-/eight-digit and how many spans each conversion can process — the procedure which processes the most items is selected. The output from this step are values for the following record fields:
    • element_size — 1, 2, 4, 8 or 0 if SSE conversion is not possible;
    • element_count — the numbers of converted spans;
    • total_skip — how many input bytes are processed; for pattern ____dddd________ we know that the whole input is processed (total_skip=16); for pattern ____d____ddddd_d we process only the first span at index 5, but we know that four non-digit characters after the span can be also skipped, thus total_skip is 9.
  3. Construct the shuffle pattern. Having the element size and spans positions the vector shuffle_digits is built. The characters from a span are mapped on the appropriate subarrays in the vector. Also zero completion is done in this step, as the instruction pshufb either copy given byte from the input vector or put zero if the index has got the most significant bit set.

4.3.1   Example

Let's consider span pattern _ddd__d__ddd_dd_. It can be converted using four-digit conversion, so the layout of normalized vector should be:

[ a0 | a1 | a2 | a3 | b0 | b1 | b2 | b3 | c0 | c1 | c2 | c3 | d0 | d1 | d2 | d3 ]

There are four numbers a, b, c and d in the vector; note that the bytes at 0th index hold the most significant digits.

The first number in pattern has three digits at indices 1, 2 and 3. It means that a1 = 1, a2 = 2 and a3 = 3; value of a0 should be zeroed, so a0 = 0x80.

The second number has just one digit at index 6. Thus b3 = 6 and, since the rest digits have to be zero, b0 = b1 = b2 = 0x80.

The third number has three digits at indices 9, 10, 11. Thus c1 = 9, c2 = 10, c3 = 11, and c0 = 0x80.

The last, forth number, has got two digits at indices 14 and 15. Thus d2 = 14, d3 = 15 and the rest have to be zero, so d0 = d1 = 0x80.

The final content of the shuffle_digits vector is:

[ 80 |  1 |  2 |  3 | 80 | 80 | 80 |  6 | 80 |  9 | 10 | 11 | 80 | 80 | 14 | 15 ]

4.4   SSE algorithm outline

Below are major steps of the algorithm's loop, with snippets from the actual implementation.

  1. Load 16 characters into the input vector.
const __m128i input = _mm_loadu_si128((const __m128i*)data);
  1. Check for invalid characters (will be discussed later).
  1. Build the span pattern:
// t0 = input[i] in ['0'..'9] ? 0xff : 0x00
const __m128i  t0 = decimal_digits_mask(input);
const uint16_t span_mask = _mm_movemask_epi8(t0);
  1. Fetch the block info with the precalculate parameters.
const BlockInfo& bi = blocks[span_mask];
  1. From the block info get the shuffle pattern for _mm_shuffle_epi8.
const __m128i shuffle_digits = _mm_loadu_si128((const __m128i*)b.shuffle_digits);
const __m128i shuffled = _mm_shuffle_epi8(input, shuffle_digits);
  1. The shuffled vector is then issued to the proper conversion routine. The block info structure keeps the conversion kind (conversion_routine) and also the number of items to get from the SSE result (element_count). It is not always possible to use any SSE conversion procedure, thus a scalar code is needed to handle odd cases.
if (b.conversion_routine == Conversion::SSE1Digit) {

    convert_1digit(shuffled, b.element_count, output);

} else if (b.conversion_routine == Conversion::SSE2Digits) {

    convert_2digits(shuffled, b.element_count, output);

} else if (b.conversion_routine == Conversion::SSE4Digits) {

    convert_4digits(shuffled, b.element_count, output);

} else if (b.conversion_routine == Conversion::SSE8Digits) {

    convert_8digits(shuffled, b.element_count, output);

} else {

    // case unsupported, a scalar code is called here
}
  1. Advance the input pointer. The block info structure keeps how many bytes were actually covered.
data += b.total_skip;

4.5   Detecting invalid inputs

In case of unsigned inputs we need to classify input characters into three categories:

  • digits, i.e. chars '0' .. '9';
  • user-defined set of separators;
  • invalid numbers.

In practise we need two masks: one with positions of digits, another with positions of separators. If the or'ed mask has some zero elements, it means that there are invalid characters. Below is a schema:

const __m128i bytemask_digit = decimal_digits_mask(input);
const __m128i bytemask_sep   = matcher.get_mask(input);
const __m128i bytemask_valid = _mm_or_si128(bytemask_digit, bytemask_sep);

const uint16_t valid_mask = _mm_movemask_epi8(bytemask_valid);

if (valid_mask != 0xffff) {
    throw std::logic_error("wrong input");
}

4.5.1   Detecting digits update

The easiest algorithm implements expression c >= '0' and c <= '9':

__m128i decimal_digits_mask(const __m128i input) {
    const __m128i ascii0 = _mm_set1_epi8('0');
    const __m128i after_ascii9 = _mm_set1_epi8('9' + 1);

    const __m128i t0 = _mm_cmplt_epi8(input, ascii0); // t1 = (x < '0')
    const __m128i t1 = _mm_cmplt_epi8(input, after_ascii9); // t0 = (x <= '9')

    return _mm_andnot_si128(t0, t1); // x <= '9' and x >= '0'
}

Following C-expression requires only one subtraction and one comparison:

bool is_digit(char c) {

    const uint8_t x = c;
    return (x - '0') <= 9;
}

However, it requires unsigned comparison, which is not available in both SSE and AVX2. Only AVX512BW defines such instruction _mm512_cmple_epu8_mask (vcmpgtub).

uint64_t decimal_digits_mask(const __m128i input) {

    const __m512i ascii0 = _mm512_set1_epi8('0');
    const __m512i ascii9 = _mm512_set1_epi8('9');

    const __m512i t0 = _mm512_sub_epi8(input, ascii0);

    return __mm256_cmple_epu8_mask(t0, ascii9);
}

Fortunatelly, it's possible to use similar trick for signed bytes, we need just different constants. First, we subtract from the input byte value (ord('0') + 128) & 0xff = 176 & 0xff = 80. This moves codes of '0'..'9' to range -128..-119. Then it's sufficient to compare subtracted value with -118.

__m128i decimal_digits_mask_version2(const __m128i input) {

    const __m512i t0 = _mm_sub_epi8(input, _mm_set1_epi8(80));

    return _mm_cmplt_epi8(t0, _mm_set1_epi8(-118));
}

4.5.2   Detecting characters from set

4.5.2.1   SSE & AVX2

The most generic SSE implementation works in time proportional to the size of a separators set. Each character from the set is broadcasted into a vector and then compared with the input vector. The resulting byte mask is ored with the previous result.

__m128i get_mask(const __m128i& input, const __m128i& initial) const {
    __m128i result = initial;
    for (size_t i=0; i < n + 1; i++) {

        const __m128i mask = _mm_cmpeq_epi8(letters[i], input);
        result = _mm_or_si128(result, mask);
    }

    return result;
}

If the separators set is known at compilation time, then a compiler may unroll the loop.

4.5.2.2   SSE4.2

The SNTI instruction pcmpestrm (_mm_cmpestrm) from SSE4.2 can also be used in same cases. The instruction gets two arguments. One is an input vector. Another one is interpreted either as a set of individual characters (up to 16) or characters ranges (up to 8). The result is a mask for the input characters matching the set (or ranges). In the example below the set variant was used.

__m128i get_mask(const __m128i& input, const __m128i& initial) {

    const uint8_t mode = _SIDD_UBYTE_OPS
                       | _SIDD_CMP_EQUAL_ANY
                       | _SIDD_UNIT_MASK;

    return _mm_or_si128(initial, _mm_cmpestrm(set, set_size, input, 16, mode));
}

4.6   Caveats

  1. It is unlikely, and sometimes impossible, to process all digits from the input vector. In a single iteration one to 16 bytes are converted. The table below shows details — an important conclusion is that almost 95% of patterns process at least half of an input vector.
bytes processed patterns % cumulative %
0 17 0.03% 0.03%
2 32 0.05% 0.07%
3 48 0.07% 0.15%
4 176 0.27% 0.42%
5 360 0.55% 0.97%
6 542 0.83% 1.79%
7 693 1.06% 2.85%
8 1084 1.65% 4.50%
9 1620 2.47% 6.98%
10 2274 3.47% 10.45%
11 3057 4.66% 15.11%
12 4526 6.91% 22.02%
13 6193 9.45% 31.47%
14 8952 13.66% 45.13%
15 12890 19.67% 64.79%
16 23072 35.21% 100.00%
  1. However, the input validation is always done for the whole vector, all 16 bytes. This problem might be overcome, see Processing larger inputs.
  2. SSE units are underutilized. The longest one-digit number sequence is "1;2;3;4;5;6;7;8;". It contains eight numbers, while the SSE procedure is able to process 16 numbers. Likewise, the longest two-digit number sequence is "12;34;56;78;98;;". It has five numbers, but the SSE procedure is able to convert eight numbers. Similarly, the longest four-digit numbers sequence is "1234;5678;9123;;" — it has three four-digit numbers, the SSE procedure is able to convert four numbers.
SSE procedure usage in parser converted numbers average utilization
input size converted nums # % avg max
N/A --- 264 0.40% --- --- ---
1-digit 16 5799 8.85% 3.52 8 22%
2-digit 8 21241 32.41% 3.99 7 49%
4-digit 4 30924 47.18% 3.48 4 87%
8-digit 2 7308 11.15% 1.97 2 98%

5   Parsing and conversion of signed numbers

5.1   Algorithm

Similarly to the unsigned conversion, the signed conversion also requires a span pattern. But in this case also the sign characters '+' and '-' are included in the pattern. For this sample string:

input     = "_+123_-45_78_-9_"

the span pattern is:

span_mask = _dddd_ddd_dd_dd_

The conversion unfortunately requires two shuffles:

  1. Shuffle spans onto proper vector elements — this is exactly the same as in unsigned conversion.
  2. Populate the first span characters, i.e. possibly sign character, on the proper vector elements.

For input:

input =

[ '_' | '-' | '1' | '2' | '3' | '_' | '+' | '4' | '5' | '_' | '-' | '6' | '_' | '7' | '8' | '_' ]

The conversion processes four four-digit numbers. After shuffling the digit spans we have:

shuffled =

[ '-'   '1'   '2'   '3' | 0x00  '+'   '4'   '5' | 0x00  0x00  '-'   '6' | 0x00  0x00  '7'   '8' ]

and the vector of sign characters is:

shuffled_signs =

[ '-'   '-'   '-'   '-' | '+'   '+'   '+'   '+' | '-'   '-'   '-'   '-' | '7'   '7'   '7'   '7' ]

The shuffled_signs vector does not necessarily contain only '+' or '-', it might contain also digits, but this is not an error.

Then the conversion works as follows:

  • From the shuffled vector the sign characters are removed and ASCII characters are converted into values:

    [  0     1     2     3  |  0     0     4     5  |  0     0     0     6  |  0     0     7     8  ]
    
  • Thus we end up with a vector of unsigned numbers, and such an input can be converted by the already defined SSE procedures:

    [          123          |           45          |           6           |           78          ]
    
  • Finally, negate appropriate elements; we use a well known equation (~x + 1), which is expressed in the SSE code as (x xor (-1) - (-1)). The value -1 is obtained from comparing shuffled_signs with the '-':

    negated_mask = (shuffled_signs == packed_byte('-'))
    
    [ ff    ff    ff    ff  | 00    00    00    00  | ff    ff    ff    ff  | 00    00    00    00  ]
    

    or as vector of int32_t:

    [          -1           |           0           |          -1           |           0           ]
    

5.2   Implementation note

Masking of the sign characters in shuffled vector is done at no cost, as the instruction _mm_subs_epu8 (psubusb) is used to convert from ASCII to numeric values:

const __m128i ascii0 = _mm_set1_epi8('0');
const __m128i t0 = _mm_subs_epu8(input, ascii0);

The instruction performs subtract with saturation, i.e. calculates max(0, a-b). Since the ASCII codes of '-' and '+' are respectively 43 and 45 and the ASCII code of '0' is 48, then the result of such expression is zero for both sign characters.

5.3   Detecting invalid inputs

The characters '+' and '-' can be present only at the beginning of a digit span, i.e. we want to reject inputs like "++12" or "1234-,".

To properly validate the input we need four masks for:

  1. separators,
  2. digits,
  3. the character '-',
  4. the character '+'.

The first phase is similar to the unsigned conversion, i.e. all masks are or'ed together, then zero elements point at invalid characters.

The second phase checks if sign characters are placed properly. For each span combination we need to precalculate the mask of valid sign positions. Then, after or'ing the masks for the '+' and '-' we simply verify if they are placed on valid positions.

In practise the precalculated data has a bit-mask for invalid sign positions, thanks to that validation require just a simple and. The snippet below shows the schema of the second step.

const __m128i ascii_minus = _mm_set1_epi8('-');
const __m128i ascii_plus  = _mm_set1_epi8('+');

// ...

const __m128i bytemask_plus  = _mm_cmpeq_epi8(input, ascii_plus);
const __m128i bytemask_minus = _mm_cmpeq_epi8(input, ascii_minus);
const __m128i bytemask_sign  = _mm_or_si128(bytemask_plus, bytemask_minus);

// ...

const uint16_t sign_mask = _mm_movemask_epi8(bytemask_sign);
const uint16_t span_mask = _mm_movemask_epi8(bytemask_span);
const BlockInfo& bi = blocks[span_mask];
if (sign_mask & bi.invalid_sign_mask) {
    throw std::runtime_error("'+' or '-' at invalid position");
}

5.3.1   AVX512VBMI

AVX512VBMI defines the powerful instruction _mm512_permutex2var_epi8 (vperm2b) which does a lookup in a 128-byte table using as the indices the seven lowest bits of each byte of a ZMM register. Invocation of the intrinsics function is weird:

const __m512i lookup_lo = ... // elements 0..63
const __m512i lookup_hi = ... // and 64..127 of the lookup
const __m512i transformed = _mm512_permutex2var_epi8(lookup_lo, input, lookup_hi);

If the set of valid character fits in the standard ASCII character set, i.e. just seven lowest bits are set, then the validation step might quickly reject extended ASCII (7th bit set) and then translate the input using single invocation of _mm512_permutex2var_epi8. When full 8-bit input is allowed, then the instruction has to be called twice with two halves of a 256-byte lookup.

An invocation (or two) of this instruction allows to classify at once all 64 input bytes into required categories:

  • separators — it doesn't matter how many separator chars are used;
  • digits;
  • and sign characters.

Category is encoded as a bit at certain position, in sample implementation following schema is used:

  • separator = 0x01,
  • digit = 0x80,
  • sign = 0x80 | 0x40.

If the transformed vector contains zero byte, it means there are invalid characters.

The span pattern is obtained from the most significant bits by the instruction _mm512_movepi8_mask (vpmovb2m); it works exactly the same way as pmovmaskb from SSE. The sign pattern is obtained by the instruction _mm512_test_epi8_mask (vptestmb), by testing bit 7th (mask 0x40).

Below is a snippet from the implementation:

const __m512i classes = _mm512_permutex2var_epi8(class_lo, input, class_hi);

if (_mm512_test_epi8_mask(classes, classes) != uint64_t(-1)) {
    throw std::logic_error("invalid character");
}

uint64_t span_mask64 = _mm512_movepi8_mask(classes);
uint64_t sign_mask64 = _mm512_test_epi8_mask(classes, _mm512_set1_epi8(int8_t(0x40)));

5.4   SSE algorithm outline

Below are major steps of the algorithm's loop, with snippets from the actual implementation.

  1. Load 16 characters into the input vector.
const __m128i input = _mm_loadu_si128((const __m128i*)data);
  1. Check for invalid characters (as described above).
  2. Build the span pattern and fetch the conversion parameters.
const __m128i ascii_minus    = _mm_set1_epi8('-');
const __m128i ascii_plus     = _mm_set1_epi8('+');

// ...
const __m128i bytemask_plus  = _mm_cmpeq_epi8(input, ascii_plus);
const __m128i bytemask_minus = _mm_cmpeq_epi8(input, ascii_minus);
const __m128i bytemask_sign  = _mm_or_si128(bytemask_plus, bytemask_minus);

const __m128i bytemask_digit = decimal_digits_mask(input);
const __m128i bytemask_span  = _mm_or_si128(bytemask_digit, bytemask_sign);

const uint16_t span_mask = _mm_movemask_epi8(bytemask_span);
const BlockInfo& bi = block_info[span_mask];
  1. Validate if the sign characters are properly placed.
const uint16_t sign_mask = _mm_movemask_epi8(bytemask_sign);
if (sign_mask & bi.invalid_sign_mask) {
    throw std::runtime_error("'+' or '-' at invalid position");
}
  1. If sign_mask is zero, choose faster unsigned conversion path.
  2. Otherwise prepare data for signed conversion.
const __m128i ascii_minus = _mm_set1_epi8('-');

const __m128i shuffle_digits = _mm_loadu_si128((const __m128i*)bi.shuffle_digits);
const __m128i shuffle_signs  = _mm_loadu_si128((const __m128i*)bi.shuffle_signs);

const __m128i shuffled       = _mm_shuffle_epi8(input, shuffle_digits);
const __m128i shuffled_signs = _mm_shuffle_epi8(input, shuffle_signs);
const __m128i negate_mask    = _mm_cmpeq_epi8(shuffled_signs, ascii_minus);
  1. And finally invoke proper conversion routine.
if (bi.conversion_routine == Conversion::SSE1Digit) {

    convert_1digit(shuffled, bi.element_count, output);

} else if (bi.conversion_routine == Conversion::SSE2Digits) {

    convert_2digits_signed(shuffled, negate_mask, bi.element_count, output);

} else if (bi.conversion_routine == Conversion::SSE4Digits) {

    convert_4digits_signed(shuffled, negate_mask, bi.element_count, output);

} else if (bi.conversion_routine == Conversion::SSE8Digits) {

    convert_8digits_signed(shuffled, negate_mask, bi.element_count, output);

} else {

    // scalar conversion
}

6   Processing larger inputs

As it was stated in Caveats section, a single step of algorithm validates the whole input vector but rarely converts all bytes from the input. Although detecting digits and sign characters might be considered cheap, we saw that validating against an arbitrary set of separators might be really time consuming.

The idea is to separate the validation from the conversion. In case of the unsigned conversion only the span_mask is needed; in case of the signed conversion we need also sign_mask.

In each iteration we process four 16-byte chunks. The result from this step are 64-bit masks. Then, we load 16 bytes into the input vector and use the lower 16-bit part of already calculated mask to get block info for the loaded code.

For unsigned conversion the main loop looks like this:

while (data < end) {

    const __m128i input = _mm_loadu_si128((const __m128i*)data);

    // validate input and calculate span_mask
    const uint16_t span_mask = get_span_mask(input);
    // obtain block info
    const BlockInfo& bi = block_info[span_mask];
    // convert
    convert(input, bi);

    // advance pointer
    data += bi.total_processed;
}

And after applying the proposed change:

while (data < end) {

    const __m128i input0 = _mm_loadu_si128((const __m128i*)(data + 0*16));
    const __m128i input1 = _mm_loadu_si128((const __m128i*)(data + 1*16));
    const __m128i input2 = _mm_loadu_si128((const __m128i*)(data + 2*16));
    const __m128i input3 = _mm_loadu_si128((const __m128i*)(data + 3*16));

    // validate input and calculate 4 x span_mask
    const uint16_t span_mask0 = get_span_mask(input0);
    const uint16_t span_mask1 = get_span_mask(input1);
    const uint16_t span_mask2 = get_span_mask(input2);
    const uint16_t span_mask3 = get_span_mask(input3);

    const uint64_t span_mask64 = (uint64_t(span_mask0) << (0*16))
                               | (uint64_t(span_mask1) << (1*16))
                               | (uint64_t(span_mask2) << (2*16))
                               | (uint64_t(span_mask3) << (3*16));

    char* ptr = data + 3*16;
    while (data < ptr) {
        const uint16_t span_mask = span_mask64 & 0xffff;
        // obtain block info
        const BlockInfo& bi = block_info[span_mask];

        // read input
        const __m128i input = load 16-byte from data

        // convert
        convert(input, bi);

        // advance pointer
        data += bi.total_processed;

        // shift the mas accordingly
        span_mask64 >>= bi.total_processed;
    }
}

The problem of this solution is multiple reading the input bytes. Moreover, the looping schema is more complicated; especially in practice, where also scalar fallback have to be considered (it is not shown here).

7   Scalar hybrid

The idea of separation the validation step from the actual conversion might also be applied for scalar parser. The validation, i.e. building a span_mask is done by SIMD code. Then, each byte of the span_mask is processed by a precompiled code for each of 256 cases.

There are two major advantages:

1. Scalar conversions are called with compile-time constants, thanks to that a compiler can perform as many optimizations as possible.

2. Digits spans that cross byte boundary can be processed seamlessly. When a span ends at the byte's end, a conversion routine converts all characters from the input's chunk and marks that the conversion is not completed.

When the next pattern is processed, the mark is checked: if the first bit of the current pattern is zero, it means that the previously converted number is ready and can be saved. If the first bit is one, it means that the first span is continuation and thus conversion must be carried on from the saved point.

7.1   Example

Let's convert following input 24-byte input (the space separates 8-byte chunks):

_12_3_45 6_7890_1 ___99___

The first chunk of input is _12_3_45, i.e. the span_pattern = 0b01101011 = 0x6b. The digit spans are [(1,2), (4,4), (6,7)], following code handles them:

*output++ = convert<2>(data + 1);
*output++ = convert<1>(data + 4);
val       = convert<2>(data + 6);
has_val   = true;

The template convert is parametrized by the number of digits to convert and gets a pointer to the first character. The important fact is such code doesn't need to validate pointers, sizes etc. All is known in the compile-time.

The next input's chunk is 6_7890_1, the digit spans are [(0,0), (2,5), (7,7)]. A variant of template convert that accepts two parameters simple starts conversion from the saved state (val).

if (has_val) {
    *output++ = continue<1>(data + 0, val);
} else {
    *output++ = convert<1>(data + 0);
}

*output++ = convert<4>(data + 2);
val       = convert<1>(data + 7);
has_val   = true;

The last, third chunk, is ___99___, and there is just one digit span [(3,4)]:

if (has_val) {
    *output++ = val;
    has_val = false;
}

*output++ = convert<2>(data + 3);
has_val   = false;

8   Appendix A — conversion of three-digit numbers

When the input vector contains an array of three-digit numbers, like "123;456;789;123;", we still use four-digit conversion routine. But we might utilize the fact that the most significant digit is always zero.

We use _mm_maddubs_epi16, but with weights 1, 10, 100, 0: two lower digits are converted and the most significant digit is properly scaled. Then simple horizontal add _mm_hadd_epi16 adds the intermediate results.

// t0 = [   0 |   5 |   7 |   9 |   0 |   1 |   2 |   3 |   0 |   9 |   2 |   5 |   0 |   3 |   2 |   7 ]
const __m128i ascii0  = _mm_set1_epi8('0');
const __m128i t0 = _mm_subs_epu8(input, ascii0);

// t1 = [    500    |     79    |    100    |    23     |    900    |    25     |    300    |    27     ]
const __m128i mul_all = _mm_setr_epi8(0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1);
const __m128i t1 = _mm_maddubs_epi16(t0, mul_all);

// t2 = [    579    |    123    |    925    |   327     |    579    |    123    |    925    |   327     ]
const __m128i t2 = _mm_hadd_epi16(t1, t1);

9   Appendix B — conversion of two four-digit numbers

Where two four-digits are going to be converted, a faster approach can be used. Instead of storing each digit on separate bytes, digits are stored on 16-bit values. Then a single _mm_madd_epi16 is used to multiply all digits by weights 1, 10, 100, 1000. Then the result's halves are added with _mm_hadd_epi32 (phaddd).

// input = "12345678"

// t0 = [   0 |   1 |   0 |   2 |   0 |   3 |   0 |   4 |   0 |   5 |   0 |   6 |   0 |   7 |   0 |   8 ]
const __m128i ascii0  = _mm_set1_epi8('0');
const __m128i t0 = _mm_subs_epu8(input, ascii0);

// t0 = [         1 |         2 |         3 |         4 |         5 |         6 |         7 |         8 ]
//    * [      1000 |       100 |        10 |         1 |      1000 |       100 |        10 |         1 ]

// t1 = [         1200          |           34          |          5600         |          78           ]
const __m128i mul_all = _mm_setr_epi16(1000, 100, 10, 1, 1000, 100, 10, 1);
const __m128i t1 = _mm_madd_epi16(t0, mul_all);

// t2 = [         1234          |         5678          |         1234          |          5678         ]
const __m128i t2 = _mm_hadd_epi32(t1, t1);

10   Reference scalar procedures

There are two reference scalar implementations:

The handcrafted source:

template <typename INSERTER>
void parse_signed(const char* data, size_t size, const char* separators, INSERTER output) {

    enum State {
        Separator,
        Plus,
        Minus,
        Digit
    };

    State state = Separator;
    State prev = Separator;
    bool negative = false;
    int32_t number = 0;

    for (size_t i=0; i < size; i++) {
        const char c = data[i];
        if (c == '+') {
            state = Plus;
        } else if (c == '-') {
            state = Minus;
        } else if (c >= '0' && c <= '9') {
            state = Digit;
        } else if (contains(separators, c)) {
            state = Separator;
        } else {
            throw std::runtime_error("Wrong character (scalar)");
        }

        switch (state) {
            case Plus:
                if (prev != Separator) {
                    throw std::runtime_error("Invalid syntax ('+' follows a non-separator character)");
                }
                number = 0;
                negative = false;
                break;

            case Minus:
                if (prev != Separator) {
                    throw std::runtime_error("Invalid syntax ('-' follows a non-separator character)");
                }
                number = 0;
                negative = true;
                break;

            case Digit:
                if (prev == Separator) {
                    number = c - '0';
                    negative = false;
                } else {
                    number = 10*number + c - '0';
                }
                break;

            case Separator:
                if (prev == Digit) {
                    if (negative) {
                        *output = -number;
                    } else {
                        *output = number;
                    }
                } else if (prev != Separator) {
                    throw std::runtime_error("Invalid syntax ('-' or '+' not followed by any digit)");
                }
                break;
        } // switch

        prev = state;
    } // for

    if (state == Separator) {
        if (prev == Digit) {
            if (negative) {
                *output = -number;
            } else {
                *output = number;
            }
        } else if (prev != Separator) {
            throw std::runtime_error("Invalid syntax ('-' or '+' not followed by any digit)");
        }
    }
}

And implementation based on standard functions:

template <typename INSERTER>
void parse_signed(const char* data, size_t size, const char* separators, INSERTER output) {

    char* ptr = const_cast<char*>(data);
    char* end = ptr + size;

    char* endptr;
    while (true) {
        ptr += strspn(ptr, separators);
        if (ptr == end) {
            break;
        }

        errno = 0;
        const long val = std::strtol(ptr, &endptr, 10);

        // the following check comes from "man 3 strtol"
        if ((errno == ERANGE && (val == LONG_MAX || val == LONG_MIN)) || (errno != 0 && val == 0)) {
            throw std::logic_error("invalid input");
        }

        if (endptr == ptr) {
            throw std::logic_error("no digits");
        }

        ptr = endptr;
        *output++ = val;
    }
}

11   Experiments

This part presents evaluation of algorithms in terms of running time (performance tests) and also some statistical measurements that might help in understanding run-time properties.

Following parameters of the input data are being changed:

11.1   SSE conversion — execution statistics

Caveats shows statistics of 1) average processed bytes and 2) SSE utilization. This section presents real numbers gathered during conversion of 1000000 bytes containing 1-8 digit signed numbers.

11.1.1   Overview

loops 90'790  
... different span patterns 1'315  
total numbers converted 162'228 100.00%
conversion handled by scalar fallbacks 13'601 8.38%
conversions done by SSE procedures 148'627 91.62%
... unsigned SSE parser 6'222 N/A
... signed SSE parser 142'405

As we see, most conversions were done by the SSE code; moreover, most conversions went through the signed parser, which is slower than unsigned counterpart.

Another important number, which may give a hint about possible cache misses to the block_info structure, is the count of different span patterns. It is 1315 (~2% of the whole table size, 65536), but only 120 of them were processed in 75% iterations. The next section shows this parameter in details.

11.1.2   SSE routines calls

Below are detailed statistics for SSE routines:

  • calls — how many times the procedure was called;
  • converted — how many numbers were parsed;
  • conv/call — average utilization.

Conversions 1-/2-/4-/8-digits are described in section SSE conversion capabilities, the 3-digit procedure in Appendix A — conversion of three-digit numbers.

procedure unsigned path signed path
calls converted conv/call calls converted conv/call
1-digit conversion 325 410 1.26 N/A N/A N/A
2-digit conversion 21 48 2.29 1949 3521 1.81
3-digit conversion 21 43 2.05 1409 2246 1.59
4-digit conversion 239 581 2.43 10569 26255 2.48
8-digit conversion 2887 5140 1.78 59769 110383 1.85

The table below shows how many bytes from the input vector were processed in single iteration. It is pretty close to average numbers showed in Caveats.

bytes processed patterns % cumulative %
0 0 0.00% 0.00%
1 0 0.00% 0.00%
2 197 0.26% 0.26%
3 583 0.76% 1.01%
4 773 1.00% 2.01%
5 1137 1.47% 3.48%
6 1841 2.39% 5.87%
7 2544 3.30% 9.17%
8 5305 6.87% 16.04%
9 9135 11.83% 27.87%
10 7163 9.28% 37.15%
11 8834 11.44% 48.60%
12 10102 13.09% 61.68%
13 9342 12.10% 73.79%
14 8221 10.65% 84.44%
15 6743 8.74% 93.17%
16 5269 6.83% 100.00%

11.2   SSE conversion — runtime analysis

The lookup table is quite large: in the sample implementation single record occupies 38 bytes, it gives almost 2,5 MiB. This section presents analysis of span_pattern usage during conversion of various inputs. Following parameters are collated:

  1. The number of different span patterns that appear during conversion. This is explained in details below.
  2. The number of CPU clocks per input byte (on Skylake); the results were directly get from Performance comparison.
  3. Both cache miss and branch miss ratios (the same Skylake); these results were obtained by running separate bin/benchmark-hwevents tool. The ratios should give a hint about real-workload penalties.

During a conversion a histogram is build. The histogram contains span_pattern and the number of its occurrences; the histogram is sorted by the frequency. Having such a sorted histogram it is easy to estimate how many distinct patterns appear in most iterations.

Let's analyse sample, made up, histogram:

# span pattern frequency cumulative
1 0xaaaa 1 1
2 0xbbbb 1 2
3 0x4444 1 3
4 0xdddd 1 4
5 0xeeee 2 6
6 0xffff 2 8
7 0x1111 3 11
8 0x3333 5 16
9 0x5555 10 26
10 0x7777 17 43

There are 10 different patterns that were used in 43 loops. We can read from the table that in 1/4 of iterations (11 of 43) were used even seven patterns. Most of processing uses just three patterns. It means that only a small portion of the large lookup is actually touched.

The tables in following sections have got data processed in this way. Each column contains number of distinct patterns that take part in given percentage of the iterations (25%, 50%, 75%, 95% and all, 100%).

The results for various input sizes are available in the repository, below are shown only 4kB and 64kB inputs.

11.2.1   Input size 4,096 bytes

parameters distinct span masks count cycles per byte branches cache references
  < 25% < 50% < 75% < 95% 100% min avg taken mispredicted ratio count missed ratio
Fixed length
1 digit, single separator character 32 44 50 52 53 3.095 3.165 26085 633 2.43% 1531 415 27.11%
2 digits, single separator character 12 16 18 19 20 3.305 3.400 25270 485 1.92% 1333 114 8.55%
3 digits, single separator character 5 7 8 8 9 2.533 2.584 21809 247 1.13% 954 63 6.60%
4 digits, single separator character 4 5 6 6 7 3.193 3.231 24145 336 1.39% 926 49 5.29%
5 digits, single separator character 2 3 4 4 5 2.648 2.680 20788 241 1.16% 1067 24 2.25%
6 digits, single separator character 2 4 4 4 5 2.312 2.347 18289 233 1.27% 915 19 2.08%
7 digits, single separator character 2 2 2 2 3 4.121 4.167 25682 313 1.22% 780 6 0.77%
8 digits, single separator character 1 3 3 3 4 5.495 5.592 44188 837 1.89% 836 25 2.99%
Uniform distribution
1 .. 1 digit, single separator character 42 54 60 62 63 3.162 3.288 25992 619 2.38% 1618 334 20.64%
1 .. 2 digits, single separator character 71 102 118 126 128 3.201 3.268 26561 465 1.75% 2118 655 30.93%
1 .. 3 digits, single separator character 72 112 134 146 148 2.891 2.941 24393 514 2.11% 2522 663 26.29%
1 .. 4 digits, single separator character 87 135 159 170 172 3.398 3.463 25831 906 3.51% 2480 667 26.90%
1 .. 5 digits, single separator character 92 136 163 177 179 3.495 3.553 25865 806 3.12% 2509 548 21.84%
1 .. 6 digits, single separator character 90 144 168 178 180 3.494 3.557 24943 639 2.56% 2591 627 24.20%
1 .. 7 digits, single separator character 86 125 143 151 153 3.259 3.324 23730 579 2.44% 2137 476 22.27%
1 .. 8 digits, single separator character 95 135 157 166 167 3.828 3.894 28964 859 2.97% 2074 318 15.33%
Gaussian distribution
max at 1 digit, single separator character 76 133 161 174 176 3.218 3.264 25990 485 1.87% 2658 799 30.06%
max at 2 digit, single separator character 77 129 157 170 173 3.122 3.189 24951 686 2.75% 2803 455 16.23%
max at 3 digit, single separator character 76 103 115 120 122 3.192 3.246 24812 799 3.22% 3038 256 8.43%
max at 4 digit, single separator character 43 57 63 65 66 3.134 3.215 23754 414 1.74% 1650 106 6.42%
max at 5 digit, single separator character 32 38 41 43 44 2.822 2.875 21380 346 1.62% 1320 93 7.05%
max at 6 digit, single separator character 25 30 33 34 35 2.887 2.935 21765 496 2.28% 1223 79 6.46%
max at 7 digit, single separator character 20 22 24 24 25 3.921 3.983 29234 794 2.72% 1096 40 3.65%
max at 8 digit, single separator character 11 12 13 14 15 4.825 4.887 37274 1015 2.72% 944 44 4.66%
Fixed length
1 digit, 1 .. 6 separator characters 65 131 197 223 227 3.331 3.404 18979 677 3.57% 3111 1581 50.82%
2 digits, 1 .. 6 separator characters 67 135 175 195 198 3.473 3.529 20588 799 3.88% 2858 1287 45.03%
3 digits, 1 .. 6 separator characters 68 119 147 157 158 2.747 2.806 18629 481 2.58% 2728 1150 42.16%
4 digits, 1 .. 6 separator characters 69 101 118 124 125 2.641 2.695 18672 488 2.61% 1919 613 31.94%
5 digits, 1 .. 6 separator characters 53 71 78 81 82 2.748 2.824 19009 530 2.79% 1638 372 22.71%
6 digits, 1 .. 6 separator characters 43 50 53 55 56 2.958 3.002 19989 450 2.25% 1304 208 15.95%
7 digits, 1 .. 6 separator characters 12 14 17 18 19 3.316 3.341 20881 300 1.44% 1058 52 4.91%
8 digits, 1 .. 6 separator characters 33 46 54 57 58 4.302 4.346 34811 728 2.09% 1377 206 14.96%
Uniform distribution
1 .. 1 digit, 1 .. 6 separator characters 66 132 198 232 237 3.197 3.268 19319 685 3.55% 2972 1273 42.83%
1 .. 2 digits, 1 .. 6 separator characters 66 132 198 238 243 3.025 3.081 19752 842 4.26% 3186 1247 39.14%
1 .. 3 digits, 1 .. 6 separator characters 67 135 203 233 238 3.070 3.150 19453 743 3.82% 3357 1216 36.22%
1 .. 4 digits, 1 .. 6 separator characters 70 140 210 250 256 3.390 3.477 19640 897 4.57% 3586 1226 34.19%
1 .. 5 digits, 1 .. 6 separator characters 71 142 213 242 247 3.338 3.401 19580 823 4.20% 3170 940 29.65%
1 .. 6 digits, 1 .. 6 separator characters 71 143 207 233 237 3.288 3.375 19457 785 4.03% 2916 807 27.67%
1 .. 7 digits, 1 .. 6 separator characters 74 148 208 228 230 3.304 3.375 19268 761 3.95% 2985 904 30.28%
1 .. 8 digits, 1 .. 6 separator characters 79 159 209 227 230 3.692 3.766 23045 881 3.82% 2946 727 24.68%
Gaussian distribution
max at 1 digit, 1 .. 6 separator characters 66 133 200 244 249 3.154 3.216 19347 900 4.65% 3515 1060 30.16%
max at 2 digit, 1 .. 6 separator characters 67 135 203 241 246 3.155 3.229 19299 786 4.07% 3340 892 26.71%
max at 3 digit, 1 .. 6 separator characters 69 139 205 229 233 3.202 3.278 19382 804 4.15% 3188 661 20.73%
max at 4 digit, 1 .. 6 separator characters 70 141 180 198 201 2.914 2.978 18669 611 3.27% 2775 692 24.94%
max at 5 digit, 1 .. 6 separator characters 74 131 156 164 166 3.147 3.217 19276 586 3.04% 2291 552 24.09%
max at 6 digit, 1 .. 6 separator characters 70 93 101 105 106 3.196 3.257 20806 493 2.37% 1719 237 13.79%
max at 7 digit, 1 .. 6 separator characters 64 80 87 91 92 3.597 3.646 24974 622 2.49% 1599 251 15.70%
max at 8 digit, 1 .. 6 separator characters 56 74 82 87 88 3.994 4.055 29874 848 2.84% 1587 231 14.56%

11.2.2   Input size 65,536 bytes

parameters distinct span masks count cycles per byte branches cache references
  < 25% < 50% < 75% < 95% 100% min avg taken mispredicted ratio count missed ratio
Fixed length
1 digit, single separator character 61 74 81 84 85 3.619 3.655 393175 6529 1.66% 13469 974 7.23%
2 digits, single separator character 13 17 20 21 22 3.727 3.754 374164 3451 0.92% 11212 286 2.55%
3 digits, single separator character 4 6 8 8 9 2.839 2.852 328065 829 0.25% 7539 101 1.34%
4 digits, single separator character 4 5 6 6 7 3.495 3.507 361605 2429 0.67% 7148 89 1.25%
5 digits, single separator character 1 2 3 3 4 2.921 2.935 313419 849 0.27% 7335 119 1.62%
6 digits, single separator character 1 2 3 3 4 2.545 2.555 272684 741 0.27% 8704 58 0.67%
7 digits, single separator character 1 2 2 2 3 4.407 4.417 386099 2206 0.57% 5224 27 0.52%
8 digits, single separator character 2 3 4 4 5 7.094 7.135 666950 13453 2.02% 5192 75 1.44%
Uniform distribution
1 .. 1 digit, single separator character 59 72 80 82 83 3.627 3.693 393231 6582 1.67% 14582 973 6.67%
1 .. 2 digits, single separator character 236 280 299 308 310 3.908 3.935 397134 5214 1.31% 19011 2176 11.45%
1 .. 3 digits, single separator character 355 420 450 466 469 3.565 3.590 361819 4669 1.29% 21347 1858 8.70%
1 .. 4 digits, single separator character 416 493 523 537 540 5.329 5.387 391117 12854 3.29% 24169 1621 6.71%
1 .. 5 digits, single separator character 444 510 543 556 558 5.270 5.328 390345 11393 2.92% 19020 1568 8.24%
1 .. 6 digits, single separator character 423 492 525 535 537 4.589 4.633 372174 7872 2.12% 19917 1282 6.44%
1 .. 7 digits, single separator character 392 462 481 491 493 4.174 4.218 355907 6263 1.76% 15979 949 5.94%
1 .. 8 digits, single separator character 519 598 620 630 631 5.217 5.273 423108 11289 2.67% 16635 695 4.18%
Gaussian distribution
max at 1 digit, single separator character 481 587 630 645 648 4.150 4.189 395482 6382 1.61% 25467 1620 6.36%
max at 2 digit, single separator character 454 534 570 586 589 4.104 4.143 371930 6882 1.85% 25775 739 2.87%
max at 3 digit, single separator character 300 330 344 350 352 4.990 5.050 375961 11571 3.08% 15877 351 2.21%
max at 4 digit, single separator character 132 147 153 155 156 3.760 3.814 356129 4522 1.27% 10634 220 2.07%
max at 5 digit, single separator character 84 90 93 95 96 3.031 3.045 318232 1548 0.49% 8123 177 2.18%
max at 6 digit, single separator character 54 59 62 64 65 3.479 3.499 328097 6041 1.84% 8491 366 4.31%
max at 7 digit, single separator character 36 39 41 41 42 5.520 5.560 444157 11932 2.69% 5814 124 2.13%
max at 8 digit, single separator character 22 23 24 25 26 6.897 6.946 562307 15610 2.78% 5449 120 2.20%
Fixed length
1 digit, 1 .. 6 separator characters 1052 1626 1870 1950 1958 4.371 4.408 280108 7580 2.71% 34860 8289 23.78%
2 digits, 1 .. 6 separator characters 693 909 995 1026 1029 5.044 5.108 308039 12071 3.92% 26297 4481 17.04%
3 digits, 1 .. 6 separator characters 354 438 480 494 496 3.334 3.358 273155 3865 1.41% 17568 2647 15.07%
4 digits, 1 .. 6 separator characters 177 224 246 252 254 3.311 3.335 277306 4944 1.78% 10967 1361 12.41%
5 digits, 1 .. 6 separator characters 100 126 134 137 138 3.648 3.677 284172 7099 2.50% 7727 742 9.60%
6 digits, 1 .. 6 separator characters 69 77 80 82 83 3.675 3.703 297939 5210 1.75% 6470 402 6.21%
7 digits, 1 .. 6 separator characters 46 49 51 53 54 3.662 3.673 315611 2311 0.73% 6077 308 5.07%
8 digits, 1 .. 6 separator characters 30 45 54 58 59 5.735 5.782 532039 10908 2.05% 6292 438 6.96%
Uniform distribution
1 .. 1 digit, 1 .. 6 separator characters 1053 1643 1896 1973 1981 4.364 4.396 279251 7676 2.75% 34478 6144 17.82%
1 .. 2 digits, 1 .. 6 separator characters 1066 1792 2104 2212 2221 5.090 5.154 288339 12133 4.21% 39995 5781 14.45%
1 .. 3 digits, 1 .. 6 separator characters 1084 1802 2120 2226 2236 4.974 5.031 286026 10029 3.51% 44877 5187 11.56%
1 .. 4 digits, 1 .. 6 separator characters 1120 1808 2084 2173 2182 5.332 5.395 287345 11700 4.07% 42918 3871 9.02%
1 .. 5 digits, 1 .. 6 separator characters 1141 1713 1959 2031 2039 5.078 5.153 287785 10589 3.68% 40253 2950 7.33%
1 .. 6 digits, 1 .. 6 separator characters 1142 1617 1803 1856 1860 4.888 4.960 286913 10124 3.53% 35625 2229 6.26%
1 .. 7 digits, 1 .. 6 separator characters 1091 1503 1646 1678 1681 4.970 5.033 287420 11059 3.85% 30932 3218 10.40%
1 .. 8 digits, 1 .. 6 separator characters 1109 1476 1598 1619 1622 5.650 5.711 339596 14223 4.19% 28870 2243 7.77%
Gaussian distribution
max at 1 digit, 1 .. 6 separator characters 1070 1975 2405 2551 2565 5.389 5.456 285670 12734 4.46% 46840 6140 13.11%
max at 2 digit, 1 .. 6 separator characters 1094 1862 2174 2283 2293 5.125 5.177 288202 10818 3.75% 43779 3228 7.37%
max at 3 digit, 1 .. 6 separator characters 1021 1440 1610 1666 1672 4.927 4.986 284545 10542 3.70% 34647 1690 4.88%
max at 4 digit, 1 .. 6 separator characters 715 905 985 1010 1013 4.072 4.129 280545 8310 2.96% 20920 931 4.45%
max at 5 digit, 1 .. 6 separator characters 445 534 567 575 577 4.127 4.167 286377 8569 2.99% 14319 651 4.55%
max at 6 digit, 1 .. 6 separator characters 296 336 346 351 352 4.025 4.054 312734 6394 2.04% 11762 402 3.42%
max at 7 digit, 1 .. 6 separator characters 185 207 214 219 220 4.481 4.520 376654 8042 2.14% 7924 452 5.70%
max at 8 digit, 1 .. 6 separator characters 110 133 141 146 148 5.532 5.576 448317 12621 2.82% 7779 821 10.55%

11.3   Performance comparison

11.3.1   Tested procedures

Following procedures are tested:

  • scalar — the hand-coded validating scalar parser;
  • scalar (std) — also the hand-coded parser, but use only standard functions strtol and strspn;
  • scalar hybrid — the implementation of Scalar hybrid;
  • SSE — the implementation of the base SSE algorithm;
  • SSE (block) — the implementation of variant described in Processing larger inputs
  • SSE (simplified) — the SSE variant with relaxed delimiters definition.

11.3.2   Tests setup

There are two tests approaches:

  • Overall, where rather huge input data is parsed (a few megabytes) and the wall clock is used to measure the time of the whole procedure. Procedures are repeated several times (10, 100, 1000 or more), the minimum running time is noted.

    To run these tests type make report-overall.rst; the tests suite uses executable bin/benchmark.

    The results from various computers are available in the repository.

  • Microbenchmarks, where small inputs are parsed and exact number of CPU cycles is obtained. Procedures are also repeated many times (1000) and then minumum and average cycles to process a single input byte are calculated.

    To run these tests type make microbenchmarks.rst; the tests suite uses executable bin/benchmark-cpuclocks.

    The all results from various computers are available in the repository, below are only the measurements from the Skylake computer.

11.3.3   Core i7 results

Compiler: GCC 7.3.0

CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz


The table below compares speedup to the reference procedure for various input sizes and data distributions. Detailed measurements for each size and different distribution parameters are presented in the subsequent sections.

  speedup over scalar procedure
  scalar (std) scalar (hybrid) SSE SSE (block) SSE (simplified)
size [B] distribution samples min avg max min avg max min avg max min avg max min avg max
1024 Gaussian distribution 16 0.45 0.87 1.29 1.65 2.36 3.26 1.53 3.21 5.44 1.66 3.16 4.97 1.58 3.37 5.78
1024 Fixed length 16 0.42 0.87 1.29 1.63 2.41 3.53 1.28 3.32 5.23 1.47 3.30 5.05 1.30 3.54 5.79
1024 Uniform distribution 16 0.40 0.86 1.32 1.64 2.41 3.36 2.04 3.70 5.54 2.22 3.53 4.97 2.13 3.88 5.79
4096 Gaussian distribution 16 0.39 0.69 1.24 1.69 3.01 4.67 1.91 4.56 7.21 2.13 4.88 7.84 2.01 4.77 7.68
4096 Fixed length 16 0.36 0.69 1.39 1.94 2.87 4.32 1.60 4.44 6.86 1.71 4.66 8.34 1.62 4.70 7.83
4096 Uniform distribution 16 0.43 0.78 1.38 2.07 3.24 4.55 3.19 5.33 7.40 3.36 5.61 7.87 3.27 5.61 7.79
65536 Gaussian distribution 16 0.45 0.77 1.30 1.40 1.74 2.13 1.73 4.46 5.88 1.76 4.57 6.00 1.77 4.60 6.25
65536 Fixed length 16 0.40 0.75 1.36 1.55 1.87 2.37 1.46 4.68 6.72 1.48 5.00 7.31 1.50 4.95 7.53
65536 Uniform distribution 16 0.55 0.86 1.36 1.38 1.75 2.34 3.32 4.98 6.39 3.29 5.01 6.52 3.43 5.20 6.89
102400 Gaussian distribution 16 0.46 0.77 1.26 1.38 1.71 2.03 1.78 4.28 5.66 1.81 4.44 5.81 1.82 4.42 6.00
102400 Fixed length 16 0.40 0.75 1.36 1.53 1.84 2.31 1.29 4.59 6.49 1.46 4.91 6.99 1.31 4.85 7.16
102400 Uniform distribution 16 0.54 0.86 1.38 1.35 1.71 2.32 3.22 4.78 6.38 3.21 4.83 6.43 3.31 4.97 6.85

11.3.3.1   Input size 4096 bytes

Gaussian distribution (max at 8 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 8.782 9.103 1.00
scalar (std) 22.874 23.586 0.38
scalar (hybrid) 5.005 5.148 1.75
SSE 4.898 4.974 1.79
SSE (block) 4.603 4.651 1.91
SSE (simplified) 4.687 4.730 1.87

Gaussian distribution (max at 8 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 12.729 13.130 1.00
scalar (std) 17.501 18.150 0.73
scalar (hybrid) 4.067 4.273 3.13
SSE 4.002 4.069 3.18
SSE (block) 3.903 3.945 3.26
SSE (simplified) 3.914 3.970 3.25

Gaussian distribution (max at 7 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 8.836 9.191 1.00
scalar (std) 23.365 24.139 0.38
scalar (hybrid) 5.104 5.250 1.73
SSE 4.129 4.195 2.14
SSE (block) 3.476 3.518 2.54
SSE (simplified) 3.975 4.039 2.22

Gaussian distribution (max at 7 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 13.646 14.115 1.00
scalar (std) 18.029 18.596 0.76
scalar (hybrid) 4.042 4.274 3.38
SSE 3.591 3.644 3.80
SSE (block) 3.294 3.334 4.14
SSE (simplified) 3.431 3.481 3.98

Gaussian distribution (max at 6 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 9.434 9.852 1.00
scalar (std) 24.766 25.547 0.38
scalar (hybrid) 5.571 5.728 1.69
SSE 2.833 2.882 3.33
SSE (block) 2.977 3.024 3.17
SSE (simplified) 2.738 2.786 3.45

Gaussian distribution (max at 6 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 14.200 14.750 1.00
scalar (std) 17.992 18.563 0.79
scalar (hybrid) 4.091 4.330 3.47
SSE 3.207 3.266 4.43
SSE (block) 2.885 2.932 4.92
SSE (simplified) 3.058 3.113 4.64

Gaussian distribution (max at 5 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 10.384 10.864 1.00
scalar (std) 25.763 26.495 0.40
scalar (hybrid) 6.102 6.273 1.70
SSE 2.786 2.827 3.73
SSE (block) 2.710 2.744 3.83
SSE (simplified) 2.699 2.752 3.85

Gaussian distribution (max at 5 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 15.168 15.622 1.00
scalar (std) 18.088 18.872 0.84
scalar (hybrid) 4.244 4.442 3.57
SSE 3.023 3.096 5.02
SSE (block) 2.864 2.934 5.30
SSE (simplified) 2.908 2.984 5.22

Gaussian distribution (max at 1 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 18.091 18.877 1.00
scalar (std) 34.484 35.258 0.52
scalar (hybrid) 7.037 7.275 2.57
SSE 3.160 3.213 5.72
SSE (block) 3.282 3.352 5.51
SSE (simplified) 3.120 3.194 5.80

Gaussian distribution (max at 1 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 21.390 22.112 1.00
scalar (std) 18.566 19.088 1.15
scalar (hybrid) 5.160 5.286 4.15
SSE 3.067 3.126 6.97
SSE (block) 2.904 2.963 7.37
SSE (simplified) 2.961 3.022 7.22

Gaussian distribution (max at 4 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 11.966 12.550 1.00
scalar (std) 26.982 27.705 0.44
scalar (hybrid) 5.896 6.082 2.03
SSE 3.234 3.284 3.70
SSE (block) 3.026 3.074 3.95
SSE (simplified) 3.117 3.183 3.84

Gaussian distribution (max at 4 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 16.371 16.967 1.00
scalar (std) 17.832 18.445 0.92
scalar (hybrid) 4.173 4.395 3.92
SSE 2.909 3.001 5.63
SSE (block) 2.760 2.813 5.93
SSE (simplified) 2.848 2.908 5.75

Gaussian distribution (max at 3 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 13.280 13.948 1.00
scalar (std) 29.238 30.157 0.45
scalar (hybrid) 6.225 6.450 2.13
SSE 3.225 3.309 4.12
SSE (block) 2.963 3.012 4.48
SSE (simplified) 3.238 3.286 4.10

Gaussian distribution (max at 3 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 18.193 18.831 1.00
scalar (std) 18.642 19.287 0.98
scalar (hybrid) 4.119 4.246 4.42
SSE 3.126 3.190 5.82
SSE (block) 3.016 3.090 6.03
SSE (simplified) 3.000 3.048 6.06

Gaussian distribution (max at 2 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 15.942 16.733 1.00
scalar (std) 32.551 33.334 0.49
scalar (hybrid) 6.372 6.605 2.50
SSE 3.130 3.190 5.09
SSE (block) 3.041 3.121 5.24
SSE (simplified) 3.020 3.082 5.28

Gaussian distribution (max at 2 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 20.088 20.800 1.00
scalar (std) 18.382 19.013 1.09
scalar (hybrid) 4.574 4.835 4.39
SSE 3.184 3.264 6.31
SSE (block) 3.018 3.069 6.66
SSE (simplified) 2.988 3.042 6.72

Uniform distribution (1 .. 1 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 18.677 19.206 1.00
scalar (std) 34.479 35.207 0.54
scalar (hybrid) 7.953 8.184 2.35
SSE 3.014 3.099 6.20
SSE (block) 3.117 3.172 5.99
SSE (simplified) 2.977 3.062 6.27

Uniform distribution (1 .. 1 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 21.981 22.621 1.00
scalar (std) 16.860 17.391 1.30
scalar (hybrid) 5.285 5.481 4.16
SSE 3.029 3.085 7.26
SSE (block) 2.942 3.000 7.47
SSE (simplified) 3.099 3.179 7.09

Uniform distribution (1 .. 2 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 17.695 18.444 1.00
scalar (std) 34.917 35.633 0.51
scalar (hybrid) 7.245 7.401 2.44
SSE 3.324 3.387 5.32
SSE (block) 3.323 3.398 5.33
SSE (simplified) 3.236 3.321 5.47

Uniform distribution (1 .. 2 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 21.164 21.860 1.00
scalar (std) 18.567 19.126 1.14
scalar (hybrid) 4.937 5.156 4.29
SSE 2.940 2.995 7.20
SSE (block) 2.924 2.975 7.24
SSE (simplified) 2.785 2.826 7.60

Uniform distribution (1 .. 3 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 16.385 17.254 1.00
scalar (std) 33.359 34.075 0.49
scalar (hybrid) 6.877 7.145 2.38
SSE 2.892 2.947 5.67
SSE (block) 3.088 3.158 5.31
SSE (simplified) 2.752 2.844 5.95

Uniform distribution (1 .. 3 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 20.530 21.221 1.00
scalar (std) 18.995 19.580 1.08
scalar (hybrid) 4.647 4.811 4.42
SSE 3.037 3.089 6.76
SSE (block) 2.941 2.992 6.98
SSE (simplified) 2.914 2.972 7.05

Uniform distribution (1 .. 4 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 14.926 15.613 1.00
scalar (std) 31.275 32.210 0.48
scalar (hybrid) 6.445 6.677 2.32
SSE 3.393 3.458 4.40
SSE (block) 3.126 3.184 4.77
SSE (simplified) 3.293 3.344 4.53

Uniform distribution (1 .. 4 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 19.047 19.920 1.00
scalar (std) 19.500 20.130 0.98
scalar (hybrid) 4.539 4.723 4.20
SSE 3.325 3.383 5.73
SSE (block) 3.132 3.193 6.08
SSE (simplified) 3.182 3.235 5.99

Uniform distribution (1 .. 5 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 13.748 14.405 1.00
scalar (std) 30.105 30.941 0.46
scalar (hybrid) 6.144 6.413 2.24
SSE 3.475 3.533 3.96
SSE (block) 3.171 3.238 4.34
SSE (simplified) 3.293 3.356 4.17

Uniform distribution (1 .. 5 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 18.297 19.010 1.00
scalar (std) 19.209 19.845 0.95
scalar (hybrid) 4.477 4.652 4.09
SSE 3.293 3.352 5.56
SSE (block) 3.135 3.188 5.84
SSE (simplified) 3.182 3.230 5.75

Uniform distribution (1 .. 6 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 12.912 13.548 1.00
scalar (std) 29.511 30.337 0.44
scalar (hybrid) 5.958 6.253 2.17
SSE 3.430 3.487 3.76
SSE (block) 3.201 3.265 4.03
SSE (simplified) 3.283 3.337 3.93

Uniform distribution (1 .. 6 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 17.288 17.956 1.00
scalar (std) 19.250 19.841 0.90
scalar (hybrid) 4.356 4.481 3.97
SSE 3.205 3.276 5.39
SSE (block) 2.990 3.046 5.78
SSE (simplified) 3.116 3.172 5.55

Uniform distribution (1 .. 7 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 11.974 12.552 1.00
scalar (std) 27.884 28.562 0.43
scalar (hybrid) 6.214 6.385 1.93
SSE 3.253 3.309 3.68
SSE (block) 3.057 3.100 3.92
SSE (simplified) 3.119 3.167 3.84

Uniform distribution (1 .. 7 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 16.583 17.366 1.00
scalar (std) 18.947 19.528 0.88
scalar (hybrid) 4.265 4.401 3.89
SSE 3.248 3.311 5.11
SSE (block) 2.989 3.048 5.55
SSE (simplified) 3.165 3.215 5.24

Uniform distribution (1 .. 8 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 11.637 12.165 1.00
scalar (std) 27.107 27.851 0.43
scalar (hybrid) 6.079 6.231 1.91
SSE 3.821 3.870 3.05
SSE (block) 3.492 3.550 3.33
SSE (simplified) 3.688 3.745 3.16

Uniform distribution (1 .. 8 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 15.739 16.589 1.00
scalar (std) 18.875 19.563 0.83
scalar (hybrid) 4.335 4.463 3.63
SSE 3.611 3.667 4.36
SSE (block) 3.337 3.400 4.72
SSE (simplified) 3.490 3.538 4.51

11.3.3.2   Input size 65536 bytes

Gaussian distribution (max at 8 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 11.852 11.971 1.00
scalar (std) 26.589 26.792 0.45
scalar (hybrid) 8.336 8.411 1.42
SSE 6.911 6.956 1.71
SSE (block) 6.921 6.990 1.71
SSE (simplified) 6.757 6.802 1.75

Gaussian distribution (max at 8 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 17.008 17.110 1.00
scalar (std) 21.492 21.653 0.79
scalar (hybrid) 8.722 8.800 1.95
SSE 5.650 5.692 3.01
SSE (block) 5.679 5.733 2.99
SSE (simplified) 5.426 5.470 3.13

Gaussian distribution (max at 7 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 12.818 12.929 1.00
scalar (std) 27.335 27.609 0.47
scalar (hybrid) 9.246 9.315 1.39
SSE 5.513 5.542 2.33
SSE (block) 5.259 5.296 2.44
SSE (simplified) 5.364 5.405 2.39

Gaussian distribution (max at 7 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 17.978 18.119 1.00
scalar (std) 21.924 22.109 0.82
scalar (hybrid) 9.205 9.286 1.95
SSE 4.537 4.591 3.96
SSE (block) 4.351 4.388 4.13
SSE (simplified) 4.352 4.407 4.13

Gaussian distribution (max at 6 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 13.772 13.862 1.00
scalar (std) 28.295 28.596 0.49
scalar (hybrid) 10.005 10.079 1.38
SSE 3.456 3.479 3.98
SSE (block) 3.358 3.385 4.10
SSE (simplified) 3.379 3.402 4.08

Gaussian distribution (max at 6 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 19.374 19.538 1.00
scalar (std) 22.288 22.489 0.87
scalar (hybrid) 9.660 9.730 2.01
SSE 3.949 3.983 4.91
SSE (block) 3.726 3.751 5.20
SSE (simplified) 3.870 3.918 5.01

Gaussian distribution (max at 5 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 14.989 15.092 1.00
scalar (std) 29.851 30.089 0.50
scalar (hybrid) 10.750 10.842 1.39
SSE 2.933 2.945 5.11
SSE (block) 2.807 2.822 5.34
SSE (simplified) 2.866 2.878 5.23

Gaussian distribution (max at 5 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 20.729 20.888 1.00
scalar (std) 22.536 22.718 0.92
scalar (hybrid) 10.225 10.311 2.03
SSE 4.032 4.086 5.14
SSE (block) 3.762 3.805 5.51
SSE (simplified) 3.985 4.043 5.20

Gaussian distribution (max at 1 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 24.489 24.804 1.00
scalar (std) 38.285 38.540 0.64
scalar (hybrid) 16.088 16.198 1.52
SSE 4.159 4.200 5.89
SSE (block) 4.107 4.142 5.96
SSE (simplified) 4.213 4.250 5.81

Gaussian distribution (max at 1 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 28.265 28.549 1.00
scalar (std) 22.648 22.767 1.25
scalar (hybrid) 13.524 13.622 2.09
SSE 5.201 5.271 5.43
SSE (block) 5.452 5.513 5.18
SSE (simplified) 5.057 5.128 5.59

Gaussian distribution (max at 4 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 16.687 16.790 1.00
scalar (std) 31.153 31.398 0.54
scalar (hybrid) 11.835 11.913 1.41
SSE 3.590 3.614 4.65
SSE (block) 3.466 3.489 4.81
SSE (simplified) 3.496 3.525 4.77

Gaussian distribution (max at 4 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 22.284 22.529 1.00
scalar (std) 22.476 22.653 0.99
scalar (hybrid) 10.916 11.005 2.04
SSE 4.032 4.081 5.53
SSE (block) 3.859 3.898 5.77
SSE (simplified) 3.953 3.991 5.64

Gaussian distribution (max at 3 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 19.121 19.291 1.00
scalar (std) 32.927 33.175 0.58
scalar (hybrid) 13.629 13.718 1.40
SSE 4.848 4.901 3.94
SSE (block) 4.780 4.825 4.00
SSE (simplified) 4.760 4.812 4.02

Gaussian distribution (max at 3 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 24.281 24.611 1.00
scalar (std) 22.530 22.703 1.08
scalar (hybrid) 11.920 12.185 2.04
SSE 4.756 4.823 5.11
SSE (block) 4.781 4.835 5.08
SSE (simplified) 4.607 4.664 5.27

Gaussian distribution (max at 2 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 22.188 22.377 1.00
scalar (std) 36.158 36.461 0.61
scalar (hybrid) 15.232 15.326 1.46
SSE 4.058 4.108 5.47
SSE (block) 3.952 3.990 5.61
SSE (simplified) 4.061 4.110 5.46

Gaussian distribution (max at 2 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 26.633 26.932 1.00
scalar (std) 22.723 22.869 1.17
scalar (hybrid) 13.162 13.267 2.02
SSE 5.097 5.174 5.23
SSE (block) 5.160 5.224 5.16
SSE (simplified) 4.941 5.008 5.39

Uniform distribution (1 .. 1 digit) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 22.963 23.203 1.00
scalar (std) 37.465 37.718 0.61
scalar (hybrid) 13.951 14.058 1.65
SSE 3.389 3.447 6.78
SSE (block) 3.593 3.643 6.39
SSE (simplified) 3.324 3.394 6.91

Uniform distribution (1 .. 1 digit) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 27.116 27.518 1.00
scalar (std) 20.621 20.742 1.31
scalar (hybrid) 11.908 12.006 2.28
SSE 4.128 4.167 6.57
SSE (block) 4.179 4.214 6.49
SSE (simplified) 4.105 4.136 6.61

Uniform distribution (1 .. 2 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 23.675 24.277 1.00
scalar (std) 37.914 38.203 0.62
scalar (hybrid) 15.327 15.432 1.54
SSE 3.896 3.933 6.08
SSE (block) 3.870 3.901 6.12
SSE (simplified) 3.913 3.952 6.05

Uniform distribution (1 .. 2 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 27.811 28.100 1.00
scalar (std) 22.433 22.537 1.24
scalar (hybrid) 13.003 13.113 2.14
SSE 4.947 5.018 5.62
SSE (block) 5.201 5.265 5.35
SSE (simplified) 4.807 4.872 5.79

Uniform distribution (1 .. 3 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 22.512 22.742 1.00
scalar (std) 36.660 36.946 0.61
scalar (hybrid) 15.263 15.362 1.47
SSE 3.474 3.507 6.48
SSE (block) 3.456 3.480 6.51
SSE (simplified) 3.528 3.561 6.38

Uniform distribution (1 .. 3 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 26.841 27.155 1.00
scalar (std) 22.874 23.042 1.17
scalar (hybrid) 13.184 13.284 2.04
SSE 4.815 4.876 5.57
SSE (block) 4.952 5.022 5.42
SSE (simplified) 4.699 4.753 5.71

Uniform distribution (1 .. 4 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 21.123 21.329 1.00
scalar (std) 35.205 35.486 0.60
scalar (hybrid) 14.756 14.856 1.43
SSE 5.217 5.283 4.05
SSE (block) 5.053 5.106 4.18
SSE (simplified) 5.163 5.227 4.09

Uniform distribution (1 .. 4 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 25.877 26.129 1.00
scalar (std) 23.132 23.285 1.12
scalar (hybrid) 12.984 13.084 1.99
SSE 5.211 5.291 4.97
SSE (block) 5.283 5.356 4.90
SSE (simplified) 5.048 5.116 5.13

Uniform distribution (1 .. 5 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 19.780 19.978 1.00
scalar (std) 34.247 34.511 0.58
scalar (hybrid) 14.274 14.371 1.39
SSE 5.099 5.160 3.88
SSE (block) 4.920 4.975 4.02
SSE (simplified) 5.053 5.099 3.91

Uniform distribution (1 .. 5 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 24.749 25.022 1.00
scalar (std) 23.269 23.468 1.06
scalar (hybrid) 12.627 12.718 1.96
SSE 5.046 5.111 4.90
SSE (block) 4.999 5.126 4.95
SSE (simplified) 4.903 4.954 5.05

Uniform distribution (1 .. 6 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 18.656 18.857 1.00
scalar (std) 33.362 33.565 0.56
scalar (hybrid) 13.643 13.742 1.37
SSE 4.433 4.479 4.21
SSE (block) 4.282 4.319 4.36
SSE (simplified) 4.356 4.399 4.28

Uniform distribution (1 .. 6 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 23.975 24.304 1.00
scalar (std) 23.495 23.653 1.02
scalar (hybrid) 12.269 12.358 1.95
SSE 4.883 4.947 4.91
SSE (block) 4.715 4.770 5.08
SSE (simplified) 4.778 4.825 5.02

Uniform distribution (1 .. 7 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 17.632 17.799 1.00
scalar (std) 32.227 32.509 0.55
scalar (hybrid) 13.079 13.177 1.35
SSE 4.114 4.158 4.29
SSE (block) 3.976 4.028 4.43
SSE (simplified) 4.032 4.069 4.37

Uniform distribution (1 .. 7 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 22.878 23.116 1.00
scalar (std) 23.259 23.512 0.98
scalar (hybrid) 12.010 12.113 1.90
SSE 4.926 4.986 4.64
SSE (block) 4.670 4.735 4.90
SSE (simplified) 4.836 4.892 4.73

Uniform distribution (1 .. 8 digits) — single separator character

  cycles per one input byte speed-up
procedure min avg (min)
scalar 16.753 16.923 1.00
scalar (std) 31.309 31.604 0.54
scalar (hybrid) 12.522 12.615 1.34
SSE 5.181 5.232 3.23
SSE (block) 5.263 5.313 3.18
SSE (simplified) 5.089 5.133 3.29

Uniform distribution (1 .. 8 digits) — 1 .. 6 separator characters

  cycles per one input byte speed-up
procedure min avg (min)
scalar 22.070 22.400 1.00
scalar (std) 23.084 23.272 0.96
scalar (hybrid) 11.689 11.777 1.89
SSE 5.666 5.729 3.90
SSE (block) 5.833 5.886 3.78
SSE (simplified) 5.493 5.553 4.02

12   Conclusions

13   Acknowledgements

Nathan Kurz and Daniel Lemire gave incredibly valuable feedback and comments on text and experiments. Thank you!

14   See also

External links:

15   Source code

Source code is available on github.