Author: | Wojciech Muła |
---|---|
Added on: | 2018-04-19 |
Updated on: | 2020-07-05 (fixed a typo in conversion 2-digit numbers, thank to Alexander Vingardt), 2020-03-30 (added method decimal_digits_mask_version2 by Carl Witty), 2020-03-29 (fixed a typo, thanks to Carl Witty), 2018-04-30 (info about overflows), 2018-04-25 (added speed-up comparison table) |
While conversion from a string into an integer value is feasible with SIMD instructions, this application is unpractical. For typical cases, when a single value is parsed, scalar procedures — like the standard atoi or strtol — are faster than any fancy SSE code.
However, SIMD procedures can be really fast and convert in parallel several numbers. There is only one "but": the input data has to be regular and valid, i.e. the input string must contain only ASCII digits. Recently, I updated article about SSE parsing with the benchmark results. The speed-ups are really impressive, for example the SSSE3 parser is 7 to 9 times faster than a naive, scalar code.
The obvious question is how these powerful SIMD procedures can be used to convert real data? By real I mean possibly broken inputs that contain series of numbers of different length separated with characters from a predefined set.
In this text I try to answer that question; major contributions of this article are:
The article starts with unsigned conversion, because it is easier than a signed one. The signed conversion shares the core idea, it just adds some extra steps.
The text is accompanied with BSD-licensed software, that includes fully functional implementations alongside the programs which validate and benchmark the procedures.
Contents
The input is defined as follows:
For the input string "123; -52, +432424 -999; 1234568, +879", and three separators ',', ';' and ' ', an algorithm is expected to extract six integers: [123, -52, 432424, -999, 1234568, 879]. The order of converted numbers must be exactly the same as in the input string.
A simplified parser is also tested. This parser simply considers all non-digit and non-sign characters as delimiters.
SSE instructions work on 16-byte vectors. When the input numbers have the same count of digits and the digits of each number are placed — depending of the digits count — on bytes, words, double words or quad words of an SSE vector, then an SSE procedure converts them in parallel. Below are required layouts of vectors; the most significant digit is at index 0 of each subword.
16 one-digit unsigned numbers a .. p:
[ a0 | b0 | c0 | d0 | e0 | f0 | g0 | h0 | i0 | j0 | k0 | l0 | m0 | n0 | o0 | p0 ]
8 two-digit unsigned numbers a .. h:
[ a0 | a1 | b0 | b1 | c0 | c1 | d0 | d1 | e0 | e1 | f0 | f1 | g0 | g1 | h0 | h1 ] | input 0 | input 1 | input 2 | input 3 | input 4 | input 5 | input 6 | input 7 |
4 four-digit unsigned numbers a .. d:
[ a0 | a1 | a2 | a3 | b0 | b1 | b2 | b3 | c0 | c1 | c2 | c3 | d0 | d1 | d2 | d3 ] | input 0 | input 1 | input 2 | input 3 |
and 2 eight-digit unsigned numbers a and b:
[ a0 | a1 | a2 | a3 | a4 | a5 | a6 | a7 | b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 ] | input 0 | input 1 |
The conversion is done gradually. We start with an input vector filled with ASCII digits '0'..'9'; each byte of a vector contains single character. For string "1234567890123456" we have:
in = ['1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'|'0'|'1'|'2'|'3'|'4'|'5'|'6']
A single subtract _mm_subs_epu8 (psubusb) converts from ASCII into numbers 0..9:
t0 = [ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | 1 | 2 | 3 | 4 | 5 | 6 ]
At this point we converted 16 one-digit numbers. Now, by merging adjacent numbers from t0, we may obtain two-digit numbers. SSSE3 introduced the instruction _mm_maddubs_epi16 (pmaddubsw) which multiplies vectors of bytes and adds intermediate 16-bit values producing a vector of 16-bit values. Thus, when we multiply t0 by a vector filled with weights (1, 10) we get eight two-digit numbers, each saved on a 16-bit field:
t1 = [ 12 | 34 | 56 | 78 | 90 | 12 | 34 | 56 ]
Similarly, these numbers can be merged into four-digit numbers. SSE has the instruction _mm_madd_epi16 (pmaddwd) which performs similar operation as _mm_maddubs_epi16, but works on 16-bit values and yields vector of 32-bit numbers. The other vector is filled with weights (1, 100) and we get four four-digit numbers, each saved on a 32-bit field:
t2 = [ 1234 | 5678 | 9012 | 3456 ]
Four-digit numbers fits into 16-bit values, we might convert the above result into a 16-bit vector using _mm_packus_epi32 (packusdw) and again apply the _mm_madd_epi16 instruction, this time with weights (1, 10000). Then we get two eight-digit numbers:
t3 = [ 12345678 | 90123456 ]
To summarize: conversion from a 16-digit string into two unsigned eight-digit numbers requires only one subtract, three multiply-add instructions and one pack operation.
If we need narrower numbers we simply stop the above procedure at an earlier step; for example, for four-digit numbers the result is t2. Since the intermediate vectors have different element size (8-, 16-, 32-bit) a cast to other vector type might be needed. Widening can be done with the instructions pmovsx (intrinsics _mm_cvtepi16_epi32, _mm_cvtepi8_epi32, etc.); narrowing with the mentioned pack instruction which is also available in different variants.
The same conversion using a naive scalar procedure would require 16 subtracts, 14 multiplications and 14 additions. More advanced SWAR procedures would require fewer multiplications, but still not as few as the SIMD procedure.
The maximum possible value from SSE conversion procedure is ± 99999999. This value is smaller than the maximum that can hold a 32-bit number (231 − 1 = 2147483647) and also absolute value of the minimum (231 = 2147483648). It means that if an output collection uses at least 32-bit numbers, then SSE procedures never cause overflow (signed nor unsigned).
Obviously, for 16-bit outputs, overflow is possible. For such case an extra step would be required, but I do not discuss it in this text.
Below is a sample SSE implementation which shows all described steps.
// 1. convert from ASCII '0' .. '9' to numbers 0 .. 9 const __m128i ascii0 = _mm_set1_epi8('0'); const __m128i t0 = _mm_subs_epu8(input, ascii0); // 2. convert to 2-digit numbers const __m128i mul_1_10 = _mm_setr_epi8(10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1, 10, 1); const __m128i t1 = _mm_maddubs_epi16(t0, mul_1_10); // 3. convert to 4-digit numbers const __m128i mul_1_100 = _mm_setr_epi16(100, 1, 100, 1, 100, 1, 100, 1); const __m128i t2 = _mm_madd_epi16(t1, mul_1_100); // 4a. convert form 32-bit into 16-bit element vector const __m128i t3 = _mm_packus_epi32(t2, t2); // 4. convert to 8-digit numbers const __m128i mul_1_10000 = _mm_setr_epi16(10000, 1, 10000, 1, 10000, 1, 10000, 1); const __m128i t4 = _mm_madd_epi16(t3, mul_1_10000);
The conversion algorithm works in a loop, in each iteration exactly one input vector is loaded from memory and processed.
Since SIMD procedures impose specific layout of digits within a vector, an arbitrary input has to be properly aligned. In order to do this, we need to identify spans of digits in the input and move each span onto certain subarrays of a vector. Let's assume at this point that the input contains either digits or separators (denoted with _).
For instance, the vector with one-digit numbers 1___2_3__4__5_6_ must be transformed into 123456__________. Then an SSE procedure will convert in parallel all the numbers into the array [1, 2, 3, 4, 5, 6]. Likewise, the vector with two-digit numbers _12__34___56_78_ must be transformed into 12345678________ and then the result will be [12, 34, 56, 78].
Let's consider a more complicated case, when a string has numbers with different count of digits. There are a few ways to convert the input _1_2_34_567_89__:
We can see that in order to convert given span combination we need to know:
Obtaining this information seems to be quite complicated, especially when we look at the last example. Fortunately, all parameters can be precalculated. A span combination can be saved as a bit-pattern, where ones represent digits. For example, from vector _1_2_34_567_89__ we get the span pattern 0b0101011011101100 = 0x56ec. A span pattern is used to fetch a record from the precalculated array. The record contains following fields:
The solution with a precalculated array is suitable only for SSE, as span patterns have 16 bits. In cases of AVX2 and AVX512, where vectors are wider, such a table would be simply too large, respectively 232 or 264 entries. Additionally, the AVX2 version of pshufb instruction works on lanes, i.e. 128-bit halves of a vector, thus it is impossible to shuffle all inputs.
But AVX2 and AVX512 instructions still might be used in some parts of algorithms, especially in input validation.
To build the whole precalculated table all 65536 span patterns have to be processed. Processing a pattern consists following steps:
Let's consider span pattern _ddd__d__ddd_dd_. It can be converted using four-digit conversion, so the layout of normalized vector should be:
[ a0 | a1 | a2 | a3 | b0 | b1 | b2 | b3 | c0 | c1 | c2 | c3 | d0 | d1 | d2 | d3 ]
There are four numbers a, b, c and d in the vector; note that the bytes at 0th index hold the most significant digits.
The first number in pattern has three digits at indices 1, 2 and 3. It means that a1 = 1, a2 = 2 and a3 = 3; value of a0 should be zeroed, so a0 = 0x80.
The second number has just one digit at index 6. Thus b3 = 6 and, since the rest digits have to be zero, b0 = b1 = b2 = 0x80.
The third number has three digits at indices 9, 10, 11. Thus c1 = 9, c2 = 10, c3 = 11, and c0 = 0x80.
The last, forth number, has got two digits at indices 14 and 15. Thus d2 = 14, d3 = 15 and the rest have to be zero, so d0 = d1 = 0x80.
The final content of the shuffle_digits vector is:
[ 80 | 1 | 2 | 3 | 80 | 80 | 80 | 6 | 80 | 9 | 10 | 11 | 80 | 80 | 14 | 15 ]
Below are major steps of the algorithm's loop, with snippets from the actual implementation.
const __m128i input = _mm_loadu_si128((const __m128i*)data);
// t0 = input[i] in ['0'..'9] ? 0xff : 0x00 const __m128i t0 = decimal_digits_mask(input); const uint16_t span_mask = _mm_movemask_epi8(t0);
const BlockInfo& bi = blocks[span_mask];
const __m128i shuffle_digits = _mm_loadu_si128((const __m128i*)b.shuffle_digits); const __m128i shuffled = _mm_shuffle_epi8(input, shuffle_digits);
if (b.conversion_routine == Conversion::SSE1Digit) { convert_1digit(shuffled, b.element_count, output); } else if (b.conversion_routine == Conversion::SSE2Digits) { convert_2digits(shuffled, b.element_count, output); } else if (b.conversion_routine == Conversion::SSE4Digits) { convert_4digits(shuffled, b.element_count, output); } else if (b.conversion_routine == Conversion::SSE8Digits) { convert_8digits(shuffled, b.element_count, output); } else { // case unsupported, a scalar code is called here }
data += b.total_skip;
In case of unsigned inputs we need to classify input characters into three categories:
In practise we need two masks: one with positions of digits, another with positions of separators. If the or'ed mask has some zero elements, it means that there are invalid characters. Below is a schema:
const __m128i bytemask_digit = decimal_digits_mask(input); const __m128i bytemask_sep = matcher.get_mask(input); const __m128i bytemask_valid = _mm_or_si128(bytemask_digit, bytemask_sep); const uint16_t valid_mask = _mm_movemask_epi8(bytemask_valid); if (valid_mask != 0xffff) { throw std::logic_error("wrong input"); }
The easiest algorithm implements expression c >= '0' and c <= '9':
__m128i decimal_digits_mask(const __m128i input) { const __m128i ascii0 = _mm_set1_epi8('0'); const __m128i after_ascii9 = _mm_set1_epi8('9' + 1); const __m128i t0 = _mm_cmplt_epi8(input, ascii0); // t1 = (x < '0') const __m128i t1 = _mm_cmplt_epi8(input, after_ascii9); // t0 = (x <= '9') return _mm_andnot_si128(t0, t1); // x <= '9' and x >= '0' }
Following C-expression requires only one subtraction and one comparison:
bool is_digit(char c) { const uint8_t x = c; return (x - '0') <= 9; }
However, it requires unsigned comparison, which is not available in both SSE and AVX2. Only AVX512BW defines such instruction _mm512_cmple_epu8_mask (vcmpgtub).
uint64_t decimal_digits_mask(const __m128i input) { const __m512i ascii0 = _mm512_set1_epi8('0'); const __m512i ascii9 = _mm512_set1_epi8('9'); const __m512i t0 = _mm512_sub_epi8(input, ascii0); return __mm256_cmple_epu8_mask(t0, ascii9); }
Fortunatelly, it's possible to use similar trick for signed bytes, we need just different constants. First, we subtract from the input byte value (ord('0') + 128) & 0xff = 176 & 0xff = 80. This moves codes of '0'..'9' to range -128..-119. Then it's sufficient to compare subtracted value with -118.
__m128i decimal_digits_mask_version2(const __m128i input) { const __m512i t0 = _mm_sub_epi8(input, _mm_set1_epi8(80)); return _mm_cmplt_epi8(t0, _mm_set1_epi8(-118)); }
The most generic SSE implementation works in time proportional to the size of a separators set. Each character from the set is broadcasted into a vector and then compared with the input vector. The resulting byte mask is ored with the previous result.
__m128i get_mask(const __m128i& input, const __m128i& initial) const { __m128i result = initial; for (size_t i=0; i < n + 1; i++) { const __m128i mask = _mm_cmpeq_epi8(letters[i], input); result = _mm_or_si128(result, mask); } return result; }
If the separators set is known at compilation time, then a compiler may unroll the loop.
The SNTI instruction pcmpestrm (_mm_cmpestrm) from SSE4.2 can also be used in same cases. The instruction gets two arguments. One is an input vector. Another one is interpreted either as a set of individual characters (up to 16) or characters ranges (up to 8). The result is a mask for the input characters matching the set (or ranges). In the example below the set variant was used.
__m128i get_mask(const __m128i& input, const __m128i& initial) { const uint8_t mode = _SIDD_UBYTE_OPS | _SIDD_CMP_EQUAL_ANY | _SIDD_UNIT_MASK; return _mm_or_si128(initial, _mm_cmpestrm(set, set_size, input, 16, mode)); }
bytes processed | patterns | % | cumulative % |
---|---|---|---|
0 | 17 | 0.03% | 0.03% |
2 | 32 | 0.05% | 0.07% |
3 | 48 | 0.07% | 0.15% |
4 | 176 | 0.27% | 0.42% |
5 | 360 | 0.55% | 0.97% |
6 | 542 | 0.83% | 1.79% |
7 | 693 | 1.06% | 2.85% |
8 | 1084 | 1.65% | 4.50% |
9 | 1620 | 2.47% | 6.98% |
10 | 2274 | 3.47% | 10.45% |
11 | 3057 | 4.66% | 15.11% |
12 | 4526 | 6.91% | 22.02% |
13 | 6193 | 9.45% | 31.47% |
14 | 8952 | 13.66% | 45.13% |
15 | 12890 | 19.67% | 64.79% |
16 | 23072 | 35.21% | 100.00% |
SSE procedure | usage in parser | converted numbers | average utilization | |||
---|---|---|---|---|---|---|
input size | converted nums | # | % | avg | max | |
N/A | --- | 264 | 0.40% | --- | --- | --- |
1-digit | 16 | 5799 | 8.85% | 3.52 | 8 | 22% |
2-digit | 8 | 21241 | 32.41% | 3.99 | 7 | 49% |
4-digit | 4 | 30924 | 47.18% | 3.48 | 4 | 87% |
8-digit | 2 | 7308 | 11.15% | 1.97 | 2 | 98% |
Similarly to the unsigned conversion, the signed conversion also requires a span pattern. But in this case also the sign characters '+' and '-' are included in the pattern. For this sample string:
input = "_+123_-45_78_-9_"
the span pattern is:
span_mask = _dddd_ddd_dd_dd_
The conversion unfortunately requires two shuffles:
For input:
input = [ '_' | '-' | '1' | '2' | '3' | '_' | '+' | '4' | '5' | '_' | '-' | '6' | '_' | '7' | '8' | '_' ]
The conversion processes four four-digit numbers. After shuffling the digit spans we have:
shuffled = [ '-' '1' '2' '3' | 0x00 '+' '4' '5' | 0x00 0x00 '-' '6' | 0x00 0x00 '7' '8' ]
and the vector of sign characters is:
shuffled_signs = [ '-' '-' '-' '-' | '+' '+' '+' '+' | '-' '-' '-' '-' | '7' '7' '7' '7' ]
The shuffled_signs vector does not necessarily contain only '+' or '-', it might contain also digits, but this is not an error.
Then the conversion works as follows:
From the shuffled vector the sign characters are removed and ASCII characters are converted into values:
[ 0 1 2 3 | 0 0 4 5 | 0 0 0 6 | 0 0 7 8 ]
Thus we end up with a vector of unsigned numbers, and such an input can be converted by the already defined SSE procedures:
[ 123 | 45 | 6 | 78 ]
Finally, negate appropriate elements; we use a well known equation (~x + 1), which is expressed in the SSE code as (x xor (-1) - (-1)). The value -1 is obtained from comparing shuffled_signs with the '-':
negated_mask = (shuffled_signs == packed_byte('-')) [ ff ff ff ff | 00 00 00 00 | ff ff ff ff | 00 00 00 00 ]
or as vector of int32_t:
[ -1 | 0 | -1 | 0 ]
Masking of the sign characters in shuffled vector is done at no cost, as the instruction _mm_subs_epu8 (psubusb) is used to convert from ASCII to numeric values:
const __m128i ascii0 = _mm_set1_epi8('0'); const __m128i t0 = _mm_subs_epu8(input, ascii0);
The instruction performs subtract with saturation, i.e. calculates max(0, a-b). Since the ASCII codes of '-' and '+' are respectively 43 and 45 and the ASCII code of '0' is 48, then the result of such expression is zero for both sign characters.
The characters '+' and '-' can be present only at the beginning of a digit span, i.e. we want to reject inputs like "++12" or "1234-,".
To properly validate the input we need four masks for:
The first phase is similar to the unsigned conversion, i.e. all masks are or'ed together, then zero elements point at invalid characters.
The second phase checks if sign characters are placed properly. For each span combination we need to precalculate the mask of valid sign positions. Then, after or'ing the masks for the '+' and '-' we simply verify if they are placed on valid positions.
In practise the precalculated data has a bit-mask for invalid sign positions, thanks to that validation require just a simple and. The snippet below shows the schema of the second step.
const __m128i ascii_minus = _mm_set1_epi8('-'); const __m128i ascii_plus = _mm_set1_epi8('+'); // ... const __m128i bytemask_plus = _mm_cmpeq_epi8(input, ascii_plus); const __m128i bytemask_minus = _mm_cmpeq_epi8(input, ascii_minus); const __m128i bytemask_sign = _mm_or_si128(bytemask_plus, bytemask_minus); // ... const uint16_t sign_mask = _mm_movemask_epi8(bytemask_sign); const uint16_t span_mask = _mm_movemask_epi8(bytemask_span); const BlockInfo& bi = blocks[span_mask]; if (sign_mask & bi.invalid_sign_mask) { throw std::runtime_error("'+' or '-' at invalid position"); }
AVX512VBMI defines the powerful instruction _mm512_permutex2var_epi8 (vperm2b) which does a lookup in a 128-byte table using as the indices the seven lowest bits of each byte of a ZMM register. Invocation of the intrinsics function is weird:
const __m512i lookup_lo = ... // elements 0..63 const __m512i lookup_hi = ... // and 64..127 of the lookup const __m512i transformed = _mm512_permutex2var_epi8(lookup_lo, input, lookup_hi);
If the set of valid character fits in the standard ASCII character set, i.e. just seven lowest bits are set, then the validation step might quickly reject extended ASCII (7th bit set) and then translate the input using single invocation of _mm512_permutex2var_epi8. When full 8-bit input is allowed, then the instruction has to be called twice with two halves of a 256-byte lookup.
An invocation (or two) of this instruction allows to classify at once all 64 input bytes into required categories:
Category is encoded as a bit at certain position, in sample implementation following schema is used:
If the transformed vector contains zero byte, it means there are invalid characters.
The span pattern is obtained from the most significant bits by the instruction _mm512_movepi8_mask (vpmovb2m); it works exactly the same way as pmovmaskb from SSE. The sign pattern is obtained by the instruction _mm512_test_epi8_mask (vptestmb), by testing bit 7th (mask 0x40).
Below is a snippet from the implementation:
const __m512i classes = _mm512_permutex2var_epi8(class_lo, input, class_hi); if (_mm512_test_epi8_mask(classes, classes) != uint64_t(-1)) { throw std::logic_error("invalid character"); } uint64_t span_mask64 = _mm512_movepi8_mask(classes); uint64_t sign_mask64 = _mm512_test_epi8_mask(classes, _mm512_set1_epi8(int8_t(0x40)));
Below are major steps of the algorithm's loop, with snippets from the actual implementation.
const __m128i input = _mm_loadu_si128((const __m128i*)data);
const __m128i ascii_minus = _mm_set1_epi8('-'); const __m128i ascii_plus = _mm_set1_epi8('+'); // ... const __m128i bytemask_plus = _mm_cmpeq_epi8(input, ascii_plus); const __m128i bytemask_minus = _mm_cmpeq_epi8(input, ascii_minus); const __m128i bytemask_sign = _mm_or_si128(bytemask_plus, bytemask_minus); const __m128i bytemask_digit = decimal_digits_mask(input); const __m128i bytemask_span = _mm_or_si128(bytemask_digit, bytemask_sign); const uint16_t span_mask = _mm_movemask_epi8(bytemask_span); const BlockInfo& bi = block_info[span_mask];
const uint16_t sign_mask = _mm_movemask_epi8(bytemask_sign); if (sign_mask & bi.invalid_sign_mask) { throw std::runtime_error("'+' or '-' at invalid position"); }
const __m128i ascii_minus = _mm_set1_epi8('-'); const __m128i shuffle_digits = _mm_loadu_si128((const __m128i*)bi.shuffle_digits); const __m128i shuffle_signs = _mm_loadu_si128((const __m128i*)bi.shuffle_signs); const __m128i shuffled = _mm_shuffle_epi8(input, shuffle_digits); const __m128i shuffled_signs = _mm_shuffle_epi8(input, shuffle_signs); const __m128i negate_mask = _mm_cmpeq_epi8(shuffled_signs, ascii_minus);
if (bi.conversion_routine == Conversion::SSE1Digit) { convert_1digit(shuffled, bi.element_count, output); } else if (bi.conversion_routine == Conversion::SSE2Digits) { convert_2digits_signed(shuffled, negate_mask, bi.element_count, output); } else if (bi.conversion_routine == Conversion::SSE4Digits) { convert_4digits_signed(shuffled, negate_mask, bi.element_count, output); } else if (bi.conversion_routine == Conversion::SSE8Digits) { convert_8digits_signed(shuffled, negate_mask, bi.element_count, output); } else { // scalar conversion }
As it was stated in Caveats section, a single step of algorithm validates the whole input vector but rarely converts all bytes from the input. Although detecting digits and sign characters might be considered cheap, we saw that validating against an arbitrary set of separators might be really time consuming.
The idea is to separate the validation from the conversion. In case of the unsigned conversion only the span_mask is needed; in case of the signed conversion we need also sign_mask.
In each iteration we process four 16-byte chunks. The result from this step are 64-bit masks. Then, we load 16 bytes into the input vector and use the lower 16-bit part of already calculated mask to get block info for the loaded code.
For unsigned conversion the main loop looks like this:
while (data < end) { const __m128i input = _mm_loadu_si128((const __m128i*)data); // validate input and calculate span_mask const uint16_t span_mask = get_span_mask(input); // obtain block info const BlockInfo& bi = block_info[span_mask]; // convert convert(input, bi); // advance pointer data += bi.total_processed; }
And after applying the proposed change:
while (data < end) { const __m128i input0 = _mm_loadu_si128((const __m128i*)(data + 0*16)); const __m128i input1 = _mm_loadu_si128((const __m128i*)(data + 1*16)); const __m128i input2 = _mm_loadu_si128((const __m128i*)(data + 2*16)); const __m128i input3 = _mm_loadu_si128((const __m128i*)(data + 3*16)); // validate input and calculate 4 x span_mask const uint16_t span_mask0 = get_span_mask(input0); const uint16_t span_mask1 = get_span_mask(input1); const uint16_t span_mask2 = get_span_mask(input2); const uint16_t span_mask3 = get_span_mask(input3); const uint64_t span_mask64 = (uint64_t(span_mask0) << (0*16)) | (uint64_t(span_mask1) << (1*16)) | (uint64_t(span_mask2) << (2*16)) | (uint64_t(span_mask3) << (3*16)); char* ptr = data + 3*16; while (data < ptr) { const uint16_t span_mask = span_mask64 & 0xffff; // obtain block info const BlockInfo& bi = block_info[span_mask]; // read input const __m128i input = load 16-byte from data // convert convert(input, bi); // advance pointer data += bi.total_processed; // shift the mas accordingly span_mask64 >>= bi.total_processed; } }
The problem of this solution is multiple reading the input bytes. Moreover, the looping schema is more complicated; especially in practice, where also scalar fallback have to be considered (it is not shown here).
The idea of separation the validation step from the actual conversion might also be applied for scalar parser. The validation, i.e. building a span_mask is done by SIMD code. Then, each byte of the span_mask is processed by a precompiled code for each of 256 cases.
There are two major advantages:
1. Scalar conversions are called with compile-time constants, thanks to that a compiler can perform as many optimizations as possible.
2. Digits spans that cross byte boundary can be processed seamlessly. When a span ends at the byte's end, a conversion routine converts all characters from the input's chunk and marks that the conversion is not completed.
When the next pattern is processed, the mark is checked: if the first bit of the current pattern is zero, it means that the previously converted number is ready and can be saved. If the first bit is one, it means that the first span is continuation and thus conversion must be carried on from the saved point.
Let's convert following input 24-byte input (the space separates 8-byte chunks):
_12_3_45 6_7890_1 ___99___
The first chunk of input is _12_3_45, i.e. the span_pattern = 0b01101011 = 0x6b. The digit spans are [(1,2), (4,4), (6,7)], following code handles them:
*output++ = convert<2>(data + 1); *output++ = convert<1>(data + 4); val = convert<2>(data + 6); has_val = true;
The template convert is parametrized by the number of digits to convert and gets a pointer to the first character. The important fact is such code doesn't need to validate pointers, sizes etc. All is known in the compile-time.
The next input's chunk is 6_7890_1, the digit spans are [(0,0), (2,5), (7,7)]. A variant of template convert that accepts two parameters simple starts conversion from the saved state (val).
if (has_val) { *output++ = continue<1>(data + 0, val); } else { *output++ = convert<1>(data + 0); } *output++ = convert<4>(data + 2); val = convert<1>(data + 7); has_val = true;
The last, third chunk, is ___99___, and there is just one digit span [(3,4)]:
if (has_val) { *output++ = val; has_val = false; } *output++ = convert<2>(data + 3); has_val = false;
When the input vector contains an array of three-digit numbers, like "123;456;789;123;", we still use four-digit conversion routine. But we might utilize the fact that the most significant digit is always zero.
We use _mm_maddubs_epi16, but with weights 1, 10, 100, 0: two lower digits are converted and the most significant digit is properly scaled. Then simple horizontal add _mm_hadd_epi16 adds the intermediate results.
// t0 = [ 0 | 5 | 7 | 9 | 0 | 1 | 2 | 3 | 0 | 9 | 2 | 5 | 0 | 3 | 2 | 7 ] const __m128i ascii0 = _mm_set1_epi8('0'); const __m128i t0 = _mm_subs_epu8(input, ascii0); // t1 = [ 500 | 79 | 100 | 23 | 900 | 25 | 300 | 27 ] const __m128i mul_all = _mm_setr_epi8(0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1, 0, 100, 10, 1); const __m128i t1 = _mm_maddubs_epi16(t0, mul_all); // t2 = [ 579 | 123 | 925 | 327 | 579 | 123 | 925 | 327 ] const __m128i t2 = _mm_hadd_epi16(t1, t1);
Where two four-digits are going to be converted, a faster approach can be used. Instead of storing each digit on separate bytes, digits are stored on 16-bit values. Then a single _mm_madd_epi16 is used to multiply all digits by weights 1, 10, 100, 1000. Then the result's halves are added with _mm_hadd_epi32 (phaddd).
// input = "12345678" // t0 = [ 0 | 1 | 0 | 2 | 0 | 3 | 0 | 4 | 0 | 5 | 0 | 6 | 0 | 7 | 0 | 8 ] const __m128i ascii0 = _mm_set1_epi8('0'); const __m128i t0 = _mm_subs_epu8(input, ascii0); // t0 = [ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 ] // * [ 1000 | 100 | 10 | 1 | 1000 | 100 | 10 | 1 ] // t1 = [ 1200 | 34 | 5600 | 78 ] const __m128i mul_all = _mm_setr_epi16(1000, 100, 10, 1, 1000, 100, 10, 1); const __m128i t1 = _mm_madd_epi16(t0, mul_all); // t2 = [ 1234 | 5678 | 1234 | 5678 ] const __m128i t2 = _mm_hadd_epi32(t1, t1);
There are two reference scalar implementations:
The handcrafted source:
template <typename INSERTER> void parse_signed(const char* data, size_t size, const char* separators, INSERTER output) { enum State { Separator, Plus, Minus, Digit }; State state = Separator; State prev = Separator; bool negative = false; int32_t number = 0; for (size_t i=0; i < size; i++) { const char c = data[i]; if (c == '+') { state = Plus; } else if (c == '-') { state = Minus; } else if (c >= '0' && c <= '9') { state = Digit; } else if (contains(separators, c)) { state = Separator; } else { throw std::runtime_error("Wrong character (scalar)"); } switch (state) { case Plus: if (prev != Separator) { throw std::runtime_error("Invalid syntax ('+' follows a non-separator character)"); } number = 0; negative = false; break; case Minus: if (prev != Separator) { throw std::runtime_error("Invalid syntax ('-' follows a non-separator character)"); } number = 0; negative = true; break; case Digit: if (prev == Separator) { number = c - '0'; negative = false; } else { number = 10*number + c - '0'; } break; case Separator: if (prev == Digit) { if (negative) { *output = -number; } else { *output = number; } } else if (prev != Separator) { throw std::runtime_error("Invalid syntax ('-' or '+' not followed by any digit)"); } break; } // switch prev = state; } // for if (state == Separator) { if (prev == Digit) { if (negative) { *output = -number; } else { *output = number; } } else if (prev != Separator) { throw std::runtime_error("Invalid syntax ('-' or '+' not followed by any digit)"); } } }
And implementation based on standard functions:
template <typename INSERTER> void parse_signed(const char* data, size_t size, const char* separators, INSERTER output) { char* ptr = const_cast<char*>(data); char* end = ptr + size; char* endptr; while (true) { ptr += strspn(ptr, separators); if (ptr == end) { break; } errno = 0; const long val = std::strtol(ptr, &endptr, 10); // the following check comes from "man 3 strtol" if ((errno == ERANGE && (val == LONG_MAX || val == LONG_MIN)) || (errno != 0 && val == 0)) { throw std::logic_error("invalid input"); } if (endptr == ptr) { throw std::logic_error("no digits"); } ptr = endptr; *output++ = val; } }
This part presents evaluation of algorithms in terms of running time (performance tests) and also some statistical measurements that might help in understanding run-time properties.
Following parameters of the input data are being changed:
Caveats shows statistics of 1) average processed bytes and 2) SSE utilization. This section presents real numbers gathered during conversion of 1000000 bytes containing 1-8 digit signed numbers.
loops | 90'790 | |
... different span patterns | 1'315 | |
total numbers converted | 162'228 | 100.00% |
conversion handled by scalar fallbacks | 13'601 | 8.38% |
conversions done by SSE procedures | 148'627 | 91.62% |
... unsigned SSE parser | 6'222 | N/A |
... signed SSE parser | 142'405 |
As we see, most conversions were done by the SSE code; moreover, most conversions went through the signed parser, which is slower than unsigned counterpart.
Another important number, which may give a hint about possible cache misses to the block_info structure, is the count of different span patterns. It is 1315 (~2% of the whole table size, 65536), but only 120 of them were processed in 75% iterations. The next section shows this parameter in details.
Below are detailed statistics for SSE routines:
Conversions 1-/2-/4-/8-digits are described in section SSE conversion capabilities, the 3-digit procedure in Appendix A — conversion of three-digit numbers.
procedure | unsigned path | signed path | ||||
---|---|---|---|---|---|---|
calls | converted | conv/call | calls | converted | conv/call | |
1-digit conversion | 325 | 410 | 1.26 | N/A | N/A | N/A |
2-digit conversion | 21 | 48 | 2.29 | 1949 | 3521 | 1.81 |
3-digit conversion | 21 | 43 | 2.05 | 1409 | 2246 | 1.59 |
4-digit conversion | 239 | 581 | 2.43 | 10569 | 26255 | 2.48 |
8-digit conversion | 2887 | 5140 | 1.78 | 59769 | 110383 | 1.85 |
The table below shows how many bytes from the input vector were processed in single iteration. It is pretty close to average numbers showed in Caveats.
bytes processed | patterns | % | cumulative % |
---|---|---|---|
0 | 0 | 0.00% | 0.00% |
1 | 0 | 0.00% | 0.00% |
2 | 197 | 0.26% | 0.26% |
3 | 583 | 0.76% | 1.01% |
4 | 773 | 1.00% | 2.01% |
5 | 1137 | 1.47% | 3.48% |
6 | 1841 | 2.39% | 5.87% |
7 | 2544 | 3.30% | 9.17% |
8 | 5305 | 6.87% | 16.04% |
9 | 9135 | 11.83% | 27.87% |
10 | 7163 | 9.28% | 37.15% |
11 | 8834 | 11.44% | 48.60% |
12 | 10102 | 13.09% | 61.68% |
13 | 9342 | 12.10% | 73.79% |
14 | 8221 | 10.65% | 84.44% |
15 | 6743 | 8.74% | 93.17% |
16 | 5269 | 6.83% | 100.00% |
The lookup table is quite large: in the sample implementation single record occupies 38 bytes, it gives almost 2,5 MiB. This section presents analysis of span_pattern usage during conversion of various inputs. Following parameters are collated:
During a conversion a histogram is build. The histogram contains span_pattern and the number of its occurrences; the histogram is sorted by the frequency. Having such a sorted histogram it is easy to estimate how many distinct patterns appear in most iterations.
Let's analyse sample, made up, histogram:
# | span pattern | frequency | cumulative |
---|---|---|---|
1 | 0xaaaa | 1 | 1 |
2 | 0xbbbb | 1 | 2 |
3 | 0x4444 | 1 | 3 |
4 | 0xdddd | 1 | 4 |
5 | 0xeeee | 2 | 6 |
6 | 0xffff | 2 | 8 |
7 | 0x1111 | 3 | 11 |
8 | 0x3333 | 5 | 16 |
9 | 0x5555 | 10 | 26 |
10 | 0x7777 | 17 | 43 |
There are 10 different patterns that were used in 43 loops. We can read from the table that in 1/4 of iterations (11 of 43) were used even seven patterns. Most of processing uses just three patterns. It means that only a small portion of the large lookup is actually touched.
The tables in following sections have got data processed in this way. Each column contains number of distinct patterns that take part in given percentage of the iterations (25%, 50%, 75%, 95% and all, 100%).
The results for various input sizes are available in the repository, below are shown only 4kB and 64kB inputs.
parameters | distinct span masks count | cycles per byte | branches | cache references | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
< 25% | < 50% | < 75% | < 95% | 100% | min | avg | taken | mispredicted | ratio | count | missed | ratio | |
Fixed length | |||||||||||||
1 digit, single separator character | 32 | 44 | 50 | 52 | 53 | 3.095 | 3.165 | 26085 | 633 | 2.43% | 1531 | 415 | 27.11% |
2 digits, single separator character | 12 | 16 | 18 | 19 | 20 | 3.305 | 3.400 | 25270 | 485 | 1.92% | 1333 | 114 | 8.55% |
3 digits, single separator character | 5 | 7 | 8 | 8 | 9 | 2.533 | 2.584 | 21809 | 247 | 1.13% | 954 | 63 | 6.60% |
4 digits, single separator character | 4 | 5 | 6 | 6 | 7 | 3.193 | 3.231 | 24145 | 336 | 1.39% | 926 | 49 | 5.29% |
5 digits, single separator character | 2 | 3 | 4 | 4 | 5 | 2.648 | 2.680 | 20788 | 241 | 1.16% | 1067 | 24 | 2.25% |
6 digits, single separator character | 2 | 4 | 4 | 4 | 5 | 2.312 | 2.347 | 18289 | 233 | 1.27% | 915 | 19 | 2.08% |
7 digits, single separator character | 2 | 2 | 2 | 2 | 3 | 4.121 | 4.167 | 25682 | 313 | 1.22% | 780 | 6 | 0.77% |
8 digits, single separator character | 1 | 3 | 3 | 3 | 4 | 5.495 | 5.592 | 44188 | 837 | 1.89% | 836 | 25 | 2.99% |
Uniform distribution | |||||||||||||
1 .. 1 digit, single separator character | 42 | 54 | 60 | 62 | 63 | 3.162 | 3.288 | 25992 | 619 | 2.38% | 1618 | 334 | 20.64% |
1 .. 2 digits, single separator character | 71 | 102 | 118 | 126 | 128 | 3.201 | 3.268 | 26561 | 465 | 1.75% | 2118 | 655 | 30.93% |
1 .. 3 digits, single separator character | 72 | 112 | 134 | 146 | 148 | 2.891 | 2.941 | 24393 | 514 | 2.11% | 2522 | 663 | 26.29% |
1 .. 4 digits, single separator character | 87 | 135 | 159 | 170 | 172 | 3.398 | 3.463 | 25831 | 906 | 3.51% | 2480 | 667 | 26.90% |
1 .. 5 digits, single separator character | 92 | 136 | 163 | 177 | 179 | 3.495 | 3.553 | 25865 | 806 | 3.12% | 2509 | 548 | 21.84% |
1 .. 6 digits, single separator character | 90 | 144 | 168 | 178 | 180 | 3.494 | 3.557 | 24943 | 639 | 2.56% | 2591 | 627 | 24.20% |
1 .. 7 digits, single separator character | 86 | 125 | 143 | 151 | 153 | 3.259 | 3.324 | 23730 | 579 | 2.44% | 2137 | 476 | 22.27% |
1 .. 8 digits, single separator character | 95 | 135 | 157 | 166 | 167 | 3.828 | 3.894 | 28964 | 859 | 2.97% | 2074 | 318 | 15.33% |
Gaussian distribution | |||||||||||||
max at 1 digit, single separator character | 76 | 133 | 161 | 174 | 176 | 3.218 | 3.264 | 25990 | 485 | 1.87% | 2658 | 799 | 30.06% |
max at 2 digit, single separator character | 77 | 129 | 157 | 170 | 173 | 3.122 | 3.189 | 24951 | 686 | 2.75% | 2803 | 455 | 16.23% |
max at 3 digit, single separator character | 76 | 103 | 115 | 120 | 122 | 3.192 | 3.246 | 24812 | 799 | 3.22% | 3038 | 256 | 8.43% |
max at 4 digit, single separator character | 43 | 57 | 63 | 65 | 66 | 3.134 | 3.215 | 23754 | 414 | 1.74% | 1650 | 106 | 6.42% |
max at 5 digit, single separator character | 32 | 38 | 41 | 43 | 44 | 2.822 | 2.875 | 21380 | 346 | 1.62% | 1320 | 93 | 7.05% |
max at 6 digit, single separator character | 25 | 30 | 33 | 34 | 35 | 2.887 | 2.935 | 21765 | 496 | 2.28% | 1223 | 79 | 6.46% |
max at 7 digit, single separator character | 20 | 22 | 24 | 24 | 25 | 3.921 | 3.983 | 29234 | 794 | 2.72% | 1096 | 40 | 3.65% |
max at 8 digit, single separator character | 11 | 12 | 13 | 14 | 15 | 4.825 | 4.887 | 37274 | 1015 | 2.72% | 944 | 44 | 4.66% |
Fixed length | |||||||||||||
1 digit, 1 .. 6 separator characters | 65 | 131 | 197 | 223 | 227 | 3.331 | 3.404 | 18979 | 677 | 3.57% | 3111 | 1581 | 50.82% |
2 digits, 1 .. 6 separator characters | 67 | 135 | 175 | 195 | 198 | 3.473 | 3.529 | 20588 | 799 | 3.88% | 2858 | 1287 | 45.03% |
3 digits, 1 .. 6 separator characters | 68 | 119 | 147 | 157 | 158 | 2.747 | 2.806 | 18629 | 481 | 2.58% | 2728 | 1150 | 42.16% |
4 digits, 1 .. 6 separator characters | 69 | 101 | 118 | 124 | 125 | 2.641 | 2.695 | 18672 | 488 | 2.61% | 1919 | 613 | 31.94% |
5 digits, 1 .. 6 separator characters | 53 | 71 | 78 | 81 | 82 | 2.748 | 2.824 | 19009 | 530 | 2.79% | 1638 | 372 | 22.71% |
6 digits, 1 .. 6 separator characters | 43 | 50 | 53 | 55 | 56 | 2.958 | 3.002 | 19989 | 450 | 2.25% | 1304 | 208 | 15.95% |
7 digits, 1 .. 6 separator characters | 12 | 14 | 17 | 18 | 19 | 3.316 | 3.341 | 20881 | 300 | 1.44% | 1058 | 52 | 4.91% |
8 digits, 1 .. 6 separator characters | 33 | 46 | 54 | 57 | 58 | 4.302 | 4.346 | 34811 | 728 | 2.09% | 1377 | 206 | 14.96% |
Uniform distribution | |||||||||||||
1 .. 1 digit, 1 .. 6 separator characters | 66 | 132 | 198 | 232 | 237 | 3.197 | 3.268 | 19319 | 685 | 3.55% | 2972 | 1273 | 42.83% |
1 .. 2 digits, 1 .. 6 separator characters | 66 | 132 | 198 | 238 | 243 | 3.025 | 3.081 | 19752 | 842 | 4.26% | 3186 | 1247 | 39.14% |
1 .. 3 digits, 1 .. 6 separator characters | 67 | 135 | 203 | 233 | 238 | 3.070 | 3.150 | 19453 | 743 | 3.82% | 3357 | 1216 | 36.22% |
1 .. 4 digits, 1 .. 6 separator characters | 70 | 140 | 210 | 250 | 256 | 3.390 | 3.477 | 19640 | 897 | 4.57% | 3586 | 1226 | 34.19% |
1 .. 5 digits, 1 .. 6 separator characters | 71 | 142 | 213 | 242 | 247 | 3.338 | 3.401 | 19580 | 823 | 4.20% | 3170 | 940 | 29.65% |
1 .. 6 digits, 1 .. 6 separator characters | 71 | 143 | 207 | 233 | 237 | 3.288 | 3.375 | 19457 | 785 | 4.03% | 2916 | 807 | 27.67% |
1 .. 7 digits, 1 .. 6 separator characters | 74 | 148 | 208 | 228 | 230 | 3.304 | 3.375 | 19268 | 761 | 3.95% | 2985 | 904 | 30.28% |
1 .. 8 digits, 1 .. 6 separator characters | 79 | 159 | 209 | 227 | 230 | 3.692 | 3.766 | 23045 | 881 | 3.82% | 2946 | 727 | 24.68% |
Gaussian distribution | |||||||||||||
max at 1 digit, 1 .. 6 separator characters | 66 | 133 | 200 | 244 | 249 | 3.154 | 3.216 | 19347 | 900 | 4.65% | 3515 | 1060 | 30.16% |
max at 2 digit, 1 .. 6 separator characters | 67 | 135 | 203 | 241 | 246 | 3.155 | 3.229 | 19299 | 786 | 4.07% | 3340 | 892 | 26.71% |
max at 3 digit, 1 .. 6 separator characters | 69 | 139 | 205 | 229 | 233 | 3.202 | 3.278 | 19382 | 804 | 4.15% | 3188 | 661 | 20.73% |
max at 4 digit, 1 .. 6 separator characters | 70 | 141 | 180 | 198 | 201 | 2.914 | 2.978 | 18669 | 611 | 3.27% | 2775 | 692 | 24.94% |
max at 5 digit, 1 .. 6 separator characters | 74 | 131 | 156 | 164 | 166 | 3.147 | 3.217 | 19276 | 586 | 3.04% | 2291 | 552 | 24.09% |
max at 6 digit, 1 .. 6 separator characters | 70 | 93 | 101 | 105 | 106 | 3.196 | 3.257 | 20806 | 493 | 2.37% | 1719 | 237 | 13.79% |
max at 7 digit, 1 .. 6 separator characters | 64 | 80 | 87 | 91 | 92 | 3.597 | 3.646 | 24974 | 622 | 2.49% | 1599 | 251 | 15.70% |
max at 8 digit, 1 .. 6 separator characters | 56 | 74 | 82 | 87 | 88 | 3.994 | 4.055 | 29874 | 848 | 2.84% | 1587 | 231 | 14.56% |
parameters | distinct span masks count | cycles per byte | branches | cache references | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
< 25% | < 50% | < 75% | < 95% | 100% | min | avg | taken | mispredicted | ratio | count | missed | ratio | |
Fixed length | |||||||||||||
1 digit, single separator character | 61 | 74 | 81 | 84 | 85 | 3.619 | 3.655 | 393175 | 6529 | 1.66% | 13469 | 974 | 7.23% |
2 digits, single separator character | 13 | 17 | 20 | 21 | 22 | 3.727 | 3.754 | 374164 | 3451 | 0.92% | 11212 | 286 | 2.55% |
3 digits, single separator character | 4 | 6 | 8 | 8 | 9 | 2.839 | 2.852 | 328065 | 829 | 0.25% | 7539 | 101 | 1.34% |
4 digits, single separator character | 4 | 5 | 6 | 6 | 7 | 3.495 | 3.507 | 361605 | 2429 | 0.67% | 7148 | 89 | 1.25% |
5 digits, single separator character | 1 | 2 | 3 | 3 | 4 | 2.921 | 2.935 | 313419 | 849 | 0.27% | 7335 | 119 | 1.62% |
6 digits, single separator character | 1 | 2 | 3 | 3 | 4 | 2.545 | 2.555 | 272684 | 741 | 0.27% | 8704 | 58 | 0.67% |
7 digits, single separator character | 1 | 2 | 2 | 2 | 3 | 4.407 | 4.417 | 386099 | 2206 | 0.57% | 5224 | 27 | 0.52% |
8 digits, single separator character | 2 | 3 | 4 | 4 | 5 | 7.094 | 7.135 | 666950 | 13453 | 2.02% | 5192 | 75 | 1.44% |
Uniform distribution | |||||||||||||
1 .. 1 digit, single separator character | 59 | 72 | 80 | 82 | 83 | 3.627 | 3.693 | 393231 | 6582 | 1.67% | 14582 | 973 | 6.67% |
1 .. 2 digits, single separator character | 236 | 280 | 299 | 308 | 310 | 3.908 | 3.935 | 397134 | 5214 | 1.31% | 19011 | 2176 | 11.45% |
1 .. 3 digits, single separator character | 355 | 420 | 450 | 466 | 469 | 3.565 | 3.590 | 361819 | 4669 | 1.29% | 21347 | 1858 | 8.70% |
1 .. 4 digits, single separator character | 416 | 493 | 523 | 537 | 540 | 5.329 | 5.387 | 391117 | 12854 | 3.29% | 24169 | 1621 | 6.71% |
1 .. 5 digits, single separator character | 444 | 510 | 543 | 556 | 558 | 5.270 | 5.328 | 390345 | 11393 | 2.92% | 19020 | 1568 | 8.24% |
1 .. 6 digits, single separator character | 423 | 492 | 525 | 535 | 537 | 4.589 | 4.633 | 372174 | 7872 | 2.12% | 19917 | 1282 | 6.44% |
1 .. 7 digits, single separator character | 392 | 462 | 481 | 491 | 493 | 4.174 | 4.218 | 355907 | 6263 | 1.76% | 15979 | 949 | 5.94% |
1 .. 8 digits, single separator character | 519 | 598 | 620 | 630 | 631 | 5.217 | 5.273 | 423108 | 11289 | 2.67% | 16635 | 695 | 4.18% |
Gaussian distribution | |||||||||||||
max at 1 digit, single separator character | 481 | 587 | 630 | 645 | 648 | 4.150 | 4.189 | 395482 | 6382 | 1.61% | 25467 | 1620 | 6.36% |
max at 2 digit, single separator character | 454 | 534 | 570 | 586 | 589 | 4.104 | 4.143 | 371930 | 6882 | 1.85% | 25775 | 739 | 2.87% |
max at 3 digit, single separator character | 300 | 330 | 344 | 350 | 352 | 4.990 | 5.050 | 375961 | 11571 | 3.08% | 15877 | 351 | 2.21% |
max at 4 digit, single separator character | 132 | 147 | 153 | 155 | 156 | 3.760 | 3.814 | 356129 | 4522 | 1.27% | 10634 | 220 | 2.07% |
max at 5 digit, single separator character | 84 | 90 | 93 | 95 | 96 | 3.031 | 3.045 | 318232 | 1548 | 0.49% | 8123 | 177 | 2.18% |
max at 6 digit, single separator character | 54 | 59 | 62 | 64 | 65 | 3.479 | 3.499 | 328097 | 6041 | 1.84% | 8491 | 366 | 4.31% |
max at 7 digit, single separator character | 36 | 39 | 41 | 41 | 42 | 5.520 | 5.560 | 444157 | 11932 | 2.69% | 5814 | 124 | 2.13% |
max at 8 digit, single separator character | 22 | 23 | 24 | 25 | 26 | 6.897 | 6.946 | 562307 | 15610 | 2.78% | 5449 | 120 | 2.20% |
Fixed length | |||||||||||||
1 digit, 1 .. 6 separator characters | 1052 | 1626 | 1870 | 1950 | 1958 | 4.371 | 4.408 | 280108 | 7580 | 2.71% | 34860 | 8289 | 23.78% |
2 digits, 1 .. 6 separator characters | 693 | 909 | 995 | 1026 | 1029 | 5.044 | 5.108 | 308039 | 12071 | 3.92% | 26297 | 4481 | 17.04% |
3 digits, 1 .. 6 separator characters | 354 | 438 | 480 | 494 | 496 | 3.334 | 3.358 | 273155 | 3865 | 1.41% | 17568 | 2647 | 15.07% |
4 digits, 1 .. 6 separator characters | 177 | 224 | 246 | 252 | 254 | 3.311 | 3.335 | 277306 | 4944 | 1.78% | 10967 | 1361 | 12.41% |
5 digits, 1 .. 6 separator characters | 100 | 126 | 134 | 137 | 138 | 3.648 | 3.677 | 284172 | 7099 | 2.50% | 7727 | 742 | 9.60% |
6 digits, 1 .. 6 separator characters | 69 | 77 | 80 | 82 | 83 | 3.675 | 3.703 | 297939 | 5210 | 1.75% | 6470 | 402 | 6.21% |
7 digits, 1 .. 6 separator characters | 46 | 49 | 51 | 53 | 54 | 3.662 | 3.673 | 315611 | 2311 | 0.73% | 6077 | 308 | 5.07% |
8 digits, 1 .. 6 separator characters | 30 | 45 | 54 | 58 | 59 | 5.735 | 5.782 | 532039 | 10908 | 2.05% | 6292 | 438 | 6.96% |
Uniform distribution | |||||||||||||
1 .. 1 digit, 1 .. 6 separator characters | 1053 | 1643 | 1896 | 1973 | 1981 | 4.364 | 4.396 | 279251 | 7676 | 2.75% | 34478 | 6144 | 17.82% |
1 .. 2 digits, 1 .. 6 separator characters | 1066 | 1792 | 2104 | 2212 | 2221 | 5.090 | 5.154 | 288339 | 12133 | 4.21% | 39995 | 5781 | 14.45% |
1 .. 3 digits, 1 .. 6 separator characters | 1084 | 1802 | 2120 | 2226 | 2236 | 4.974 | 5.031 | 286026 | 10029 | 3.51% | 44877 | 5187 | 11.56% |
1 .. 4 digits, 1 .. 6 separator characters | 1120 | 1808 | 2084 | 2173 | 2182 | 5.332 | 5.395 | 287345 | 11700 | 4.07% | 42918 | 3871 | 9.02% |
1 .. 5 digits, 1 .. 6 separator characters | 1141 | 1713 | 1959 | 2031 | 2039 | 5.078 | 5.153 | 287785 | 10589 | 3.68% | 40253 | 2950 | 7.33% |
1 .. 6 digits, 1 .. 6 separator characters | 1142 | 1617 | 1803 | 1856 | 1860 | 4.888 | 4.960 | 286913 | 10124 | 3.53% | 35625 | 2229 | 6.26% |
1 .. 7 digits, 1 .. 6 separator characters | 1091 | 1503 | 1646 | 1678 | 1681 | 4.970 | 5.033 | 287420 | 11059 | 3.85% | 30932 | 3218 | 10.40% |
1 .. 8 digits, 1 .. 6 separator characters | 1109 | 1476 | 1598 | 1619 | 1622 | 5.650 | 5.711 | 339596 | 14223 | 4.19% | 28870 | 2243 | 7.77% |
Gaussian distribution | |||||||||||||
max at 1 digit, 1 .. 6 separator characters | 1070 | 1975 | 2405 | 2551 | 2565 | 5.389 | 5.456 | 285670 | 12734 | 4.46% | 46840 | 6140 | 13.11% |
max at 2 digit, 1 .. 6 separator characters | 1094 | 1862 | 2174 | 2283 | 2293 | 5.125 | 5.177 | 288202 | 10818 | 3.75% | 43779 | 3228 | 7.37% |
max at 3 digit, 1 .. 6 separator characters | 1021 | 1440 | 1610 | 1666 | 1672 | 4.927 | 4.986 | 284545 | 10542 | 3.70% | 34647 | 1690 | 4.88% |
max at 4 digit, 1 .. 6 separator characters | 715 | 905 | 985 | 1010 | 1013 | 4.072 | 4.129 | 280545 | 8310 | 2.96% | 20920 | 931 | 4.45% |
max at 5 digit, 1 .. 6 separator characters | 445 | 534 | 567 | 575 | 577 | 4.127 | 4.167 | 286377 | 8569 | 2.99% | 14319 | 651 | 4.55% |
max at 6 digit, 1 .. 6 separator characters | 296 | 336 | 346 | 351 | 352 | 4.025 | 4.054 | 312734 | 6394 | 2.04% | 11762 | 402 | 3.42% |
max at 7 digit, 1 .. 6 separator characters | 185 | 207 | 214 | 219 | 220 | 4.481 | 4.520 | 376654 | 8042 | 2.14% | 7924 | 452 | 5.70% |
max at 8 digit, 1 .. 6 separator characters | 110 | 133 | 141 | 146 | 148 | 5.532 | 5.576 | 448317 | 12621 | 2.82% | 7779 | 821 | 10.55% |
Following procedures are tested:
There are two tests approaches:
Overall, where rather huge input data is parsed (a few megabytes) and the wall clock is used to measure the time of the whole procedure. Procedures are repeated several times (10, 100, 1000 or more), the minimum running time is noted.
To run these tests type make report-overall.rst; the tests suite uses executable bin/benchmark.
The results from various computers are available in the repository.
Microbenchmarks, where small inputs are parsed and exact number of CPU cycles is obtained. Procedures are also repeated many times (1000) and then minumum and average cycles to process a single input byte are calculated.
To run these tests type make microbenchmarks.rst; the tests suite uses executable bin/benchmark-cpuclocks.
The all results from various computers are available in the repository, below are only the measurements from the Skylake computer.
Compiler: GCC 7.3.0
CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
The table below compares speedup to the reference procedure for various input sizes and data distributions. Detailed measurements for each size and different distribution parameters are presented in the subsequent sections.
speedup over scalar procedure | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
scalar (std) | scalar (hybrid) | SSE | SSE (block) | SSE (simplified) | |||||||||||||
size [B] | distribution | samples | min | avg | max | min | avg | max | min | avg | max | min | avg | max | min | avg | max |
1024 | Gaussian distribution | 16 | 0.45 | 0.87 | 1.29 | 1.65 | 2.36 | 3.26 | 1.53 | 3.21 | 5.44 | 1.66 | 3.16 | 4.97 | 1.58 | 3.37 | 5.78 |
1024 | Fixed length | 16 | 0.42 | 0.87 | 1.29 | 1.63 | 2.41 | 3.53 | 1.28 | 3.32 | 5.23 | 1.47 | 3.30 | 5.05 | 1.30 | 3.54 | 5.79 |
1024 | Uniform distribution | 16 | 0.40 | 0.86 | 1.32 | 1.64 | 2.41 | 3.36 | 2.04 | 3.70 | 5.54 | 2.22 | 3.53 | 4.97 | 2.13 | 3.88 | 5.79 |
4096 | Gaussian distribution | 16 | 0.39 | 0.69 | 1.24 | 1.69 | 3.01 | 4.67 | 1.91 | 4.56 | 7.21 | 2.13 | 4.88 | 7.84 | 2.01 | 4.77 | 7.68 |
4096 | Fixed length | 16 | 0.36 | 0.69 | 1.39 | 1.94 | 2.87 | 4.32 | 1.60 | 4.44 | 6.86 | 1.71 | 4.66 | 8.34 | 1.62 | 4.70 | 7.83 |
4096 | Uniform distribution | 16 | 0.43 | 0.78 | 1.38 | 2.07 | 3.24 | 4.55 | 3.19 | 5.33 | 7.40 | 3.36 | 5.61 | 7.87 | 3.27 | 5.61 | 7.79 |
65536 | Gaussian distribution | 16 | 0.45 | 0.77 | 1.30 | 1.40 | 1.74 | 2.13 | 1.73 | 4.46 | 5.88 | 1.76 | 4.57 | 6.00 | 1.77 | 4.60 | 6.25 |
65536 | Fixed length | 16 | 0.40 | 0.75 | 1.36 | 1.55 | 1.87 | 2.37 | 1.46 | 4.68 | 6.72 | 1.48 | 5.00 | 7.31 | 1.50 | 4.95 | 7.53 |
65536 | Uniform distribution | 16 | 0.55 | 0.86 | 1.36 | 1.38 | 1.75 | 2.34 | 3.32 | 4.98 | 6.39 | 3.29 | 5.01 | 6.52 | 3.43 | 5.20 | 6.89 |
102400 | Gaussian distribution | 16 | 0.46 | 0.77 | 1.26 | 1.38 | 1.71 | 2.03 | 1.78 | 4.28 | 5.66 | 1.81 | 4.44 | 5.81 | 1.82 | 4.42 | 6.00 |
102400 | Fixed length | 16 | 0.40 | 0.75 | 1.36 | 1.53 | 1.84 | 2.31 | 1.29 | 4.59 | 6.49 | 1.46 | 4.91 | 6.99 | 1.31 | 4.85 | 7.16 |
102400 | Uniform distribution | 16 | 0.54 | 0.86 | 1.38 | 1.35 | 1.71 | 2.32 | 3.22 | 4.78 | 6.38 | 3.21 | 4.83 | 6.43 | 3.31 | 4.97 | 6.85 |
Gaussian distribution (max at 8 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 8.782 | 9.103 | 1.00 |
scalar (std) | 22.874 | 23.586 | 0.38 |
scalar (hybrid) | 5.005 | 5.148 | 1.75 |
SSE | 4.898 | 4.974 | 1.79 |
SSE (block) | 4.603 | 4.651 | 1.91 |
SSE (simplified) | 4.687 | 4.730 | 1.87 |
Gaussian distribution (max at 8 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 12.729 | 13.130 | 1.00 |
scalar (std) | 17.501 | 18.150 | 0.73 |
scalar (hybrid) | 4.067 | 4.273 | 3.13 |
SSE | 4.002 | 4.069 | 3.18 |
SSE (block) | 3.903 | 3.945 | 3.26 |
SSE (simplified) | 3.914 | 3.970 | 3.25 |
Gaussian distribution (max at 7 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 8.836 | 9.191 | 1.00 |
scalar (std) | 23.365 | 24.139 | 0.38 |
scalar (hybrid) | 5.104 | 5.250 | 1.73 |
SSE | 4.129 | 4.195 | 2.14 |
SSE (block) | 3.476 | 3.518 | 2.54 |
SSE (simplified) | 3.975 | 4.039 | 2.22 |
Gaussian distribution (max at 7 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 13.646 | 14.115 | 1.00 |
scalar (std) | 18.029 | 18.596 | 0.76 |
scalar (hybrid) | 4.042 | 4.274 | 3.38 |
SSE | 3.591 | 3.644 | 3.80 |
SSE (block) | 3.294 | 3.334 | 4.14 |
SSE (simplified) | 3.431 | 3.481 | 3.98 |
Gaussian distribution (max at 6 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 9.434 | 9.852 | 1.00 |
scalar (std) | 24.766 | 25.547 | 0.38 |
scalar (hybrid) | 5.571 | 5.728 | 1.69 |
SSE | 2.833 | 2.882 | 3.33 |
SSE (block) | 2.977 | 3.024 | 3.17 |
SSE (simplified) | 2.738 | 2.786 | 3.45 |
Gaussian distribution (max at 6 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 14.200 | 14.750 | 1.00 |
scalar (std) | 17.992 | 18.563 | 0.79 |
scalar (hybrid) | 4.091 | 4.330 | 3.47 |
SSE | 3.207 | 3.266 | 4.43 |
SSE (block) | 2.885 | 2.932 | 4.92 |
SSE (simplified) | 3.058 | 3.113 | 4.64 |
Gaussian distribution (max at 5 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 10.384 | 10.864 | 1.00 |
scalar (std) | 25.763 | 26.495 | 0.40 |
scalar (hybrid) | 6.102 | 6.273 | 1.70 |
SSE | 2.786 | 2.827 | 3.73 |
SSE (block) | 2.710 | 2.744 | 3.83 |
SSE (simplified) | 2.699 | 2.752 | 3.85 |
Gaussian distribution (max at 5 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 15.168 | 15.622 | 1.00 |
scalar (std) | 18.088 | 18.872 | 0.84 |
scalar (hybrid) | 4.244 | 4.442 | 3.57 |
SSE | 3.023 | 3.096 | 5.02 |
SSE (block) | 2.864 | 2.934 | 5.30 |
SSE (simplified) | 2.908 | 2.984 | 5.22 |
Gaussian distribution (max at 1 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 18.091 | 18.877 | 1.00 |
scalar (std) | 34.484 | 35.258 | 0.52 |
scalar (hybrid) | 7.037 | 7.275 | 2.57 |
SSE | 3.160 | 3.213 | 5.72 |
SSE (block) | 3.282 | 3.352 | 5.51 |
SSE (simplified) | 3.120 | 3.194 | 5.80 |
Gaussian distribution (max at 1 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 21.390 | 22.112 | 1.00 |
scalar (std) | 18.566 | 19.088 | 1.15 |
scalar (hybrid) | 5.160 | 5.286 | 4.15 |
SSE | 3.067 | 3.126 | 6.97 |
SSE (block) | 2.904 | 2.963 | 7.37 |
SSE (simplified) | 2.961 | 3.022 | 7.22 |
Gaussian distribution (max at 4 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 11.966 | 12.550 | 1.00 |
scalar (std) | 26.982 | 27.705 | 0.44 |
scalar (hybrid) | 5.896 | 6.082 | 2.03 |
SSE | 3.234 | 3.284 | 3.70 |
SSE (block) | 3.026 | 3.074 | 3.95 |
SSE (simplified) | 3.117 | 3.183 | 3.84 |
Gaussian distribution (max at 4 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 16.371 | 16.967 | 1.00 |
scalar (std) | 17.832 | 18.445 | 0.92 |
scalar (hybrid) | 4.173 | 4.395 | 3.92 |
SSE | 2.909 | 3.001 | 5.63 |
SSE (block) | 2.760 | 2.813 | 5.93 |
SSE (simplified) | 2.848 | 2.908 | 5.75 |
Gaussian distribution (max at 3 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 13.280 | 13.948 | 1.00 |
scalar (std) | 29.238 | 30.157 | 0.45 |
scalar (hybrid) | 6.225 | 6.450 | 2.13 |
SSE | 3.225 | 3.309 | 4.12 |
SSE (block) | 2.963 | 3.012 | 4.48 |
SSE (simplified) | 3.238 | 3.286 | 4.10 |
Gaussian distribution (max at 3 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 18.193 | 18.831 | 1.00 |
scalar (std) | 18.642 | 19.287 | 0.98 |
scalar (hybrid) | 4.119 | 4.246 | 4.42 |
SSE | 3.126 | 3.190 | 5.82 |
SSE (block) | 3.016 | 3.090 | 6.03 |
SSE (simplified) | 3.000 | 3.048 | 6.06 |
Gaussian distribution (max at 2 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 15.942 | 16.733 | 1.00 |
scalar (std) | 32.551 | 33.334 | 0.49 |
scalar (hybrid) | 6.372 | 6.605 | 2.50 |
SSE | 3.130 | 3.190 | 5.09 |
SSE (block) | 3.041 | 3.121 | 5.24 |
SSE (simplified) | 3.020 | 3.082 | 5.28 |
Gaussian distribution (max at 2 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 20.088 | 20.800 | 1.00 |
scalar (std) | 18.382 | 19.013 | 1.09 |
scalar (hybrid) | 4.574 | 4.835 | 4.39 |
SSE | 3.184 | 3.264 | 6.31 |
SSE (block) | 3.018 | 3.069 | 6.66 |
SSE (simplified) | 2.988 | 3.042 | 6.72 |
Uniform distribution (1 .. 1 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 18.677 | 19.206 | 1.00 |
scalar (std) | 34.479 | 35.207 | 0.54 |
scalar (hybrid) | 7.953 | 8.184 | 2.35 |
SSE | 3.014 | 3.099 | 6.20 |
SSE (block) | 3.117 | 3.172 | 5.99 |
SSE (simplified) | 2.977 | 3.062 | 6.27 |
Uniform distribution (1 .. 1 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 21.981 | 22.621 | 1.00 |
scalar (std) | 16.860 | 17.391 | 1.30 |
scalar (hybrid) | 5.285 | 5.481 | 4.16 |
SSE | 3.029 | 3.085 | 7.26 |
SSE (block) | 2.942 | 3.000 | 7.47 |
SSE (simplified) | 3.099 | 3.179 | 7.09 |
Uniform distribution (1 .. 2 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 17.695 | 18.444 | 1.00 |
scalar (std) | 34.917 | 35.633 | 0.51 |
scalar (hybrid) | 7.245 | 7.401 | 2.44 |
SSE | 3.324 | 3.387 | 5.32 |
SSE (block) | 3.323 | 3.398 | 5.33 |
SSE (simplified) | 3.236 | 3.321 | 5.47 |
Uniform distribution (1 .. 2 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 21.164 | 21.860 | 1.00 |
scalar (std) | 18.567 | 19.126 | 1.14 |
scalar (hybrid) | 4.937 | 5.156 | 4.29 |
SSE | 2.940 | 2.995 | 7.20 |
SSE (block) | 2.924 | 2.975 | 7.24 |
SSE (simplified) | 2.785 | 2.826 | 7.60 |
Uniform distribution (1 .. 3 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 16.385 | 17.254 | 1.00 |
scalar (std) | 33.359 | 34.075 | 0.49 |
scalar (hybrid) | 6.877 | 7.145 | 2.38 |
SSE | 2.892 | 2.947 | 5.67 |
SSE (block) | 3.088 | 3.158 | 5.31 |
SSE (simplified) | 2.752 | 2.844 | 5.95 |
Uniform distribution (1 .. 3 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 20.530 | 21.221 | 1.00 |
scalar (std) | 18.995 | 19.580 | 1.08 |
scalar (hybrid) | 4.647 | 4.811 | 4.42 |
SSE | 3.037 | 3.089 | 6.76 |
SSE (block) | 2.941 | 2.992 | 6.98 |
SSE (simplified) | 2.914 | 2.972 | 7.05 |
Uniform distribution (1 .. 4 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 14.926 | 15.613 | 1.00 |
scalar (std) | 31.275 | 32.210 | 0.48 |
scalar (hybrid) | 6.445 | 6.677 | 2.32 |
SSE | 3.393 | 3.458 | 4.40 |
SSE (block) | 3.126 | 3.184 | 4.77 |
SSE (simplified) | 3.293 | 3.344 | 4.53 |
Uniform distribution (1 .. 4 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 19.047 | 19.920 | 1.00 |
scalar (std) | 19.500 | 20.130 | 0.98 |
scalar (hybrid) | 4.539 | 4.723 | 4.20 |
SSE | 3.325 | 3.383 | 5.73 |
SSE (block) | 3.132 | 3.193 | 6.08 |
SSE (simplified) | 3.182 | 3.235 | 5.99 |
Uniform distribution (1 .. 5 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 13.748 | 14.405 | 1.00 |
scalar (std) | 30.105 | 30.941 | 0.46 |
scalar (hybrid) | 6.144 | 6.413 | 2.24 |
SSE | 3.475 | 3.533 | 3.96 |
SSE (block) | 3.171 | 3.238 | 4.34 |
SSE (simplified) | 3.293 | 3.356 | 4.17 |
Uniform distribution (1 .. 5 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 18.297 | 19.010 | 1.00 |
scalar (std) | 19.209 | 19.845 | 0.95 |
scalar (hybrid) | 4.477 | 4.652 | 4.09 |
SSE | 3.293 | 3.352 | 5.56 |
SSE (block) | 3.135 | 3.188 | 5.84 |
SSE (simplified) | 3.182 | 3.230 | 5.75 |
Uniform distribution (1 .. 6 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 12.912 | 13.548 | 1.00 |
scalar (std) | 29.511 | 30.337 | 0.44 |
scalar (hybrid) | 5.958 | 6.253 | 2.17 |
SSE | 3.430 | 3.487 | 3.76 |
SSE (block) | 3.201 | 3.265 | 4.03 |
SSE (simplified) | 3.283 | 3.337 | 3.93 |
Uniform distribution (1 .. 6 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 17.288 | 17.956 | 1.00 |
scalar (std) | 19.250 | 19.841 | 0.90 |
scalar (hybrid) | 4.356 | 4.481 | 3.97 |
SSE | 3.205 | 3.276 | 5.39 |
SSE (block) | 2.990 | 3.046 | 5.78 |
SSE (simplified) | 3.116 | 3.172 | 5.55 |
Uniform distribution (1 .. 7 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 11.974 | 12.552 | 1.00 |
scalar (std) | 27.884 | 28.562 | 0.43 |
scalar (hybrid) | 6.214 | 6.385 | 1.93 |
SSE | 3.253 | 3.309 | 3.68 |
SSE (block) | 3.057 | 3.100 | 3.92 |
SSE (simplified) | 3.119 | 3.167 | 3.84 |
Uniform distribution (1 .. 7 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 16.583 | 17.366 | 1.00 |
scalar (std) | 18.947 | 19.528 | 0.88 |
scalar (hybrid) | 4.265 | 4.401 | 3.89 |
SSE | 3.248 | 3.311 | 5.11 |
SSE (block) | 2.989 | 3.048 | 5.55 |
SSE (simplified) | 3.165 | 3.215 | 5.24 |
Uniform distribution (1 .. 8 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 11.637 | 12.165 | 1.00 |
scalar (std) | 27.107 | 27.851 | 0.43 |
scalar (hybrid) | 6.079 | 6.231 | 1.91 |
SSE | 3.821 | 3.870 | 3.05 |
SSE (block) | 3.492 | 3.550 | 3.33 |
SSE (simplified) | 3.688 | 3.745 | 3.16 |
Uniform distribution (1 .. 8 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 15.739 | 16.589 | 1.00 |
scalar (std) | 18.875 | 19.563 | 0.83 |
scalar (hybrid) | 4.335 | 4.463 | 3.63 |
SSE | 3.611 | 3.667 | 4.36 |
SSE (block) | 3.337 | 3.400 | 4.72 |
SSE (simplified) | 3.490 | 3.538 | 4.51 |
Gaussian distribution (max at 8 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 11.852 | 11.971 | 1.00 |
scalar (std) | 26.589 | 26.792 | 0.45 |
scalar (hybrid) | 8.336 | 8.411 | 1.42 |
SSE | 6.911 | 6.956 | 1.71 |
SSE (block) | 6.921 | 6.990 | 1.71 |
SSE (simplified) | 6.757 | 6.802 | 1.75 |
Gaussian distribution (max at 8 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 17.008 | 17.110 | 1.00 |
scalar (std) | 21.492 | 21.653 | 0.79 |
scalar (hybrid) | 8.722 | 8.800 | 1.95 |
SSE | 5.650 | 5.692 | 3.01 |
SSE (block) | 5.679 | 5.733 | 2.99 |
SSE (simplified) | 5.426 | 5.470 | 3.13 |
Gaussian distribution (max at 7 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 12.818 | 12.929 | 1.00 |
scalar (std) | 27.335 | 27.609 | 0.47 |
scalar (hybrid) | 9.246 | 9.315 | 1.39 |
SSE | 5.513 | 5.542 | 2.33 |
SSE (block) | 5.259 | 5.296 | 2.44 |
SSE (simplified) | 5.364 | 5.405 | 2.39 |
Gaussian distribution (max at 7 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 17.978 | 18.119 | 1.00 |
scalar (std) | 21.924 | 22.109 | 0.82 |
scalar (hybrid) | 9.205 | 9.286 | 1.95 |
SSE | 4.537 | 4.591 | 3.96 |
SSE (block) | 4.351 | 4.388 | 4.13 |
SSE (simplified) | 4.352 | 4.407 | 4.13 |
Gaussian distribution (max at 6 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 13.772 | 13.862 | 1.00 |
scalar (std) | 28.295 | 28.596 | 0.49 |
scalar (hybrid) | 10.005 | 10.079 | 1.38 |
SSE | 3.456 | 3.479 | 3.98 |
SSE (block) | 3.358 | 3.385 | 4.10 |
SSE (simplified) | 3.379 | 3.402 | 4.08 |
Gaussian distribution (max at 6 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 19.374 | 19.538 | 1.00 |
scalar (std) | 22.288 | 22.489 | 0.87 |
scalar (hybrid) | 9.660 | 9.730 | 2.01 |
SSE | 3.949 | 3.983 | 4.91 |
SSE (block) | 3.726 | 3.751 | 5.20 |
SSE (simplified) | 3.870 | 3.918 | 5.01 |
Gaussian distribution (max at 5 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 14.989 | 15.092 | 1.00 |
scalar (std) | 29.851 | 30.089 | 0.50 |
scalar (hybrid) | 10.750 | 10.842 | 1.39 |
SSE | 2.933 | 2.945 | 5.11 |
SSE (block) | 2.807 | 2.822 | 5.34 |
SSE (simplified) | 2.866 | 2.878 | 5.23 |
Gaussian distribution (max at 5 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 20.729 | 20.888 | 1.00 |
scalar (std) | 22.536 | 22.718 | 0.92 |
scalar (hybrid) | 10.225 | 10.311 | 2.03 |
SSE | 4.032 | 4.086 | 5.14 |
SSE (block) | 3.762 | 3.805 | 5.51 |
SSE (simplified) | 3.985 | 4.043 | 5.20 |
Gaussian distribution (max at 1 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 24.489 | 24.804 | 1.00 |
scalar (std) | 38.285 | 38.540 | 0.64 |
scalar (hybrid) | 16.088 | 16.198 | 1.52 |
SSE | 4.159 | 4.200 | 5.89 |
SSE (block) | 4.107 | 4.142 | 5.96 |
SSE (simplified) | 4.213 | 4.250 | 5.81 |
Gaussian distribution (max at 1 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 28.265 | 28.549 | 1.00 |
scalar (std) | 22.648 | 22.767 | 1.25 |
scalar (hybrid) | 13.524 | 13.622 | 2.09 |
SSE | 5.201 | 5.271 | 5.43 |
SSE (block) | 5.452 | 5.513 | 5.18 |
SSE (simplified) | 5.057 | 5.128 | 5.59 |
Gaussian distribution (max at 4 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 16.687 | 16.790 | 1.00 |
scalar (std) | 31.153 | 31.398 | 0.54 |
scalar (hybrid) | 11.835 | 11.913 | 1.41 |
SSE | 3.590 | 3.614 | 4.65 |
SSE (block) | 3.466 | 3.489 | 4.81 |
SSE (simplified) | 3.496 | 3.525 | 4.77 |
Gaussian distribution (max at 4 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 22.284 | 22.529 | 1.00 |
scalar (std) | 22.476 | 22.653 | 0.99 |
scalar (hybrid) | 10.916 | 11.005 | 2.04 |
SSE | 4.032 | 4.081 | 5.53 |
SSE (block) | 3.859 | 3.898 | 5.77 |
SSE (simplified) | 3.953 | 3.991 | 5.64 |
Gaussian distribution (max at 3 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 19.121 | 19.291 | 1.00 |
scalar (std) | 32.927 | 33.175 | 0.58 |
scalar (hybrid) | 13.629 | 13.718 | 1.40 |
SSE | 4.848 | 4.901 | 3.94 |
SSE (block) | 4.780 | 4.825 | 4.00 |
SSE (simplified) | 4.760 | 4.812 | 4.02 |
Gaussian distribution (max at 3 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 24.281 | 24.611 | 1.00 |
scalar (std) | 22.530 | 22.703 | 1.08 |
scalar (hybrid) | 11.920 | 12.185 | 2.04 |
SSE | 4.756 | 4.823 | 5.11 |
SSE (block) | 4.781 | 4.835 | 5.08 |
SSE (simplified) | 4.607 | 4.664 | 5.27 |
Gaussian distribution (max at 2 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 22.188 | 22.377 | 1.00 |
scalar (std) | 36.158 | 36.461 | 0.61 |
scalar (hybrid) | 15.232 | 15.326 | 1.46 |
SSE | 4.058 | 4.108 | 5.47 |
SSE (block) | 3.952 | 3.990 | 5.61 |
SSE (simplified) | 4.061 | 4.110 | 5.46 |
Gaussian distribution (max at 2 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 26.633 | 26.932 | 1.00 |
scalar (std) | 22.723 | 22.869 | 1.17 |
scalar (hybrid) | 13.162 | 13.267 | 2.02 |
SSE | 5.097 | 5.174 | 5.23 |
SSE (block) | 5.160 | 5.224 | 5.16 |
SSE (simplified) | 4.941 | 5.008 | 5.39 |
Uniform distribution (1 .. 1 digit) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 22.963 | 23.203 | 1.00 |
scalar (std) | 37.465 | 37.718 | 0.61 |
scalar (hybrid) | 13.951 | 14.058 | 1.65 |
SSE | 3.389 | 3.447 | 6.78 |
SSE (block) | 3.593 | 3.643 | 6.39 |
SSE (simplified) | 3.324 | 3.394 | 6.91 |
Uniform distribution (1 .. 1 digit) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 27.116 | 27.518 | 1.00 |
scalar (std) | 20.621 | 20.742 | 1.31 |
scalar (hybrid) | 11.908 | 12.006 | 2.28 |
SSE | 4.128 | 4.167 | 6.57 |
SSE (block) | 4.179 | 4.214 | 6.49 |
SSE (simplified) | 4.105 | 4.136 | 6.61 |
Uniform distribution (1 .. 2 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 23.675 | 24.277 | 1.00 |
scalar (std) | 37.914 | 38.203 | 0.62 |
scalar (hybrid) | 15.327 | 15.432 | 1.54 |
SSE | 3.896 | 3.933 | 6.08 |
SSE (block) | 3.870 | 3.901 | 6.12 |
SSE (simplified) | 3.913 | 3.952 | 6.05 |
Uniform distribution (1 .. 2 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 27.811 | 28.100 | 1.00 |
scalar (std) | 22.433 | 22.537 | 1.24 |
scalar (hybrid) | 13.003 | 13.113 | 2.14 |
SSE | 4.947 | 5.018 | 5.62 |
SSE (block) | 5.201 | 5.265 | 5.35 |
SSE (simplified) | 4.807 | 4.872 | 5.79 |
Uniform distribution (1 .. 3 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 22.512 | 22.742 | 1.00 |
scalar (std) | 36.660 | 36.946 | 0.61 |
scalar (hybrid) | 15.263 | 15.362 | 1.47 |
SSE | 3.474 | 3.507 | 6.48 |
SSE (block) | 3.456 | 3.480 | 6.51 |
SSE (simplified) | 3.528 | 3.561 | 6.38 |
Uniform distribution (1 .. 3 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 26.841 | 27.155 | 1.00 |
scalar (std) | 22.874 | 23.042 | 1.17 |
scalar (hybrid) | 13.184 | 13.284 | 2.04 |
SSE | 4.815 | 4.876 | 5.57 |
SSE (block) | 4.952 | 5.022 | 5.42 |
SSE (simplified) | 4.699 | 4.753 | 5.71 |
Uniform distribution (1 .. 4 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 21.123 | 21.329 | 1.00 |
scalar (std) | 35.205 | 35.486 | 0.60 |
scalar (hybrid) | 14.756 | 14.856 | 1.43 |
SSE | 5.217 | 5.283 | 4.05 |
SSE (block) | 5.053 | 5.106 | 4.18 |
SSE (simplified) | 5.163 | 5.227 | 4.09 |
Uniform distribution (1 .. 4 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 25.877 | 26.129 | 1.00 |
scalar (std) | 23.132 | 23.285 | 1.12 |
scalar (hybrid) | 12.984 | 13.084 | 1.99 |
SSE | 5.211 | 5.291 | 4.97 |
SSE (block) | 5.283 | 5.356 | 4.90 |
SSE (simplified) | 5.048 | 5.116 | 5.13 |
Uniform distribution (1 .. 5 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 19.780 | 19.978 | 1.00 |
scalar (std) | 34.247 | 34.511 | 0.58 |
scalar (hybrid) | 14.274 | 14.371 | 1.39 |
SSE | 5.099 | 5.160 | 3.88 |
SSE (block) | 4.920 | 4.975 | 4.02 |
SSE (simplified) | 5.053 | 5.099 | 3.91 |
Uniform distribution (1 .. 5 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 24.749 | 25.022 | 1.00 |
scalar (std) | 23.269 | 23.468 | 1.06 |
scalar (hybrid) | 12.627 | 12.718 | 1.96 |
SSE | 5.046 | 5.111 | 4.90 |
SSE (block) | 4.999 | 5.126 | 4.95 |
SSE (simplified) | 4.903 | 4.954 | 5.05 |
Uniform distribution (1 .. 6 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 18.656 | 18.857 | 1.00 |
scalar (std) | 33.362 | 33.565 | 0.56 |
scalar (hybrid) | 13.643 | 13.742 | 1.37 |
SSE | 4.433 | 4.479 | 4.21 |
SSE (block) | 4.282 | 4.319 | 4.36 |
SSE (simplified) | 4.356 | 4.399 | 4.28 |
Uniform distribution (1 .. 6 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 23.975 | 24.304 | 1.00 |
scalar (std) | 23.495 | 23.653 | 1.02 |
scalar (hybrid) | 12.269 | 12.358 | 1.95 |
SSE | 4.883 | 4.947 | 4.91 |
SSE (block) | 4.715 | 4.770 | 5.08 |
SSE (simplified) | 4.778 | 4.825 | 5.02 |
Uniform distribution (1 .. 7 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 17.632 | 17.799 | 1.00 |
scalar (std) | 32.227 | 32.509 | 0.55 |
scalar (hybrid) | 13.079 | 13.177 | 1.35 |
SSE | 4.114 | 4.158 | 4.29 |
SSE (block) | 3.976 | 4.028 | 4.43 |
SSE (simplified) | 4.032 | 4.069 | 4.37 |
Uniform distribution (1 .. 7 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 22.878 | 23.116 | 1.00 |
scalar (std) | 23.259 | 23.512 | 0.98 |
scalar (hybrid) | 12.010 | 12.113 | 1.90 |
SSE | 4.926 | 4.986 | 4.64 |
SSE (block) | 4.670 | 4.735 | 4.90 |
SSE (simplified) | 4.836 | 4.892 | 4.73 |
Uniform distribution (1 .. 8 digits) — single separator character
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 16.753 | 16.923 | 1.00 |
scalar (std) | 31.309 | 31.604 | 0.54 |
scalar (hybrid) | 12.522 | 12.615 | 1.34 |
SSE | 5.181 | 5.232 | 3.23 |
SSE (block) | 5.263 | 5.313 | 3.18 |
SSE (simplified) | 5.089 | 5.133 | 3.29 |
Uniform distribution (1 .. 8 digits) — 1 .. 6 separator characters
cycles per one input byte | speed-up | ||
---|---|---|---|
procedure | min | avg | (min) |
scalar | 22.070 | 22.400 | 1.00 |
scalar (std) | 23.084 | 23.272 | 0.96 |
scalar (hybrid) | 11.689 | 11.777 | 1.89 |
SSE | 5.666 | 5.729 | 3.90 |
SSE (block) | 5.833 | 5.886 | 3.78 |
SSE (simplified) | 5.493 | 5.553 | 4.02 |
The vectorized algorithm can be faster from 2 to 5 times than the scalar base.
Although SSE speed-up depends on the count of digits in numbers, negative impact is visible when the numbers are separated by more than one character. This is caused by greater number of distinct span patterns: for "Gaussian distribution — max at 1 digit, single separator character" the number of distinct patterns is approximately 1000; for the same distribution, but variable number of separator, this value raises almost 28 times, to 28000.
It means touching more memory locations within the lookup table and, as a result, more cache misses. Also amount of branch mispredictions is greater due to processing greater variety of patterns.
Scalar hybrid performs surprisingly well, it can be even 2 times faster than the baseline. However, its compile time is terribly long.
Nathan Kurz and Daniel Lemire gave incredibly valuable feedback and comments on text and experiments. Thank you!
External links:
Internal links:
Source code is available on github.