Author: | Wojciech Muła |
---|---|
Added on: | 2023-04-09 |
Contents
Just for recap, an IPv4 address written in the textual form consists four decimal numbers separated by the dot character. Each number represents an octet (byte), that is in range from 0 to 255. Here are some examples: "10.1.1.12", "127.0.0.1", "255.255.255.0". An IPv4 address is stored in the big-endian byte order.
Parsing IPv4 addresses seems to be a trivial task. For example, the Go builtin module netip implements parsing with full validation in 35 lines. The inet_pton specialisation for IPv4 addresses, that can be found in Glibc, spans approx 40 lines of plain C code. For completeness, their full sources were put in appendix.
These two procedures share the same schema of parsing and validation:
I bet 99.999% of ever written IPv4 parsers use similar schema. And that is pretty hard to see any obvious inefficiency in this approach. Additionally, when we take into account that compilers are getting smarter and smarter, we may assume that a compiler would do a decent job for us.
However, it's possible to make the conversion faster, using SIMD instructions. The best solution is two-three times faster than a reference scalar procedure.
The actual C++ code snippets with SSE instructions are used to illustrate solution. Full source code is available. They include a SWAR implementation and also different SSE variants.
Since SIMD instructions process many bytes in parallel, we can approach the parsing problem from a slightly different direction.
With the help of SIMD instructions we may quickly get a 15-bit mask, where each 1 denotes the dot, and 0 a digit byte. The following table shows all 81 valid patterns.
length in bytes | # of patterns | masks (binary) | masks (hex) |
---|---|---|---|
7 | 1 | 0101010 | 002a |
8 | 4 | 01001010, 01010100, 00101010, 01010010 | 004a, 0054, 002a, 0052 |
9 | 10 | 001001010, 010101000, 001010010, 010010100, 010010010, 001010100, 000101010, 010001010, 010100100, 010100010 | 004a, 00a8, 0052, 0094, 0092, 0054, 002a, 008a, 00a4, 00a2 |
10 | 16 | 0100010010, 0100010100, 0100100100, 0010100100, 0010010010, 0010100010, 0010010100, 0001001010, 0100101000, 0101001000, 0010001010, 0001010010, 0010101000, 0001010100, 0100100010, 0101000100 | 0112, 0114, 0124, 00a4, 0092, 00a2, 0094, 004a, 0128, 0148, 008a, 0052, 00a8, 0054, 0122, 0144 |
11 | 19 | 00010001010, 01010001000, 01000100010, 00010010010, 00100100100, 00100010100, 00010100010, 01000100100, 00100010010, 00101001000, 01000101000, 00100100010, 01001000100, 00100101000, 00010010100, 01001001000, 00010100100, 00101000100, 00010101000 | 008a, 0288, 0222, 0092, 0124, 0114, 00a2, 0224, 0112, 0148, 0228, 0122, 0244, 0128, 0094, 0248, 00a4, 0144, 00a8 |
12 | 16 | 001000100100, 010010001000, 010001001000, 000100100100, 000101000100, 000100010100, 000100101000, 001001000100, 001000100010, 010001000100, 000100010010, 001010001000, 001001001000, 001000101000, 000101001000, 000100100010 | 0224, 0488, 0448, 0124, 0144, 0114, 0128, 0244, 0222, 0444, 0112, 0288, 0248, 0228, 0148, 0122 |
13 | 10 | 0001001000100, 0010010001000, 0010001001000, 0100010001000, 0001010001000, 0010001000100, 0001001001000, 0001000100100, 0001000100010, 0001000101000 | 0244, 0488, 0448, 0888, 0288, 0444, 0248, 0224, 0222, 0228 |
14 | 4 | 00010001000100, 00100010001000, 00010010001000, 00010001001000 | 0444, 0888, 0488, 0448 |
15 | 1 | 000100010001000 | 0888 |
total | 81 |
In the first step we reject inputs that are too short or too long.
result sse_parse_ipv4(const std::string& ipv4) { result res; res.ipv4 = 0; res.err = 0; const size_t n = ipv4.size(); if (n < minlen_ipv4) { res.err = errTooShort; return res; } if (n > maxlen_ipv4) { res.err = errTooLong; return res; } // ... }
Next, we load the input string into an SSE register. In the following code there's the assumption we can read past the end of string. In the terms of C++, we assume s.capacity() >= 16. In the real-world applications it's not always hold, thus we need to use a masked load.
const __m128i input = _mm_loadu_si128((const __m128i*)ipv4.data());
An address "195.42.3.0", having 10 bytes, has the following byte layout in an SSE register:
? ? . ? ? ? 0 . 3 . 2 4 . 5 9 1 ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │3f│3f│2e│3f│3f│3f│30│2e│33│2e│32│34│2e│35│39│31│ = input └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ ^^ ^^ byte 15 byte 0
Then we locate the dots. Please note that we have garbage bytes past the string end, thus there's need to have a bitmask for valid positions.
uint16_t mask = 0xffff; mask <<= n; mask = ~mask; uint16_t dotmask; { const __m128i dot = _mm_set1_epi8('.'); const __m128i t0 = _mm_cmpeq_epi8(input, dot); dotmask = _mm_movemask_epi8(t0); dotmask &= mask; }
? ? . ? ? ? 0 . 3 . 2 4 . 5 9 1 ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │3f│3f│2e│3f│3f│3f│30│2e│33│2e│32│34│2e│35│39│31│ = input └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │00│00│ff│00│00│00│00│ff│00│ff│00│00│ff│00│00│00│ = t0 └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ ^^ ^^ ^^ ^^ 0x2148 = 0b0010000101001000 = dotmask 0b0000001111111111 = mask 0b0000000101001000 = dotmask & mask
At this point we can count the number of dots (using cheap POPCNT). Yet, that is not necessary, as the wrong dot-patterns can be detected later. It depends on the method we choose for the dot mask processing.
const auto k = __builtin_popcount(dotmask); if (k != 3) { if (k > 3) { res.err = errTooManyFields; } else { res.err = errTooFewFields; } return res; }
Now, we check if all other bytes are ASCII digits. It requires just one SIMD subtraction and compare.
const __m128i ascii0 = _mm_set1_epi8(-128 + '0'); const __m128i rangedigits = _mm_set1_epi8(-128 + ('9' - '0' + 1)); const __m128i t1 = _mm_sub_epi8(input, ascii0); const __m128i t2 = _mm_cmplt_epi8(t1, rangedigits); uint16_t less = _mm_movemask_epi8(t2); less &= mask // trim to string length less |= dotmask // combine with dot positions if (less != mask) { res.err = errWrongCharacter; return res; }
There are different ways the dot bitmask can be used to select the parsing path.
The first, and quite obvious solution, is to sequentially scan the bitmask. Finding the first bit set is a cheap instruction (TZCNT). We set artificial 1 after the last field, thanks to that we use the same code to handle all four fields. The number of digits has to be in range 1 to 3, otherwise the input is wrong.
This schema can be applied literally, and is shown in details the section Scalar conversion. We iterate over fields, read the number of digits and use code specialised to convert and validate the given number of digits.
It is better, as we do not have to deal with numbers of an unknown length. We may do it faster, by selecting specialised conversion of four fields, in other words unroll the loop described earlier for given lengths of fields. As we learned, there are only 81 different length combinations to consider. For example, the address "123.45.6.7" has to be converted by code path that would parse-and-validate 3-digit number, 2-digit number and two single digits.
The only remaining issue is how to get the id/address of conversion path from the dot bitmask. We have a number of possibilities:
An easier way is to treat the lengths as digits of a number in the base 4. It means that we simply store the lengths as two-bit subfields of an eight-bit number. In other words map a bitmask into 256 distinct values, with spare slots (for any l0, 1, 2, 3 = 0).
We may treat the dot bitmask as a natural number. The maximum value then is either 2184, but with three repetitions (due to three possible lengths of the last field). If we want to get unique numbers, we have to either add 1 after the last field, or combine bitmask with a 4-bit input length. Either way, that leads to max number ~30,000 — it's too big to have reasonable lookup table. Thus we have the following possibilities:
The common trait of these methods is that we do not have to check the number of dots. Such wrong patterns will not match anything.
There are two hashing functions that were designed during the research:
A quite complex function, having the maximum value 137. This is the hash used by function "SSE (v5)".
uint16_t w0 = 31 * uint16_t((dotmask >> 9) & 0xff); uint16_t w1 = 3 * uint16_t((dotmask >> 2) & 0x3f); uint16_t w2 = 19 * uint16_t((dotmask >> 8) & 0xff); uint16_t hashcode = n + w0 + w1 - w2;
The simpler hash has maximum much bigger: 1228. This function is used by function "SSE (v7)".
const uint16_t hashcode = (dotmask >> 5) ^ (dotmask & 0x03ff);
In scalar conversion we do not take advantage of having data already in an SSE register However, we know upfront that all characters are valid and there are exactly three dots. Thus we have only to validate number ranges.
This is the procedure "SSE" from sample code.
Before we start iterating over dots, we add artificial one to the dot bitmask, after the last character. Thanks to that we can handle all fields in exactly the same way.
dotmask |= uint16_t(1) << ipv4.size();
In the next step we repeat the code four times. In each iteration we find the position of the first 1 in the dot mask. It tells us how many digits we have to convert. This is the place where we do additional validation: we check if the number of digits is 1, 2 or 3.
Depending on the number of digits we use a specialised code path. Each path converts a number, validate its range and update the output word accordingly. We let the compiler to do its best.
In each step we also shift out the processed bits from the dot mask.
const uint8_t* data = (const uint8_t*)ipv4.data(); for (int i=0; i < 4; i++) { const int n = __builtin_ctz(dotmask); switch (n) { case 1: res.byte[i] = (data[0] - '0'); data += 2; dotmask >>= 2; break; case 2: { const uint32_t tmp = 10 * (data[0] - '0') + (data[1] - '0'); if (tmp < 10) { res.err = errLeadingZeros; return res; } res.byte[i] = tmp; } data += 3; dotmask >>= 3; break; case 3: { const uint32_t tmp = 100 * (data[0] - '0') + 10 * (data[1] - '0') + (data[2] - '0'); if (tmp > 0xff) { res.err = errTooBig; return res; } if (tmp < 100) { res.err = errLeadingZeros; return res; } res.byte[i] = tmp; } data += 4; dotmask >>= 4; break; default: if (n > 0) { res.err = errTooManyDigits; } else { res.err = errEmptyField; } return res; } }
While we can learn upfront what are widths of all fields, we utilize this fact. For each possible valid combination we pre-compute a specialised path that converts given combination of numbers.
This is the procedure "SSE (v2)" from sample code.
As in the previous solution, we add an artificial one to the dot mask, just after the last character.
dotmask |= uint16_t(1) << ipv4.size();
Convert from byte mask into 8-bit code. Although there are only 81 valid dot masks, they are stored as variable-length codes. We iterator over fields and encode their lengths on two bits, forming a byte. For our sample IPv4 "195.42.3.0" we have lengths [3, 2, 1, 1] and it's encoded as [0b11, 0b10, 0b01, 0b01] => 0b11100101 = 0xe5.
In parallel to code building, we validate if all fields have valid lengths (1, 2 or 3 chars).
const uint8_t* byte = (const uint8_t*)ipv4.data(); uint8_t code = 0; for (int i=0; i < 4; i++) { const uint8_t n = __builtin_ctz(dotmask); if (n == 0) { res.err = errEmptyField; return res; } if (n > 3) { res.err = errTooManyDigits; return res; } code = (code << 2) | n; dotmask >>= n + 1; }
The next step is to use the code value to choose specialised converter. It's a huge switch generated by a program; here we show just a single case.
uint32_t val0; uint32_t val1; uint32_t val2; uint32_t val3; switch (code) { // ... case 0xe5: // [3, 2, 1, 1] tmp = (data[0] - '0'); // 3-digit number tmp = 10 * tmp + (data[1] - '0'); tmp = 10 * tmp + (data[2] - '0'); if (tmp < 100) { res.err = errLeadingZeros; break; } if (tmp > 255) { res.err = errTooBig; break; } res.byte[0] = tmp; res.byte[1] = 10 * (data[4] - '0') + (data[5] - '0'); // 2-byte number if (res.byte[1] < 10) { res.err = errLeadingZeros; break; } res.byte[2] = (data[7] - '0'); // single-digit number res.byte[3] = (data[9] - '0'); // single-digit number break; // ... }
The approach is almost exactly as the previous one. The only difference is in how we choose the parsing specialisation.
We're doing that in two steps. First, we use the input length. This already limits the number of possibilities (see the table above). Next, we match the bitpattern as is, without any transcoding.
This is the procedure "SSE (v3)" from sample code.
Below is shown the part of switch that handles pattern [3, 2, 1, 1].
uint32_t tmp; switch (ipv4.size()) { case 7: // 1 case(s) // ... case 8: // 4 case(s) // ... case 9: // 10 case(s) // ... case 10: // 16 case(s) switch (dotmask & 0x3ff) { // ... case 0x148: // [3, 2, 1, 1] tmp = (byte[0] - '0'); tmp = 10 * tmp + (byte[1] - '0'); tmp = 10 * tmp + (byte[2] - '0'); if (tmp < 100) { res.err = errLeadingZeros; break; } if (tmp > 255) { res.err = errTooBig; break; } res.byte[0] = tmp; res.byte[1] = 10 * (byte[4] - '0') + (byte[5] - '0'); if (res.byte[1] < 10) { res.err = errLeadingZeros; break; } res.byte[2] = (byte[7] - '0'); res.byte[3] = (byte[9] - '0'); break; // ... } break; // ... }
Since the digit validation and dot-pattern generation is done by SIMD code, getting back to the scalar domain seems not to be optimal.
We can use SIMD to perform full conversion. From the dot bitmask we determine how to layout the input bytes to make them processable by SIMD instructions. This is done by single PSHUFB, then we pick a proper procedure.
Unlike the scalar unrolled code, we do not have to enumerate all 81 possibilities. We roughly classify an IPv4 address as:
In fact we may enumerate eight classes (for instance: only 2-digit numbers, only 2- or 3-digit numbers), but it's not clear if such fragmentation has much sense. Using the maximum numeric field length is good enough.
A SIMD conversion procedure requires the following parameters:
These parameters fit in 20 bytes. When we express the dot mask as a base-3 number or hash function #1, we may have a single array of parameters, having — respectively — 81 or 137 entries.
Procedures "SSE (v5)" and "SSE (v6)" follow the pattern:
hashcode = hash(dotmask) if (hashcode > max_hashcode) { // return error } spec = lookup[hashcode] if (dotmask != spec.expected_dotmask) { // return error } switch (spec.max_length) { case 1: res.ipv4 = conversion_max1(spec.pshufb_pattern) return res case 2: res.ipv4 = conversion_max2(spec.pshufb_pattern) return res case 3: res.ipv4 = conversion_max3(spec.pshufb_pattern) return res default: // return error }
However, when we use hash function #2 which max value is 1228, we use two tables. The first lookup encodes transition from the hash value into length pattern. It should be number from 0 to 80 plus one special value for "not found". The second table of size 82 has actual parameters.
Procedure "SSE (v7)" follows the pattern:
hashcode = hash(dotmask) if (hashcode > max_hashcode) { // return error } id = lookup1[hashcode] if (id == not_found_value) { // return error } spec = lookup2[id] if (dotmask != spec.expected_dotmask) { // return error } switch (spec.max_length) { case 1: res.ipv4 = conversion_max1(spec.pshufb_pattern) return res case 2: res.ipv4 = conversion_max2(spec.pshufb_pattern) return res case 3: res.ipv4 = conversion_max3(spec.pshufb_pattern) return res default: // return error }
In this case there's no need to validate anything, just pick the bytes from even indices. PSHUFB layouts the input digits in the lower 32-bit word of an SSE register and we simply copy them with single instruction.
const __m128i pattern = _mm_setr_epi8(0, 2, 4, 6, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); const __m128i t1 = _mm_shuffle_epi8(t0, pattern); res.ipv4 = _mm_cvtsi128_si32(t1) & 0x0f0f0f0f;
In this case we have only to check if the 2-digit fields do not contain ASCII '0' on their most significant position. We lay digits in two 32-bit words. The lower one contains the most significant ASCII digits, or value 0 for 1-digit fields. The higher 32-bit word contains the least significant digits.
Let's see how we would parse address "12.3.45.6".
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │..│..│..│..│..│..│..│36│2e│35│34│2e│33│2e│32│31│ = input └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ ^ ^ byte 15 byte 0
First, we split the input in the described way.
const __m128i t1 = _mm_shuffle_epi8(t0, pattern);
┌───────────────────────┬──┬──┬──┬──┬──┬──┬──┬──┐ │ unused 64 bits │ │ │ │ │ 0│34│ 0│31│ = lower 32-bit word └───────────────────────┴──┴──┴──┴──┴──┴──┴──┴──┘ ┌───────────────────────┬──┬──┬──┬──┬──┬──┬──┬──┐ │ unused 64 bits │36│35│33│32│ │ │ │ │ = higher 32-bit word └───────────────────────┴──┴──┴──┴──┴──┴──┴──┴──┘
The actual byte layout is:
┌───────────────────────┬──┬──┬──┬──┬──┬──┬──┬──┐ │ unused 64 bits │36│35│33│32│ 0│34│ 0│31│ = t1 └───────────────────────┴──┴──┴──┴──┴──┴──┴──┴──┘
Next, compare the t1 vector with ASCII '0' (hex 0x30). We are interested only in the lowest 32-bit result.
const __m128i ascii0 = _mm_set1_epi8('0'); const __m128i t2 = _mm_cmpeq_epi8(t1, ascii0);
┌───────────────────────────────────┬──┬──┬──┬──┐ │ unused 96 bits │30│30│30│30│ = ascii0 └───────────────────────────────────┴──┴──┴──┴──┘ ┌───────────────────────┬──┬──┬──┬──┬──┬──┬──┬──┐ │ unused 64 bits │36│35│33│32│ 0│34│ 0│31│ = t1 └───────────────────────┴──┴──┴──┴──┴──┴──┴──┴──┘ ┌───────────────────────────────────┬──┬──┬──┬──┐ │ unused 96 bits │00│00│00│00│ = t2 = cmp(ascii0, t1) └───────────────────────────────────┴──┴──┴──┴──┘
As the next step, we extract that lowest word from t2. If it is not zero, that means there is at least one 2-digit field having the leading zero.
const uint64_t msd = _mm_cvtsi128_si64(t2); if (msd != 0) { res.err = errLeadingZeros; return res; }
Once we know the input is valid, we extract the lower 64 bits from t1 and get back to scalar code. There is simply too few data to use SIMD instructions for that.
const uint64_t ascii = _mm_cvtsi128_si64(t1); const uint64_t w01 = ascii & 0x0f0f0f0f0f0f0f0f; const uint32_t w0 = w01 >> 32; const uint32_t w1 = w01 & 0xfffffffflu;
┌──┬──┬──┬──┬──┬──┬──┬──┐ │36│35│33│32│ 0│34│ 0│31│ = ascii └──┴──┴──┴──┴──┴──┴──┴──┘ ┌──┬──┬──┬──┬──┬──┬──┬──┐ │ 6│ 5│ 3│ 2│ 0│ 4│ 0│ 1│ = w01 └──┴──┴──┴──┴──┴──┴──┴──┘ ┌──┬──┬──┬──┐ │ 6│ 5│ 3│ 2│ = w0 └──┴──┴──┴──┘ ┌──┬──┬──┬──┐ │ 0│28│ 0│ a│ = w1 └──┴──┴──┴──┘ 40 10 decimal
Finally, we convert the individual digits into actual values. We may freely multiply by 10 and add, as we will never get an overflow; the SWAR solution works fine.
res.ipv4 = 10 * w1 + w0;
┌──┬──┬──┬──┐ │ 6│2d│ 3│ c│ = IPv4 └──┴──┴──┴──┘
In this case we split input into four sets of digits.
Let's analyse parsing of "127.45.6.222".
2 2 2 . 6 . 5 4 . 7 2 1 ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │..│..│..│..│32│32│32│2e│36│2e│35│34│2e│37│32│31│ = input └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ ^ ^ byte 15 byte 0
We split the input into four subwords:
const __m128i t1 = _mm_shuffle_epi8(t0, pattern);
127.45.6.234 │││ ││ │ │││ ││└──┴─┴───┴── digit #0 [7, 5, 6, 4] ││ │ ││ │└──┴─────┴─── digit #1 [2, 4, -, 3] │ │ └────────┴──── digit #2 [1, -, -, 2]
┌───────────────────────┬──┬──┬──┬──┬──┬──┬──┬──┐ │ │ │37│ │35│ │36│ │34│ = digit #0 └───────────────────────┴──┴──┴──┴──┴──┴──┴──┴──┘ ┌───────────────────────┬──┬──┬──┬──┬──┬──┬──┬──┐ │ │32│ │34│ │ 0│ │33│ │ = digit #1 └───────────────────────┴──┴──┴──┴──┴──┴──┴──┴──┘ ┌──┬──┬──┬──┬──┬──┬──┬──┬───────────────────────┐ │ │31│ │ 0│ │ 0│ │32│ │ = digit #2 └──┴──┴──┴──┴──┴──┴──┴──┴───────────────────────┘ ┌──┬──┬──┬──┬──┬──┬──┬──┬───────────────────────┐ │31│ │34│ │ 0│ │32│ │ │ = most significant digits └──┴──┴──┴──┴──┴──┴──┴──┴───────────────────────┘
The actual layout after PSHUFB:
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │31│31│34│ 0│ 0│ 0│32│32│32│37│34│35│ 0│36│33│34│ = t1 └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
Check if any significant digit is ASCII '0'. We only use the results for odd bytes from the higher half of word t1, thus masking with 0xAA00.
const __m128i ascii0 = _mm_set1_epi8('0'); // 0x30 const __m128i eq0 = _mm_cmpeq_epi8(t1, ascii0); const uint16_t res = _mm_movemask_epi8(eq0); if ((res & 0xaa00) != 0) { // report leading zeros }
Convert from ASCII to digits. It's sufficient to subtract with saturation value 0x30 (ASCII '0').
const __m128i t2 = _mm_subs_epu8(t1, ascii0);
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │ 1│ 1│ 4│ 0│ 0│ 0│ 2│ 2│ 2│ 7│ 4│ 5│ 0│ 6│ 3│ 4│ = t2 └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
Then, we calculate 10 * digit #1 + digit #0 in the lower half, and 100 * digit #2. To do this, we use instruction PMADDUBSW that multiplies horizontally bytes and add adjacent results into 16-bit values.
The multipliers for most significant digits from upper half are set to zero, thanks to that we do not need any extra masking.
const __m128i weights = _mm_setr_epi8( 10, 1, 10, 1, 10, 1, 10, 1, // 10 = 0x0a 100, 0, 100, 0, 100, 0, 100) // 100 = 0x64 const __m128i t3 = _mm_maddubs_epi16(t2, weights)
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │ 0│64│ 0│64│ 0│64│ 0│64│ a│ 1│ a│ 1│ a│ 1│ a│ 1│ = weights └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │ 1│ 1│ 4│ 0│ 0│ 0│ 2│ 2│ 2│ 7│ 4│ 5│ 0│ 6│ 3│ 4│ = t2 └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 100│ 0│ 0│ 200│ 27│ 45│ 6│ 34│ = t3 = maddubs(t2, weights) └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ [shown in decimal]
Now we have to combine these partial results. Simply shift right the t3 by 8 bytes (with PALIGNR) and add with the original constant of t3.
const __m128i t4 = _mm_alignr_epi8(t3, 8); const __m128i t5 = _mm_add_epi16(t4);
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 100│ 0│ 0│ 200│ 27│ 45│ 6│ 34│ = t3 └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ ┌───────────────────────┬─────┬─────┬─────┬─────┐ │ unused │ 100│ 0│ 0│ 200│ = t4 └───────────────────────┴─────┴─────┴─────┴─────┘ ┌───────────────────────┬─────┬─────┬─────┬─────┐ │ unused │ 127│ 45│ 6│ 234│ = t5 = t3 + t4 └───────────────────────┴─────┴─────┴─────┴─────┘ 0x3F 0x2D 0x06 0xEA
The almost final result is stored in t4 on 16-bit values. We have to check if all are not greater than 255. It's done by a regular comparison.
const __m128i gt = _mm_cmpgt_epi16(t4, _mm_set1_epi16(255)); const uint16_t gtmask = _mm_movemask_epi8(gt); if ((gtmask & 0x00ff) != 0) { // report too big fields }
Finally, we need to convert from uint16 to uint8. Since we know the values in t4 fit in a byte, pack with unsigned saturation is sufficient.
const __m128i t6 = _mm_packus_epi16(t5, t5); ipv4 = _mm_cvtsi128_si32(t6);
As the baseline we selected the best result from scalar procedures: go, glibc and naive.
procedure | comment |
---|---|
go | A translation into C++ the Go procedure. |
glibc | A C procedure from Glibc with a thin wrapping for C++. |
naive | A from scratch implementation of sequential parsing, as described in introduction. |
naive (no validation) | It converts only valid IPv4 addresses; used to estimate the lower bound of the naive implementation. |
SSE | It uses the SIMD validation, and scalar sequential conversion. |
SSE (v2) | It uses the SIMD validation, and unrolled scalar conversion with parsing schema based on a single switch. The switch statement takes lengths pattern encoded as an 8-bit values. |
SSE (v3) | It uses the SIMD validation, and unrolled scalar conversion with parsing schema using nested switches. The outer switch uses the input length, nested switches are based on explicit dot mask values. |
SSE (v4) | It uses both the SIMD validation, and SIMD number conversion with nested switches: the outer one uses the input length, nested ones are based on explicit dot mask values. |
SSE (v5) | It uses both the SIMD validation, and SIMD number conversion with parsing schema selection using complex hash function |
SSE (v6) | It uses both SIMD validation, and SIMD number conversion with parsing schema selection using base-3 numbers. |
SSE (v7) | It uses both SIMD validation, and SIMD number conversion with parsing schema selection using simpler hash function. |
CPU: | Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz |
---|---|
Compiler: | g++-9 (Ubuntu 9.3.0-10ubuntu2~16.04) 9.3.0 |
procedure | best | avg. | speedup | |
---|---|---|---|---|
[cycles] | [cycles] | |||
naive | 79.962 | 80.001 | 0.97 | ████████████████▍ |
glibc | 77.893 | 78.201 | 1.00 | ████████████████▊ |
go | 85.522 | 85.918 | 0.91 | ███████████████▎ |
naive (no validation) | 48.328 | 48.458 | 1.61 | ███████████████████████████▏ |
SWAR | 84.235 | 84.258 | 0.92 | ███████████████▌ |
SSE | 47.295 | 47.344 | 1.65 | ███████████████████████████▋ |
SSE (v2) | 56.745 | 56.792 | 1.37 | ███████████████████████ |
SSE (v3) | 40.779 | 40.823 | 1.91 | ████████████████████████████████▏ |
SSE (v4) | 31.922 | 31.938 | 2.44 | █████████████████████████████████████████ |
SSE (v5) | 28.127 | 28.139 | 2.77 | ██████████████████████████████████████████████▋ |
SSE (v6) | 34.278 | 34.287 | 2.27 | ██████████████████████████████████████▎ |
SSE (v7) | 26.239 | 26.251 | 2.97 | ██████████████████████████████████████████████████ |
SSE (v8) | 33.645 | 33.655 | 2.32 | ██████████████████████████████████████▉ |
CPU: | Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz |
---|---|
Compiler: | gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-14) |
procedure | best | avg. | speedup | |
---|---|---|---|---|
[cycles] | [cycles] | |||
naive | 42.217 | 42.601 | 1.00 | ███████████████████▊ |
glibc | 58.327 | 58.717 | 0.72 | ██████████████▎ |
go | 51.779 | 51.914 | 0.82 | ████████████████▏ |
naive (no validation) | 32.414 | 32.491 | 1.30 | █████████████████████████▊ |
SWAR | 54.670 | 54.718 | 0.77 | ███████████████▎ |
SSE | 28.496 | 28.550 | 1.48 | █████████████████████████████▎ |
SSE (v2) | 38.016 | 38.108 | 1.11 | █████████████████████▉ |
SSE (v3) | 23.586 | 23.711 | 1.79 | ███████████████████████████████████▍ |
SSE (v4) | 16.716 | 16.778 | 2.53 | ██████████████████████████████████████████████████ |
SSE (v5) | 19.385 | 19.438 | 2.18 | ███████████████████████████████████████████ |
SSE (v6) | 25.097 | 25.129 | 1.68 | █████████████████████████████████▎ |
SSE (v7) | 17.041 | 17.110 | 2.48 | █████████████████████████████████████████████████ |
SSE (v8) | 23.761 | 23.793 | 1.78 | ███████████████████████████████████▏ |
All programs are available on Github.
func parseIPv4(s string) (ip Addr, err error) { var fields [4]uint8 var val, pos int var digLen int // number of digits in current octet for i := 0; i < len(s); i++ { if s[i] >= '0' && s[i] <= '9' { if digLen == 1 && val == 0 { return Addr{}, parseAddrError{in: s, msg: "IPv4 field has octet with leading zero"} } val = val*10 + int(s[i]) - '0' digLen++ if val > 255 { return Addr{}, parseAddrError{in: s, msg: "IPv4 field has value >255"} } } else if s[i] == '.' { if i == 0 || i == len(s)-1 || s[i-1] == '.' { return Addr{}, parseAddrError{in: s, msg: "IPv4 field must have at least one digit", at: s[i:]} } if pos == 3 { return Addr{}, parseAddrError{in: s, msg: "IPv4 address too long"} } fields[pos] = uint8(val) pos++ val = 0 digLen = 0 } else { return Addr{}, parseAddrError{in: s, msg: "unexpected character", at: s[i:]} } } if pos < 3 { return Addr{}, parseAddrError{in: s, msg: "IPv4 address too short"} } fields[3] = uint8(val) return AddrFrom4(fields), nil }
static int inet_pton4 (const char *src, const char *end, unsigned char *dst) { int saw_digit, octets, ch; unsigned char tmp[NS_INADDRSZ], *tp; saw_digit = 0; octets = 0; *(tp = tmp) = 0; while (src < end) { ch = *src++; if (ch >= '0' && ch <= '9') { unsigned int new = *tp * 10 + (ch - '0'); if (saw_digit && *tp == 0) return 0; if (new > 255) return 0; *tp = new; if (! saw_digit) { if (++octets > 4) return 0; saw_digit = 1; } } else if (ch == '.' && saw_digit) { if (octets == 4) return 0; *++tp = 0; saw_digit = 0; } else return 0; } if (octets < 4) return 0; memcpy (dst, tmp, NS_INADDRSZ); return 1; }