SIMD-ized faster parse of IPv4 addresses

Author:Wojciech Muła
Added on:2023-04-09

Contents

Introduction

Just for recap, an IPv4 address written in the textual form consists four decimal numbers separated by the dot character. Each number represents an octet (byte), that is in range from 0 to 255. Here are some examples: "10.1.1.12", "127.0.0.1", "255.255.255.0". An IPv4 address is stored in the big-endian byte order.

Parsing IPv4 addresses seems to be a trivial task. For example, the Go builtin module netip implements parsing with full validation in 35 lines. The inet_pton specialisation for IPv4 addresses, that can be found in Glibc, spans approx 40 lines of plain C code. For completeness, their full sources were put in appendix.

These two procedures share the same schema of parsing and validation:

  1. Read the input string byte by byte.
  2. When the current byte is an ASCII digit ('0' ... '9') add it to the current octet. If the octet value becomes larger than 255 or the leading zero was detected, then report an error.
  3. When the current byte is the dot ('.'), check if we read at least one digit. If not, it's an error (for example ".1.1.20" or "192..0.12").
  4. When the current byte is not a digit or the dot, report an error.
  5. At the very end, check if exactly four octets were read.

I bet 99.999% of ever written IPv4 parsers use similar schema. And that is pretty hard to see any obvious inefficiency in this approach. Additionally, when we take into account that compilers are getting smarter and smarter, we may assume that a compiler would do a decent job for us.

However, it's possible to make the conversion faster, using SIMD instructions. The best solution is two-three times faster than a reference scalar procedure.

The actual C++ code snippets with SSE instructions are used to illustrate solution. Full source code is available. They include a SWAR implementation and also different SSE variants.

SIMD improvements

Since SIMD instructions process many bytes in parallel, we can approach the parsing problem from a slightly different direction.

  1. First we load all the data. Since a valid IPv4 string contains from 7 to 15 bytes, such string fits in a single 16-byte SSE register.
  2. Then, we can quickly validate all bytes at once.
    • As the first substep, we determine positions of the decimal dots ('.', hex code 0x2e). There have to be exactly three dots, so this trait may be used to early filter out garbage inputs.
    • As the second substep, we check if all remaining characters are digits ('0'...'9', hex codes from 0x30 to 0x39).
  3. The positions of dots determine the further parsing. Moreover, not all positions all valid: the decimal numbers have to span from 1 to 3 digits. Our goal is to transform the position of dots into "something" that can be an input to the subsequent step — actual number conversion.
  4. The last step is conversion of numeric fields, with additional validation regarding their ranges: numbers cannot excess the byte range (0..255, for example "1.2.300.4". We need also to reject fields with leading zeros (like "01.2.3.4"). Conversion can be done by a scalar or SIMD code. Both variants are shown in this article.

With the help of SIMD instructions we may quickly get a 15-bit mask, where each 1 denotes the dot, and 0 a digit byte. The following table shows all 81 valid patterns.

Valid dot patterns
length in bytes # of patterns masks (binary) masks (hex)
7 1 0101010 002a
8 4 01001010, 01010100, 00101010, 01010010 004a, 0054, 002a, 0052
9 10 001001010, 010101000, 001010010, 010010100, 010010010, 001010100, 000101010, 010001010, 010100100, 010100010 004a, 00a8, 0052, 0094, 0092, 0054, 002a, 008a, 00a4, 00a2
10 16 0100010010, 0100010100, 0100100100, 0010100100, 0010010010, 0010100010, 0010010100, 0001001010, 0100101000, 0101001000, 0010001010, 0001010010, 0010101000, 0001010100, 0100100010, 0101000100 0112, 0114, 0124, 00a4, 0092, 00a2, 0094, 004a, 0128, 0148, 008a, 0052, 00a8, 0054, 0122, 0144
11 19 00010001010, 01010001000, 01000100010, 00010010010, 00100100100, 00100010100, 00010100010, 01000100100, 00100010010, 00101001000, 01000101000, 00100100010, 01001000100, 00100101000, 00010010100, 01001001000, 00010100100, 00101000100, 00010101000 008a, 0288, 0222, 0092, 0124, 0114, 00a2, 0224, 0112, 0148, 0228, 0122, 0244, 0128, 0094, 0248, 00a4, 0144, 00a8
12 16 001000100100, 010010001000, 010001001000, 000100100100, 000101000100, 000100010100, 000100101000, 001001000100, 001000100010, 010001000100, 000100010010, 001010001000, 001001001000, 001000101000, 000101001000, 000100100010 0224, 0488, 0448, 0124, 0144, 0114, 0128, 0244, 0222, 0444, 0112, 0288, 0248, 0228, 0148, 0122
13 10 0001001000100, 0010010001000, 0010001001000, 0100010001000, 0001010001000, 0010001000100, 0001001001000, 0001000100100, 0001000100010, 0001000101000 0244, 0488, 0448, 0888, 0288, 0444, 0248, 0224, 0222, 0228
14 4 00010001000100, 00100010001000, 00010010001000, 00010001001000 0444, 0888, 0488, 0448
15 1 000100010001000 0888
total 81    

SIMD validation and preprocessing

  1. In the first step we reject inputs that are too short or too long.

    result sse_parse_ipv4(const std::string& ipv4) {
        result res;
        res.ipv4 = 0;
        res.err = 0;
    
        const size_t n = ipv4.size();
        if (n < minlen_ipv4) {
            res.err = errTooShort;
            return res;
        }
        if (n > maxlen_ipv4) {
            res.err = errTooLong;
            return res;
        }
    
        // ...
    }
    
  2. Next, we load the input string into an SSE register. In the following code there's the assumption we can read past the end of string. In the terms of C++, we assume s.capacity() >= 16. In the real-world applications it's not always hold, thus we need to use a masked load.

    const __m128i input = _mm_loadu_si128((const __m128i*)ipv4.data());
    

    An address "195.42.3.0", having 10 bytes, has the following byte layout in an SSE register:

      ?  ?  .  ?  ?  ?  0  .  3  .  2  4  .  5  9  1
    ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
    │3f│3f│2e│3f│3f│3f│30│2e│33│2e│32│34│2e│35│39│31│ = input
    └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
     ^^                                           ^^
     byte 15                                  byte 0
    
  3. Then we locate the dots. Please note that we have garbage bytes past the string end, thus there's need to have a bitmask for valid positions.

    uint16_t mask = 0xffff;
    mask <<= n;
    mask = ~mask;
    
    uint16_t dotmask;
    {
        const __m128i dot = _mm_set1_epi8('.');
        const __m128i t0 = _mm_cmpeq_epi8(input, dot);
        dotmask = _mm_movemask_epi8(t0);
        dotmask &= mask;
    }
    
      ?  ?  .  ?  ?  ?  0  .  3  .  2  4  .  5  9  1
    ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
    │3f│3f│2e│3f│3f│3f│30│2e│33│2e│32│34│2e│35│39│31│ = input
    └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
    ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
    │00│00│ff│00│00│00│00│ff│00│ff│00│00│ff│00│00│00│ = t0
    └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
           ^^             ^^    ^^       ^^
    
                          0x2148 = 0b0010000101001000 = dotmask
                                   0b0000001111111111 = mask
                                   0b0000000101001000 = dotmask & mask
    
  4. At this point we can count the number of dots (using cheap POPCNT). Yet, that is not necessary, as the wrong dot-patterns can be detected later. It depends on the method we choose for the dot mask processing.

    const auto k = __builtin_popcount(dotmask);
    if (k != 3) {
        if (k > 3) {
            res.err = errTooManyFields;
        } else {
            res.err = errTooFewFields;
        }
        return res;
    }
    
  5. Now, we check if all other bytes are ASCII digits. It requires just one SIMD subtraction and compare.

    const __m128i ascii0 = _mm_set1_epi8(-128 + '0');
    const __m128i rangedigits = _mm_set1_epi8(-128 + ('9' - '0' + 1));
    
    const __m128i t1 = _mm_sub_epi8(input, ascii0);
    const __m128i t2 = _mm_cmplt_epi8(t1, rangedigits);
    
    uint16_t less = _mm_movemask_epi8(t2);
    less &= mask     // trim to string length
    less |= dotmask  // combine with dot positions
    
    if (less != mask) {
        res.err = errWrongCharacter;
        return res;
    }
    

Selecting parsing schema

There are different ways the dot bitmask can be used to select the parsing path.

The first, and quite obvious solution, is to sequentially scan the bitmask. Finding the first bit set is a cheap instruction (TZCNT). We set artificial 1 after the last field, thanks to that we use the same code to handle all four fields. The number of digits has to be in range 1 to 3, otherwise the input is wrong.

This schema can be applied literally, and is shown in details the section Scalar conversion. We iterate over fields, read the number of digits and use code specialised to convert and validate the given number of digits.

It is better, as we do not have to deal with numbers of an unknown length. We may do it faster, by selecting specialised conversion of four fields, in other words unroll the loop described earlier for given lengths of fields. As we learned, there are only 81 different length combinations to consider. For example, the address "123.45.6.7" has to be converted by code path that would parse-and-validate 3-digit number, 2-digit number and two single digits.

The only remaining issue is how to get the id/address of conversion path from the dot bitmask. We have a number of possibilities:

Hash functions

There are two hashing functions that were designed during the research:

  1. A quite complex function, having the maximum value 137. This is the hash used by function "SSE (v5)".

    uint16_t w0 = 31 * uint16_t((dotmask >> 9) & 0xff);
    uint16_t w1 =  3 * uint16_t((dotmask >> 2) & 0x3f);
    uint16_t w2 = 19 * uint16_t((dotmask >> 8) & 0xff);
    uint16_t hashcode = n + w0 + w1 - w2;
    
  1. The simpler hash has maximum much bigger: 1228. This function is used by function "SSE (v7)".

    const uint16_t hashcode = (dotmask >> 5) ^ (dotmask & 0x03ff);
    

Number conversions and validation

Scalar conversions

In scalar conversion we do not take advantage of having data already in an SSE register However, we know upfront that all characters are valid and there are exactly three dots. Thus we have only to validate number ranges.

Sequential conversion

This is the procedure "SSE" from sample code.

  1. Before we start iterating over dots, we add artificial one to the dot bitmask, after the last character. Thanks to that we can handle all fields in exactly the same way.

    dotmask |= uint16_t(1) << ipv4.size();
    
  2. In the next step we repeat the code four times. In each iteration we find the position of the first 1 in the dot mask. It tells us how many digits we have to convert. This is the place where we do additional validation: we check if the number of digits is 1, 2 or 3.

    Depending on the number of digits we use a specialised code path. Each path converts a number, validate its range and update the output word accordingly. We let the compiler to do its best.

    In each step we also shift out the processed bits from the dot mask.

    const uint8_t* data = (const uint8_t*)ipv4.data();
    for (int i=0; i < 4; i++) {
        const int n = __builtin_ctz(dotmask);
    
        switch (n) {
            case 1:
                res.byte[i] = (data[0] - '0');
    
                data += 2;
                dotmask >>= 2;
                break;
    
            case 2:
                {
                    const uint32_t tmp = 10 * (data[0] - '0') + (data[1] - '0');
                    if (tmp < 10) {
                        res.err = errLeadingZeros;
                        return res;
                    }
                    res.byte[i] = tmp;
                }
                data += 3;
                dotmask >>= 3;
                break;
    
            case 3:
                {
                    const uint32_t tmp = 100 * (data[0] - '0') + 10 * (data[1] - '0') + (data[2] - '0');
                    if (tmp > 0xff) {
                        res.err = errTooBig;
                        return res;
                    }
                    if (tmp < 100) {
                        res.err = errLeadingZeros;
                        return res;
                    }
    
                    res.byte[i] = tmp;
                }
    
                data += 4;
                dotmask >>= 4;
                break;
    
            default:
                if (n > 0) {
                    res.err = errTooManyDigits;
                } else {
                    res.err = errEmptyField;
                }
                return res;
        }
    }
    

Unrolled conversion

While we can learn upfront what are widths of all fields, we utilize this fact. For each possible valid combination we pre-compute a specialised path that converts given combination of numbers.

This is the procedure "SSE (v2)" from sample code.

  1. As in the previous solution, we add an artificial one to the dot mask, just after the last character.

    dotmask |= uint16_t(1) << ipv4.size();
    
  2. Convert from byte mask into 8-bit code. Although there are only 81 valid dot masks, they are stored as variable-length codes. We iterator over fields and encode their lengths on two bits, forming a byte. For our sample IPv4 "195.42.3.0" we have lengths [3, 2, 1, 1] and it's encoded as [0b11, 0b10, 0b01, 0b01] => 0b11100101 = 0xe5.

    In parallel to code building, we validate if all fields have valid lengths (1, 2 or 3 chars).

    const uint8_t* byte = (const uint8_t*)ipv4.data();
    uint8_t code = 0;
    for (int i=0; i < 4; i++) {
        const uint8_t n = __builtin_ctz(dotmask);
        if (n == 0) {
            res.err = errEmptyField;
            return res;
        }
        if (n > 3) {
            res.err = errTooManyDigits;
            return res;
        }
    
        code = (code << 2) | n;
        dotmask >>= n + 1;
    }
    
  3. The next step is to use the code value to choose specialised converter. It's a huge switch generated by a program; here we show just a single case.

    uint32_t val0;
    uint32_t val1;
    uint32_t val2;
    uint32_t val3;
    switch (code) {
        // ...
        case 0xe5: // [3, 2, 1, 1]
            tmp = (data[0] - '0');                                  // 3-digit number
            tmp = 10 * tmp + (data[1] - '0');
            tmp = 10 * tmp + (data[2] - '0');
            if (tmp < 100) { res.err = errLeadingZeros; break; }
            if (tmp > 255) { res.err = errTooBig; break; }
            res.byte[0] = tmp;
            res.byte[1]  = 10 * (data[4] - '0') + (data[5] - '0');  // 2-byte number
            if (res.byte[1] < 10) { res.err = errLeadingZeros; break; }
            res.byte[2] = (data[7] - '0');                          // single-digit number
            res.byte[3] = (data[9] - '0');                          // single-digit number
            break;
        // ...
    }
    

Unrolled conversion (group by input length)

The approach is almost exactly as the previous one. The only difference is in how we choose the parsing specialisation.

We're doing that in two steps. First, we use the input length. This already limits the number of possibilities (see the table above). Next, we match the bitpattern as is, without any transcoding.

This is the procedure "SSE (v3)" from sample code.

Below is shown the part of switch that handles pattern [3, 2, 1, 1].

uint32_t tmp;
switch (ipv4.size()) {
    case 7: // 1 case(s)
            // ...
    case 8: // 4 case(s)
            // ...
    case 9: // 10 case(s)
            // ...
    case 10: // 16 case(s)
        switch (dotmask & 0x3ff) {
            // ...
            case 0x148: // [3, 2, 1, 1]
                tmp = (byte[0] - '0');
                tmp = 10 * tmp + (byte[1] - '0');
                tmp = 10 * tmp + (byte[2] - '0');
                if (tmp < 100) { res.err = errLeadingZeros; break; }
                if (tmp > 255) { res.err = errTooBig; break; }
                res.byte[0] = tmp;
                res.byte[1]  = 10 * (byte[4] - '0') + (byte[5] - '0');
                if (res.byte[1] < 10) { res.err = errLeadingZeros; break; }
                res.byte[2] = (byte[7] - '0');
                res.byte[3] = (byte[9] - '0');
                break;
            // ...
        }
        break;
    // ...
}

SIMD conversion

Since the digit validation and dot-pattern generation is done by SIMD code, getting back to the scalar domain seems not to be optimal.

We can use SIMD to perform full conversion. From the dot bitmask we determine how to layout the input bytes to make them processable by SIMD instructions. This is done by single PSHUFB, then we pick a proper procedure.

Unlike the scalar unrolled code, we do not have to enumerate all 81 possibilities. We roughly classify an IPv4 address as:

  1. having only single-digit numbers (like "1.2.3.4");
  2. having 1- and 2-digit numbers (like "11.2.33.44");
  3. having 1-, 2- and 3-digit numbers (like: "1.222.111.44").

In fact we may enumerate eight classes (for instance: only 2-digit numbers, only 2- or 3-digit numbers), but it's not clear if such fragmentation has much sense. Using the maximum numeric field length is good enough.

Selecting procedure

A SIMD conversion procedure requires the following parameters:

  • Pattern for the PSHUFB instruction — to layout input in the format required by a conversion procedure.
  • Procedure number — a proper conversion procedure id (in practise maximum length of numeric field).
  • The exact dot mask, that is compared with the current dot mask to filter out false negatives.

These parameters fit in 20 bytes. When we express the dot mask as a base-3 number or hash function #1, we may have a single array of parameters, having — respectively — 81 or 137 entries.

Procedures "SSE (v5)" and "SSE (v6)" follow the pattern:

hashcode = hash(dotmask)
if (hashcode > max_hashcode) {
    // return error
}
spec = lookup[hashcode]
if (dotmask != spec.expected_dotmask) {
    // return error
}

switch (spec.max_length) {
    case 1:
        res.ipv4 = conversion_max1(spec.pshufb_pattern)
        return res
    case 2:
        res.ipv4 = conversion_max2(spec.pshufb_pattern)
        return res
    case 3:
        res.ipv4 = conversion_max3(spec.pshufb_pattern)
        return res
    default:
        // return error
}

However, when we use hash function #2 which max value is 1228, we use two tables. The first lookup encodes transition from the hash value into length pattern. It should be number from 0 to 80 plus one special value for "not found". The second table of size 82 has actual parameters.

Procedure "SSE (v7)" follows the pattern:

hashcode = hash(dotmask)
if (hashcode > max_hashcode) {
    // return error
}
id = lookup1[hashcode]
if (id == not_found_value) {
    // return error
}
spec = lookup2[id]
if (dotmask != spec.expected_dotmask) {
    // return error
}

switch (spec.max_length) {
    case 1:
        res.ipv4 = conversion_max1(spec.pshufb_pattern)
        return res
    case 2:
        res.ipv4 = conversion_max2(spec.pshufb_pattern)
        return res
    case 3:
        res.ipv4 = conversion_max3(spec.pshufb_pattern)
        return res
    default:
        // return error
}

Conversion of 1-digit numbers

In this case there's no need to validate anything, just pick the bytes from even indices. PSHUFB layouts the input digits in the lower 32-bit word of an SSE register and we simply copy them with single instruction.

const __m128i pattern = _mm_setr_epi8(0, 2, 4, 6, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1);
const __m128i t1 = _mm_shuffle_epi8(t0, pattern);
res.ipv4 = _mm_cvtsi128_si32(t1) & 0x0f0f0f0f;

Conversion of 1- and 2-digit numbers

In this case we have only to check if the 2-digit fields do not contain ASCII '0' on their most significant position. We lay digits in two 32-bit words. The lower one contains the most significant ASCII digits, or value 0 for 1-digit fields. The higher 32-bit word contains the least significant digits.

  1. Let's see how we would parse address "12.3.45.6".

    ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
    │..│..│..│..│..│..│..│36│2e│35│34│2e│33│2e│32│31│ = input
    └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
     ^                                             ^
     byte 15                                  byte 0
    
  2. First, we split the input in the described way.

    const __m128i t1 = _mm_shuffle_epi8(t0, pattern);
    
    ┌───────────────────────┬──┬──┬──┬──┬──┬──┬──┬──┐
    │    unused 64 bits     │  │  │  │  │ 0│34│ 0│31│ = lower 32-bit word
    └───────────────────────┴──┴──┴──┴──┴──┴──┴──┴──┘
    ┌───────────────────────┬──┬──┬──┬──┬──┬──┬──┬──┐
    │    unused 64 bits     │36│35│33│32│  │  │  │  │ = higher 32-bit word
    └───────────────────────┴──┴──┴──┴──┴──┴──┴──┴──┘
    

    The actual byte layout is:

    ┌───────────────────────┬──┬──┬──┬──┬──┬──┬──┬──┐
    │    unused 64 bits     │36│35│33│32│ 0│34│ 0│31│ = t1
    └───────────────────────┴──┴──┴──┴──┴──┴──┴──┴──┘
    
  3. Next, compare the t1 vector with ASCII '0' (hex 0x30). We are interested only in the lowest 32-bit result.

    const __m128i ascii0 = _mm_set1_epi8('0');
    const __m128i t2     = _mm_cmpeq_epi8(t1, ascii0);
    
    ┌───────────────────────────────────┬──┬──┬──┬──┐
    │            unused 96 bits         │30│30│30│30│ = ascii0
    └───────────────────────────────────┴──┴──┴──┴──┘
    ┌───────────────────────┬──┬──┬──┬──┬──┬──┬──┬──┐
    │    unused 64 bits     │36│35│33│32│ 0│34│ 0│31│ = t1
    └───────────────────────┴──┴──┴──┴──┴──┴──┴──┴──┘
    ┌───────────────────────────────────┬──┬──┬──┬──┐
    │            unused 96 bits         │00│00│00│00│ = t2 = cmp(ascii0, t1)
    └───────────────────────────────────┴──┴──┴──┴──┘
    
  4. As the next step, we extract that lowest word from t2. If it is not zero, that means there is at least one 2-digit field having the leading zero.

    const uint64_t msd = _mm_cvtsi128_si64(t2);
    if (msd != 0) {
        res.err = errLeadingZeros;
        return res;
    }
    
  5. Once we know the input is valid, we extract the lower 64 bits from t1 and get back to scalar code. There is simply too few data to use SIMD instructions for that.

    const uint64_t ascii = _mm_cvtsi128_si64(t1);
    const uint64_t w01   = ascii & 0x0f0f0f0f0f0f0f0f;
    const uint32_t w0    = w01 >> 32;
    const uint32_t w1    = w01 & 0xfffffffflu;
    
    ┌──┬──┬──┬──┬──┬──┬──┬──┐
    │36│35│33│32│ 0│34│ 0│31│ = ascii
    └──┴──┴──┴──┴──┴──┴──┴──┘
    ┌──┬──┬──┬──┬──┬──┬──┬──┐
    │ 6│ 5│ 3│ 2│ 0│ 4│ 0│ 1│ = w01
    └──┴──┴──┴──┴──┴──┴──┴──┘
    
                ┌──┬──┬──┬──┐
                │ 6│ 5│ 3│ 2│ = w0
                └──┴──┴──┴──┘
                ┌──┬──┬──┬──┐
                │ 0│28│ 0│ a│ = w1
                └──┴──┴──┴──┘
                    40    10 decimal
    
  6. Finally, we convert the individual digits into actual values. We may freely multiply by 10 and add, as we will never get an overflow; the SWAR solution works fine.

    res.ipv4 = 10 * w1 + w0;
    
    ┌──┬──┬──┬──┐
    │ 6│2d│ 3│ c│ = IPv4
    └──┴──┴──┴──┘
    

Conversion of 1-, 2- and 3-digit numbers

In this case we split input into four sets of digits.

  1. Let's analyse parsing of "127.45.6.222".

                  2  2  2  .  6  .  5  4  .  7  2  1
    ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
    │..│..│..│..│32│32│32│2e│36│2e│35│34│2e│37│32│31│ = input
    └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
     ^                                             ^
     byte 15                                  byte 0
    
  2. We split the input into four subwords:

    • digits #0;
    • digits #1 or zero for single-digit numbers;
    • digits #2 or zero for 1- and 2-digit numbers;
    • most significant ASCII digits of 2- and 3-digit numbers, 0 otherwise.
    const __m128i t1 = _mm_shuffle_epi8(t0, pattern);
    
    127.45.6.234
    │││ ││ │ │││
    ││└──┴─┴───┴── digit #0 [7, 5, 6, 4]
    ││  │    ││
    │└──┴─────┴─── digit #1 [2, 4, -, 3]
    │        │
    └────────┴──── digit #2 [1, -, -, 2]
    
    ┌───────────────────────┬──┬──┬──┬──┬──┬──┬──┬──┐
    │                       │  │37│  │35│  │36│  │34│ = digit #0
    └───────────────────────┴──┴──┴──┴──┴──┴──┴──┴──┘
    ┌───────────────────────┬──┬──┬──┬──┬──┬──┬──┬──┐
    │                       │32│  │34│  │ 0│  │33│  │ = digit #1
    └───────────────────────┴──┴──┴──┴──┴──┴──┴──┴──┘
    ┌──┬──┬──┬──┬──┬──┬──┬──┬───────────────────────┐
    │  │31│  │ 0│  │ 0│  │32│                       │ = digit #2
    └──┴──┴──┴──┴──┴──┴──┴──┴───────────────────────┘
    ┌──┬──┬──┬──┬──┬──┬──┬──┬───────────────────────┐
    │31│  │34│  │ 0│  │32│  │                       │ = most significant digits
    └──┴──┴──┴──┴──┴──┴──┴──┴───────────────────────┘
    

    The actual layout after PSHUFB:

    ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
    │31│31│34│ 0│ 0│ 0│32│32│32│37│34│35│ 0│36│33│34│ = t1
    └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
    
  3. Check if any significant digit is ASCII '0'. We only use the results for odd bytes from the higher half of word t1, thus masking with 0xAA00.

    const __m128i ascii0 = _mm_set1_epi8('0'); // 0x30
    const __m128i eq0    = _mm_cmpeq_epi8(t1, ascii0);
    const uint16_t res   = _mm_movemask_epi8(eq0);
    if ((res & 0xaa00) != 0) {
        // report leading zeros
    }
    
  4. Convert from ASCII to digits. It's sufficient to subtract with saturation value 0x30 (ASCII '0').

    const __m128i t2 = _mm_subs_epu8(t1, ascii0);
    
    ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
    │ 1│ 1│ 4│ 0│ 0│ 0│ 2│ 2│ 2│ 7│ 4│ 5│ 0│ 6│ 3│ 4│ = t2
    └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
    
  5. Then, we calculate 10 * digit #1 + digit #0 in the lower half, and 100 * digit #2. To do this, we use instruction PMADDUBSW that multiplies horizontally bytes and add adjacent results into 16-bit values.

    The multipliers for most significant digits from upper half are set to zero, thanks to that we do not need any extra masking.

    const __m128i weights = _mm_setr_epi8(
        10, 1, 10, 1, 10, 1, 10, 1,     //  10 = 0x0a
        100, 0, 100, 0, 100, 0, 100)    // 100 = 0x64
    
    const __m128i t3 = _mm_maddubs_epi16(t2, weights)
    
    ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
    │ 0│64│ 0│64│ 0│64│ 0│64│ a│ 1│ a│ 1│ a│ 1│ a│ 1│ = weights
    └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
    ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
    │ 1│ 1│ 4│ 0│ 0│ 0│ 2│ 2│ 2│ 7│ 4│ 5│ 0│ 6│ 3│ 4│ = t2
    └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │  100│    0│    0│  200│   27│   45│    6│   34│ = t3 = maddubs(t2, weights)
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘        [shown in decimal]
    
  6. Now we have to combine these partial results. Simply shift right the t3 by 8 bytes (with PALIGNR) and add with the original constant of t3.

    const __m128i t4 = _mm_alignr_epi8(t3, 8);
    const __m128i t5 = _mm_add_epi16(t4);
    
    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │  100│    0│    0│  200│   27│   45│    6│   34│ = t3
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
    ┌───────────────────────┬─────┬─────┬─────┬─────┐
    │        unused         │  100│    0│    0│  200│ = t4
    └───────────────────────┴─────┴─────┴─────┴─────┘
    ┌───────────────────────┬─────┬─────┬─────┬─────┐
    │        unused         │  127│   45│    6│  234│ = t5 = t3 + t4
    └───────────────────────┴─────┴─────┴─────┴─────┘
                              0x3F  0x2D  0x06  0xEA
    
  7. The almost final result is stored in t4 on 16-bit values. We have to check if all are not greater than 255. It's done by a regular comparison.

    const __m128i  gt     = _mm_cmpgt_epi16(t4, _mm_set1_epi16(255));
    const uint16_t gtmask = _mm_movemask_epi8(gt);
    if ((gtmask & 0x00ff) != 0) {
        // report too big fields
    }
    
  8. Finally, we need to convert from uint16 to uint8. Since we know the values in t4 fit in a byte, pack with unsigned saturation is sufficient.

    const __m128i t6 = _mm_packus_epi16(t5, t5);
    ipv4 = _mm_cvtsi128_si32(t6);
    

Benchmarks

As the baseline we selected the best result from scalar procedures: go, glibc and naive.

Tested procedures

Tested procedures
procedure comment
go A translation into C++ the Go procedure.
glibc A C procedure from Glibc with a thin wrapping for C++.
naive A from scratch implementation of sequential parsing, as described in introduction.
naive (no validation) It converts only valid IPv4 addresses; used to estimate the lower bound of the naive implementation.
SSE It uses the SIMD validation, and scalar sequential conversion.
SSE (v2) It uses the SIMD validation, and unrolled scalar conversion with parsing schema based on a single switch. The switch statement takes lengths pattern encoded as an 8-bit values.
SSE (v3) It uses the SIMD validation, and unrolled scalar conversion with parsing schema using nested switches. The outer switch uses the input length, nested switches are based on explicit dot mask values.
SSE (v4) It uses both the SIMD validation, and SIMD number conversion with nested switches: the outer one uses the input length, nested ones are based on explicit dot mask values.
SSE (v5) It uses both the SIMD validation, and SIMD number conversion with parsing schema selection using complex hash function
SSE (v6) It uses both SIMD validation, and SIMD number conversion with parsing schema selection using base-3 numbers.
SSE (v7) It uses both SIMD validation, and SIMD number conversion with parsing schema selection using simpler hash function.

Results from SkyLake machine

CPU:Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Compiler:g++-9 (Ubuntu 9.3.0-10ubuntu2~16.04) 9.3.0
procedure best avg. speedup  
  [cycles] [cycles]    
naive 79.962 80.001 0.97 ████████████████▍
glibc 77.893 78.201 1.00 ████████████████▊
go 85.522 85.918 0.91 ███████████████▎
naive (no validation) 48.328 48.458 1.61 ███████████████████████████▏
SWAR 84.235 84.258 0.92 ███████████████▌
SSE 47.295 47.344 1.65 ███████████████████████████▋
SSE (v2) 56.745 56.792 1.37 ███████████████████████
SSE (v3) 40.779 40.823 1.91 ████████████████████████████████▏
SSE (v4) 31.922 31.938 2.44 █████████████████████████████████████████
SSE (v5) 28.127 28.139 2.77 ██████████████████████████████████████████████▋
SSE (v6) 34.278 34.287 2.27 ██████████████████████████████████████▎
SSE (v7) 26.239 26.251 2.97 ██████████████████████████████████████████████████
SSE (v8) 33.645 33.655 2.32 ██████████████████████████████████████▉

Results from IceLake machine

CPU:Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Compiler:gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-14)
procedure best avg. speedup  
  [cycles] [cycles]    
naive 42.217 42.601 1.00 ███████████████████▊
glibc 58.327 58.717 0.72 ██████████████▎
go 51.779 51.914 0.82 ████████████████▏
naive (no validation) 32.414 32.491 1.30 █████████████████████████▊
SWAR 54.670 54.718 0.77 ███████████████▎
SSE 28.496 28.550 1.48 █████████████████████████████▎
SSE (v2) 38.016 38.108 1.11 █████████████████████▉
SSE (v3) 23.586 23.711 1.79 ███████████████████████████████████▍
SSE (v4) 16.716 16.778 2.53 ██████████████████████████████████████████████████
SSE (v5) 19.385 19.438 2.18 ███████████████████████████████████████████
SSE (v6) 25.097 25.129 1.68 █████████████████████████████████▎
SSE (v7) 17.041 17.110 2.48 █████████████████████████████████████████████████
SSE (v8) 23.761 23.793 1.78 ███████████████████████████████████▏

Source

All programs are available on Github.

Appendix

Appendix A — the Go source

func parseIPv4(s string) (ip Addr, err error) {
    var fields [4]uint8
    var val, pos int
    var digLen int // number of digits in current octet
    for i := 0; i < len(s); i++ {
        if s[i] >= '0' && s[i] <= '9' {
            if digLen == 1 && val == 0 {
                return Addr{}, parseAddrError{in: s, msg: "IPv4 field has octet with leading zero"}
            }
            val = val*10 + int(s[i]) - '0'
            digLen++
            if val > 255 {
                return Addr{}, parseAddrError{in: s, msg: "IPv4 field has value >255"}
            }
        } else if s[i] == '.' {
            if i == 0 || i == len(s)-1 || s[i-1] == '.' {
                return Addr{}, parseAddrError{in: s, msg: "IPv4 field must have at least one digit", at: s[i:]}
            }
            if pos == 3 {
                return Addr{}, parseAddrError{in: s, msg: "IPv4 address too long"}
            }
            fields[pos] = uint8(val)
            pos++
            val = 0
            digLen = 0
        } else {
            return Addr{}, parseAddrError{in: s, msg: "unexpected character", at: s[i:]}
        }
    }
    if pos < 3 {
        return Addr{}, parseAddrError{in: s, msg: "IPv4 address too short"}
    }
    fields[3] = uint8(val)
    return AddrFrom4(fields), nil
}

Appendix B — the Glibc source

static int inet_pton4 (const char *src, const char *end, unsigned char *dst)
{
  int saw_digit, octets, ch;
  unsigned char tmp[NS_INADDRSZ], *tp;

  saw_digit = 0;
  octets = 0;
  *(tp = tmp) = 0;
  while (src < end)
    {
      ch = *src++;
      if (ch >= '0' && ch <= '9')
        {
          unsigned int new = *tp * 10 + (ch - '0');

          if (saw_digit && *tp == 0)
            return 0;
          if (new > 255)
            return 0;
          *tp = new;
          if (! saw_digit)
            {
              if (++octets > 4)
                return 0;
              saw_digit = 1;
            }
        }
      else if (ch == '.' && saw_digit)
        {
          if (octets == 4)
            return 0;
          *++tp = 0;
          saw_digit = 0;
        }
      else
        return 0;
    }
  if (octets < 4)
    return 0;
  memcpy (dst, tmp, NS_INADDRSZ);
  return 1;
}