Modern perfect hashing for strings

Author:Wojciech Muła
Added on:2023-04-30
Updated on:2023-04-02 (grammar corrections)

Contents

Introduction

Looking up in a static set of strings is a common problem we encounter when parsing any textual formats. Such sets are often keywords of a programming language or protocol.

Parsing HTTP verbs appeared to be the fastest when we use a compile-time trie: a series of nested switch statements. I could not believe that a perfect hash function is not better, and that led to a novel hashing approach that is based on the instruction PEXT (Parallel Bits Extract).

Briefly, when constructing a perfect hash function, we are looking for the smallest set of input bytes that can be then the input for some function combines them into a single value. The instruction PEXT allows to quickly construct any n-bit subword from a 64-bit word; the latency of the instruction is 3 CPU cycles on the current processors. This allows us to extend the schema for looking for the smallest subset of bits. This n-bit word is then the input for a function that translates the word into the desired value.

Instead of something like:

func hash(s []byte) uint64 {
    // read bytes at indices a, b, c
    // and push forward
    return hash_bytes(s[a], s[b], s[c])
}

We have:

func hash(s []byte) uint64 {
    // read bytes at indices d and e, and form a temp value
    uint64 tmp = (uint64(s[d]) << 8) | uint64(s[e])

    // from the temp value (bytes d and e) extract crucial bits
    uint64 val = PEXT(tmp, precomputed_mask)

    return hash_uint64(val)
}

Please note that depending on the strings set, the number of bytes read in both schemas can vary. It is not the rule that a bit-level hash function would touch fewer bytes than a byte-level hash function.

Apart from the above hashing schema, this text describes also constructing a compile-time hash table and compile-time switch.

All source codes are available.

PEXT recap

The instruction PEXT gets two arguments: the input word and the input mask. Bits from the input word for which the input mask is 1 are copied to the output. For example:

word:   0010101011010111
mask:   0011100100100010
masked: __101__0__0___1_
PEXT:   __________101001

Constructing perfect hashing function

Short recap:

While minimal hash functions are hard to find, perfect hash functions are quite easy to compute. We will describe two methods:

Open addressing

When we build a generic hash table implementation, collisions are unavoidable. The most common solution to this problem is using some kind of container (a linked list or balanced tree) to keep all objects sharing the same hash value.

A less common strategy is open addressing. We store values directly in the table. If there is a collision — that is, the given index is already allocated — we try another index. There are several approaches to pick a new index; it simply may be the next index (called linear probing), but there are of course more sophisticated procedures (quadratic probing, double hashing, etc.).

In our case, where we have a static set of strings, we use the simplest solution. Our goal is to find the minimum size of table (N) in which the number of collisions (k) is also the least possible, preferably without collisions.

Input parameters are: 1) a set of strings and 2) hash function; to avoid calling the hash function over and over, we cache the hash value for each string.

The main procedure is responsible for settings the maximum number of collisions; it starts from 1 and increases it until we succeeded. The number of collisions depends on the quality of hash function, but often it is just 2 or 3 collisions. The Go implementation of procedure is shown below.

Then, for some range of table sizes, we check the actual number of collisions, and we report the minimum table size.

func computerHashParameters(keywords []Keyword, hash func([]byte) uint64) (size int, collisions int) {
    hashes := make([]uint64, len(keywords))
    for i := range keywords {
        hashes[i] = hash(keywords[i].word)
    }
    collisions = 1
    for {
        size = hashtableaux(hashes, collisions)
        if size > 0 {
            fmt.Printf("size = %5d, maxcollisions = %d\n", size, collisions)
            return
        }
        collisions += 1
    }
}

Below is the implementation of a helper function. It checks the table size starting from the size of input set, up to its tenfold; the multiply of 10 appeared to be a quite good upper bound.

func hashtableaux(hashes []uint64, maxcollisions int) int {
    n := len(hashes)
    const sizeOverhead = 10

outer:
    for N := n; N < sizeOverhead*n; N++ {
        table := make(map[uint64]int)
        for _, h := range hashes {
            idx := h % uint64(N)
            table[idx] += 1
            if table[idx] > maxcollisions {
                continue outer
            }
        }
        return N
    }

    return -1
}

Now, having the parameters N and k, we can finally write a procedure for lookup. The following C++ procedure lookups for one of Go keywords.

int lookup_go_hash1(std::string_view s) {
    const uint64_t idx = (hash1(s) % 38) * 2;
    static std::string_view lookup[76] = {
        "chan", // 836 (0x344)
        "", "", "",
        "case", // 800 (0x320)
        "for", // 648 (0x288)
        "", "",
        "continue", // 1600 (0x640)
        "", "", "",
        "defer", // 1070 (0x42e)
        "func", // 804 (0x324)
        "", "", "", "",
        "package", // 1491 (0x5d3)
        "",
        "else", // 808 (0x328)
        "go", // 428 (0x1ac)
        "const", // 1075 (0x433)
        "range", // 1075 (0x433)
        "var", // 696 (0x2b8)
        "", "", "",
        "return", // 1344 (0x540)
        "", "", "", "", "",
        "map", // 663 (0x297)
        "",
        "select", // 1386 (0x56a)
        "struct", // 1386 (0x56a)
        "", "",
        "goto", // 856 (0x358)
        "", "", "",
        "switch", // 1314 (0x522)
        "", "", "",
        "fallthrough", // 2266 (0x8da)
        "", "", "", "", "", "", "", "", "", "", "",
        "default", // 1512 (0x5e8)
        "interface", // 1854 (0x73e)
        "", "",
        "type", // 868 (0x364)
        "", "", "",
        "if", // 414 (0x19e)
        "import", // 1326 (0x52e)
        "", "", "", "",
        "break", // 1025 (0x401)
        "", };
    static int values[76] = {
        2, // 836 (0x344)
        -1, -1, -1,
        1, // 800 (0x320)
        9, // 648 (0x288)
        -1, -1,
        4, // 1600 (0x640)
        -1, -1, -1,
        6, // 1070 (0x42e)
        10, // 804 (0x324)
        -1, -1, -1, -1,
        17, // 1491 (0x5d3)
        -1,
        7, // 808 (0x328)
        11, // 428 (0x1ac)
        3, // 1075 (0x433)
        18, // 1075 (0x433)
        24, // 696 (0x2b8)
        -1, -1, -1,
        19, // 1344 (0x540)
        -1, -1, -1, -1, -1,
        16, // 663 (0x297)
        -1,
        20, // 1386 (0x56a)
        21, // 1386 (0x56a)
        -1, -1,
        12, // 856 (0x358)
        -1, -1, -1,
        22, // 1314 (0x522)
        -1, -1, -1,
        8, // 2266 (0x8da)
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        5, // 1512 (0x5e8)
        15, // 1854 (0x73e)
        -1, -1,
        23, // 868 (0x364)
        -1, -1, -1,
        13, // 414 (0x19e)
        14, // 1326 (0x52e)
        -1, -1, -1, -1,
        0, // 1025 (0x401)
        -1,
    };
    for (int i=0; i < 2; i++) {
        if (lookup[idx + i] == s) {
            return values[idx + i];
        }
    }
    return -1;
}

First we have a call to hash1 on the input string. Its result is taken modulo 38 — that is the value N found by the above algorithm. Likewise, the magic constant 2 is parameter k, and it is the number of collisions we have to handle.

The lookup procedure contains two auxiliary tables of size N * k = 76: one for strings and another for values; the not-found-value was set to -1. (Note that we can have a single array of structures, however it's hard to tell what's better for typically small arrays.) After getting the index idx we just run two times the following check:

if (lookup[idx + i] == s) {
    return values[idx + i];
}

Please note that all parameters for hash table are known in compile time, thus a compiler can rid of division and apply additional optimizations.

Below is the assembly code of the procedure, compiled for Skylake architecture with GCC 12.2.0.

<lookup_go_hash1(std::basic_string_view<char, std::char_traits<char> >)>:
    push   %r14
    push   %r13
    mov    %rsi,%r13
    push   %r12
    push   %rbp
    mov    %rdi,%rbp
    push   %rbx

    movsbq (%rsi),%rax                  // f := first char
    movsbq -0x1(%rdi,%rsi,1),%rcx       // l := last char
    add    %rax,%rcx                    // their sum
    imul   %rdi,%rcx                    // (f + l) * s.size()

    movabs $0xd79435e50d79435f,%rax     //
    mul    %rcx                         //
    shr    $0x5,%rdx                    //
    imul   $0x26,%rdx,%rax              //
    sub    %rax,%rcx                    //
    mov    %rcx,%rdx                    //
    lea    (%rcx,%rcx,1),%r12           //
    shl    $0x5,%rdx                    // rdx = ((f + l) * s.size()) % 38

    // The for loop. GCC split the loop -- it quickly compares lengths
    // of lookup[idx] + i with s.size (rbp) and only if they are equal
    // jumps to bytes comparison.
    lea    0x267b47(%rip),%rax          // rax = lookup address
    lea    (%rdx,%rax,1),%rbx           // rbx = size
    lea    0x2(%r12),%r14
    mov    0x8(%rbx),%rdi
    cmp    (%rbx),%rbp                  // jump to full compare only if lengths are equal
    je     <lookup_go_hash1(std::basic_string_view<char, std::char_traits<char> >)+0x78>
    inc    %r12
    add    $0x10,%rbx
    cmp    %r14,%r12
    jne    <lookup_go_hash1(std::basic_string_view<char, std::char_traits<char> >)+0x52>
    pop    %rbx
    pop    %rbp
    pop    %r12
    pop    %r13
    mov    $0xffffffff,%eax
    pop    %r14
    ret
    nopl   (%rax)

    // check if bytes are equal
    test   %rbp,%rbp
    je     <lookup_go_hash1(std::basic_string_view<char, std::char_traits<char> >)+0x8c>
    mov    %rbp,%rdx
    mov    %r13,%rsi
    call   <memcmp@plt>
    test   %eax,%eax
    jne    <lookup_go_hash1(std::basic_string_view<char, std::char_traits<char> >)+0x5b>
    pop    %rbx
    pop    %rbp
    lea    0x1f56bb(%rip),%rax        # <lookup_go_hash1(std::basic_string_view<char, std::char_traits<char> >)::values>
    mov    (%rax,%r12,4),%eax
    pop    %r12
    pop    %r13
    pop    %r14
    ret

Hash functions

Following hash functions were tested:

  • hash1 — sum of the first and last char multiplied by the string length;
  • hash2 — sum of the first and last char bit-xor'ed with the string length;
  • hash3 — first, second and last characters combined with the string length: ((f << 8) | s) + l + (n * 0x111);
  • hash_sum0 — sum of all bytes;
  • hash_sumN — sum of all bytes plus the string length;
  • hash_djb2Berenstein's hash;
  • hash_sdbsdbm hash.

Splitting by input length

I observed that the string length is a decent discriminator: STL: map with string as key — access speedup. In most cases we obtain the length of strings in constant time; for instance this C++ function simplifies to a few CPU instructions.

#include <cstdint>
#include <string_view>

uint64_t len(std::string_view s) {
   return s.size();
}
$ g++ -O2 -S test.cpp
$ cat test.s
    ...
    movq    %rdi, %rax
    ret
    ...

Taking this into account, we first split the set of strings into the subsets of strings having the same length. These subsets are usually small. We can fine-tune looking up in these subsets independently. One common optimization that is present in sample code is using plain if-ladder of compares when a subset is small (has at most two or three elements).

Additionally, working with same-length strings enables more low-level optimizations:

  • The length of string is usually known in the compile-time.
  • Thus, bound checks can be eliminated.
  • Equality of two strings from a subset is simple memcmp, and this procedure may be further inlined and optimized by a compiler.

Using the minimum number of bits

Note: this method was inspired by the hashing procedure found in GNU gperf.

The presented method requires input string to have the same lengths. We guaranteed that property by pre-classifying the input string by its length, as described in the previous section.

gperf uses the following observation: not all input characters contribute to uniqueness of set. In most cases we may discard many input characters and such modified set still holds unique items.

For instance, five-letter Go keywords are break, const, defer, and range. The last character is sufficient to distinguish them:

break => ____k
const => ____t
defer => ____r
range => ____e

In the case of six-letter keywords, we need two characters, as shown below. The letter 't' repeats at the last position, but the second character makes this set unique.

import => ___o_t
return => ___u_n
select => ___e_t
struct => ___u_t
switch => ___t_h

The major problem is that we need to combine somehow these unique characters into a unique and relatively small value. gperf adds the characters after passing them through an auxiliary lookup table. Like: lookup[s[0]] + lookup[s[5]].

Similarly to discarding individual bytes, we may discard individual bits, ending up with a bitmask that can be used to extract significant bits. To do this, we use a very fast instruction PEXT.

Before we dig into details, let us re-examine the examples. For five-letter Go keywords we have:

01100010.01110010.01100101.01100001.01101011 => ________.________.________.________.______11
01100011.01101111.01101110.01110011.01110100 => ________.________.________.________.______00
01100100.01100101.01100110.01100101.01110010 => ________.________.________.________.______10
01110010.01100001.01101110.01100111.01100101 => ________.________.________.________.______01

And for six-letter ones:

01101001.01101101.01110000.01101111.01110010.01110100 => ________.________.________.___0____._______0._____1__
01110010.01100101.01110100.01110101.01110010.01101110 => ________.________.________.___1____._______0._____1__
01110011.01100101.01101100.01100101.01100011.01110100 => ________.________.________.___0____._______1._____1__
01110011.01110100.01110010.01110101.01100011.01110100 => ________.________.________.___1____._______1._____1__
01110011.01110111.01101001.01110100.01100011.01101000 => ________.________.________.___1____._______1._____0__

Thus, in the first case we need just 2 bits, and in the second bits only 3 bits. The 2-bit subset contains all possible values (00, 01, 11, 10), while 3-bit subset is not full (001, 101, 011, 111, 110).

A skeleton of C++ procedure that uses this approach is:

...
/* 1 */ const uint64_t word = load_bytes(s);
/* 2 */ const uint64_t idx  = _pext_u64(word, mask);
/* 3 */ if (memcmp(lookup[idx], s.data(), size) == 0) {
            return value[idx];
        }
...

Once we know which bits are significant, we need to find which bytes have to be loaded (1); a byte is the minimum amount of data we can transfer. As we can see, for five-letter keywords it is sufficient to load just one byte. But for six-letter keywords we need three bytes — in practise it is still a single load, but done on an 32-bit entity.

Then, we extract the bits using the instruction PEXT (2) — these bits form an n-bit index.

We need an auxiliary table(s) of size 2n, having exactly the same meaning as in hashing. We compare the input string with keyword (3) and if they equal, we return the associated value.

This is a snippet from procedure for looking up the Go keywords:

int lookup_go_pext(std::string_view s) {
    switch (s.size()) {
        // ...
        case 5: {
            static char lookup[4][5] = {
                {'b', 'r', 'e', 'a', 'k'},
                {'r', 'a', 'n', 'g', 'e'},
                {'d', 'e', 'f', 'e', 'r'},
                {'c', 'o', 'n', 's', 't'},
            };
            static int value[4] = {
                0,
                18,
                6,
                3,
            };
            const uint8_t w2 = s[4];
            const uint64_t idx = _pext_u64(w2, 0x14);
            if (memcmp(lookup[idx], s.data(), sizeof(lookup[0])) == 0) {
                return value[idx];
            }
        }
        break;
        case 6: {
            static char lookup[8][6] = {
                {}, // no match
                {'s', 'w', 'i', 't', 'c', 'h'},
                {}, // no match
                {'r', 'e', 't', 'u', 'r', 'n'},
                {'s', 'e', 'l', 'e', 'c', 't'},
                {'s', 't', 'r', 'u', 'c', 't'},
                {'i', 'm', 'p', 'o', 'r', 't'},
                {}, // no match
            };
            static int value[8] = {
                -1,
                22,
                -1,
                19,
                20,
                21,
                14,
                -1,
            };
            uint32_t w3 = 0;
            memcpy(&w3, &s[2], 4);
            const uint64_t idx = _pext_u64(w3, 0x10101000);
            if (memcmp(lookup[idx], s.data(), sizeof(lookup[0])) == 0) {
                return value[idx];
            }
        }
        break;
        // ...
    }
    return -1;
}

Variable-length set of words

Having strings of the same length is not a strict requirement. We always can complete the past-end characters with some easy-to-compute values. The following values were tested during writing this text:

  • zero (0x00),
  • the first character,
  • the last character,
  • the input length,
  • the first/last character combined with the input length.

Depending on the chosen method and string set, the outcome differs. Including length and one of chars seems to help. However, pre-classifying inputs by their length appeared to be easier and faster, thus this topic is not covered here.

Finding mask

Finding the mask is surprisingly straightforward. We start with a full mask, and iterate over all bits. For each bit we unset it, and then we check if the set of masked bytes is still unique. If it's true, we keep that bit unset, otherwise revert the change.

Below are shown two procedures that compute the mask.

func computeMask(words []Keyword) (mask []byte, err error) {
    max := 0
    for _, w := range words {
        if n := len(w.word); n > max {
            max = n
        }
    }

    mask = make([]byte, max)
    for i := range mask {
        mask[i] = 0xff
    }

    ok := checksamesize(words, mask)
    if !ok {
        err = fmt.Errorf("set of words is not unique")
        return
    }

    for byteIdx := 0; byteIdx < max; byteIdx++ {
        for bitIdx := 0; bitIdx < 8; bitIdx++ {
            old := mask[byteIdx]
            mask[byteIdx] &= ^(1 << bitIdx)
            if !checksamesize(words, mask) {
                mask[byteIdx] = old
            }
        }
    }

    return
}

func checksamesize(words []Keyword, mask []byte) bool {
    M := make(map[string]struct{})
    word := make([]byte, len(mask))
    for i := range words {
        for j := range mask {
            word[j] = words[i].word[j] & mask[j]
        }

        s := string(word)
        _, ok := M[s]
        if ok {
            return false
        }
        M[s] = struct{}{}
    }

    return true
}

Loads

As it was shown in examples above, often a single load is sufficient, as the significant bits fit in a 8-, 16-, 32-, or 64-bit word.

However, sometimes more loads are needed. In the generic case we concatenate results of individual loads, forming a 32- or 64-bit word. Below is an example of such code, where we combine 8-bit and 16-bit input words into a single 32-bit word, which is the final argument to PEXT. Note that the we need only to concatenate values in runtime, the mask for PEXT is know in the compile-time.

uint32_t w1;
uint16_t w2 = 0;
memcpy(&w2, &s[0], 2);
w1 = uint32_t(w2);
const uint8_t w3 = s[2];
w1 |= (uint32_t(w3) << 16);
const uint64_t idx = _pext_u64(w1, 0x1e1104);

Another possibility is when we have exactly two loads and their masks do not collide, their bit-and result is zero. Then we can merge these two words; sample code for such case is shown below.

uint16_t w7 = 0;
memcpy(&w7, &s[10], 2);
const uint8_t w8 = s[2];
const uint16_t w9 = (uint16_t(w7) & 0x410) | (uint16_t(w8) & 0x4);
const uint64_t idx = _pext_u64(w9, 0x414);

Compile-time switches

To complete the image, we present a better alternative to compile-time trie expansion. As the first step, the method splits input by the length. Then we compute switch expression for same-lengths strings; each subset is processed separately.

Like in the trie approach, we use a single character to select further processing. But since we deal with subset of strings having the same sizes, we can pick a character at any index. We select an index that splits the subset into the greatest number of smaller subsets, ideally into single-element subsets (singletons). The process is repeated recursively, we need only to remember which indices were already used.

For example, the six-letter keywords of Java, can be fully classified by the second character. Five-letter keywords are classified by the last character, but for one case (the letter 't') more tests are needed.

Below is shown the corresponding fragment of lookup procedure.

int lookup_java_split(std::string_view s) {
    switch (s.size()) {
        // ...
        case 5: {
            switch (s[4]) {
                case 'w': { // 0x77
                    if (s == "throw") {
                        return 4;
                    }
                }
                break;
                case 'k': { // 0x6b
                    if (s == "break") {
                        return 10;
                    }
                }
                break;
                case 't': { // 0x74
                    switch (s[0]) {
                        case 's': { // 0x73
                            if (s == "short") {
                                return 23;
                            }
                        }
                        break;
                        case 'f': { // 0x66
                            if (s == "float") {
                                return 36;
                            }
                        }
                        break;
                        case 'c': { // 0x63
                            if (s == "const") {
                                return 39;
                            }
                        }
                        break;
                    }
                }
                break;
                case 'h': { // 0x68
                    if (s == "catch") {
                        return 25;
                    }
                }
                break;
                case 'l': { // 0x6c
                    if (s == "final") {
                        return 26;
                    }
                }
                break;
                case 'r': { // 0x72
                    if (s == "super") {
                        return 33;
                    }
                }
                break;
                case 'e': { // 0x65
                    if (s == "while") {
                        return 34;
                    }
                }
                break;
                case 's': { // 0x73
                    if (s == "class") {
                        return 35;
                    }
                }
                break;
            }
        }
        break;
        case 6: {
            switch (s[1]) {
                case 'h': { // 0x68
                    if (s == "throws") {
                        return 9;
                    }
                }
                break;
                case 'o': { // 0x6f
                    if (s == "double") {
                        return 11;
                    }
                }
                break;
                case 'm': { // 0x6d
                    if (s == "import") {
                        return 12;
                    }
                }
                break;
                case 'u': { // 0x75
                    if (s == "public") {
                        return 13;
                    }
                }
                break;
                case 'e': { // 0x65
                    if (s == "return") {
                        return 18;
                    }
                }
                break;
                case 't': { // 0x74
                    if (s == "static") {
                        return 28;
                    }
                }
                break;
                case 'a': { // 0x61
                    if (s == "native") {
                        return 37;
                    }
                }
                break;
                case 'w': { // 0x77
                    if (s == "switch") {
                        return 38;
                    }
                }
                break;
            }
        }
        break;
        // ...
    }
    return -1;
}

Experiments

Input sets

The input sets contains mostly keywords from programming languages.

Input sets
name number of words notes
ada-pred 54 gperf tests
ada-res 63 gperf tests
ada 63 gperf tests
adadefs 54 gperf tests
c 32 gperf tests
countries 192 English names of countries
cplusplus 106 gperf tests
cpp 47 gperf tests
go 25  
gpc 35 gperf tests
irc 45 gperf tests
java 47 gperf tests
jscript 53 gperf tests
modula 53 gperf tests
modula2 40 gperf tests
modula3 53 gperf tests
pascal 36 gperf tests
sneller 240 SQL keywords and some functions from Sneller
snellerbuiltin 119 builtin functions from Sneller
states 50 USA state names

Input data

The input words were repeated in random order to produce approx 1MB block of text, not counting the new line character.

Keywords were mutated by changing a single letter (at random position) into a random character or digit. Chances that keyword is added intact were set to: 0%, 25%, 50% and 75%. This is what we call "valid words density".

Tested procedures

Tested procedures
procedure comments
pext PEXT-based, bit-level function
split Compile-time switches
hash1 Compile-time hash table with hash1 function
hash2 Compile-time hash table with hash2 function
hash3 Compile-time hash table with hash3 function
hash_sum0 Compile-time hash table with hash_sum0 function
hash_sumN Compile-time hash table with hash_sumN function
hash_djb2 Compile-time hash table with hash_djb2 function
hash_sdb Compile-time hash table with hash_sdb function

Skylake timings

CPU:Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
Compiler:g++-9 (Ubuntu 9.3.0-10ubuntu2~16.04) 9.3.0

The numbers are given in CPU cycles.

  valid words density
procedure 0% speed-up 25% speed-up 50% speed-up 75% speed-up
dataset ada
pext 33.20 1.000 41.47 1.000 45.67 1.000 39.57 1.000
split 61.81 0.537 67.50 0.614 67.69 0.675 65.62 0.603
hash1 75.38 0.440 77.63 0.534 73.33 0.623 64.66 0.612
hash2 72.78 0.456 76.73 0.540 72.23 0.632 64.01 0.618
hash3 55.63 0.597 67.57 0.614 68.79 0.664 63.79 0.620
hash_sum0 47.63 0.697 60.64 0.684 69.10 0.661 69.96 0.566
hash_sumN 50.68 0.655 61.91 0.670 70.02 0.652 68.14 0.581
hash_sdb 39.91 0.832 50.90 0.815 61.03 0.748 56.61 0.699
hash_djb2 41.22 0.805 51.66 0.803 60.32 0.757 55.12 0.718
dataset ada-pred
pext 37.41 1.000 45.55 1.000 46.60 1.000 41.09 1.000
split 63.65 0.588 68.54 0.665 66.71 0.699 62.59 0.657
hash1 87.09 0.430 88.23 0.516 84.02 0.555 70.16 0.586
hash2 88.20 0.424 89.78 0.507 79.61 0.585 70.49 0.583
hash3 64.07 0.584 64.80 0.703 60.81 0.766 49.78 0.825
hash_sum0 49.99 0.748 62.74 0.726 70.37 0.662 71.98 0.571
hash_sumN 52.95 0.706 64.19 0.710 75.90 0.614 74.72 0.550
hash_sdb 49.55 0.755 61.30 0.743 72.63 0.642 66.59 0.617
hash_djb2 46.79 0.800 58.65 0.777 66.92 0.696 64.21 0.640
dataset ada-res
pext 33.28 1.000 41.21 1.000 45.13 1.000 39.77 1.000
split 65.65 0.507 71.41 0.577 67.72 0.666 69.91 0.569
hash1 74.25 0.448 76.05 0.542 73.33 0.615 62.88 0.632
hash2 73.24 0.454 78.54 0.525 72.33 0.624 65.06 0.611
hash3 54.56 0.610 66.09 0.623 69.05 0.654 62.02 0.641
hash_sum0 47.62 0.699 58.82 0.700 72.94 0.619 68.58 0.580
hash_sumN 50.34 0.661 63.12 0.653 70.00 0.645 65.14 0.610
hash_sdb 40.05 0.831 51.05 0.807 61.46 0.734 56.13 0.709
hash_djb2 41.53 0.801 52.13 0.790 60.56 0.745 56.34 0.706
dataset adadefs
pext 37.85 1.000 45.68 1.000 46.70 1.000 41.30 1.000
split 62.59 0.605 67.59 0.676 67.50 0.692 63.26 0.653
hash1 90.23 0.419 91.55 0.499 82.56 0.566 72.14 0.573
hash2 85.91 0.441 86.50 0.528 82.34 0.567 67.33 0.613
hash3 66.51 0.569 67.66 0.675 58.06 0.804 52.81 0.782
hash_sum0 48.92 0.774 60.24 0.758 74.57 0.626 68.25 0.605
hash_sumN 53.62 0.706 66.31 0.689 73.98 0.631 78.04 0.529
hash_sdb 50.43 0.750 62.91 0.726 72.20 0.647 64.81 0.637
hash_djb2 45.90 0.824 58.03 0.787 69.26 0.674 62.68 0.659
dataset c
pext 31.99 1.000 40.50 1.000 44.03 1.000 38.06 1.000
split 50.58 0.632 55.28 0.733 58.78 0.749 53.15 0.716
hash1 61.26 0.522 62.69 0.646 58.98 0.746 49.40 0.770
hash2 61.59 0.519 62.57 0.647 53.50 0.823 48.16 0.790
hash3 51.56 0.620 58.94 0.687 56.15 0.784 45.65 0.834
hash_sum0 50.07 0.639 62.08 0.652 69.28 0.635 67.87 0.561
hash_sumN 45.62 0.701 58.30 0.695 66.23 0.665 61.98 0.614
hash_sdb 42.02 0.761 52.80 0.767 61.86 0.712 52.40 0.726
hash_djb2 40.45 0.791 51.62 0.784 58.87 0.748 55.34 0.688
dataset countries
pext 36.51 1.000 45.13 1.000 47.62 1.000 41.73 1.000
split 97.77 0.373 103.25 0.437 102.72 0.464 94.96 0.439
hash1 132.88 0.275 129.03 0.350 116.18 0.410 100.02 0.417
hash2 129.50 0.282 126.89 0.356 118.28 0.403 99.82 0.418
hash3 92.53 0.395 97.42 0.463 88.54 0.538 79.79 0.523
hash_sum0 62.07 0.588 78.25 0.577 86.10 0.553 88.92 0.469
hash_sumN 63.41 0.576 80.48 0.561 84.76 0.562 89.56 0.466
hash_sdb 50.96 0.716 64.70 0.698 76.18 0.625 76.97 0.542
hash_djb2 63.47 0.575 74.61 0.605 87.23 0.546 86.02 0.485
dataset cplusplus
pext 33.82 1.000 42.58 1.000 46.32 1.000 40.12 1.000
split 67.07 0.504 72.82 0.585 72.88 0.636 71.07 0.565
hash1 109.82 0.308 109.25 0.390 101.32 0.457 90.27 0.444
hash2 96.47 0.351 97.92 0.435 92.32 0.502 81.82 0.490
hash3 86.25 0.392 91.60 0.465 87.83 0.527 78.80 0.509
hash_sum0 51.95 0.651 66.86 0.637 76.79 0.603 80.85 0.496
hash_sumN 57.29 0.590 70.77 0.602 81.49 0.568 87.37 0.459
hash_sdb 44.25 0.764 56.91 0.748 67.76 0.684 59.27 0.677
hash_djb2 45.77 0.739 56.94 0.748 66.75 0.694 62.54 0.642
dataset cpp
pext 31.13 1.000 39.15 1.000 43.77 1.000 37.27 1.000
split 60.01 0.519 64.33 0.609 61.03 0.717 61.17 0.609
hash1 79.09 0.394 80.43 0.487 79.58 0.550 64.79 0.575
hash2 77.41 0.402 79.82 0.490 71.89 0.609 64.33 0.579
hash3 52.87 0.589 58.08 0.674 54.88 0.798 45.60 0.817
hash_sum0 47.12 0.661 60.08 0.652 67.08 0.653 65.49 0.569
hash_sumN 48.65 0.640 62.08 0.631 68.92 0.635 67.31 0.554
hash_sdb 42.10 0.740 52.74 0.742 61.94 0.707 55.81 0.668
hash_djb2 40.41 0.770 50.86 0.770 58.90 0.743 55.69 0.669
dataset go
pext 34.09 1.000 42.02 1.000 44.33 1.000 38.98 1.000
split 53.48 0.638 57.68 0.729 53.66 0.826 52.37 0.744
hash1 77.18 0.442 80.18 0.524 73.74 0.601 67.21 0.580
hash2 75.88 0.449 79.44 0.529 77.16 0.575 64.39 0.605
hash3 50.11 0.680 60.97 0.689 52.59 0.843 49.99 0.780
hash_sum0 45.55 0.748 55.76 0.754 64.22 0.690 59.05 0.660
hash_sumN 43.86 0.777 54.65 0.769 60.94 0.728 56.83 0.686
hash_sdb 44.85 0.760 56.65 0.742 65.52 0.677 55.36 0.704
hash_djb2 47.05 0.725 57.45 0.731 65.85 0.673 59.28 0.658
dataset gpc
pext 31.63 1.000 39.37 1.000 43.57 1.000 37.99 1.000
split 54.04 0.585 58.34 0.675 60.28 0.723 55.08 0.690
hash1 67.64 0.468 77.23 0.510 77.72 0.561 67.73 0.561
hash2 56.41 0.561 61.56 0.640 55.65 0.783 49.95 0.761
hash3 36.41 0.869 51.94 0.758 53.33 0.817 44.76 0.849
hash_sum0 48.02 0.659 59.74 0.659 68.07 0.640 69.51 0.547
hash_sumN 48.15 0.657 59.59 0.661 67.78 0.643 64.33 0.591
hash_sdb 39.74 0.796 51.98 0.757 60.72 0.718 55.70 0.682
hash_djb2 45.31 0.698 56.10 0.702 62.26 0.700 56.72 0.670
dataset irc
pext 30.92 1.000 39.45 1.000 43.61 1.000 35.90 1.000
split 61.98 0.499 66.63 0.592 67.25 0.649 63.39 0.566
hash1 91.91 0.336 91.33 0.432 84.69 0.515 75.90 0.473
hash2 73.70 0.420 76.54 0.515 75.02 0.581 61.11 0.587
hash3 49.63 0.623 58.92 0.670 50.82 0.858 46.29 0.776
hash_sum0 52.71 0.587 65.81 0.600 73.05 0.597 67.20 0.534
hash_sumN 57.54 0.537 71.23 0.554 75.28 0.579 70.94 0.506
hash_sdb 41.56 0.744 52.76 0.748 61.69 0.707 56.09 0.640
hash_djb2 41.20 0.751 52.06 0.758 62.05 0.703 56.49 0.636
dataset java
pext 32.81 1.000 41.97 1.000 44.86 1.000 40.05 1.000
split 61.79 0.531 63.11 0.665 67.20 0.668 60.71 0.660
hash1 84.10 0.390 87.53 0.479 80.63 0.556 70.61 0.567
hash2 79.72 0.412 79.53 0.528 74.78 0.600 61.47 0.652
hash3 54.22 0.605 61.39 0.684 52.91 0.848 48.92 0.819
hash_sum0 50.32 0.652 64.39 0.652 73.70 0.609 79.53 0.504
hash_sumN 51.68 0.635 64.23 0.654 70.61 0.635 72.76 0.550
hash_sdb 43.18 0.760 55.42 0.757 64.60 0.694 57.78 0.693
hash_djb2 43.35 0.757 53.60 0.783 64.25 0.698 59.08 0.678
dataset jscript
pext 32.62 1.000 41.07 1.000 45.48 1.000 39.26 1.000
split 55.46 0.588 60.59 0.678 63.60 0.715 57.62 0.681
hash1 76.26 0.428 78.06 0.526 76.60 0.594 63.68 0.617
hash2 74.21 0.440 75.27 0.546 67.39 0.675 60.76 0.646
hash3 53.13 0.614 60.49 0.679 57.44 0.792 49.12 0.799
hash_sum0 50.57 0.645 62.13 0.661 71.23 0.639 71.21 0.551
hash_sumN 53.58 0.609 64.87 0.633 76.94 0.591 76.87 0.511
hash_sdb 46.41 0.703 57.42 0.715 68.15 0.667 58.71 0.669
hash_djb2 44.39 0.735 54.69 0.751 63.35 0.718 59.26 0.663
dataset modula
pext 32.74 1.000 40.76 1.000 45.30 1.000 39.60 1.000
split 60.07 0.545 64.43 0.633 63.79 0.710 61.26 0.646
hash1 69.22 0.473 71.27 0.572 66.81 0.678 58.11 0.682
hash2 71.44 0.458 76.04 0.536 71.22 0.636 62.85 0.630
hash3 43.89 0.746 56.34 0.723 51.91 0.873 45.92 0.862
hash_sum0 48.84 0.670 60.48 0.674 69.19 0.655 69.94 0.566
hash_sumN 50.78 0.645 61.46 0.663 70.09 0.646 70.17 0.564
hash_sdb 44.64 0.733 55.30 0.737 63.62 0.712 57.84 0.685
hash_djb2 42.33 0.773 52.38 0.778 61.94 0.731 56.49 0.701
dataset modula2
pext 31.87 1.000 40.22 1.000 44.70 1.000 38.37 1.000
split 58.01 0.549 62.07 0.648 57.86 0.773 58.33 0.658
hash1 69.44 0.459 74.37 0.541 67.05 0.667 60.74 0.632
hash2 69.66 0.458 78.43 0.513 77.69 0.575 65.48 0.586
hash3 41.98 0.759 56.11 0.717 51.12 0.874 47.33 0.811
hash_sum0 45.71 0.697 57.02 0.705 71.58 0.624 64.89 0.591
hash_sumN 53.05 0.601 63.66 0.632 74.12 0.603 68.24 0.562
hash_sdb 45.22 0.705 55.33 0.727 63.21 0.707 59.78 0.642
hash_djb2 40.88 0.780 51.74 0.777 61.84 0.723 56.80 0.676
dataset modula3
pext 32.15 1.000 40.28 1.000 45.30 1.000 39.02 1.000
split 56.67 0.567 61.46 0.655 63.68 0.711 58.65 0.665
hash1 70.24 0.458 73.43 0.549 70.79 0.640 60.96 0.640
hash2 71.59 0.449 76.39 0.527 69.77 0.649 63.72 0.612
hash3 43.70 0.736 56.68 0.711 55.22 0.820 46.28 0.843
hash_sum0 50.16 0.641 61.09 0.659 68.90 0.657 70.24 0.556
hash_sumN 50.20 0.640 61.30 0.657 69.35 0.653 69.73 0.560
hash_sdb 40.66 0.791 51.76 0.778 61.05 0.742 51.66 0.755
hash_djb2 40.96 0.785 51.04 0.789 59.26 0.764 56.20 0.694
dataset pascal
pext 32.61 1.000 40.32 1.000 44.18 1.000 38.97 1.000
split 56.47 0.578 61.20 0.659 58.03 0.761 58.26 0.669
hash1 66.03 0.494 75.19 0.536 70.14 0.630 65.05 0.599
hash2 57.36 0.569 59.72 0.675 56.49 0.782 46.41 0.840
hash3 39.59 0.824 55.37 0.728 51.04 0.865 48.82 0.798
hash_sum0 42.79 0.762 52.50 0.768 59.01 0.749 54.74 0.712
hash_sumN 49.75 0.656 60.97 0.661 67.10 0.658 66.58 0.585
hash_sdb 45.78 0.712 56.05 0.719 63.83 0.692 56.62 0.688
hash_djb2 41.33 0.789 52.38 0.770 62.20 0.710 55.77 0.699
dataset sneller
pext 36.32 1.000 45.18 1.000 48.15 1.000 41.59 1.000
split 84.98 0.427 91.96 0.491 91.27 0.528 85.14 0.489
hash1 111.37 0.326 109.84 0.411 101.95 0.472 87.03 0.478
hash2 107.71 0.337 108.25 0.417 99.64 0.483 84.69 0.491
hash3 85.25 0.426 91.38 0.494 86.67 0.556 76.52 0.543
hash_sum0 63.46 0.572 79.08 0.571 87.85 0.548 90.40 0.460
hash_sumN 68.17 0.533 81.57 0.554 90.65 0.531 92.30 0.451
hash_sdb 50.37 0.721 64.13 0.705 77.16 0.624 80.36 0.518
hash_djb2 67.26 0.540 80.46 0.562 92.01 0.523 94.85 0.438
dataset snellerbuiltin
pext 39.24 1.000 47.75 1.000 49.16 1.000 43.15 1.000
split 67.34 0.583 72.82 0.656 70.72 0.695 70.04 0.616
hash1 118.59 0.331 116.16 0.411 109.08 0.451 89.63 0.481
hash2 123.17 0.319 120.14 0.397 105.14 0.468 90.62 0.476
hash3 105.75 0.371 109.16 0.437 102.81 0.478 87.78 0.492
hash_sum0 62.24 0.630 74.92 0.637 84.30 0.583 86.59 0.498
hash_sumN 59.27 0.662 70.96 0.673 82.69 0.594 81.20 0.531
hash_sdb 53.43 0.734 68.37 0.698 81.32 0.604 74.38 0.580
hash_djb2 59.08 0.664 72.51 0.658 83.62 0.588 86.05 0.501
dataset states
pext 37.22 1.000 45.84 1.000 47.53 1.000 41.11 1.000
split 66.25 0.562 71.06 0.645 69.24 0.686 64.75 0.635
hash1 88.44 0.421 88.77 0.516 80.99 0.587 67.90 0.605
hash2 95.52 0.390 95.42 0.480 87.17 0.545 73.57 0.559
hash3 66.76 0.557 67.49 0.679 60.97 0.779 52.47 0.784
hash_sum0 48.00 0.775 62.88 0.729 69.39 0.685 64.14 0.641
hash_sumN 51.32 0.725 63.15 0.726 68.72 0.692 64.42 0.638
hash_sdb 45.26 0.822 59.31 0.773 71.25 0.667 60.47 0.680
hash_djb2 48.32 0.770 59.57 0.770 70.71 0.672 64.98 0.633

Summary

  • The PEXT-based hash is the fastest.
  • Fast hashesh (hash1, hash2, hash3) indeed do not touch all input bytes, but their quality is poor, and as a result, performance is not good.
  • Surprisingly, summing all input bytes is comparable to more complex djb2 or sdbm hashes.

Sources

All sources are available on Github: https://github.com/WojciechMula/toys/tree/master/lookup-in-string-set.