AVX-512 conflict detection without resolving conflicts

Author:	Wojciech Muła
Added on:	2023-05-06

Contents

Introduction
Core AVX-512 instructions
Histogram
Experiment results
Source code

Introduction

One of the hardest problem in SIMD is dealing with non-continuous data accesses, that appear pretty common. Data structures based on indices, like graphs or trees, are a good example. CPU vendors introduced instructions GATHER and SCATTER to address these needs. A gather instruction builds a SIMD vector from N values loaded from N addresses. A scatter instruction stores N values from a SIMD vector at N addresses. Both instructions allow repeated indices.

Repeated indices are the real issue if an algorithm uses the SCATTER — that is, it either sets or updates values. Then we need to define how to handle repeated stores. To solve that particular problem AVX-512 introduced a complex instruction called conflict detection. The instruction builds a vector containing masks that mark repeated values in the input vector.

Intel proposed a pattern that uses the gather, scatter and conflict detection instructions to efficiently handle repeated indices. It is described in the freely available "Intel® 64 and IA-32 Architectures Optimization Reference Manual", in chapter "18.16.1 Vectorization with Conflict Detection". The problem of calculating a histogram is used there as an example.

The core of Intel's approach is a conflict resolution loop in which the repeated values are aggregated into a single element. The number of iterations varies, and depends on data: it is 0 to 4, when we process 16-item vectors (32-bit elements).

We propose a modified approach, that avoids any additional looping at the cost of additional storage. It is faster 1.4 times than the Intel algorithm when the input size is larger than 100,000 items.

The text contains a recap of AVX-512 instructions, a detailed overview of the Intel algorithm, the presentation of our procedure, and evaluation results.

All source codes are available.

Core AVX-512 instructions

Before we start, let's see how the crucial AVX-512 instructions are defined. We are focused on instruction variants processing 32-bit items; there are also 64-bit counterparts.

GATHER — vectorized load of N elements from given indices; its scalar implementation for 32-bit items is below.

type vector [16]uint32

func vec_gather(mem []uint32, indices vector) (v vector) {
    for i := range indices {
        v[i] = mem[indices[i]]
    }
    return
}

SCATTER — vectorized store N elements at given indices. It is important that the hardware implementation performs stores in the order of items in the vector. For example if all indices are equal, the last, 15th value will be stored.
```
func vec_scatter(values vector, mem []uint32, indices vector) {
    for i := range indices {
        mem[indices[i]] = values[i]
    }
}
```

CONFLICTDETECTION — build N bitmasks, where each bitmask points out which previous elements are equal to the i-th element.

The following example shows conflict detection for an 8-element vector.

    7        6        5        4        3        2        1      item 0
┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
│     123│     123│       2│     123│       2│       2│     123│       2│ = input vector
└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘
┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
│01100010│00010010│00001101│00000010│00000101│00000001│00000000│00000000│ = conflicts mask
└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘
 │        │                 │                          │
 │        │                 │                          the first occurrence of 123
 │        │                 │
 │        │                 └─ value 123 in range [0:3] is present at indices: 1
 │        └─────────────────── value 123 in range [0:5] is present at indices: 1 and 4
 └──────────────────────────── value 123 in range [0:6] is present at indices: 1, 4 and 6

This procedure implements conflict detection for a 16-element vector.

func vec_conflicts(v vector) (res vector) {
    for i := 0; i < 16; i++ {
        mask := uint32(0)
        bit := uint32(1)

        x := v[i]
        for j := 0; j < i; j++ {
            if x == v[j] {
                mask |= bit
            }
            bit <<= 1
        }
        res[i] = mask
    }
    return
}

Histogram

For the sake of completeness, there is a Go implementation of byte histogram procedure. We assume that input[i] < 256.

func histogramReference(input []uint32, output []uint32) {
    for _, idx := range input {
        output[idx] += 1
    }
}

Reference Intel algorithm

In this section we show the reference implementation of Intel solution for histogram computation. The implementation presented in the article is the Go. The assembly translation is available in sources.

func HistogramIntelReference(indices []uint32, output []uint32) {
    mem := output

    zeros := vec_set1(0)
    neg1 := vec_set1(0xffffffff)
    for len(indices) > 0 {
        chunk := vec_load(indices)
        conflicts := vec_conflicts(chunk)

        increment := vec_set1(1)

        anyConflict := vec_not_equals(conflicts, zeros)
        if anyConflict != 0 {
            position := vec_lzcnt(conflicts)
            position = vec_sub(vec_set1(31), position)

            active := anyConflict
            for {
                conflicted := vec_permute_masked(increment, position, active)
                conflicted = vec_mask(conflicted, active)

                increment = vec_add_masked(increment, conflicted, active)

                position = vec_permute_masked(position, position, active)

                active = vec_not_equals(position, neg1)
                if active == 0 {
                    break
                }
            }
        }

        oldValues := vec_gather(mem, chunk)
        newValues := vec_add(oldValues, increment)
        vec_scatter(newValues, mem, chunk)

        indices = indices[16:]
    }
}

The main loop of the algorithm works as follows.

First we load 16 indices. For simplicity we assume that we already have these bytes expanded into 32-bit values.

In our example we have a set of only three values: {1, 22, 33}.

chunk := vec_load(indices)

┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│   22│   22│    1│   33│    1│   33│   33│    1│    1│   22│   22│   22│   22│    1│    1│    1│ = chunk
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
dword 15                                                                                  dword 0

Next, we perform the conflict detection on the loaded indices.

conflicts := vec_conflicts(chunk)

The contents of conflicts is presented in binary.

conflicts[0]  = 00000000_00000000   // 1
conflicts[1]  = 00000000_00000001   // 1
conflicts[2]  = 00000000_00000011   // 1
conflicts[3]  = 00000000_00000000   // 22
conflicts[4]  = 00000000_00001000   // 22
conflicts[5]  = 00000000_00011000   // 22
conflicts[6]  = 00000000_00111000   // 22
conflicts[7]  = 00000000_00000111   // 1
conflicts[8]  = 00000000_10000111   // 1
conflicts[9]  = 00000000_00000000   // 33
conflicts[10] = 00000010_00000000   // 33
conflicts[11] = 00000001_10000111   // 1
conflicts[12] = 00000110_00000000   // 33
conflicts[13] = 00001001_10000111   // 1
conflicts[14] = 00000000_01111000   // 22
conflicts[15] = 01000000_01111000   // 22

We initialize the increment vector to all decimal ones. If there are no conflicts, that is a ready to use increment value.

increment := vec_set1(1)

┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│ = icrement
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘

When all words of conflict vector are zeros, it means there are no conflicts. We may skip a conflict resolution loop, and increment counters (step 5). But in our example there are repeated indices, thus we are entering the conflict resolution loop.

anyConflict := vec_not_equals(conflicts, zeros)

The prerequisite to handle the repeated values is computing a permutation vector. The permutation vector contains the indices to the previous occurrence of the given element. The index is -1 when the i-th element is the first occurrence of a value.

This calculation is possible with LZCNT (Leading Zeros Count) instruction. The previous index equals to 31 - LZCNT(conflicts[i]); if there is no conflict (conflicts[i] = 0), then the result is -1.

position := vec_lzcnt(conflicts)
position = vec_sub(vec_set1(31), position)

┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│   14│    6│   11│   10│    8│    9│   -1│    7│    2│    5│    4│    3│   -1│    1│    0│   -1│ = position
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘

The next step is to set active elements. We are processing only the repeated items — those having the previous element (position[i] != -1). Initially, the active mask equals to the conflicts mask.

active := anyConflict

┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│   14│    6│   11│   10│    8│    9│   -1│    7│    2│    5│    4│    3│   -1│    1│    0│   -1│ = position
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
     1     1     1     1     1     1     0     1     1     1     1     1     0     1     1     0  = active

                                                                             1111_1101_1111_0110  = active

We repeat this step until active mask is not zero.

We use the position vector to pull previous values of indices to the vector conflicted, zeroing non-active elements (lines 1 and 2). Then we update only the active elements of increment (line 3) — in our case it is a single, masked addition. Finally, we pull the previous positions of the position vector itself (line 4), and update active mask accordingly (lines 5 to 8).

/* 1 */ conflicted := vec_permute_masked(increment, position, active)
/* 2 */ conflicted = vec_mask(conflicted, active)

/* 3 */ increment = vec_add_masked(increment, conflicted, active)

/* 4 */ position = vec_permute_masked(position, position, active)

/* 5 */ active = vec_not_equals(position, neg1)
/* 6 */ if active == 0 {
/* 7 */     break
/* 8 */ }

Let's see in details how these vectors change across iterations. Because the increment vector is full of ones, we named elements of this vector as i{index}. In the first iteration we have:

   i15   i14   i13   i12   i11   i10    i9    i8    i7    i6    i5    i4    i3    i2    i1    i0
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│ = increment
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│   14│    6│   11│   10│    8│    9│   -1│    7│    2│    5│    4│    3│   -1│    1│    0│   -1│ = position
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    1│    1│    1│    1│    1│    1│    0│    1│    1│    1│    1│    1│    0│    1│    1│    0│ = conflicted
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
   i14    i6   i11   i10    i8    i9          i7    i2    i5    i4    i3          i1    i0

After updating the increment vector (line 3) its content is:

┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    2│    2│    2│    2│    2│    2│    1│    2│    2│    2│    2│    2│    1│    2│    2│    1│ = increment
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
   i15   i14   i13   i12   i11   i10    i9    i8    i7    i6    i5    i4    i3    i2    i1    i0
  +i14   +i6  +i11  +i10   +i8   +i9         +i7   +i2   +i5   +i4   +i3         +i1   +i0

The permutation of position yields the following vector. Note that now we have twice as much -1 values.

┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    6│    5│    8│    9│    7│   -1│   -1│    2│    1│    4│    3│   -1│   -1│    0│   -1│   -1│ = position
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
     1     1     1     1     1     0     0     1     1     1     1     0     0     1     0     0  = active
                                                                              1111_1001_1110_0100 = active

In the second iteration we have:

   i15   i14   i13   i12   i11   i10    i9    i8    i7    i6    i5    i4    i3    i2    i1    i0
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    2│    2│    2│    2│    2│    2│    1│    2│    2│    2│    2│    2│    1│    2│    2│    1│ = increment
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    6│    5│    8│    9│    7│   -1│   -1│    2│    1│    4│    3│   -1│   -1│    0│   -1│   -1│ = position
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    2│    2│    2│    1│    2│    0│    0│    2│    2│    2│    1│    0│    0│    1│    0│    0│ = conflicted
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
    i6    i5    i8    i9    i7                i2    i1    i4    i3                i0

After updating the increment vector (line 3) its content is:

┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    4│    4│    4│    3│    4│    2│    1│    4│    4│    4│    3│    2│    1│    3│    2│    1│ = increment
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
   i15   i14   i13   i12   i11   i10    i9    i8    i7    i6    i5    i4    i3    i2    i1    i0
   +i6   +i5   +i8   +i9   +i7               +i2   +i1   +i4   +i3               +i0

The permutation of position yields the following vector.

┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    4│    3│    2│   -1│    1│   -1│   -1│    0│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│ = position
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
     1     1     1     0     1     0     0     1     0     0     0     0     0     0     0     0  = active
                                                                              1110_1001_0000_0000 = active

In the third iteration we have:

   i15   i14   i13   i12   i11   i10    i9    i8    i7    i6    i5    i4    i3    i2    i1    i0
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    4│    4│    4│    3│    4│    2│    1│    4│    4│    4│    3│    2│    1│    3│    2│    1│ = increment
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    4│    3│    2│   -1│    1│   -1│   -1│    0│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│ = position
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    2│    1│    3│    0│    2│    0│    0│    1│    0│    0│    0│    0│    0│    0│    0│    0│ = conflicted
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
    i4    i3    i2          i1                i0

After updating the increment vector (line 3) its content is:

┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    6│    5│    7│    3│    6│    2│    1│    5│    4│    4│    3│    2│    1│    3│    2│    1│ = increment
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
   i15   i14   i13   i12   i11   i10    i9    i8    i7    i6    i5    i4    i3    i2    i1    i0
   +i4   +i3   +i2         +i1               +i0

The permutation of position yields the vector full of -1. This means active mask is zero, and we finished the conflict resolution.

┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│ = position
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘

Before moving to the final step, let's look closer on increment collated with the input vector chunk.

┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│   22│   22│    1│   33│    1│   33│   33│    1│    1│   22│   22│   22│   22│    1│    1│    1│ = chunk
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│    6│    5│    7│    3│    6│    2│    1│    5│    4│    4│    3│    2│    1│    3│    2│    1│ = increment
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
     │           │     │
     │           │     └─ the last occurrence of 33: 3 is the count
     │           └─────── the last occurrence of  1: 7 is the count
     └─────────────────── the last occurrence of 22: 6 is the count

The last step is histogram update. First, we gather the counters for the bytes we loaded in the step 1 (line 1). Then, we update all counters with the increment vector calculated earlier (line 2). Finally, we scatter the updated values back (line 3).

The real counts of repeated values are stored at position corresponding the last occurrence of given value. The vector newValues contains garbage for earlier occurrences of a repeated value. It is not a problem, as the scatter instruction guarantees that the value written to the memory will be the correct, last one.
```
/* 1 */ oldValues := vec_gather(mem, chunk)
/* 2 */ newValues := vec_add(oldValues, increment)
/* 3 */ vec_scatter(newValues, mem, chunk)
```
Note: due to high latency of the gather instructions, in the assembly code the gather instruction is issued before the conflict resolution loop. Here we put it closer just to make the algorithm more readable.

Downsides of Intel approach

The core algorithm has to be repeated twice -- in the conflict resolution loop and in the main loop. The histogram algorithm requires just a simple addition, thus that is not that visible. But when we need more operations, it becomes more significant.
Likewise, when the algorithm needs more than one variable, then all variables have to be permuted in the same way we permute the increment vector.
Unpredictable number of iterations.
Underuse of available resources: in each iteration we process fewer and fewer items. The following table shows how many active elements were processed in the conflict resolution loop for various inputs.

input characteristics	all elements	active elements	active [%]
constant value	40960	31360	76.56
random values from 0 to 3	31088	14754	47.46
random values from 0 to 14	19280	4888	25.35
random values from 0 to 63	9776	1155	11.81
random bytes	4000	294	7.35

Alternative algorithm

The solution we propose completely rids of the conflict resolution loop. The key observation is that in a single iteration of the main loop we have up to 16 conflicts. Each input item might have assigned the unique repetition id (from 0 to 15) within the subset of the same values.

By extending the algorithm state by the factor of 16, we can update without conflicts all values at once. While the original algorithm updates counters at indices chunk[i], we update a sub-counters at chunk[i] * 16 + subbucket, where subbucket is a number from 0 to 15.

After the main loop completes, we finalize computations by merging these sixteen sub-states into one.

Let's look how this approach would assign subbucket numbers in our example input vector.

┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│   22│   22│    1│   33│    1│   33│   33│    1│    1│   22│   22│   22│   22│    1│    1│    1│ = chunk
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
     │     │     |     │     │     │     │     │     │     │     │     │     │     │     │     │
     │     │     6     │     5     │     │     4     3     │     │     │     │     2     1     0
     5     4           │           │     │                 3     2     1     0
                       2           1     0

The following Go code shows how the histogram problem can be solved with this new approach.

func HistogramV2Reference(indices []uint32, output []uint32) {
    mem := output

    for len(indices) > 0 {
        chunk := vec_load(indices)
        conflicts := vec_conflicts(chunk)

        subbuckets := vec_lzcnt(conflicts)
        subbuckets = vec_sub(vec_set1(32), subbuckets)

        offsets := vec_sll(chunk, 4)
        offsets = vec_add(offsets, subbuckets) // 16 * index + subbucket

        oldValues := vec_gather(mem, offsets)
        newValues := vec_add(oldValues, vec_set1(1))
        vec_scatter(newValues, mem, offsets)

        indices = indices[16:]
    }

    // finalize
    for i := 0; i < bins; i++ {
        pos := i * 16
        sum := uint32(0)
        for j := 0; j < 16; j++ {
            sum += output[pos+j]
        }

        output[i] = sum
    }
}

Experiment results

The assembler implementations of the Intel and new approach were tested using the standard Go benchmarking tool. Various inputs were tested:

constant value,
random values from 0 to 3,
random values from 0 to 14,
random values from 0 to 64,
random bytes,
all of above sorted.

CPU: Xeon W-2104 Processor

size [32-bit words]	Intel algorithm [MB/s]	Alternative algorithm [MB/s]	Speedup
constant value
1,024	860.61	227.23	0.26
5,120	869.37	656.92	0.76
10,240	870.25	861.90	0.99
102,400	867.46	1182.99	1.36
512,000	850.11	1221.05	1.44
1,048,576	849.96	1226.56	1.44
2,097,152	838.58	1206.70	1.44
5,242,880	786.06	1118.72	1.42
10,485,760	777.66	1117.22	1.44
sorted values from 0 to 3
1,024	860.67	227.21	0.26
5,120	869.26	657.08	0.76
10,240	870.47	861.83	0.99
102,400	866.41	1177.93	1.36
512,000	849.94	1220.74	1.44
1,048,576	849.05	1226.69	1.44
2,097,152	835.50	1200.06	1.44
5,242,880	784.94	1115.10	1.42
10,485,760	779.75	1114.16	1.43
sorted values from 0 to 14
1,024	860.75	227.22	0.26
5,120	869.33	657.22	0.76
10,240	862.62	849.14	0.98
102,400	868.60	1190.50	1.37
512,000	848.46	1219.15	1.44
1,048,576	850.39	1228.37	1.44
2,097,152	834.96	1198.63	1.44
5,242,880	782.96	1114.30	1.42
10,485,760	778.32	1112.89	1.43
sorted values from 0 to 63
1,024	860.72	227.22	0.26
5,120	869.31	641.37	0.74
10,240	870.51	861.93	0.99
102,400	863.03	1175.73	1.36
512,000	851.23	1224.23	1.44
1,048,576	851.38	1226.58	1.44
2,097,152	832.73	1194.05	1.43
5,242,880	782.41	1119.39	1.43
10,485,760	779.38	1112.67	1.43
sorted bytes
1,024	860.77	227.22	0.26
5,120	869.25	657.24	0.76
10,240	870.45	861.94	0.99
102,400	863.78	1180.33	1.37
512,000	849.81	1223.83	1.44
1,048,576	849.93	1225.23	1.44
2,097,152	831.93	1194.59	1.44
5,242,880	785.46	1116.25	1.42
10,485,760	778.15	1110.47	1.43
constant value
1,024	860.73	227.22	0.26
5,120	869.30	657.21	0.76
10,240	859.60	849.05	0.99
102,400	862.47	1180.54	1.37
512,000	849.13	1223.81	1.44
1,048,576	850.00	1227.04	1.44
2,097,152	835.06	1199.79	1.44
5,242,880	785.40	1117.14	1.42
10,485,760	779.17	1114.66	1.43
random values from 0 to 3
1,024	860.73	227.21	0.26
5,120	844.43	657.25	0.78
10,240	870.48	861.93	0.99
102,400	863.60	1180.64	1.37
512,000	850.54	1221.42	1.44
1,048,576	850.45	1227.06	1.44
2,097,152	834.29	1199.90	1.44
5,242,880	785.20	1117.01	1.42
10,485,760	777.80	1112.16	1.43
random values from 0 to 14
1,024	860.67	227.23	0.26
5,120	869.34	657.10	0.76
10,240	870.50	861.89	0.99
102,400	864.86	1182.81	1.37
512,000	846.93	1220.55	1.44
1,048,576	848.47	1227.36	1.45
2,097,152	834.09	1199.69	1.44
5,242,880	784.80	1118.62	1.43
10,485,760	779.74	1113.79	1.43
random values from 0 to 63
1,024	860.77	227.24	0.26
5,120	869.31	656.93	0.76
10,240	860.87	849.00	0.99
102,400	867.53	1182.82	1.36
512,000	849.01	1220.70	1.44
1,048,576	849.53	1226.27	1.44
2,097,152	832.92	1198.14	1.44
5,242,880	785.25	1116.44	1.42
10,485,760	779.28	1112.75	1.43
random bytes
1,024	860.73	227.21	0.26
5,120	869.36	641.30	0.74
10,240	860.29	861.87	1.00
102,400	868.43	1190.52	1.37
512,000	846.36	1223.07	1.45
1,048,576	849.26	1228.86	1.45
2,097,152	834.57	1198.15	1.44
5,242,880	786.23	1120.11	1.42
10,485,760	779.94	1116.79	1.43

Source code

All sources are available on Github: https://github.com/WojciechMula/toys/tree/master/avx512-conflict-detection