AVX-512 conflict detection without resolving conflicts

Author: Wojciech Muła
Added on:2023-05-06

Contents

Introduction

One of the hardest problem in SIMD is dealing with non-continuous data accesses, that appear pretty common. Data structures based on indices, like graphs or trees, are a good example. CPU vendors introduced instructions GATHER and SCATTER to address these needs. A gather instruction builds a SIMD vector from N values loaded from N addresses. A scatter instruction stores N values from a SIMD vector at N addresses. Both instructions allow repeated indices.

Repeated indices are the real issue if an algorithm uses the SCATTER — that is, it either sets or updates values. Then we need to define how to handle repeated stores. To solve that particular problem AVX-512 introduced a complex instruction called conflict detection. The instruction builds a vector containing masks that mark repeated values in the input vector.

Intel proposed a pattern that uses the gather, scatter and conflict detection instructions to efficiently handle repeated indices. It is described in the freely available "Intel® 64 and IA-32 Architectures Optimization Reference Manual", in chapter "18.16.1 Vectorization with Conflict Detection". The problem of calculating a histogram is used there as an example.

The core of Intel's approach is a conflict resolution loop in which the repeated values are aggregated into a single element. The number of iterations varies, and depends on data: it is 0 to 4, when we process 16-item vectors (32-bit elements).

We propose a modified approach, that avoids any additional looping at the cost of additional storage. It is faster 1.4 times than the Intel algorithm when the input size is larger than 100,000 items.

The text contains a recap of AVX-512 instructions, a detailed overview of the Intel algorithm, the presentation of our procedure, and evaluation results.

All source codes are available.

Core AVX-512 instructions

Before we start, let's see how the crucial AVX-512 instructions are defined. We are focused on instruction variants processing 32-bit items; there are also 64-bit counterparts.

  1. GATHER — vectorized load of N elements from given indices; its scalar implementation for 32-bit items is below.

    type vector [16]uint32
    
    func vec_gather(mem []uint32, indices vector) (v vector) {
        for i := range indices {
            v[i] = mem[indices[i]]
        }
        return
    }
    
  2. SCATTER — vectorized store N elements at given indices. It is important that the hardware implementation performs stores in the order of items in the vector. For example if all indices are equal, the last, 15th value will be stored.

    func vec_scatter(values vector, mem []uint32, indices vector) {
        for i := range indices {
            mem[indices[i]] = values[i]
        }
    }
    
  3. CONFLICTDETECTION — build N bitmasks, where each bitmask points out which previous elements are equal to the i-th element.

    The following example shows conflict detection for an 8-element vector.

        7        6        5        4        3        2        1      item 0
    ┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
    │     123│     123│       2│     123│       2│       2│     123│       2│ = input vector
    └────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘
    ┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
    │01100010│00010010│00001101│00000010│00000101│00000001│00000000│00000000│ = conflicts mask
    └────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘
     │        │                 │                          │
     │        │                 │                          the first occurrence of 123
     │        │                 │
     │        │                 └─ value 123 in range [0:3] is present at indices: 1
     │        └─────────────────── value 123 in range [0:5] is present at indices: 1 and 4
     └──────────────────────────── value 123 in range [0:6] is present at indices: 1, 4 and 6
    

    This procedure implements conflict detection for a 16-element vector.

    func vec_conflicts(v vector) (res vector) {
        for i := 0; i < 16; i++ {
            mask := uint32(0)
            bit := uint32(1)
    
            x := v[i]
            for j := 0; j < i; j++ {
                if x == v[j] {
                    mask |= bit
                }
                bit <<= 1
            }
            res[i] = mask
        }
        return
    }
    

Histogram

For the sake of completeness, there is a Go implementation of byte histogram procedure. We assume that input[i] < 256.

func histogramReference(input []uint32, output []uint32) {
    for _, idx := range input {
        output[idx] += 1
    }
}

Reference Intel algorithm

In this section we show the reference implementation of Intel solution for histogram computation. The implementation presented in the article is the Go. The assembly translation is available in sources.

func HistogramIntelReference(indices []uint32, output []uint32) {
    mem := output

    zeros := vec_set1(0)
    neg1 := vec_set1(0xffffffff)
    for len(indices) > 0 {
        chunk := vec_load(indices)
        conflicts := vec_conflicts(chunk)

        increment := vec_set1(1)

        anyConflict := vec_not_equals(conflicts, zeros)
        if anyConflict != 0 {
            position := vec_lzcnt(conflicts)
            position = vec_sub(vec_set1(31), position)

            active := anyConflict
            for {
                conflicted := vec_permute_masked(increment, position, active)
                conflicted = vec_mask(conflicted, active)

                increment = vec_add_masked(increment, conflicted, active)

                position = vec_permute_masked(position, position, active)

                active = vec_not_equals(position, neg1)
                if active == 0 {
                    break
                }
            }
        }

        oldValues := vec_gather(mem, chunk)
        newValues := vec_add(oldValues, increment)
        vec_scatter(newValues, mem, chunk)

        indices = indices[16:]
    }
}

The main loop of the algorithm works as follows.

  1. First we load 16 indices. For simplicity we assume that we already have these bytes expanded into 32-bit values.

    In our example we have a set of only three values: {1, 22, 33}.

    chunk := vec_load(indices)
    
    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │   22│   22│    1│   33│    1│   33│   33│    1│    1│   22│   22│   22│   22│    1│    1│    1│ = chunk
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
    dword 15                                                                                  dword 0
    
  2. Next, we perform the conflict detection on the loaded indices.

    conflicts := vec_conflicts(chunk)
    

    The contents of conflicts is presented in binary.

    conflicts[0]  = 00000000_00000000   // 1
    conflicts[1]  = 00000000_00000001   // 1
    conflicts[2]  = 00000000_00000011   // 1
    conflicts[3]  = 00000000_00000000   // 22
    conflicts[4]  = 00000000_00001000   // 22
    conflicts[5]  = 00000000_00011000   // 22
    conflicts[6]  = 00000000_00111000   // 22
    conflicts[7]  = 00000000_00000111   // 1
    conflicts[8]  = 00000000_10000111   // 1
    conflicts[9]  = 00000000_00000000   // 33
    conflicts[10] = 00000010_00000000   // 33
    conflicts[11] = 00000001_10000111   // 1
    conflicts[12] = 00000110_00000000   // 33
    conflicts[13] = 00001001_10000111   // 1
    conflicts[14] = 00000000_01111000   // 22
    conflicts[15] = 01000000_01111000   // 22
    
  3. We initialize the increment vector to all decimal ones. If there are no conflicts, that is a ready to use increment value.

    increment := vec_set1(1)
    
    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│ = icrement
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
    
  1. When all words of conflict vector are zeros, it means there are no conflicts. We may skip a conflict resolution loop, and increment counters (step 5). But in our example there are repeated indices, thus we are entering the conflict resolution loop.

    anyConflict := vec_not_equals(conflicts, zeros)
    
    1. The prerequisite to handle the repeated values is computing a permutation vector. The permutation vector contains the indices to the previous occurrence of the given element. The index is -1 when the i-th element is the first occurrence of a value.

      This calculation is possible with LZCNT (Leading Zeros Count) instruction. The previous index equals to 31 - LZCNT(conflicts[i]); if there is no conflict (conflicts[i] = 0), then the result is -1.

      position := vec_lzcnt(conflicts)
      position = vec_sub(vec_set1(31), position)
      
      ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
      │   14│    6│   11│   10│    8│    9│   -1│    7│    2│    5│    4│    3│   -1│    1│    0│   -1│ = position
      └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
      
    2. The next step is to set active elements. We are processing only the repeated items — those having the previous element (position[i] != -1). Initially, the active mask equals to the conflicts mask.

      active := anyConflict
      
      ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
      │   14│    6│   11│   10│    8│    9│   -1│    7│    2│    5│    4│    3│   -1│    1│    0│   -1│ = position
      └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
           1     1     1     1     1     1     0     1     1     1     1     1     0     1     1     0  = active
      
                                                                                   1111_1101_1111_0110  = active
      
    3. We repeat this step until active mask is not zero.

      We use the position vector to pull previous values of indices to the vector conflicted, zeroing non-active elements (lines 1 and 2). Then we update only the active elements of increment (line 3) — in our case it is a single, masked addition. Finally, we pull the previous positions of the position vector itself (line 4), and update active mask accordingly (lines 5 to 8).

      /* 1 */ conflicted := vec_permute_masked(increment, position, active)
      /* 2 */ conflicted = vec_mask(conflicted, active)
      
      /* 3 */ increment = vec_add_masked(increment, conflicted, active)
      
      /* 4 */ position = vec_permute_masked(position, position, active)
      
      /* 5 */ active = vec_not_equals(position, neg1)
      /* 6 */ if active == 0 {
      /* 7 */     break
      /* 8 */ }
      

      Let's see in details how these vectors change across iterations. Because the increment vector is full of ones, we named elements of this vector as i{index}. In the first iteration we have:

         i15   i14   i13   i12   i11   i10    i9    i8    i7    i6    i5    i4    i3    i2    i1    i0
      ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
      │    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│    1│ = increment
      └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
      ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
      │   14│    6│   11│   10│    8│    9│   -1│    7│    2│    5│    4│    3│   -1│    1│    0│   -1│ = position
      └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
      ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
      │    1│    1│    1│    1│    1│    1│    0│    1│    1│    1│    1│    1│    0│    1│    1│    0│ = conflicted
      └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
         i14    i6   i11   i10    i8    i9          i7    i2    i5    i4    i3          i1    i0
      

    After updating the increment vector (line 3) its content is:

    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │    2│    2│    2│    2│    2│    2│    1│    2│    2│    2│    2│    2│    1│    2│    2│    1│ = increment
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
       i15   i14   i13   i12   i11   i10    i9    i8    i7    i6    i5    i4    i3    i2    i1    i0
      +i14   +i6  +i11  +i10   +i8   +i9         +i7   +i2   +i5   +i4   +i3         +i1   +i0
    

    The permutation of position yields the following vector. Note that now we have twice as much -1 values.

    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │    6│    5│    8│    9│    7│   -1│   -1│    2│    1│    4│    3│   -1│   -1│    0│   -1│   -1│ = position
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
         1     1     1     1     1     0     0     1     1     1     1     0     0     1     0     0  = active
                                                                                  1111_1001_1110_0100 = active
    

    In the second iteration we have:

       i15   i14   i13   i12   i11   i10    i9    i8    i7    i6    i5    i4    i3    i2    i1    i0
    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │    2│    2│    2│    2│    2│    2│    1│    2│    2│    2│    2│    2│    1│    2│    2│    1│ = increment
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │    6│    5│    8│    9│    7│   -1│   -1│    2│    1│    4│    3│   -1│   -1│    0│   -1│   -1│ = position
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │    2│    2│    2│    1│    2│    0│    0│    2│    2│    2│    1│    0│    0│    1│    0│    0│ = conflicted
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
        i6    i5    i8    i9    i7                i2    i1    i4    i3                i0
    

    After updating the increment vector (line 3) its content is:

    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │    4│    4│    4│    3│    4│    2│    1│    4│    4│    4│    3│    2│    1│    3│    2│    1│ = increment
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
       i15   i14   i13   i12   i11   i10    i9    i8    i7    i6    i5    i4    i3    i2    i1    i0
       +i6   +i5   +i8   +i9   +i7               +i2   +i1   +i4   +i3               +i0
    

    The permutation of position yields the following vector.

    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │    4│    3│    2│   -1│    1│   -1│   -1│    0│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│ = position
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
         1     1     1     0     1     0     0     1     0     0     0     0     0     0     0     0  = active
                                                                                  1110_1001_0000_0000 = active
    

    In the third iteration we have:

       i15   i14   i13   i12   i11   i10    i9    i8    i7    i6    i5    i4    i3    i2    i1    i0
    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │    4│    4│    4│    3│    4│    2│    1│    4│    4│    4│    3│    2│    1│    3│    2│    1│ = increment
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │    4│    3│    2│   -1│    1│   -1│   -1│    0│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│ = position
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │    2│    1│    3│    0│    2│    0│    0│    1│    0│    0│    0│    0│    0│    0│    0│    0│ = conflicted
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
        i4    i3    i2          i1                i0
    

    After updating the increment vector (line 3) its content is:

    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │    6│    5│    7│    3│    6│    2│    1│    5│    4│    4│    3│    2│    1│    3│    2│    1│ = increment
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
       i15   i14   i13   i12   i11   i10    i9    i8    i7    i6    i5    i4    i3    i2    i1    i0
       +i4   +i3   +i2         +i1               +i0
    

    The permutation of position yields the vector full of -1. This means active mask is zero, and we finished the conflict resolution.

    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│   -1│ = position
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
    

    Before moving to the final step, let's look closer on increment collated with the input vector chunk.

    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │   22│   22│    1│   33│    1│   33│   33│    1│    1│   22│   22│   22│   22│    1│    1│    1│ = chunk
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
    ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
    │    6│    5│    7│    3│    6│    2│    1│    5│    4│    4│    3│    2│    1│    3│    2│    1│ = increment
    └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
         │           │     │
         │           │     └─ the last occurrence of 33: 3 is the count
         │           └─────── the last occurrence of  1: 7 is the count
         └─────────────────── the last occurrence of 22: 6 is the count
    
  1. The last step is histogram update. First, we gather the counters for the bytes we loaded in the step 1 (line 1). Then, we update all counters with the increment vector calculated earlier (line 2). Finally, we scatter the updated values back (line 3).

    The real counts of repeated values are stored at position corresponding the last occurrence of given value. The vector newValues contains garbage for earlier occurrences of a repeated value. It is not a problem, as the scatter instruction guarantees that the value written to the memory will be the correct, last one.

    /* 1 */ oldValues := vec_gather(mem, chunk)
    /* 2 */ newValues := vec_add(oldValues, increment)
    /* 3 */ vec_scatter(newValues, mem, chunk)
    

    Note: due to high latency of the gather instructions, in the assembly code the gather instruction is issued before the conflict resolution loop. Here we put it closer just to make the algorithm more readable.

Downsides of Intel approach

  • The core algorithm has to be repeated twice -- in the conflict resolution loop and in the main loop. The histogram algorithm requires just a simple addition, thus that is not that visible. But when we need more operations, it becomes more significant.
  • Likewise, when the algorithm needs more than one variable, then all variables have to be permuted in the same way we permute the increment vector.
  • Unpredictable number of iterations.
  • Underuse of available resources: in each iteration we process fewer and fewer items. The following table shows how many active elements were processed in the conflict resolution loop for various inputs.
input characteristics all elements active elements active [%]
constant value 40960 31360 76.56
random values from 0 to 3 31088 14754 47.46
random values from 0 to 14 19280 4888 25.35
random values from 0 to 63 9776 1155 11.81
random bytes 4000 294 7.35

Alternative algorithm

The solution we propose completely rids of the conflict resolution loop. The key observation is that in a single iteration of the main loop we have up to 16 conflicts. Each input item might have assigned the unique repetition id (from 0 to 15) within the subset of the same values.

By extending the algorithm state by the factor of 16, we can update without conflicts all values at once. While the original algorithm updates counters at indices chunk[i], we update a sub-counters at chunk[i] * 16 + subbucket, where subbucket is a number from 0 to 15.

After the main loop completes, we finalize computations by merging these sixteen sub-states into one.

Let's look how this approach would assign subbucket numbers in our example input vector.

┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│   22│   22│    1│   33│    1│   33│   33│    1│    1│   22│   22│   22│   22│    1│    1│    1│ = chunk
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
     │     │     |     │     │     │     │     │     │     │     │     │     │     │     │     │
     │     │     6     │     5     │     │     4     3     │     │     │     │     2     1     0
     5     4           │           │     │                 3     2     1     0
                       2           1     0

The following Go code shows how the histogram problem can be solved with this new approach.

func HistogramV2Reference(indices []uint32, output []uint32) {
    mem := output

    for len(indices) > 0 {
        chunk := vec_load(indices)
        conflicts := vec_conflicts(chunk)

        subbuckets := vec_lzcnt(conflicts)
        subbuckets = vec_sub(vec_set1(32), subbuckets)

        offsets := vec_sll(chunk, 4)
        offsets = vec_add(offsets, subbuckets) // 16 * index + subbucket

        oldValues := vec_gather(mem, offsets)
        newValues := vec_add(oldValues, vec_set1(1))
        vec_scatter(newValues, mem, offsets)

        indices = indices[16:]
    }

    // finalize
    for i := 0; i < bins; i++ {
        pos := i * 16
        sum := uint32(0)
        for j := 0; j < 16; j++ {
            sum += output[pos+j]
        }

        output[i] = sum
    }
}

Experiment results

The assembler implementations of the Intel and new approach were tested using the standard Go benchmarking tool. Various inputs were tested:

CPU: Xeon W-2104 Processor

size [32-bit words] Intel algorithm [MB/s] Alternative algorithm [MB/s] Speedup
constant value
1,024 860.61 227.23 0.26
5,120 869.37 656.92 0.76
10,240 870.25 861.90 0.99
102,400 867.46 1182.99 1.36
512,000 850.11 1221.05 1.44
1,048,576 849.96 1226.56 1.44
2,097,152 838.58 1206.70 1.44
5,242,880 786.06 1118.72 1.42
10,485,760 777.66 1117.22 1.44
sorted values from 0 to 3
1,024 860.67 227.21 0.26
5,120 869.26 657.08 0.76
10,240 870.47 861.83 0.99
102,400 866.41 1177.93 1.36
512,000 849.94 1220.74 1.44
1,048,576 849.05 1226.69 1.44
2,097,152 835.50 1200.06 1.44
5,242,880 784.94 1115.10 1.42
10,485,760 779.75 1114.16 1.43
sorted values from 0 to 14
1,024 860.75 227.22 0.26
5,120 869.33 657.22 0.76
10,240 862.62 849.14 0.98
102,400 868.60 1190.50 1.37
512,000 848.46 1219.15 1.44
1,048,576 850.39 1228.37 1.44
2,097,152 834.96 1198.63 1.44
5,242,880 782.96 1114.30 1.42
10,485,760 778.32 1112.89 1.43
sorted values from 0 to 63
1,024 860.72 227.22 0.26
5,120 869.31 641.37 0.74
10,240 870.51 861.93 0.99
102,400 863.03 1175.73 1.36
512,000 851.23 1224.23 1.44
1,048,576 851.38 1226.58 1.44
2,097,152 832.73 1194.05 1.43
5,242,880 782.41 1119.39 1.43
10,485,760 779.38 1112.67 1.43
sorted bytes
1,024 860.77 227.22 0.26
5,120 869.25 657.24 0.76
10,240 870.45 861.94 0.99
102,400 863.78 1180.33 1.37
512,000 849.81 1223.83 1.44
1,048,576 849.93 1225.23 1.44
2,097,152 831.93 1194.59 1.44
5,242,880 785.46 1116.25 1.42
10,485,760 778.15 1110.47 1.43
constant value
1,024 860.73 227.22 0.26
5,120 869.30 657.21 0.76
10,240 859.60 849.05 0.99
102,400 862.47 1180.54 1.37
512,000 849.13 1223.81 1.44
1,048,576 850.00 1227.04 1.44
2,097,152 835.06 1199.79 1.44
5,242,880 785.40 1117.14 1.42
10,485,760 779.17 1114.66 1.43
random values from 0 to 3
1,024 860.73 227.21 0.26
5,120 844.43 657.25 0.78
10,240 870.48 861.93 0.99
102,400 863.60 1180.64 1.37
512,000 850.54 1221.42 1.44
1,048,576 850.45 1227.06 1.44
2,097,152 834.29 1199.90 1.44
5,242,880 785.20 1117.01 1.42
10,485,760 777.80 1112.16 1.43
random values from 0 to 14
1,024 860.67 227.23 0.26
5,120 869.34 657.10 0.76
10,240 870.50 861.89 0.99
102,400 864.86 1182.81 1.37
512,000 846.93 1220.55 1.44
1,048,576 848.47 1227.36 1.45
2,097,152 834.09 1199.69 1.44
5,242,880 784.80 1118.62 1.43
10,485,760 779.74 1113.79 1.43
random values from 0 to 63
1,024 860.77 227.24 0.26
5,120 869.31 656.93 0.76
10,240 860.87 849.00 0.99
102,400 867.53 1182.82 1.36
512,000 849.01 1220.70 1.44
1,048,576 849.53 1226.27 1.44
2,097,152 832.92 1198.14 1.44
5,242,880 785.25 1116.44 1.42
10,485,760 779.28 1112.75 1.43
random bytes
1,024 860.73 227.21 0.26
5,120 869.36 641.30 0.74
10,240 860.29 861.87 1.00
102,400 868.43 1190.52 1.37
512,000 846.36 1223.07 1.45
1,048,576 849.26 1228.86 1.45
2,097,152 834.57 1198.15 1.44
5,242,880 786.23 1120.11 1.42
10,485,760 779.94 1116.79 1.43

Source code

All sources are available on Github: https://github.com/WojciechMula/toys/tree/master/avx512-conflict-detection