Author: | Wojciech Muła |
---|---|
Added on: | 2023-05-06 |
Contents
One of the hardest problem in SIMD is dealing with non-continuous data accesses, that appear pretty common. Data structures based on indices, like graphs or trees, are a good example. CPU vendors introduced instructions GATHER and SCATTER to address these needs. A gather instruction builds a SIMD vector from N values loaded from N addresses. A scatter instruction stores N values from a SIMD vector at N addresses. Both instructions allow repeated indices.
Repeated indices are the real issue if an algorithm uses the SCATTER — that is, it either sets or updates values. Then we need to define how to handle repeated stores. To solve that particular problem AVX-512 introduced a complex instruction called conflict detection. The instruction builds a vector containing masks that mark repeated values in the input vector.
Intel proposed a pattern that uses the gather, scatter and conflict detection instructions to efficiently handle repeated indices. It is described in the freely available "Intel® 64 and IA-32 Architectures Optimization Reference Manual", in chapter "18.16.1 Vectorization with Conflict Detection". The problem of calculating a histogram is used there as an example.
The core of Intel's approach is a conflict resolution loop in which the repeated values are aggregated into a single element. The number of iterations varies, and depends on data: it is 0 to 4, when we process 16-item vectors (32-bit elements).
We propose a modified approach, that avoids any additional looping at the cost of additional storage. It is faster 1.4 times than the Intel algorithm when the input size is larger than 100,000 items.
The text contains a recap of AVX-512 instructions, a detailed overview of the Intel algorithm, the presentation of our procedure, and evaluation results.
All source codes are available.
Before we start, let's see how the crucial AVX-512 instructions are defined. We are focused on instruction variants processing 32-bit items; there are also 64-bit counterparts.
GATHER — vectorized load of N elements from given indices; its scalar implementation for 32-bit items is below.
type vector [16]uint32 func vec_gather(mem []uint32, indices vector) (v vector) { for i := range indices { v[i] = mem[indices[i]] } return }
SCATTER — vectorized store N elements at given indices. It is important that the hardware implementation performs stores in the order of items in the vector. For example if all indices are equal, the last, 15th value will be stored.
func vec_scatter(values vector, mem []uint32, indices vector) { for i := range indices { mem[indices[i]] = values[i] } }
CONFLICTDETECTION — build N bitmasks, where each bitmask points out which previous elements are equal to the i-th element.
The following example shows conflict detection for an 8-element vector.
7 6 5 4 3 2 1 item 0 ┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐ │ 123│ 123│ 2│ 123│ 2│ 2│ 123│ 2│ = input vector └────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘ ┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐ │01100010│00010010│00001101│00000010│00000101│00000001│00000000│00000000│ = conflicts mask └────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘ │ │ │ │ │ │ │ the first occurrence of 123 │ │ │ │ │ └─ value 123 in range [0:3] is present at indices: 1 │ └─────────────────── value 123 in range [0:5] is present at indices: 1 and 4 └──────────────────────────── value 123 in range [0:6] is present at indices: 1, 4 and 6
This procedure implements conflict detection for a 16-element vector.
func vec_conflicts(v vector) (res vector) { for i := 0; i < 16; i++ { mask := uint32(0) bit := uint32(1) x := v[i] for j := 0; j < i; j++ { if x == v[j] { mask |= bit } bit <<= 1 } res[i] = mask } return }
For the sake of completeness, there is a Go implementation of byte histogram procedure. We assume that input[i] < 256.
func histogramReference(input []uint32, output []uint32) { for _, idx := range input { output[idx] += 1 } }
In this section we show the reference implementation of Intel solution for histogram computation. The implementation presented in the article is the Go. The assembly translation is available in sources.
func HistogramIntelReference(indices []uint32, output []uint32) { mem := output zeros := vec_set1(0) neg1 := vec_set1(0xffffffff) for len(indices) > 0 { chunk := vec_load(indices) conflicts := vec_conflicts(chunk) increment := vec_set1(1) anyConflict := vec_not_equals(conflicts, zeros) if anyConflict != 0 { position := vec_lzcnt(conflicts) position = vec_sub(vec_set1(31), position) active := anyConflict for { conflicted := vec_permute_masked(increment, position, active) conflicted = vec_mask(conflicted, active) increment = vec_add_masked(increment, conflicted, active) position = vec_permute_masked(position, position, active) active = vec_not_equals(position, neg1) if active == 0 { break } } } oldValues := vec_gather(mem, chunk) newValues := vec_add(oldValues, increment) vec_scatter(newValues, mem, chunk) indices = indices[16:] } }
The main loop of the algorithm works as follows.
First we load 16 indices. For simplicity we assume that we already have these bytes expanded into 32-bit values.
In our example we have a set of only three values: {1, 22, 33}.
chunk := vec_load(indices)
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 22│ 22│ 1│ 33│ 1│ 33│ 33│ 1│ 1│ 22│ 22│ 22│ 22│ 1│ 1│ 1│ = chunk └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ dword 15 dword 0
Next, we perform the conflict detection on the loaded indices.
conflicts := vec_conflicts(chunk)
The contents of conflicts is presented in binary.
conflicts[0] = 00000000_00000000 // 1 conflicts[1] = 00000000_00000001 // 1 conflicts[2] = 00000000_00000011 // 1 conflicts[3] = 00000000_00000000 // 22 conflicts[4] = 00000000_00001000 // 22 conflicts[5] = 00000000_00011000 // 22 conflicts[6] = 00000000_00111000 // 22 conflicts[7] = 00000000_00000111 // 1 conflicts[8] = 00000000_10000111 // 1 conflicts[9] = 00000000_00000000 // 33 conflicts[10] = 00000010_00000000 // 33 conflicts[11] = 00000001_10000111 // 1 conflicts[12] = 00000110_00000000 // 33 conflicts[13] = 00001001_10000111 // 1 conflicts[14] = 00000000_01111000 // 22 conflicts[15] = 01000000_01111000 // 22
We initialize the increment vector to all decimal ones. If there are no conflicts, that is a ready to use increment value.
increment := vec_set1(1)
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ = icrement └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
When all words of conflict vector are zeros, it means there are no conflicts. We may skip a conflict resolution loop, and increment counters (step 5). But in our example there are repeated indices, thus we are entering the conflict resolution loop.
anyConflict := vec_not_equals(conflicts, zeros)
The prerequisite to handle the repeated values is computing a permutation vector. The permutation vector contains the indices to the previous occurrence of the given element. The index is -1 when the i-th element is the first occurrence of a value.
This calculation is possible with LZCNT (Leading Zeros Count) instruction. The previous index equals to 31 - LZCNT(conflicts[i]); if there is no conflict (conflicts[i] = 0), then the result is -1.
position := vec_lzcnt(conflicts) position = vec_sub(vec_set1(31), position)
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 14│ 6│ 11│ 10│ 8│ 9│ -1│ 7│ 2│ 5│ 4│ 3│ -1│ 1│ 0│ -1│ = position └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
The next step is to set active elements. We are processing only the repeated items — those having the previous element (position[i] != -1). Initially, the active mask equals to the conflicts mask.
active := anyConflict
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 14│ 6│ 11│ 10│ 8│ 9│ -1│ 7│ 2│ 5│ 4│ 3│ -1│ 1│ 0│ -1│ = position └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 = active 1111_1101_1111_0110 = active
We repeat this step until active mask is not zero.
We use the position vector to pull previous values of indices to the vector conflicted, zeroing non-active elements (lines 1 and 2). Then we update only the active elements of increment (line 3) — in our case it is a single, masked addition. Finally, we pull the previous positions of the position vector itself (line 4), and update active mask accordingly (lines 5 to 8).
/* 1 */ conflicted := vec_permute_masked(increment, position, active) /* 2 */ conflicted = vec_mask(conflicted, active) /* 3 */ increment = vec_add_masked(increment, conflicted, active) /* 4 */ position = vec_permute_masked(position, position, active) /* 5 */ active = vec_not_equals(position, neg1) /* 6 */ if active == 0 { /* 7 */ break /* 8 */ }
Let's see in details how these vectors change across iterations. Because the increment vector is full of ones, we named elements of this vector as i{index}. In the first iteration we have:
i15 i14 i13 i12 i11 i10 i9 i8 i7 i6 i5 i4 i3 i2 i1 i0 ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ 1│ = increment └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 14│ 6│ 11│ 10│ 8│ 9│ -1│ 7│ 2│ 5│ 4│ 3│ -1│ 1│ 0│ -1│ = position └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 1│ 1│ 1│ 1│ 1│ 1│ 0│ 1│ 1│ 1│ 1│ 1│ 0│ 1│ 1│ 0│ = conflicted └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ i14 i6 i11 i10 i8 i9 i7 i2 i5 i4 i3 i1 i0
After updating the increment vector (line 3) its content is:
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 2│ 2│ 2│ 2│ 2│ 2│ 1│ 2│ 2│ 2│ 2│ 2│ 1│ 2│ 2│ 1│ = increment └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ i15 i14 i13 i12 i11 i10 i9 i8 i7 i6 i5 i4 i3 i2 i1 i0 +i14 +i6 +i11 +i10 +i8 +i9 +i7 +i2 +i5 +i4 +i3 +i1 +i0The permutation of position yields the following vector. Note that now we have twice as much -1 values.
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 6│ 5│ 8│ 9│ 7│ -1│ -1│ 2│ 1│ 4│ 3│ -1│ -1│ 0│ -1│ -1│ = position └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ 1 1 1 1 1 0 0 1 1 1 1 0 0 1 0 0 = active 1111_1001_1110_0100 = activeIn the second iteration we have:
i15 i14 i13 i12 i11 i10 i9 i8 i7 i6 i5 i4 i3 i2 i1 i0 ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 2│ 2│ 2│ 2│ 2│ 2│ 1│ 2│ 2│ 2│ 2│ 2│ 1│ 2│ 2│ 1│ = increment └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 6│ 5│ 8│ 9│ 7│ -1│ -1│ 2│ 1│ 4│ 3│ -1│ -1│ 0│ -1│ -1│ = position └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 2│ 2│ 2│ 1│ 2│ 0│ 0│ 2│ 2│ 2│ 1│ 0│ 0│ 1│ 0│ 0│ = conflicted └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ i6 i5 i8 i9 i7 i2 i1 i4 i3 i0After updating the increment vector (line 3) its content is:
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 4│ 4│ 4│ 3│ 4│ 2│ 1│ 4│ 4│ 4│ 3│ 2│ 1│ 3│ 2│ 1│ = increment └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ i15 i14 i13 i12 i11 i10 i9 i8 i7 i6 i5 i4 i3 i2 i1 i0 +i6 +i5 +i8 +i9 +i7 +i2 +i1 +i4 +i3 +i0The permutation of position yields the following vector.
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 4│ 3│ 2│ -1│ 1│ -1│ -1│ 0│ -1│ -1│ -1│ -1│ -1│ -1│ -1│ -1│ = position └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 = active 1110_1001_0000_0000 = activeIn the third iteration we have:
i15 i14 i13 i12 i11 i10 i9 i8 i7 i6 i5 i4 i3 i2 i1 i0 ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 4│ 4│ 4│ 3│ 4│ 2│ 1│ 4│ 4│ 4│ 3│ 2│ 1│ 3│ 2│ 1│ = increment └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 4│ 3│ 2│ -1│ 1│ -1│ -1│ 0│ -1│ -1│ -1│ -1│ -1│ -1│ -1│ -1│ = position └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 2│ 1│ 3│ 0│ 2│ 0│ 0│ 1│ 0│ 0│ 0│ 0│ 0│ 0│ 0│ 0│ = conflicted └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ i4 i3 i2 i1 i0After updating the increment vector (line 3) its content is:
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 6│ 5│ 7│ 3│ 6│ 2│ 1│ 5│ 4│ 4│ 3│ 2│ 1│ 3│ 2│ 1│ = increment └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ i15 i14 i13 i12 i11 i10 i9 i8 i7 i6 i5 i4 i3 i2 i1 i0 +i4 +i3 +i2 +i1 +i0The permutation of position yields the vector full of -1. This means active mask is zero, and we finished the conflict resolution.
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ -1│ -1│ -1│ -1│ -1│ -1│ -1│ -1│ -1│ -1│ -1│ -1│ -1│ -1│ -1│ -1│ = position └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘Before moving to the final step, let's look closer on increment collated with the input vector chunk.
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 22│ 22│ 1│ 33│ 1│ 33│ 33│ 1│ 1│ 22│ 22│ 22│ 22│ 1│ 1│ 1│ = chunk └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 6│ 5│ 7│ 3│ 6│ 2│ 1│ 5│ 4│ 4│ 3│ 2│ 1│ 3│ 2│ 1│ = increment └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ │ │ │ │ │ └─ the last occurrence of 33: 3 is the count │ └─────── the last occurrence of 1: 7 is the count └─────────────────── the last occurrence of 22: 6 is the count
The last step is histogram update. First, we gather the counters for the bytes we loaded in the step 1 (line 1). Then, we update all counters with the increment vector calculated earlier (line 2). Finally, we scatter the updated values back (line 3).
The real counts of repeated values are stored at position corresponding the last occurrence of given value. The vector newValues contains garbage for earlier occurrences of a repeated value. It is not a problem, as the scatter instruction guarantees that the value written to the memory will be the correct, last one.
/* 1 */ oldValues := vec_gather(mem, chunk) /* 2 */ newValues := vec_add(oldValues, increment) /* 3 */ vec_scatter(newValues, mem, chunk)
Note: due to high latency of the gather instructions, in the assembly code the gather instruction is issued before the conflict resolution loop. Here we put it closer just to make the algorithm more readable.
input characteristics | all elements | active elements | active [%] |
---|---|---|---|
constant value | 40960 | 31360 | 76.56 |
random values from 0 to 3 | 31088 | 14754 | 47.46 |
random values from 0 to 14 | 19280 | 4888 | 25.35 |
random values from 0 to 63 | 9776 | 1155 | 11.81 |
random bytes | 4000 | 294 | 7.35 |
The solution we propose completely rids of the conflict resolution loop. The key observation is that in a single iteration of the main loop we have up to 16 conflicts. Each input item might have assigned the unique repetition id (from 0 to 15) within the subset of the same values.
By extending the algorithm state by the factor of 16, we can update without conflicts all values at once. While the original algorithm updates counters at indices chunk[i], we update a sub-counters at chunk[i] * 16 + subbucket, where subbucket is a number from 0 to 15.
After the main loop completes, we finalize computations by merging these sixteen sub-states into one.
Let's look how this approach would assign subbucket numbers in our example input vector.
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │ 22│ 22│ 1│ 33│ 1│ 33│ 33│ 1│ 1│ 22│ 22│ 22│ 22│ 1│ 1│ 1│ = chunk └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ │ │ | │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ 6 │ 5 │ │ 4 3 │ │ │ │ 2 1 0 5 4 │ │ │ 3 2 1 0 2 1 0
The following Go code shows how the histogram problem can be solved with this new approach.
func HistogramV2Reference(indices []uint32, output []uint32) { mem := output for len(indices) > 0 { chunk := vec_load(indices) conflicts := vec_conflicts(chunk) subbuckets := vec_lzcnt(conflicts) subbuckets = vec_sub(vec_set1(32), subbuckets) offsets := vec_sll(chunk, 4) offsets = vec_add(offsets, subbuckets) // 16 * index + subbucket oldValues := vec_gather(mem, offsets) newValues := vec_add(oldValues, vec_set1(1)) vec_scatter(newValues, mem, offsets) indices = indices[16:] } // finalize for i := 0; i < bins; i++ { pos := i * 16 sum := uint32(0) for j := 0; j < 16; j++ { sum += output[pos+j] } output[i] = sum } }
The assembler implementations of the Intel and new approach were tested using the standard Go benchmarking tool. Various inputs were tested:
size [32-bit words] | Intel algorithm [MB/s] | Alternative algorithm [MB/s] | Speedup |
---|---|---|---|
constant value | |||
1,024 | 860.61 | 227.23 | 0.26 |
5,120 | 869.37 | 656.92 | 0.76 |
10,240 | 870.25 | 861.90 | 0.99 |
102,400 | 867.46 | 1182.99 | 1.36 |
512,000 | 850.11 | 1221.05 | 1.44 |
1,048,576 | 849.96 | 1226.56 | 1.44 |
2,097,152 | 838.58 | 1206.70 | 1.44 |
5,242,880 | 786.06 | 1118.72 | 1.42 |
10,485,760 | 777.66 | 1117.22 | 1.44 |
sorted values from 0 to 3 | |||
1,024 | 860.67 | 227.21 | 0.26 |
5,120 | 869.26 | 657.08 | 0.76 |
10,240 | 870.47 | 861.83 | 0.99 |
102,400 | 866.41 | 1177.93 | 1.36 |
512,000 | 849.94 | 1220.74 | 1.44 |
1,048,576 | 849.05 | 1226.69 | 1.44 |
2,097,152 | 835.50 | 1200.06 | 1.44 |
5,242,880 | 784.94 | 1115.10 | 1.42 |
10,485,760 | 779.75 | 1114.16 | 1.43 |
sorted values from 0 to 14 | |||
1,024 | 860.75 | 227.22 | 0.26 |
5,120 | 869.33 | 657.22 | 0.76 |
10,240 | 862.62 | 849.14 | 0.98 |
102,400 | 868.60 | 1190.50 | 1.37 |
512,000 | 848.46 | 1219.15 | 1.44 |
1,048,576 | 850.39 | 1228.37 | 1.44 |
2,097,152 | 834.96 | 1198.63 | 1.44 |
5,242,880 | 782.96 | 1114.30 | 1.42 |
10,485,760 | 778.32 | 1112.89 | 1.43 |
sorted values from 0 to 63 | |||
1,024 | 860.72 | 227.22 | 0.26 |
5,120 | 869.31 | 641.37 | 0.74 |
10,240 | 870.51 | 861.93 | 0.99 |
102,400 | 863.03 | 1175.73 | 1.36 |
512,000 | 851.23 | 1224.23 | 1.44 |
1,048,576 | 851.38 | 1226.58 | 1.44 |
2,097,152 | 832.73 | 1194.05 | 1.43 |
5,242,880 | 782.41 | 1119.39 | 1.43 |
10,485,760 | 779.38 | 1112.67 | 1.43 |
sorted bytes | |||
1,024 | 860.77 | 227.22 | 0.26 |
5,120 | 869.25 | 657.24 | 0.76 |
10,240 | 870.45 | 861.94 | 0.99 |
102,400 | 863.78 | 1180.33 | 1.37 |
512,000 | 849.81 | 1223.83 | 1.44 |
1,048,576 | 849.93 | 1225.23 | 1.44 |
2,097,152 | 831.93 | 1194.59 | 1.44 |
5,242,880 | 785.46 | 1116.25 | 1.42 |
10,485,760 | 778.15 | 1110.47 | 1.43 |
constant value | |||
1,024 | 860.73 | 227.22 | 0.26 |
5,120 | 869.30 | 657.21 | 0.76 |
10,240 | 859.60 | 849.05 | 0.99 |
102,400 | 862.47 | 1180.54 | 1.37 |
512,000 | 849.13 | 1223.81 | 1.44 |
1,048,576 | 850.00 | 1227.04 | 1.44 |
2,097,152 | 835.06 | 1199.79 | 1.44 |
5,242,880 | 785.40 | 1117.14 | 1.42 |
10,485,760 | 779.17 | 1114.66 | 1.43 |
random values from 0 to 3 | |||
1,024 | 860.73 | 227.21 | 0.26 |
5,120 | 844.43 | 657.25 | 0.78 |
10,240 | 870.48 | 861.93 | 0.99 |
102,400 | 863.60 | 1180.64 | 1.37 |
512,000 | 850.54 | 1221.42 | 1.44 |
1,048,576 | 850.45 | 1227.06 | 1.44 |
2,097,152 | 834.29 | 1199.90 | 1.44 |
5,242,880 | 785.20 | 1117.01 | 1.42 |
10,485,760 | 777.80 | 1112.16 | 1.43 |
random values from 0 to 14 | |||
1,024 | 860.67 | 227.23 | 0.26 |
5,120 | 869.34 | 657.10 | 0.76 |
10,240 | 870.50 | 861.89 | 0.99 |
102,400 | 864.86 | 1182.81 | 1.37 |
512,000 | 846.93 | 1220.55 | 1.44 |
1,048,576 | 848.47 | 1227.36 | 1.45 |
2,097,152 | 834.09 | 1199.69 | 1.44 |
5,242,880 | 784.80 | 1118.62 | 1.43 |
10,485,760 | 779.74 | 1113.79 | 1.43 |
random values from 0 to 63 | |||
1,024 | 860.77 | 227.24 | 0.26 |
5,120 | 869.31 | 656.93 | 0.76 |
10,240 | 860.87 | 849.00 | 0.99 |
102,400 | 867.53 | 1182.82 | 1.36 |
512,000 | 849.01 | 1220.70 | 1.44 |
1,048,576 | 849.53 | 1226.27 | 1.44 |
2,097,152 | 832.92 | 1198.14 | 1.44 |
5,242,880 | 785.25 | 1116.44 | 1.42 |
10,485,760 | 779.28 | 1112.75 | 1.43 |
random bytes | |||
1,024 | 860.73 | 227.21 | 0.26 |
5,120 | 869.36 | 641.30 | 0.74 |
10,240 | 860.29 | 861.87 | 1.00 |
102,400 | 868.43 | 1190.52 | 1.37 |
512,000 | 846.36 | 1223.07 | 1.45 |
1,048,576 | 849.26 | 1228.86 | 1.45 |
2,097,152 | 834.57 | 1198.15 | 1.44 |
5,242,880 | 786.23 | 1120.11 | 1.42 |
10,485,760 | 779.94 | 1116.79 | 1.43 |
All sources are available on Github: https://github.com/WojciechMula/toys/tree/master/avx512-conflict-detection