Author: Wojciech Muła 2016-09-14

# Problem

There is a SIMD register (128-, 256-, 512-bit width), we want to set all bits above the given position k; k is in range from 0 to the register's width.

Of course a lookup table could be used, but it's not a intersting (maybe a little.)

# SIMD program

Treat the register as a set of chunks, where a chunk could be a word, a double word etc. Let chunk_size is the number of bits in a chunk, then n = k / chunk_size.

All chunks above n have to be filled, all below cleared. The only exception is n-th chunk which must be filled partially.

Algorithm:

1. Prepare constants (examples for k = 71).
```const size_t chunk_size = 32;
const size_t n     = k / chunk_size;    // n = 2
const size_t shift = k % chunk_size;    // shift = 7

const __m256i chunk_numbers = _mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7);
const __m256i chunk         = _mm256_set1_epi32(n);
```
1. Fill chunks above n.
```//                7          6           5          4          3          2          1          0
// tmp1    = [0xffffffff|0xffffffff|0xffffffff|0xffffffff|0xffffffff|0x00000000|0x00000000|0x00000000]
const __m256i tmp1 = _mm256_cmpgt_epi32(chunk_numbers, chunk);
```
1. Fill the n-th chunk.
```// tmp2    = [0x00000000|0x00000000|0x00000000|0x00000000|0x00000000|0xffffffff|0x00000000|0x00000000]
const __m256i tmp2 = _mm256_cmpeq_epi32(chunk_numbers, chunk);

// tmp2[2] = 0b11111111_11111111_11111111_10000000 = 0xffffff80
const __m256i tmp3 = _mm256_slli_epi32(tmp2, shift);
```
1. Merge results.
```// result  = [0xffffffff|0xffffffff|0xffffffff|0xffffffff|0xffffffff|0xffffff80|0x00000000|0x00000000]
const __m256i result = _mm256_or_si256(tmp1, tmp3);
```

# Sample program

Github repository contains an example program with tests.