Sections

You will find here a whole lot of SIMD (i.e. SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX512).

Some of Polish texts also contain IMO interesting stuff — you can use Google Translate or at least take a look at source codes.

- Detecting intersection of convex polygons in 2D [2013-09-15]
- Another, slightly naive approach

AVX512 implementation of JPEG zigzag transformation [2018-11-04]

SSSE3/SSE4: alpha blending — operator over [2016-03-03]

16bpp/15bpp to 32bpp pixel conversions — different methods [2016-03-06]

SSSE3: PMADDUBSW and image crossfading [2016-05-03]

SSE: modify 32bpp images with lookup tables [2016-03-04]

See the papers I co-authored:

- Efficient Computation of Positional Population Counts Using SIMD Instructions
with
**Marcus D. R. Klarqvist**and**Daniel Lemire** - Faster Population Counts Using AVX2 Instructions
with
**Daniel Lemire**and**Nathan Kurz**

AVX512 8-bit positional population count procedure [2019-12-31]

- Population count using XOP instructions [2016-12-16]
- Use instruction
`VPTERNB`from AMD XOP can help a little. - Speeding up bit-parallel population count [2015-04-13]
- Nearly 50% faster than naive version.
- SSSE3: fast population count [2016-11-27]
- Population count on large bitstring could be sevaral times faster
than lookup-based approach. And an AVX2 implementation is faster
than dedicated
`popcnt`instruction.

AVX512VBMI — remove spaces from text [2019-01-05]

- SIMDized check which bytes are in a set [2018-10-18]
- Functions like
`isspace`with SSE/AVX2/AVX512 instructions. - SIMD-friendly algorithms for substring searching [2017-04-29]
- Faster
`strstr`with SIMD instructions (SSE4, AVX2, AVX512, ARM Neon). - SIMD-friendly Rabin-Karp modification [2013-03-11]
- As the title states. Speedup over
`strstr`is around 3-4 times! - SSE4 string search — modification of Karp-Rabin algorithm [2008-05-27]
- Acceleration of
`strstr`with SSE4 instruction`MPSADBW`. - Speeding up letter case conversion [2016-01-06]
- SWAR swap case could be 3 times faster than scalar version for English texts

- Parsing series of integers with SIMD [2020-03-30]
- Parse multiple decimal integers separated by arbitrary number of delimiters can be really fast with SSE.
- SWAR check if all chars are digits [2018-03-25]
- We have a string and want to check if all its characters are ASCII digits.
- Conversion numbers to octal representation [2014-10-02]
- SWAR, SSE and BMI2 conversions.
- Conversion numbers to hexadecimal representation [2016-03-07]
- SWAR, SSE and BMI2 conversions.
- Conversion numbers to binary representation [2016-11-20]
- Branchless conversion. Keywords: SWAR, SSE, BMI2.
- Using SSE to convert from hexadecimal ASCII to number [2014-10-22]
- SSE procedure can convert 16- and 32-digits inputs producing 8- and 16-bytes results
- Parsing decimal numbers — part 1: SWAR [2014-10-12]
- SWAR techniques to multiply digits in parallel
- Parsing decimal numbers — part 2: SSE [2018-03-22]
- SSE procedure is able to convert two 8-digit numbers. The main instruction used in
converting to numeric value is
`PMADDWD`. - Using PEXT to convert from hexadecimal ASCII to number [2014-10-09]
- not an obvious use of the new instruction from BMI2
- Using PEXT to convert from binary ASCII to number [2014-10-06]
- nice example of use of the new instruction from BMI2
- SSE: conversion integers to decimal representation [2018-04-25]
- Fast conversion integers to decimal representation using SSE instructions.

SSSE3: printing hex values [2016-03-07]

Use AVX512 to calculate binomial coefficient [2020-03-21]

- Use AVX512 Galois field affine transformation for bit shuffling [2020-01-19]
- Arbitrary bit-shuffling within bytes or bit-matrix transposition using
`VGF2P8AFFINEQB`and properly selected constants. - SIMDization of switch statements [2019-02-03]
- some switch statements can be vectorized
- SIMDized counting byte in byte stream [2019-01-29]
- SSE/AVX2/AVX512 can help a lot with
`std::count` - SIMDized sum of all bytes in the array — part 2: signed bytes [2018-11-18]
`std::accumulate(array, array + size, int32_t(0))`can be 2.5 times faster- SIMDized sum of all bytes in the array [2018-10-24]
`std::accumulate(array, array + size, uint32_t(0))`can be up to six times faster- Finding index of the minimum value using SIMD instructions [2018-10-03]
- faster
`std::distance(v.begin(), std::min_element(v.begin(), v.end()))`; compilers can't do this (yet) - Is sorted using SIMD instructions [2018-04-18]
- faster
`std::is_sorted` - Intersection of ordered sets [2018-03-14]
- a study of a special case (SIMD approach included)

AVX512 — first bit set in a large array [2016-12-21]

- Implementing byte-wise lookup table with PSHUFB [2016-03-13]
- A spin off from base64 decoding research. Pretty straightforward use of
`pshufb`. - SIMD-ized searching in unique constant dictionary [2016-05-03]
- "The problem: there is a
**ordered dictionary**containing only**unique**keys. Dictionary is read only, and keys are 32-bit (SSE) or 64-bit (AVX2)." - SSE4.1: PHMINPOSUW — insertion sort [2008-08-03]
- Unusual application of PHMINPOSUW instruction

Software emulation of PDEP instruction [2014-09-23]

- SSE: trie lookup speedup [2013-09-30]
- Different methods of walking along paths in tries.

Speedup reversing table of bytes [2010-05-01]

STL: map with string as key — access speedup [2010-04-03]

- Speeding up multiple vector operations using SIMD [2018-11-14]
- rewriting a nested loop can boost k-means algorithm performance

See the papers by **Daniel Lemire** and me:

- Base64 encoding and decoding at almost the speed of a memory copy
- Faster Base64 Encoding and Decoding Using AVX2 Instructions.

ARM Neon and Base64 encoding & decoding [2017-01-07]

- AVX512F base64 coding and decoding [2018-11-24]
- AVX512F is not as powerful as AVX512BW, but base64 is feasible. And two times faster than AVX2 code.
- Base64 encoding & decoding using AVX512BW instructions [2018-12-08]
- AVX512BW makes some SIMD parts of base64 really easy; AVX512VBMI and AVX512VL make extremely easy.
- Base64 encoding with SIMD instructions [2017-01-25]
- SSE code could by more than 2 times faster than lookup-based scalar code.
- Base64 decoding with SIMD instructions [2017-01-25]
- SSE code could by more than 2 times faster than lookup-based scalar code.
- Base64 encoding — implementation study [2015-12-27]
- This could be done slightly faster.

Interesting resources: Bit Twiddling Hacks (Sean Eron Anderson), Low Level Bit Hacks You Absolutely Must Know (Peteris Krumins), Bit twiddling (Jari Komppa).

- Detecting bit patterns with series of zeros followed by ones [2016-01-16]
- As the title states. Surprisingly it is quite easy.
- Building a bitmask [2016-09-14]
- "There is an array of 32-bit integers and a key — specific value. The result have to be a bit vector with bits set on these position where the key is equal to array items." Four approaches (SIMD included) to solve this real-world problem.
- SSE/AVX2: Generating mask where n leading (trailing) bytes are set [2015-03-21]
- Three methods, one the best.
- Mask for zero/non-zero bytes [2014-03-09]
- Another bit twiddling hack.

Conditionally fill word (for limited set of input values) [2014-10-01]

- Is power of two — BMI1 version [2018-03-11]
- use of the BLSR instruction
- Determining if an integer is a power of 2 — part 2 [2014-10-01]
- another use of a bit-trick

Average of two unsigned integers [2012-07-02]

Branchless set mask if value greater or how to print hex values [2010-06-09]

Determining if an integer is a power of 2 [2010-04-11]

Branchless signum [2010-04-01]

Fill word with selected bit [2010-04-01]

Transpose bits in byte using SIMD instructions [2010-03-31]

- SIMD bit mask [2016-09-14]
- fill a SIMD register starting from k-th bit

Keywords: FPU, MMX, SSE, SSE2, SSE3, SSSE3, SSE4, AVX, AVX2, AVX512

- Byte-wise alignr in AVX512F [2016-10-16]
- AVX512F has got alignr which works on 32-bit word. The article shows how to do long shifts at byte level
- Sorting an AVX512 register [2016-10-08]
- Vectorized sorting of AVX512 register (or its portion, or more registers).
- AVX512: ternary functions evaluation [2018-11-04]
- I felt in love. Wow,
`vpternlog`is my second favourite instruction just after`pshufb`. - SIMD: detecting a bit pattern [2015-03-22]
- Trying to solve specific problem, different approaches are shown.
- Scalar version of SSE move mask instruction [2014-03-16]
- How to emulate instruction
**pmovmskb**. - Penalties of errors in SSE floating point calculations [2018-11-26]
- Special floating-point values (for example denormalized) slow down computations.

SSE4: PTEST & strlen [2007-09-08]

SSE4: greater/less or equal relations for unsigned bytes/words [2015-04-03]

Integer log 10 of an unsigned integer — SIMD version [2014-03-09]

SSE: conversion uint32 to float [2008-06-18]

PABSQ — absolute value of two singed 64-bit numbers [2008-06-08]

SSE/AVX: absolute value of difference of unsigned integers [2018-03-11]

Convert float to int without FPU/SSE [2013-12-27]

Calculate floor value without FPU/SSE instruction [2013-12-29]

Floating point tricks [2008-06-15]

- Fast conversion of floating-point values to string [2015-12-29]
- Up to 15 times faster than
`sprintf`

Speeding up LIKE '%text%' queries (at least in PostgeSQL) [2013-11-03]

PostgreSQL — faster reads from static tables [2013-11-03]

- Interpolation search revisited [2014-09-25]
- Interpolation search is quite interesting algorithm, however its properties make it unsuitable for most applications.
- C++ bitset vs array [2014-03-22]
- Is std::bitset always better? (Spoiler:
**no**). - Encoding array of unsigned integers [2013-12-01]
- Easily compress sequence of integers, how some special properties of sequence could be used to improve compression.
- Efficient trie representation [2011-03-31]
- Details of different structures to store trie.
- Traversing trees without stack [2011-02-17]
- Const memory complexity, i.e. O(1).
- Join locate databases [2008-12-03]
- Join locate database files without reencoding files.