Sections

You will find here a whole lot of SIMD (i.e. SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX512).

Some of Polish texts also contain IMO interesting stuff — you can use Google Translate or at least take a look at source codes.

- Detecting intersection of convex polygons in 2D [2013-09-15]
- Another, slightly naive approach

AVX512 implementation of JPEG zigzag transformation [2018-11-04]

SSSE3/SSE4: alpha blending — operator over [2016-03-03]

16bpp/15bpp to 32bpp pixel conversions — different methods [2016-03-06]

SSSE3: PMADDUBSW and image crossfading [2016-05-03]

SSE: modify 32bpp images with lookup tables [2016-03-04]

See the paper by Daniel Lemire, Nathan Kurz and me: Faster Population Counts using AVX2 Instructions

- Population count using XOP instructions [2016-12-16]
- Use instruction
`VPTERNB`from AMD XOP can help a little. - Speeding up bit-parallel population count [2015-04-13]
- Nearly 50% faster than naive version.
- SSSE3: fast population count [2016-11-27]
- Population count on large bitstring could be sevaral times faster
than lookup-based approach. And an AVX2 implementation is faster
than dedicated
`popcnt`instruction.

- SIMDized check which bytes are in a set [2018-10-18]
- Functions like
`isspace`with SSE/AVX2/AVX512 instructions. - SIMD-friendly algorithms for substring searching [2017-04-29]
- Faster
`strstr`with SIMD instructions (SSE4, AVX2, AVX512, ARM Neon). - SIMD-friendly Rabin-Karp modification [2013-03-11]
- As the title state. Speedup over
`strstr`is around 3-4 times! - SSE4 string search — modification of Karp-Rabin algorithm [2008-05-27]
- Acceleration of
`strstr`with SSE4 instruction`MPSADBW`. - Speeding up letter case conversion [2016-01-06]
- SWAR swap case could be 3 times faster than scalar version for English texts

- Parsing series of integers with SIMD [2018-04-19]
- Parse multiple decimal integers separated by arbitrary number of delimiters can be really fast with SSE.
- SWAR check if all chars are digits [2018-03-25]
- We have a string and want to check if all its characters are ASCII digits.
- Conversion numbers to octal representation [2014-10-02]
- SWAR, SSE and BMI2 conversions.
- Conversion numbers to hexadecimal representation [2016-03-07]
- SWAR, SSE and BMI2 conversions.
- Conversion numbers to binary representation [2016-11-20]
- Branchless conversion. Keywords: SWAR, SSE, BMI2.

Using SSE to convert from hexadecimal ASCII to number [2014-10-22]

Parsing decimal numbers — part 1: SWAR [2014-10-12]

Parsing decimal numbers — part 2: SSE [2018-03-22]

Using PEXT to convert from hexadecimal ASCII to number [2014-10-09]

Using PEXT to convert from binary ASCII to number [2014-10-06]

- SSE: conversion integers to decimal representation [2018-04-25]
- Fast conversion integers to decimal representation using SSE instructions.

SSSE3: printing hex values [2016-03-07]

Is sorted using SIMD instructions [2018-04-18]

- Intersection of ordered sets [2018-03-14]
- a study of a special case (SIMD approach included)

AVX512 — first bit set in a large array [2016-12-21]

- Implementing byte-wise lookup table with PSHUFB [2016-03-13]
- A spin off from base64 decoding research. Pretty straightforward use of pshufb.
- SIMD-ized searching in unique constant dictionary [2016-05-03]
- "The problem: there is a
**ordered dictionary**containing only**unique**keys. Dictionary is read only, and keys are 32-bit (SSE) or 64-bit (AVX2)." - SSE4.1: PHMINPOSUW — insertion sort [2008-08-03]
- Unusual application of PHMINPOSUW instruction
- SSE: trie lookup speedup [2013-09-30]
- Different methods of walking along paths in tries.

Speedup reversing table of bytes [2010-05-01]

STL: map with string as key — access speedup [2010-04-03]

See the paper by Daniel Lemire and me: Faster Base64 Encoding and Decoding Using AVX2 Instructions.

ARM Neon and Base64 encoding & decoding [2017-01-07]

- AVX512F base64 coding and decoding [2017-01-25]
- AVX512F is not as powerful as AVX512BW, but base64 is feasible. And two times faster than AVX2 code.
- Base64 encoding & decoding using AVX512BW instructions [2018-11-05]
- AVX512BW makes some SIMD parts of base64 really easy; AVX512VBMI and AVX512VL make extremely easy.
- Base64 encoding with SIMD instructions [2017-01-25]
- SSE code could by more than 2 times faster than lookup-based scalar code.
- Base64 decoding with SIMD instructions [2017-01-25]
- SSE code could by more than 2 times faster than lookup-based scalar code.
- Base64 encoding — implementation study [2015-12-27]
- This could be done slightly faster.

Interesting resources: Bit Twiddling Hacks (Sean Eron Anderson), Low Level Bit Hacks You Absolutely Must Know (Peteris Krumins), Bit twiddling (Jari Komppa).

- Detecting bit patterns with series of zeros followed by ones [2016-01-16]
- As the title states. Surprisingly it is quite easy.
- Building a bitmask [2016-09-14]
- "There is an array of 32-bit integers and a key — specific value. The result have to be a bit vector with bits set on these position where the key is equal to array items." Four approaches (SIMD included) to solve this real-world problem.
- SSE/AVX2: Generating mask where n leading (trailing) bytes are set [2015-03-21]
- Three methods, one the best.
- Mask for zero/non-zero bytes [2014-03-09]
- Another bit twiddling hack.

Conditionally fill word (for limited set of input values) [2014-10-01]

Determining if an integer is a power of 2 — part 2 [2014-10-01]

Average of two unsigned integers [2012-07-02]

Branchless set mask if value greater or how to print hex values [2010-06-09]

Determining if an integer is a power of 2 [2010-04-11]

Branchless signum [2010-04-01]

Fill word with selected bit [2010-04-01]

Transpose bits in byte using SIMD instructions [2010-03-31]

Keywords: FPU, MMX, SSE, SSE2, SSE3, SSSE3, SSE4, AVX, AVX2, AVX512

- Byte-wise alignr in AVX512F [2016-10-16]
- AVX512F has got alignr which works on 32-bit word. The article shows how to do long shifts at byte level
- Sorting an AVX512 register [2016-10-08]
- Vectorized sorting of AVX512 register (or its portion, or more registers).
- AVX512: ternary functions evaluation [2018-11-04]
- I felt in love. Wow,
`vpternlog`is my second favourite instruction just after`pshufb`. - SIMD: detecting a bit pattern [2015-03-22]
- Trying to solve specific problem, different approaches are shown.
- Scalar version of SSE move mask instruction [2014-03-16]
- How to emulate instruction
**pmovmskb**. - Penalties of errors in SSE floating point calculations [2014-01-26]
- Special floating-point values (for example denormalized) slow down computations.

SSE4: PTEST & strlen [2007-09-08]

SSE4: greater/less or equal relations for unsigned bytes/words [2015-04-03]

Integer log 10 of an unsigned integer — SIMD version [2014-03-09]

SSE: conversion uint32 to float [2008-06-18]

PABSQ — absolute value of two singed 64-bit numbers [2008-06-08]

Convert float to int without FPU/SSE [2013-12-27]

Calculate floor value without FPU/SSE instruction [2013-12-29]

Floating point tricks [2008-06-15]

- Fast conversion of floating-point values to string [2015-12-29]
- Up to 15 times faster than sprintf

Speeding up LIKE '%text%' queries (at least in PostgeSQL) [2013-11-03]

PostgreSQL — faster reads from static tables [2013-11-03]

- Interpolation search revisited [2014-09-25]
- Interpolation search is quite interesting algorithm, however its properties make it unsuitable for most applications.
- C++ bitset vs array [2014-03-22]
- Is std::bitset always better? (Spoiler:
**no**). - Encoding array of unsigned integers [2013-12-01]
- Easily compress sequence of integers, how some special properties of sequence could be used to improve compression.
- Efficient trie representation [2011-03-31]
- Details of different structures to store trie.
- Traversing trees without stack [2011-02-17]
- Const memory complexity, i.e. O(1).
- Join locate databases [2008-12-03]
- Join locate database files without reencoding files.