You will find here a whole lot of SIMD (i.e. SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX512).

Some of Polish texts also contain IMO interesting stuff — you can use Google Translate or at least take a look at source codes.

Computer graphics

Detecting intersection of convex polygons in 2D [2013-09-15]
Another, slightly naive approach

AVX512 implementation of JPEG zigzag transformation [2018-11-04]

SSSE3/SSE4: alpha blending — operator over [2016-03-03]

16bpp/15bpp to 32bpp pixel conversions — different methods [2016-03-06]

SSSE3: PMADDUBSW and image crossfading [2016-05-03]

SSE: modify 32bpp images with lookup tables [2016-03-04]

Population count new

See the papers I co-authored:

AVX512 8-bit positional population count procedure [2019-12-31] new

Population count using XOP instructions [2016-12-16]
Use instruction VPTERNB from AMD XOP can help a little.
Speeding up bit-parallel population count [2015-04-13]
Nearly 50% faster than naive version.
SSSE3: fast population count [2016-11-27]
Population count on large bitstring could be sevaral times faster than lookup-based approach. And an AVX2 implementation is faster than dedicated popcnt instruction.

String processing

AVX512VBMI — remove spaces from text [2019-01-05]

SIMDized check which bytes are in a set [2018-10-18]
Functions like isspace with SSE/AVX2/AVX512 instructions.
SIMD-friendly algorithms for substring searching [2017-04-29]
Faster strstr with SIMD instructions (SSE4, AVX2, AVX512, ARM Neon).
SIMD-friendly Rabin-Karp modification [2013-03-11]
As the title states. Speedup over strstr is around 3-4 times!
SSE4 string search — modification of Karp-Rabin algorithm [2008-05-27]
Acceleration of strstr with SSE4 instruction MPSADBW.
Speeding up letter case conversion [2016-01-06]
SWAR swap case could be 3 times faster than scalar version for English texts

Number conversion and printing update

Parsing series of integers with SIMD [2020-07-05] update
Parse multiple decimal integers separated by arbitrary number of delimiters can be really fast with SSE.
SWAR check if all chars are digits [2018-03-25]
We have a string and want to check if all its characters are ASCII digits.
Conversion numbers to octal representation [2014-10-02]
SWAR, SSE and BMI2 conversions.
Conversion numbers to hexadecimal representation [2016-03-07]
SWAR, SSE and BMI2 conversions.
Conversion numbers to binary representation [2016-11-20]
Branchless conversion. Keywords: SWAR, SSE, BMI2.
Using SSE to convert from hexadecimal ASCII to number [2014-10-22]
SSE procedure can convert 16- and 32-digits inputs producing 8- and 16-bytes results
Parsing decimal numbers — part 1: SWAR [2014-10-12]
SWAR techniques to multiply digits in parallel
Parsing decimal numbers — part 2: SSE [2018-03-22]
SSE procedure is able to convert two 8-digit numbers. The main instruction used in converting to numeric value is PMADDWD.
Using PEXT to convert from hexadecimal ASCII to number [2014-10-09]
not an obvious use of the new instruction from BMI2
Using PEXT to convert from binary ASCII to number [2014-10-06]
nice example of use of the new instruction from BMI2
SSE: conversion integers to decimal representation [2018-04-25]
Fast conversion integers to decimal representation using SSE instructions.

SSSE3: printing hex values [2016-03-07]

Various algorithms new

Use AVX512 to calculate binomial coefficient [2020-03-21] new

Use AVX512 Galois field affine transformation for bit shuffling [2020-01-19] new
Arbitrary bit-shuffling within bytes or bit-matrix transposition using VGF2P8AFFINEQB and properly selected constants.
SIMDization of switch statements [2019-02-03]
some switch statements can be vectorized
SIMDized counting byte in byte stream [2019-01-29]
SSE/AVX2/AVX512 can help a lot with std::count
SIMDized sum of all bytes in the array — part 2: signed bytes [2018-11-18]
std::accumulate(array, array + size, int32_t(0)) can be 2.5 times faster
SIMDized sum of all bytes in the array [2018-10-24]
std::accumulate(array, array + size, uint32_t(0)) can be up to six times faster
Finding index of the minimum value using SIMD instructions [2018-10-03]
faster std::distance(v.begin(), std::min_element(v.begin(), v.end())); compilers can't do this (yet)
Is sorted using SIMD instructions [2018-04-18]
faster std::is_sorted
Intersection of ordered sets [2018-03-14]
a study of a special case (SIMD approach included)

AVX512 — first bit set in a large array [2016-12-21]

Implementing byte-wise lookup table with PSHUFB [2016-03-13]
A spin off from base64 decoding research. Pretty straightforward use of pshufb.
SIMD-ized searching in unique constant dictionary [2016-05-03]
"The problem: there is a ordered dictionary containing only unique keys. Dictionary is read only, and keys are 32-bit (SSE) or 64-bit (AVX2)."
SSE4.1: PHMINPOSUW — insertion sort [2008-08-03]
Unusual application of PHMINPOSUW instruction

Software emulation of PDEP instruction [2014-09-23]

SSE: trie lookup speedup [2013-09-30]
Different methods of walking along paths in tries.

Speedup reversing table of bytes [2010-05-01]

STL: map with string as key — access speedup [2010-04-03]

Speeding up multiple vector operations using SIMD [2018-11-14]
rewriting a nested loop can boost k-means algorithm performance

Base64 algorithm new

See the papers by Daniel Lemire and me:

ARM Neon and Base64 encoding & decoding [2017-01-07]

AVX512F base64 coding and decoding [2018-11-24]
AVX512F is not as powerful as AVX512BW, but base64 is feasible. And two times faster than AVX2 code.
Base64 encoding & decoding using AVX512BW instructions [2018-12-08]
AVX512BW makes some SIMD parts of base64 really easy; AVX512VBMI and AVX512VL make extremely easy.
Base64 encoding with SIMD instructions [2017-01-25]
SSE code could by more than 2 times faster than lookup-based scalar code.
Base64 decoding with SIMD instructions [2017-01-25]
SSE code could by more than 2 times faster than lookup-based scalar code.
Base64 encoding — implementation study [2015-12-27]
This could be done slightly faster.

Bit Twiddling

Interesting resources: Bit Twiddling Hacks (Sean Eron Anderson), Low Level Bit Hacks You Absolutely Must Know (Peteris Krumins), Bit twiddling (Jari Komppa).

Detecting bit patterns with series of zeros followed by ones [2016-01-16]
As the title states. Surprisingly it is quite easy.
Building a bitmask [2016-09-14]
"There is an array of 32-bit integers and a key — specific value. The result have to be a bit vector with bits set on these position where the key is equal to array items." Four approaches (SIMD included) to solve this real-world problem.
SSE/AVX2: Generating mask where n leading (trailing) bytes are set [2015-03-21]
Three methods, one the best.
Mask for zero/non-zero bytes [2014-03-09]
Another bit twiddling hack.

Conditionally fill word (for limited set of input values) [2014-10-01]

Is power of two — BMI1 version [2018-03-11]
use of the BLSR instruction
Determining if an integer is a power of 2 — part 2 [2014-10-01]
another use of a bit-trick

Average of two unsigned integers [2012-07-02]

Branchless set mask if value greater or how to print hex values [2010-06-09]

Determining if an integer is a power of 2 [2010-04-11]

Branchless signum [2010-04-01]

Fill word with selected bit [2010-04-01]

Transpose bits in byte using SIMD instructions [2010-03-31]

SIMD bit mask [2016-09-14]
fill a SIMD register starting from k-th bit

Assembler x86

Keywords: FPU, MMX, SSE, SSE2, SSE3, SSSE3, SSE4, AVX, AVX2, AVX512

Byte-wise alignr in AVX512F [2016-10-16]
AVX512F has got alignr which works on 32-bit word. The article shows how to do long shifts at byte level
Sorting an AVX512 register [2016-10-08]
Vectorized sorting of AVX512 register (or its portion, or more registers).
AVX512: ternary functions evaluation [2018-11-04]
I felt in love. Wow, vpternlog is my second favourite instruction just after pshufb.
SIMD: detecting a bit pattern [2015-03-22]
Trying to solve specific problem, different approaches are shown.
Scalar version of SSE move mask instruction [2014-03-16]
How to emulate instruction pmovmskb.
Penalties of errors in SSE floating point calculations [2018-11-26]
Special floating-point values (for example denormalized) slow down computations.

SSE4: PTEST & strlen [2007-09-08]

SSE4: greater/less or equal relations for unsigned bytes/words [2015-04-03]

Integer log 10 of an unsigned integer — SIMD version [2014-03-09]

SSE: conversion uint32 to float [2008-06-18]

PABSQ — absolute value of two singed 64-bit numbers [2008-06-08]

SSE/AVX: absolute value of difference of unsigned integers [2018-03-11]

Floating-point numbers

Convert float to int without FPU/SSE [2013-12-27]

Calculate floor value without FPU/SSE instruction [2013-12-29]

Floating point tricks [2008-06-15]

Fast conversion of floating-point values to string [2015-12-29]
Up to 15 times faster than sprintf


Speeding up LIKE '%text%' queries (at least in PostgeSQL) [2013-11-03]

PostgreSQL — faster reads from static tables [2013-11-03]


Interpolation search revisited [2014-09-25]
Interpolation search is quite interesting algorithm, however its properties make it unsuitable for most applications.
C++ bitset vs array [2014-03-22]
Is std::bitset always better? (Spoiler: no).
Encoding array of unsigned integers [2013-12-01]
Easily compress sequence of integers, how some special properties of sequence could be used to improve compression.
Efficient trie representation [2011-03-31]
Details of different structures to store trie.
Traversing trees without stack [2011-02-17]
Const memory complexity, i.e. O(1).
Join locate databases [2008-12-03]
Join locate database files without reencoding files.