You will find here a whole lot of SIMD (i.e. SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX512).

Some of Polish texts also contain IMO interesting stuff — you can use Google Translate or at least take a look at source codes.

Computer graphics

Detecting intersection of convex polygons in 2D [2013-09-15]
Another, slightly naive approach

SSSE3/SSE4: alpha blending — operator over [2016-03-03]

16bpp/15bpp to 32bpp pixel conversions — different methods [2016-03-06]

SSSE3: PMADDUBSW and image crossfading [2016-05-03]

SSE: modify 32bpp images with lookup tables [2016-03-04]

Population count

See the paper by Daniel Lemire, Nathan Kurz and me: Faster Population Counts using AVX2 Instructions

Population count using XOP instructions [2016-12-16]
Use instruction VPTERNB from AMD XOP can help a little.
Speeding up bit-parallel population count [2015-04-13]
Nearly 50% faster than naive version.
SSSE3: fast population count [2016-11-27]
Population count on large bitstring could be sevaral times faster than lookup-based approach. And an AVX2 implementation is faster than dedicated popcnt instruction.

String processing update

SIMD-friendly algorithms for substring searching [2017-04-29] update
Faster strstr with SIMD instructions (SSE4, AVX2, AVX512, ARM Neon).
SIMD-friendly Rabin-Karp modification [2013-03-11]
As the title state. Speedup over strstr is around 3-4 times!
SSE4 string search — modification of Karp-Rabin algorithm [2008-05-27]
Acceleration of strstr with SSE4 instruction MPSADBW.
Speeding up letter case conversion [2016-01-06]
SWAR swap case could be 3 times faster than scalar version for English texts

Number conversion and printing

SWAR check if all chars are digits [2016-12-21]
We have a string and want to check if all its characters are ASCII digits.
Conversion numbers to octal representation [2014-10-02]
SWAR, SSE and BMI2 conversions.
Conversion numbers to hexadecimal representation [2016-03-07]
SWAR, SSE and BMI2 conversions.
Conversion numbers to binary representation [2016-11-20]
Branchless conversion. Keywords: SWAR, SSE, BMI2.

Using SSE to convert from hexadecimal ASCII to number [2014-10-22]

Parsing decimal numbers — part 1: SWAR [2014-10-12]

Parsing decimal numbers — part 2: SSE [2014-10-15]

Using PEXT to convert from hexadecimal ASCII to number [2014-10-09]

Using PEXT to convert from binary ASCII to number [2014-10-06]

SSE: conversion integers to decimal representation [2011-10-21]
Fast conversion integers to decimal representation using SSE instructions.

SSSE3: printing hex values [2016-03-07]

Various algorithms

AVX512 — first bit set in a large array [2016-12-21]

Implementing byte-wise lookup table with PSHUFB [2016-03-13]
A spin off from base64 decoding research. Pretty straightforward use of pshufb.
SIMD-ized searching in unique constant dictionary [2016-05-03]
"The problem: there is a ordered dictionary containing only unique keys. Dictionary is read only, and keys are 32-bit (SSE) or 64-bit (AVX2)."
SSE4.1: PHMINPOSUW — insertion sort [2008-08-03]
Unusual application of PHMINPOSUW instruction
SSE: trie lookup speedup [2013-09-30]
Different methods of walking along paths in tries.

Speedup reversing table of bytes [2010-05-01]

STL: map with string as key — access speedup [2010-04-03]

Base64 algorithm update

See the paper by Daniel Lemire and me: Faster Base64 Encoding and Decoding Using AVX2 Instructions.

ARM Neon and Base64 encoding & decoding [2017-01-07]

AVX512F base64 coding and decoding [2017-01-25]
AVX512F is not as powerful as AVX512BW, but base64 is feasible. And two times faster than AVX2 code.
Base64 encoding & decoding using AVX512BW instructions [2017-05-01] update
AVX512BW makes some SIMD parts of base64-encoding really easy. And AVX512VL makes encoding extremely easy.
Base64 encoding with SIMD instructions [2017-01-25]
SSE code could by more than 2 times faster than lookup-based scalar code.
Base64 decoding with SIMD instructions [2017-01-25]
SSE code could by more than 2 times faster than lookup-based scalar code.
Base64 encoding — implementation study [2015-12-27]
This could be done slightly faster.

Bit Twiddling

Interesting resources: Bit Twiddling Hacks (Sean Eron Anderson), Low Level Bit Hacks You Absolutely Must Know (Peteris Krumins), Bit twiddling (Jari Komppa).

Detecting bit patterns with series of zeros followed by ones [2016-01-16]
As the title states. Surprisingly it is quite easy.
Building a bitmask [2016-09-14]
"There is an array of 32-bit integers and a key — specific value. The result have to be a bit vector with bits set on these position where the key is equal to array items." Four approaches (SIMD included) to solve this real-world problem.
SSE/AVX2: Generating mask where n leading (trailing) bytes are set [2015-03-21]
Three methods, one the best.
Mask for zero/non-zero bytes [2014-03-09]
Another bit twiddling hack.

Conditionally fill word (for limited set of input values) [2014-10-01]

Determining if an integer is a power of 2 — part 2 [2014-10-01]

Average of two unsigned integers [2012-07-02]

Branchless set mask if value greater or how to print hex values [2010-06-09]

Determining if an integer is a power of 2 [2010-04-11]

Branchless signum [2010-04-01]

Fill word with selected bit [2010-04-01]

Transpose bits in byte using SIMD instructions [2010-03-31]

Assembler x86

Keywords: FPU, MMX, SSE, SSE2, SSE3, SSSE3, SSE4, AVX, AVX2, AVX512

Byte-wise alignr in AVX512F [2016-10-16]
AVX512F has got alignr which works on 32-bit word. The article shows how to do long shifts at byte level
Sorting an AVX512 register [2016-10-08]
Vectorized sorting of AVX512 register (or its portion, or more registers).
AVX512: ternary functions evaluation [2016-09-04]
I felt in love. Wow, vpternlog is my second favourite instruction just after pshufb.
SIMD: detecting a bit pattern [2015-03-22]
Trying to solve specific problem, different approaches are shown.
Scalar version of SSE move mask instruction [2014-03-16]
How to emulate instruction pmovmskb.
Penalties of errors in SSE floating point calculations [2014-01-26]
Special floating-point values (for example denormalized) slow down computations.

SSE4: PTEST & strlen [2007-09-08]

SSE4: greater/less or equal relations for unsigned bytes/words [2015-04-03]

Integer log 10 of an unsigned integer — SIMD version [2014-03-09]

SSE: conversion uint32 to float [2008-06-18]

PABSQ — absolute value of two singed 64-bit numbers [2008-06-08]

Floating-point numbers

Convert float to int without FPU/SSE [2013-12-27]

Calculate floor value without FPU/SSE instruction [2013-12-29]

Floating point tricks [2008-06-15]

Fast conversion of floating-point values to string [2015-12-29]
Up to 15 times faster than sprintf


Speeding up LIKE '%text%' queries (at least in PostgeSQL) [2013-11-03]

PostgreSQL — faster reads from static tables [2013-11-03]


Interpolation search revisited [2014-09-25]
Interpolation search is quite interesting algorithm, however its properties make it unsuitable for most applications.
C++ bitset vs array [2014-03-22]
Is std::bitset always better? (Spoiler: no).
Encoding array of unsigned integers [2013-12-01]
Easily compress sequence of integers, how some special properties of sequence could be used to improve compression.
Efficient trie representation [2011-03-31]
Details of different structures to store trie.
Traversing trees without stack [2011-02-17]
Const memory complexity, i.e. O(1).
Join locate databases [2008-12-03]
Join locate database files without reencoding files.