Articles

🔍 search

2025

[2025-02-02] Change case of UTF-32-encoded strings
[2025-01-19] LoongArch64 subjective higlights
[2025-01-18] SIMD binary heap operations
[2025-01-18] AVX512: printing u64 as binary
[2025-01-12] Drawing trees
[2025-01-07] Building full-text search in Javascript
[2025-01-05] SIMD parallel bits deposit/extract
[2025-01-03] Dividing unsigned 16-bit numbers

2024

[2024-12-21] Dividing unsigned 8-bit numbers
[2024-11-11] Myriad sequences of RISC-V code
[2024-11-09] RISC-V Vector Extension overview

2023

[2023-11-20] Simple suggestions using popcount
[2023-05-06] AVX-512 conflict detection without resolving conflicts — faster approach to handle repeated indices
[2023-04-30] Modern perfect hashing for strings — use of PEXT helps
[2023-04-09] SIMD-ized faster parse of IPv4 addresses — 2-3 times faster than scalar version
[2023-03-06] SWAR find any byte from set
[2023-02-06] AVX512: finding first byte in lanes
[2023-02-05] Finding lowest common ancestor of two nodes
[2023-02-05] Faster fractional exponents — x^{½ + ¼ + 1/8} can be calculated quite fast
[2023-02-05] Converting binary fraction to ratio
[2023-01-31] AVX512: count trailing zeros — unexpected use of popcount
[2023-01-21] AVX512: check if value belongs to a set
[2023-01-19] AVX512: generating constants
[2023-01-06] AVX512: histogram of sixteen nibbles — a solution to a very specific problem

2022

[2022-01-31] Faster hack — I explore an expression that appeared very strange at the first sight
[2022-01-29] Fast parsing HTTP verbs
[2022-01-29] DDoS-ed by a service runs on AWS
[2022-01-24] AVX512VBMI2 and packed varuint format
[2022-01-17] Parsing hex numbers with validation

2021

[2021-12-22] Bit test and reset vs compilers
[2021-11-23] Conversion uint32 into decimal without division nor multiplication
[2021-03-11] How to check if any word is zero — I was curious what neat tricks compilers do use for condition like (a == 0 or b == 0 or c ==0). They do not offery anything fancy.
[2021-02-17] Autovectorization status in MSVC in 2021
[2021-02-14] Counting byte in byte stream with AVX512BW instructions
[2021-02-02] How to detect if all bytes in SIMD register are the same?
[2021-01-18] Autovectorization status in GCC & Clang in 2021

2020

[2020-03-21] Use AVX512 to calculate binomial coefficient — category: weird
[2020-01-19] Use AVX512 Galois field affine transformation for bit shuffling — arbitrary bit-shuffling within bytes or bit-matrix transposition using VGF2P8AFFINEQB and properly selected constants

2019

[2019-12-31] AVX512 8-bit positional population count procedure
[2019-02-03] SIMDization of switch statements — some switch statements can be vectorized
[2019-02-03] Malloc internal memory fragmentation footprint
[2019-02-02] Auto-vectorization status in GCC, Clang, ICC and MSVC
[2019-01-29] SIMDized counting byte in byte stream — SSE/AVX2/AVX512 can help a lot
[2019-01-23] std::function and overloaded functions — I encountered some minor problems
[2019-01-08] pyahocorasick stabilisation story — how fixing a bug helped in general improving the python extension
[2019-01-07] C++ — how to read a file into a string
[2019-01-05] AVX512VBMI — remove spaces from text

2018

[2018-11-24] Python — file modification time perils
[2018-11-18] SIMDized sum of all bytes in the array — part 2: signed bytes — std::accumulate(array, array + size, int32_t(0)) can be 2.5 times faster
[2018-11-18] How many uops are there? — are all SIMD instructions mapped on simple uops?
[2018-11-18] A short report from code::dive 2018
[2018-11-14] Speeding up multiple vector operations using SIMD — rewriting a nested loop can boost k-means algorithm performance
[2018-10-28] SIMD — why you shouldn't use static vector constants
[2018-10-24] SIMDized sum of all bytes in the array — std::accumulate(array, array + size, 0) can be six times faster
[2018-10-18] SIMDized check which bytes are in a set — functions like isspace with SSE/AVX2/AVX512 instructions
[2018-10-03] Finding index of the minimum value using SIMD instructions — compilers can't do this (yet)
[2018-05-18] AVX512 mask registers support in compilers — there is still room for improvement
[2018-05-13] AVX512 implementation of JPEG zigzag transformation — AVX512 code is 14 times faster than scalar transformation and almost 2 times faster than SSE
[2018-04-28] Be careful with directory_iterator — C++17 std::filesystem::directory_iterator is weird
[2018-04-19] Parsing series of integers with SIMD — parse multiple decimal integers separated by arbitrary number of delimiters can be really fast with SSE
[2018-04-14] Accidental recursion — when you copy-paste trivial code but forget about a detail
[2018-04-11] Is sorted using SIMD instructions — faster std::is_sorted
[2018-03-26] When lock does not lock — C++ story
[2018-03-16] An awful part of C++17 — reading the C++ standard may lead to cardiac arrest
[2018-03-14] Intersection of ordered sets — a study of a special case (SIMD approach included)
[2018-03-11] SSE/AVX: absolute value of difference of unsigned integers
[2018-03-11] Is power of two — BMI1 version — use of the BLSR instruction

2017

[2017-11-26] A short report from code::dive 2017
[2017-01-07] ARM Neon and Base64 encoding & decoding

2016

[2016-12-22] AVX512 — first bit set in a large array
[2016-12-21] SWAR check if all chars are digits — we have a string and want to check if all its characters are ASCII digits
[2016-12-16] Population count using XOP instructions — use instruction VPTERNB from AMD XOP can help a little
[2016-11-28] SIMD-friendly algorithms for substring searching — faster strstr with SIMD instructions (SSE4, AVX2, AVX512, ARM Neon)
[2016-10-23] What does AVX512 conflict detection do? — that's the question
[2016-10-16] Detecting bit patterns with series of zeros followed by ones — surprisingly it is quite easy
[2016-10-16] Byte-wise alignr in AVX512F — AVX512F has got alignr which works on 32-bit words, the article shows how to do long shifts at byte level
[2016-10-08] GNU std::string::find is very slow — it might be 10 times slower than strstr
[2016-10-08] Sorting an AVX512 register — vectorized sorting of AVX512 register (or its portion, or more registers)
[2016-09-17] AVX512F base64 coding and decoding — AVX512F is not as powerful as AVX512BW, but base64 is feasible. And two times faster than AVX2 code
[2016-09-14] SIMD bit mask — fill a SIMD register starting from k-th bit
[2016-09-14] Building a bitmask — There is an array of 32-bit integers and a key — specific value. The result have to be a bit vector with bits set on these position where the key is equal to array items.
[2016-04-03] Base64 encoding & decoding using AVX512BW instructions — AVX512BW makes some SIMD parts of base64 really easy; AVX512VBMI and AVX512VL make them extremely easy
[2016-03-13] Implementing byte-wise lookup table with PSHUFB — a spin off from base64 decoding research, pretty straightforward use of pshufb
[2016-01-17] Base64 decoding with SIMD instructions — SSE code could by more than 2 times faster than lookup-based scalar code
[2016-01-12] Base64 encoding with SIMD instructions — SSE code could by more than 2 times faster than lookup-based scalar code
[2016-01-06] Speeding up letter case conversion — SWAR swap case could be 3 times faster than scalar version for English texts

2015

[2015-12-29] Fast conversion of floating-point values to string — up to 15 times faster than sprintf
[2015-12-27] Base64 encoding — implementation study — this could be done slightly faster
[2015-12-13] Benefits from the obsession — integers are evil
[2015-11-28] Implicit conversion — the enemy
[2015-11-22] Another C++ nasty feature — a picture from a bugs party
[2015-11-15] Short report from code::dive 2015 — the conference for C++ programmers, Wrocław, Poland. I was there
[2015-10-25] Boolean function for the rescue — think out of the box and... use the most obvious solution
[2015-05-25] Tricky mistake — our brains are not good at type inference
[2015-04-13] Speeding up bit-parallel population count — nearly 50% faster than naive version
[2015-04-08] SIMD-ized searching in unique constant dictionary — there is a ordered dictionary containing only unique keys, the dictionary is read only, and keys are 32-bit (SSE) or 64-bit (AVX2)
[2015-03-22] SIMD: detecting a bit pattern — trying to solve specific problem, different approaches are shown
[2015-03-22] Compiler warnings are your future errors — a tale about unnoticed warning and unhappy consequences
[2015-03-22] AVX512: ternary functions evaluation — I felt in love. Wow, vpternlog is my second favourite instruction just after pshufb
[2015-03-21] SSE/AVX2: Generating mask where n leading (trailing) bytes are set — three methods, one the best
[2015-03-21] Not everything in AVX2 is 256-bit — sad news for AVX2 users

2014

[2014-10-22] Using SSE to convert from hexadecimal ASCII to number — SSE procedure can convert 16- and 32-digits inputs producing 8- and 16-byte results
[2014-10-15] Parsing decimal numbers — part 2: SSE — SSE procedure is able to convert two 8-digit numbers. The main instruction used in converting to numeric value is PMADDWD
[2014-10-12] Parsing decimal numbers — part 1: SWAR — SWAR techniques to multiply digits in parallel
[2014-10-09] Using PEXT to convert from hexadecimal ASCII to number — not an obvious use of the new instruction from BMI2
[2014-10-06] Using PEXT to convert from binary ASCII to number — nice example of use of the new instruction from BMI2
[2014-10-02] Conversion numbers to octal representation
[2014-10-01] Determining if an integer is a power of 2 — part 2
[2014-10-01] Conditionally fill word (for limited set of input values) — how to implement (x != 0) ? -1 : 0?
[2014-09-30] Small win over compiler
[2014-09-25] Interpolation search revisited — interpolation search is quite interesting algorithm, however its properties make it unsuitable for most applications
[2014-09-23] Software emulation of PDEP — experiments with PDEP emulation
[2014-09-21] Conversion numbers to hexadecimal representation — SWAR, SSE and BMI2 conversions
[2014-09-11] Conversion numbers to binary representation — branchless conversion: SWAR, BMI SSE variants
[2014-03-22] C++ bitset vs array — is std::bitset always better? (spoiler: no)
[2014-03-19] Quick and dirty ad-hoc git hosting
[2014-03-19] Is const-correctness paranoia good?
[2014-03-16] Scalar version of SSE move mask instruction — how to emulate instruction PMOVMSKB
[2014-03-11] SIMD-friendly Rabin-Karp modification — speedup over strstr is around 3-4 times
[2014-03-11] C++ standard inaccuracy
[2014-03-09] Integer log 10 of an unsigned integer — SIMD version
[2014-03-09] Mask for zero/non-zero bytes
[2014-03-09] GCC — asm goto
[2014-03-03] Slow-paths in GNU libc strstr
[2014-01-26] Penalties of errors in SSE floating point calculations — special floating-point values (for example denormalized) slow down computations
[2014-01-01] x86 - ISA where 80% of instructions are unimportant

2013

[2013-12-30] I accidentally created an infinite loop
[2013-12-29] Calculate floor value without FPU/SSE instruction
[2013-12-27] Convert float to int without FPU/SSE
[2013-12-25] fopen a directory
[2013-12-12] x86 extensions are useless
[2013-12-07] Problems with PDO for PostgreSQL on 32-bit machines
[2013-11-23] Encoding array of unsigned integers — easily compress sequence of integers, how some special properties of sequence could be used to improve compression
[2013-11-07] FBSTP — the most complex instruction in x86 ISA
[2013-11-04] Short story about PostgreSQL SUM function
[2013-11-02] PostgreSQL — faster reads from static tables — better performance in case when table is rarely modified and most queries use same column in WHERE or ORDER BY clauses
[2013-10-06] PostgreSQL: printf in PL/pgSQL
[2013-09-30] SSE: trie lookup speedup — different methods of walking along paths in tries
[2013-09-15] Detecting intersection of convex polygons in 2D — another, slightly naive approach
[2013-09-01] PHP quirk

2012

[2012-07-02] Average of two unsigned integers
[2012-05-25] Speeding up LIKE '%text%' queries (at least in PostgeSQL) — few milliseconds instead of a seconds? possible!

2011

[2011-10-21] SSE: conversion integers to decimal representation — fast conversion integers to decimal representation using SSE instructions
[2011-04-11] Traversing DAGs
[2011-04-09] DAWG as dictionary? Yes!
[2011-04-08] Python: C extensions — sequence-like object
[2011-03-26] Efficient trie representation — details of different structures to store trie
[2011-02-26] Python: test if object is iterable
[2011-02-17] Traversing tree without stack — const memory complexity, i.e. O(1)

2010

[2010-06-09] Branchless set mask if value greater or how to print hex values
[2010-05-01] Speedup reversing table of bytes
[2010-04-11] Determining if an integer is a power of 2
[2010-04-08] Brenchless conditional exchange
[2010-04-03] STL: map with string as key — access speedup
[2010-04-01] Fill word with selected bit
[2010-04-01] Branchless signum
[2010-03-31] Transpose bits in byte using SIMD instructions
[2010-03-30] PostgreSQL: get selected rows with given order

2008

[2008-12-03] Join locate databases — join locate database files without reencoding files
[2008-08-03] SSE4.1: PHMINPOSUW — insertion sort — unusual application of PHMINPOSUW instruction
[2008-06-21] SSSE3: PMADDUBSW and image crossfading
[2008-06-18] SSE: conversion uint32 to float
[2008-06-15] Floating point tricks
[2008-06-08] RDTSC on Core2
[2008-06-08] PABSQ — absolute value of two singed 64-bit numbers
[2008-06-07] GCC asm constraints
[2008-06-03] SSSE3/SSE4: alpha blending — operator over
[2008-06-02] SSE4: grater/less or equal relations for unsigned bytes/words
[2008-06-01] 16bpp/15bpp to 32bpp pixel conversions — different methods
[2008-06-01] SSE: modify 32bpp images with lookup tables
[2008-05-27] SSE4 string search — modification of Karp-Rabin algorithm — acceleration of strstr with SSE4 instruction MPSADBW
[2008-05-24] SSSE3: fast popcount — Population count on large bitstring could be sevaral times faster than lookup-based approach. And an AVX2 implementation is faster than dedicated popcnt instruction.
[2008-04-29] SSSE3: printing hex values