Author:Wojciech Muła

Main page

Table of contents


SIMD — why you shouldn't use static vector constants [2018-10-28]

... in a C++ code

SIMDized sum of all bytes in the array [2018-10-24]

std::accumulate(array, array + size, 0) can be six times faster

Finding index of the minimum value using SIMD instructions [2018-10-03]

Compilers can't do this (yet)

AVX512 mask registers support in compilers [2018-05-18]

There is still room for improvement

Be careful with directory_iterator [2018-04-28]

It appears that directory_iterator is odd

Is sorted using SIMD instructions [2018-04-11]

Is sorted using SIMD instructions

When lock does not lock — C++ story [2018-03-26]

When lock does not lock — C++ story

An awful part of C++17 [2018-03-16]

Reading C++ standard may lead to cardiac arrest

Is power of two — BMI1 version [2018-03-11]

Use of the BLSR instruction

SSE/AVX: absolute value of difference of unsigned integers [2018-03-11]

Saturated arithmetic for the rescue


A short report from code::dive 2017 [2017-11-26]

I was there again


GNU std::string::find is very slow [2016-10-16]

It might be 10 times slower than strstr

SIMD bit mask [2016-09-14]

Fill a SIMD register starting from k-th bit

Implementing byte-wise lookup table with PSHUFB [2016-03-13]

A spin off from base64 decoding research. Pretty straightforward use of pshufb.

Base64 decoding with SIMD instructions [2016-01-17]

SSE code could by more than 2 times faster than lookup-based scalar code.

Base64 encoding with SIMD instructions [2016-01-12]

SSE code could by more than 2 times faster than lookup-based scalar code.

Speeding up letter case conversion [2016-01-06]

SWAR swap case could be 3 times faster than scala version for English texts


Fast conversion of floating-point values to string [2015-12-29]

Up to 15 times faster than sprintf

Base64 encoding — implementation study [2015-12-27]

This could be done slightly faster

Benefits from the obsession [2015-12-13]

Integers are evil

Implicit conversion — the enemy [2015-11-28]

Read more

Another C++ nasty feature [2015-11-22]

A picture from a bugs party

Short report from code::dive 2015 [2015-11-15]

The conference for C++ programmers, Wrocław, Poland. I was there

Boolean function for the rescue [2015-10-25]

Think out of the box and... use the most obvious solution

Tricky mistake [2015-05-25]

Our brains are not good at type inference

Compiler warnings are your future errors [2015-03-22]

A tale about unnoticed warning and unhappy consequences

Not everything in AVX2 is 256-bit [2015-03-21]

Sad news for AVX2 users


Parsing decimal numbers — part 1: SWAR

SWAR techniques to multiply digits in parallel

Using PEXT to convert from hexadecimal ASCII to number

Not obvious use of the new instruction from BMI2

Using PEXT to convert from binary ASCII to number

Nice example of use of the new instruction from BMI2

Conditionally fill word (for limited set of input values)

How to implement (x != 0) ? -1 : 0

Determining if an integer is a power of 2 — part 2

Another use of a bit-trick

Small win over compiler [2014-09-30]

There are some places where a low-level programmer can beat a compiler. Consider this simple code:

#include <stdint.h>

uint32_t bsr(uint32_t x) {
    // xor, because this builtin returns 31 - bsr(x)
    return __builtin_clz(x) ^ 31;

uint32_t min1(uint32_t x) {
    if (x != 0) {
        return bsr(x) + 1;
    } else {
        return 1;

Function min1 is compiled to (GCC 4.8 with flag -O3):

    movl    4(%esp), %edx
    movl    $1, %eax
    testl   %edx, %edx
    je  .L3
    bsrl    %edx, %eax
    addl    $1, %eax
    rep ret

There is a condtional jump, not very good. When we rewrite the function:

uint32_t min2(uint32_t x) {
    return bsr(x | 1) + 1;

Result is this nice branchless code:

    movl    4(%esp), %eax
    orl $1, %eax
    bsrl    %eax, %eax
    addl    $1, %eax

Conculsion: it's worth to check a compiler output. Sometimes.

Slow-paths in GNU libc strstr [2014-03-03]

I've observed that some patterns issued to strstr cause significant slowdown.

Sample program kill-strstr.c executes strstr(data, pattern), where data is a large string (16MB) filled with the character '?'. Patterns are read from the command line.

On my machine following times were recorded:

1. searching string 'johndoe'...
                time: 0.032
2. searching string '??????????????????a'...
                time: 0.050
3. searching string '??????????????????????????????a'...
                time: 0.049
4. searching string '???????????????????????????????a'...
                time: 0.274
5. searching string '??????????????????????????????a?'...
                time: 0.356
6. searching string '??????????????????????????????a??????????????????????????????'...
                time: 0.396
  1. Slowdown is visible in case 4 (5 times slower than pattern 3). Pattern has 32 characters, and contains '?', except the last char.
  2. Even bigger slowdown occurs in case 5 (7 times slower). This pattern also contains 32 chars, but the position of the single letter 'a' is last but one.
  3. Similar slowdown occurs in case 6 (nearly 8 times slower). In this pattern single letter 'a' is surrounded by thirty '?'.

Integer log 10 of an unsigned integer — SIMD version

SIMD version of a bit-trick

GCC - asm goto [2014-03-09]

Starting from GCC 4.5 the asm statement has new form: asm goto. A programmer can use any label from a C/C++ program, however a output from this block is not allowed.

Using an asm block in an old form requires more instructions and an additional storage:

uint32_t bit_set;
asm (
       "bt %2, %%eax  \n"
       "setc %%al  \n"
       "movzx %%al, %%eax \n"
       : "=a" (bit_set)
       : "a" (number), "r" (bit)
       : "cc"

if (bit_set)
       goto has_bit;

Above code is compiled to:

80483f6:       0f a3 d8                bt     %ebx,%eax
80483f9:       0f 92 c0                setb   %al
80483fc:       0f b6 c0                movzbl %al,%eax
80483ff:       85 c0                   test   %eax,%eax
8048401:       74 16                   je     8048419

With asm goto the same task could be writting shorter and easier:

asm goto (
       "bt %1, %0 \n"
       "jc %l[has_bit] \n"

       : /* no output */
       : "r" (number), "r" (bit)
       : "cc"
       : has_bit // name of label

Complete demo is available. See also GCC documentation about Extended Asm.

C++ standard inaccuracy [2014-03-11]

First we read:

21.4.1 basic_string general requirements [string.require]


3 No erase() or pop_back() member function shall throw any exceptions.

... few pages later: basic_string::erase [string::erase]

basic_string<charT,traits, Allocator>& erase(size_type pos = 0, size_type n = npos);


2 Throws: out_of_range if pos > size().

Quick and dirty ad-hoc git hosting [2014-03-19]

Recently I needed to synchronize my local repository with a remote machine, just for full backup. It's really simple if you have standard Linux tools (Cygwin works too, of course).

  1. in a working directory run:

    $ pwd
    #         ^^^^^^^
    $ git update-server-info
  2. in the parent directory start HTTP server:

    $ cd ..
    $ pwd
    $ python -m SimpleHTTPServer
    Serving HTTP on port 8000 ...
  3. on a remote machine clone/pull/whatever:

    $ git clone http://your_ip:8000/project/.git

Step 1 have to be executed manually when the local repository has changed.

Is const-correctness paranoia good? [2014-03-19]

Yes, definitely. Lets see this simple example:

$ cat test.cpp
int test(int x) {
 if (x = 1)
  return 42;
  return 0;
$ g++ -c test.cpp
$ g++ -c -Wall test.cpp
int test(int x) {
 if (x = 1)
  return 42;
  return 0;

Only when we turn on the warnings, a compiler tell us about a possible error. Making the parameter const shows us error:

$ cat test2.cpp
int test(int x) {
 if (x = 1)
  return 42;
  return 0;
$ g++ -c test.cpp
test2.cpp: In function ‘int test(int)’:
test2.cpp:2:8: error: assignment of read-only parameter ‘x’
  if (x = 1)

All input parameters should be const, all write-once variables serving as a parameters for some computations should be also const.

Penalties of errors in SSE floating point calculations

If your application slows down significantly it might be an SSE issue

x86 - ISA where 80% of instructons are unimportant [2014-01-01]

Few years ago I counted instructions from many Linux binaries — 2014 is a good year to repeated this experiment and see that nothing has changed.

I use 32-bit Debian, my installation has been updated few months ago. All files from /usr/bin and all *.so files from /usr/lib was disassembled with objdump (5050 files were processed). Instructions were grouped simply by mnemonic name. Taking into account all addressing and encoding modes would be overkill. I've published script that does the job.

Short summary:

  • The number of distinct decoded instructions is around 650. Since objdump use AT&T syntax, same opcodes are seen under different mnemonics, for example mov is saved as movw, movb, movl depending on argument size.
  • Total number of x86 instructions is around 750. Read: one hundred instructions never appeared in the binaries.
  • There are 81 instructions used just once. For example quite useful CMPPD.
  • There are 22 instructions used twice. For example MFENCE — no one aware of memory ordering?
  • There are 15 instructions used three times. For example BTC, but bit manipulating operations are useless.
  • 81 plus 22 plus 15 is 118. Another hundred of useless stuff.

Lets look at top 15 rows from detailed results, i.e. instruction with frequency grater than 1%:

  • The total count of these instructions is 87.84% of all instructions (almost all, isn't it?).
  • The most frequently used instruction is data transfer (mov/movl) — 42%
  • Control flow instructions (call/ret/jmp) — 13%.
  • Conditions (cmp/test/condition jumps: je/jne) — 10%.
  • Basic arithmetic (add/sub/lea) — 12%
  • Simple stack operations (push/pop) — 6%

Very interesting observation is that conditions are mostly based on je/jne, i.e. jump if zero/jump if not zero.

The first FPU instruction appears at 28-th position. The first integer SSE appears at 167-th position. And the first SSE instruction operating on packed floats appear at 315-th position.

Detailed results

Whole table as txt file.

instruction count %
mov 5934098 37.63%
call 1414355 8.97%
lea 1071501 6.79%
movl 760677 4.82%
push 655921 4.16%
jmp 611540 3.88%
add 560517 3.55%
je 490250 3.11%
test 475899 3.02%
pop 441608 2.80%
sub 366228 2.32%
cmp 326379 2.07%
jne 264110 1.67%
nop 242356 1.54%
ret 238569 1.51%
xor 148194 0.94%
movzbl 122730 0.78%
and 88863 0.56%
xchg 66885 0.42%
cmpl 64907 0.41%
movzwl 64589 0.41%
movb 57247 0.36%
or 52138 0.33%
shl 50908 0.32%
cmpb 50152 0.32%
jle 41083 0.26%
leave 39923 0.25%
fldl 37428 0.24%
fstpl 37368 0.24%
shr 36503 0.23%
jbe 32866 0.21%
ja 32333 0.21%
sar 30917 0.20%
flds 29672 0.19%
subl 27636 0.18%
setne 27626 0.18%
testb 27420 0.17%
addl 25906 0.16%
imul 25569 0.16%
jg 24796 0.16%
fstp 24349 0.15%
fxch 23464 0.15%
js 21550 0.14%
fstps 21248 0.13%
sbb 16607 0.11%
inc 16200 0.10%
lock 16049 0.10%
jae 14825 0.09%
sahf 14765 0.09%
dec 14276 0.09%
fnstsw 14026 0.09%
sete 13902 0.09%
movw 13895 0.09%
adc 13640 0.09%
jb 12467 0.08%
jl 11700 0.07%
repz 11178 0.07%
fldcw 11110 0.07%
jge 11019 0.07%
movswl 10816 0.07%
fildl 8852 0.06%
cmpw 7601 0.05%
jns 7490 0.05%
fldz 7331 0.05%
fmul 7229 0.05%
out 7203 0.05%
not 7028 0.04%
movsbl 6720 0.04%
in 6503 0.04%
fld 6309 0.04%
faddp 6254 0.04%
fstl 5760 0.04%
fucom 5753 0.04%
neg 5725 0.04%
fucompp 5354 0.03%
rep 5059 0.03%
fmuls 5039 0.03%
pushl 4430 0.03%
jp 4424 0.03%
fnstcw 4400 0.03%
fld1 4176 0.03%
fmulp 4133 0.03%
orl 3927 0.02%
fadds 3789 0.02%
movq 3779 0.02%
fistpl 3709 0.02%
cltd 3597 0.02%
fmull 3313 0.02%
stos 3298 0.02%
lret 3183 0.02%
scas 3103 0.02%
lods 3066 0.02%
cwtl 3064 0.02%
fadd 2852 0.02%
fucomp 2678 0.02%
orb 2481 0.02%
fildll 2418 0.02%
andl 2379 0.02%
setb 2337 0.01%
andb 2263 0.01%
552 rows more...    


I accidentally created an infinite loop [2013-12-30]

I needed to iterate through all values of 32-bit unsigned integer, so I wrote:

#include <stdint.h>
for (uint32_t i=0; i <= UINT32_MAX; i++) {
   // whatever

Is it ok? No, because the value of uint32_t will never exceed UINT32_MAX = 0xffffffff. Of course we can use larger types, like uint64_t, but on 32-bit machines this requires some additional instructions. For example gcc 4.7 compiled following code:

void loop1(void (*fun)()) {
        for (uint64_t i=0; i <= UINT32_MAX; i++) {


00000000 :
   0:   57                      push   %edi
   1:   bf 01 00 00 00          mov    $0x1,%edi
   6:   56                      push   %esi
   7:   31 f6                   xor    %esi,%esi
   9:   53                      push   %ebx
   a:   8b 5c 24 10             mov    0x10(%esp),%ebx
   e:   66 90                   xchg   %ax,%ax
  10:   ff d3                   call   *%ebx
  12:   83 c6 ff                add    $0xffffffff,%esi
  15:   83 d7 ff                adc    $0xffffffff,%edi
  18:   89 f8                   mov    %edi,%eax
  1a:   09 f0                   or     %esi,%eax
  1c:   75 f2                   jne    10
  1e:   5b                      pop    %ebx
  1f:   5e                      pop    %esi
  20:   5f                      pop    %edi
  21:   c3                      ret

TBH, I have no idea why such a weird sequence has been generated (add, adc, or, jnz). The simplest and portable solution is to detect a wrap-around of 32-bit value after increment:

uint32_t i=0;
while (1) {
        // loop body

        i += 1;
        if (i == 0) // wrap-around

In an assembly code it's even simpler, because a CPU sets the carry flag:

         xor %eax, %eax
         ; loop body

         add $1, %eax
         jnc loop

fopen a directory [2013-12-25]

It's not clear how function fopen applied to a directory should behave, manual pages don't say anything about this. So, our common sense fail — at least when use a standard library shipped with GCC, beacuse fopen returns a valid handle.

Discussion on stackoverflow pointed that fseek or ftell` would fail. But on my system it's not true, ftell(f, 0, SEEK_END) returns the size of an opened directory.

Only when we trying to read data using fread or fgetc the errno variable is set to EISDIR error code.

Here is output from simple test program:

$ ./a.out ~
testing '/home/wojtek'...
fopen: Success [errno=0]
fseek: Success [errno=0]
fseek result: 0
ftell: Success [errno=0]
ftell result: 24576
feof: Success [errno=0]
feof result: 0 (EOF=no)
fgetc: Is a directory [errno=21]
fgetc result: -1 (EOF=yes)
fread: Is a directory [errno=21]
fread result: 0

x86 extensions are useless [2013-12-12]

Intel announced new extension to SSE: instructions accelerating calculating hashes SHA-1 and SHA256. As everything else added recently to the x86 ISA, these new instructions address special cases of "something". The number of instructions, encoding modes, etc. is increasing, but do not help in general.

Let see what sha1msg1 xmm1, xmm2 does (type of arguments is packed dword):

result[0] := xmm1[0] xor xmm1[2]
result[1] := xmm1[1] xor xmm1[3]
result[2] := xmm1[2] xor xmm2[0]
result[3] := xmm1[3] xor xmm2[1]
  • The logical operation "xor" is hardcoded. Why can't we use "or", "and", "not and"? These operations are already present in ISA.
  • Indices to xmm1 and xmm2 are hardcoded too. The instruction pshufd accepts an immediate argument (1 byte) to select permutation, why sha1msg1 couldn't be feed with 2 bytes allowing a programmer to select any permutations of arguments?
  • Sources of operators are also hardcoded. Why not use another immediate (1 byte) to select sources, for example 00b = xmm1/xmm1, 01b = xmm1/xmm2, 10b = xmm2/xmm1, 11b = xmm2/xmm2.

Such generic instruction would be saved as generic_op xmm1, xmm2, imm_1, imm_2, imm_3 and execute following algorithm:

for i := 0 to 3 do
        arg1_indice := imm_1[2*i:2*i + 1]
        arg2_indice := imm_2[2*i:2*i + 1]

        if imm_3[2*i] = 1 then
                arg1 := xmm1
                arg1 := xmm2
        end if

        if imm_3[2*i + 1] = 1 then
                arg2 := xmm2
                arg2 := xmm1
        end if

        result[i] := arg1[arg1_indice] op arg2[arg2_indice]
end for

Then sha1msg1 is just a special case:

generic_xor xmm1, xmm2, 0b11100100, 0b01001110, 0b01010000

Maybe this example is "too generic", too complex, and would be hard to express in hardware. I just wanted to show that we will get shine new instructions useful in few cases. Compilers can vectorize loops and make use of SSE, but SHA is used in drivers, OS and is encapsulated in libraries — sha1msg1 and friends will never appear in ordinary programs.

FBSTP — the most complex instruction in x86 ISA [2013-11-07]

As documentation says FBSTP "Store BCD Integer and Pop".

FBSTP saves integer part of floating point value as BCD number. Instruction expects an 11-bytes buffer, the first byte is reserved for a sign, rest are BCD digits.

Disadvantages of this instruction:

  1. BCD output — conversion to ASCII is needed.
  2. All digits are saved, including leading zeros.
  3. Limited range of numbers, not all valid double values are converted. There is limit to ~18 digits.
  4. Most important: FBST is very, very slow.

Sample sources

Sources are available at github.

Program fbst_tests.c converts a number to a string, remove leading zeros, and detects errors:

$ ./test 0 12 5671245 -143433 334535 4543985349054 999999999999999999999
printf => 0.000000
FBSTP  => 0
printf => 12.000000
FBSTP  => 12
printf => 5671245.000000
FBSTP  => 5671245
printf => -143433.000000
FBSTP  => -143433
printf => 334535.000000
FBSTP  => 334535
printf => 4543985349054.000000
FBSTP  => 4543985349054
printf => 10000000000000000000000.000000
FBSTP  => NaN/overflow

Program fbst_speed.c compares instruction FBSTP with simple implementation of itoa. There are no formatting, BCD to ASCII conversion, etc. Numbers from 1 to 10,000,000 are converted.

Results from quite old Pentium M:

... 2.285 s
simple itoa...
... 0.589 s

and recent Core i7:

... 2.165 s
simple itoa...
... 0.419 s

There is no difference! FBSTP is just 5% faster on Core i7.

Problems with PDO for PostgreSQL on 32-bit machines [2013-12-07]

The size of an integer in PHP depends on machine, it has 32 bits on 32-bit architectures, and 64 on 64-bit architectures.

On 32-bit machines PDO for PostgreSQL always convert bigint numbers returned by a server to string. Never casts to integer even if value of bigint would fit in 32-bit signed integer.

Type bigint is returned for example by COUNT and SUM functions.

On 64-bit machines there is no such problem because PHP integer is the same as bigint.

Short story about PostgreSQL SUM function [2013-11-04]

Here is a simple PostgreSQL type:

        id    integer,
        total bigint

and a simple query wrapped in a stored procedure:

        RETURNS SETOF foo_t
        LANGUAGE "SQL"
AS $$
        SELECT id, SUM(some_column) FROM some_table GROUP BY id;

Now, we want to sum everything:

        RETURNS bigint -- same as
        LANGUAGE "SQL"
AS $$
        SELECT SUM(total) FROM group_foo();

And we have an error about types inconsistency!

This is caused by SUM function — in PostgreSQL there are many variants of this function, as the db engine supports function name overriding (sounds familiar for C++ guys). There are following variants in PostgreSQL 9.1:

$ \df sum
                                                 List of functions
   Schema   | Name | Result data type | Argument data types | Type
 pg_catalog | sum  | numeric          | bigint              | agg
 pg_catalog | sum  | double precision | double precision    | agg
 pg_catalog | sum  | bigint           | integer             | agg
 pg_catalog | sum  | interval         | interval            | agg
 pg_catalog | sum  | money            | money               | agg
 pg_catalog | sum  | numeric          | numeric             | agg
 pg_catalog | sum  | real             | real                | agg
 pg_catalog | sum  | bigint           | smallint            | agg

Smaller types are promoted: from integer we get bigint, from bigint we get numeric, and so on.

PostgreSQL: printf in PL/pgSQL [2013-10-06]

PostgreSQL wiki has entry about sprintf — is is quite simple approach (and isn't marked as immutable). The main drawback is iterating over all chars of a format string. Here is a version that use strpos to locate % in the format string, and it's faster around 2 times:

CREATE OR REPLACE FUNCTION printf2(fmt text, variadic args anyarray) RETURNS text
      argcnt  int  := 1;
      head    text := ”;     -- result
      tail    text := fmt;    -- unprocessed part
      k       int;
         k := strpos(tail, '%');
         IF k = 0 THEN
            -- no more '%'
            head := head || tail;
            IF substring(tail, k+1, 1) = '%' THEN
               -- escape sequence '%%'
               head := head || substring(tail, 1, k);
               tail := substring(tail, k+2);
               -- insert argument
               head := head || substring(tail, 1, k-1) || COALESCE(args[argcnt]::text, ”);
               tail := substring(tail, k+1);
               argcnt := argcnt + 1;
            END IF;
         END IF;
      END LOOP;

      RETURN head;

PHP quirk [2013-09-01]

PHP is very funny language. Here we are a simple class, without a constructor:

class Foo

$foo = new Foo($random, $names, $are, $not, $detected);
echo "ok!\n";

One can assume that interpreter will detect undeclared variables, but as their names state this doesn't happen (PHP versions 5.3..5.5):

$ php foo1.php

When the class Foo have the constructor:

class Foo
        public function __construct() { }

$foo = new Foo($random, $names, $are, $not, $detected);

everything works as expected:

$ php foo2.php
PHP Notice:  Undefined variable: random in /home/wojtek/foo2.php on line 10
PHP Notice:  Undefined variable: names in /home/wojtek/foo2.php on line 10
PHP Notice:  Undefined variable: are in /home/wojtek/foo2.php on line 10
PHP Notice:  Undefined variable: not in /home/wojtek/foo2.php on line 10
PHP Notice:  Undefined variable: detected in /home/wojtek/foo2.php on line 10


Average of two unsigned integers [2012-07-02]

The famous error in Bentley's binary search code was caused by an integer overflow in following code:

unsinged a, b, c;

c = (a + b)/2

First the sum is evaluated, then divided (or shifted), but the sum could exceed the integer range. In unsafe languages, like Java, C or C++, such errors are not detected, while Ada would trigger an exception.

The safest expression, even in unsafe languages, looks like this:

unsigned a, b, c;
unsigned LSB_carry;

LSB_carry = a & b & 1;
c = a/2 + b/2 + LSB_carry;

We sum up all-but-the-lowest bits, then adjust this sum with carry from lowest bit (a & b & 1). This expression involves two shifts, 2 additions, and two ands, require also additional storage for carry.

Another approach detects an overflow without accessing to hardware registers:

unsigned a, b, c;       // sizeof(unsigned) = 4;
unsigned sum;

sum = a + b;
if (sum < a) { // or sum < b
        // overflow, combine "lost" carry bit from the highest bit
        c = (sum/2) | 0x8000000;
} else {
        c = sum/2;

This require on sum, one shift and conditionally or. The condition could be expressed branchlessly:

unsigned a, b, c;
unsigned sum;
unsigned MSB_carry;

sum = a + b;
MSB_carry = (unsigned)(sum < x); // 1 or 0

c = sum/2 | (MSB_carry << 31);

Sample program

Following python script varifies both methods (for 8-bit integers)

# Calcualate averge of two unsigned numbers
# without access to carry from MSB
# Wojciech Mula
# 2012-07-02

bits = 8        # unsigned width
mask = (1 << bits) - 1
MSB  = (1 << (bits - 1))

def base(a, b):
        return (a + b)/2

def safe1(a, b):
        LSB_carry = (a & b & 1)
        return (a >> 1) + (b >> 1) + LSB_carry

def safe2(a, b):
        sum = (a + b) & mask
        if sum < a:
                return (sum >> 1) | MSB
                return sum >> 1

def safe3(a, b):
        sum = (a + b) & mask
        MSB_carry = int(sum < a)

        return (sum >> 1) | (MSB_carry << (bits - 1))

def main():
        n = 1 << bits
        for a in xrange(n):
                for b in xrange(n):
                        ref = base(a, b)
                        r1  = safe1(a, b)
                        r2  = safe2(a, b)
                        r3  = safe3(a, b)

                        assert ref == r1
                        assert ref == r2
                        assert ref == r3

if __name__ == '__main__':


Regular expressions derivatives [2011-04-11]

Have you ever heard about "regular expression derivatives"? No? Me too. This is very interseting idea, that have pratical usage in pattern matching.

Matthew Might wrote a sample program using the idea: A regular expression matcher in Scheme using derivatives. There is also a detailed paper on this topic: Regular-expression derivatives re-examined by Owens, Reppy and Turon.

Traversing DAGs [2011-04-11]

If a DAG has got one component, then the simplest traversing method is depth-first-search, which could be easily implemented recursively (using an implicit stack).

struct DAGNode {
        // user data
        bool    visited;        // after construction = 0

void DFS_aux(DAGNode* node, const bool val) {
        if (node->visited != val) {
                // visit node

                node->visited = val;
                for (n in node.connected)
                        DFS_aux(n, val)

void DFS(DAGNode node) {
        static val = true;

        DFS_aux(node, val);
        val = not val;

On every call of DFS() the variable val is switched, and visited member is marked alternately with true or false.

There is just one problem — what if a traversing method stop execution before visiting all nodes? Of course in such situation we have to visit the DAG twice: on first pass reset (possibly many times) visited member to false, and then visit once each node.

But usually bool have at least 8 bits, so numbers could be used instead of boolean values 0 or 1. On each call of DFS() a reference number is incremented, thanks to that even if previous call stopped in the middle, the procedure will work correctly.

The only moment when visited markers have to be cleared is wrapping a reference numbers to zero. This happen every 256 calls if 8-bit values used; for wider counters (16, 32 bits) max value is greater.

void DFS(DAGNode node) {
        static unsigned int val = 1;
        if (val == 0) {
                // set visited member of all nodes to 0
                val += 1;

        DFS_aux(node, val);
        val += 1;

DAWG as dictionary? Yes! [2011-04-09]

If you read Wikipedia entry about DAWG, then you find following sentence:

Because the terminal nodes of a DAWG can be reached by multiple paths, a DAWG is not suitable for storing auxiliary information relating to each path, e.g. a word's frequency in the English language. A trie would be more useful in such a case.

This isn't true!

There is a quite simple algorithm, that allow to perform two-way minimal perfect hashing (MPH), i.e. convert any path representing a word to a unique number, or back — a number to a path (word). Values lie in the range 1 .. n, where n is the number of distinct words saved in a DAWG.

The algorithm is described in Applications of Finite Automata Representing Large Vocabularies, by Claudio Lucchiesi and Tomasz Kowaltowski (preprint is freely available somewhere online).

The main part of the algorithm is assigning to each node the number of reachable words from a node; this can be easily done in one pass. Then these numbers are used to perform perfect hashing. Hashing algorithm is fast and simple, translation from pseudocode presented in the paper is straightforward.

Algorithm requires additional memory for numbers in each node and a table of size n to implement dictionary lookups.

I've updated pyDAWG to support MPH.

Python: C extensions - sequence-like object [2001-04-08]

If a class has to support standard len() function or operator in, then must be a sequence-like. This requires a variable of type PySequenceMethods, that store addresses of proper functions. Finally the address of this structure have to be assigned to tp_as_sequence member of the main PyTypeObject variable.

Here is a sample code:

static PySequenceMethods class_seq;

static PyTypeObject class_type_dsc = {

classmeth_len(PyObject* self) {
        if (not error)
                return sequence_size;
                return -1;

classmeth_contains(PyObject* self, PyObject* value) {
        if (not error) {
                if (value in self)
                        return 1;
                        return 0;
                return -1;

PyInit_module() {
        class_seq.sq_length   = classmeth_len;
        class_seq.sq_contains = classmeth_contains;

        class_type_dsc.tp_as_sequence = &class_seq;


Python: test if object is iterable [2011-02-26]

def isiterable(obj)
                return True
        except TypeError:
                return False


Branchless set mask if value greater or how to print hex values [9.06.2010]

Suppose we need to get a mask when a nonnegative argument is greater then some constant value; in other words, we want to evaluate following expression:

if x > const_n then
   mask := 0xffffffff;
   mask := 0x00000000;

Portable branchless solution:

  • choose a magic number M := (1 << (k-1)) - 1 - n, where k is a bit position, for example 31 if we operate on 32-bit words
  • calculate R := x + M
  • k-th bit of R is set if x > n
  • fill mask with this bit - see note Fill word with selected bit

The key to understand this trick is binary form of M: 0111..1111zzzz, where z is 0 or 1 depending on n value. When x is greater then n, then x + M has form 1000..000zzzz, because the carry bit propagates through series of ones to the k-th position of the result.

Real world example — branchless converting hex digit to ASCII (M=0x7ffffff6 for k=31 and n=9).

; input:    eax - hex digit
; output:   eax - ASCII letter (0-9, A-F or a-f)
; destroys: ebx

        andl 0xf, %eax
        leal 0x7ffffff6(%eax), %ebx     ; MSB(ebx)=1 when eax >= 10
        sarl $31, %ebx                  ; ebx - mask
        andl  $7, %ebx                  ; ebx = 7 when eax >= 10 (for A-F letters)
        ;andl $39, %ebx                 ; ebx = 39 when eax >= 10 (for a-f letters)
        leal '0'(%eax, %ebx), %eax      ; eax = '0' + eax + ebx => ASCII letter

It is also possible to convert 4 hex digits in parallel using similar algorithm, but the input data have to be correctly prepared. Moreover generating mask requires 3 instructions and one extra register (in a scalar version just one arithmetic shift). I guess it wont be fast on x86, maybe this approach would be good for a SIMD code, where similar code transforms more bytes at once.

; input: eax - four hex digits in form [0a0b0c0d]
; output: eax - four ascii letters
; destroys: ebx, ecx

        leal 0x76767676(%eax), %ebx        ; MSB of each byte is set when corresponding eax byte is >= 10
                                           ; (here: 0x7f - 9 = 0x76)
        andl $0x80808080, %ebx
        movl %ebx, %ecx
        shrl    $7, %ebx
        subl %ebx, %ecx                    ; ecx - byte-wise mask
        ;andl $0x07070707, %ecx            ; for ASCII letters A-F
        andl $0x27272727, %ecx             ; for ASCII letters a-f
        leal 0x30303030(%eax, %ecx), %eax  ; ecx - four ascii letters

See also: SSSE3: printing hex values (weird use of PSHUFB instruction)

Determining if an integer is a power of 2 [11.04.2010]

Method from Bit Twiddling Hacks: (x != 0) && (x & (x-1) == 0). GCC compiles this to following code:

; input/ouput: eax
; destroys: ebx

        test    %eax,  %eax     ; x == 0?
        jz      1f

        leal -1(%eax), %ebx     ; ebx := x-1
        test    %eax,  %ebx     ; ZF  := (eax & ebx == 0)

        setz     %al
        movzx    %al, %eax       ; eax := ZF

We can use also BSF and BSR instructions, which determine position of first and last bit=1, respectively. If a number is power of 2, then just one bit is set, and thus these positions are equal. BSx sets also ZF flag if input value is zero.

; input/output: eax
; destroys: ebx, edx

        bsf     %eax, %ebx      ; ebx := LSB's position if eax != 0, ZF = 1 if eax = 0
        jz      1f
        bsr     %eax, %edx      ; edx := MSB's position

        cmp     %ebx, %edx      ; ZF  := (ebx = edx)

        setz    %al
        movzx   %al, %eax       ; eax := ZF

Brenchless conditional exchange [8.04.2010]

Suppose we have to exchange (or just move) two registers A and B:

  1. C := A xor B
  2. C := 0 if condition is not true
  3. A := A xor C
  4. B := B xor C

If C is 0, then A and B left unchanged, else A and B are swapped. If only a conditional move from B to A is needed, then step 4th have to be skipped.

Here is a sample x86 code, where condition is value of CF:

sbb edx, edx ; part of step 2. - edx = 0xffffff if CF=1, 0x000000 otherwise
mov ecx, eax
xor ecx, ebx ; step 1
and ecx, edx ; completed step 2. - now C is 0 or (A xor B)
xor eax, ecx ; step 3
xor ebx, ecx ; step 4

Branchless moves are possible in Pentium Pro and higher with instructions cmovcc.

See also XOR linked list.

STL: map with string as key — access speedup [3.04.2010]

The idea is quite simple: we do not have a single stl::map<string, something>, but a vector of maps, indexed with O(1) time — each map stores keys sharing certain properties. Drawback: additional memory.

I've tested following grouping schemes:

  1. the length of string,
  2. the first letter of string (one level trie),
  3. both length and the first letter.

Third is the fastest — around 60% faster then plain std::map from GCC (red-black tree).

Tests: my program read plain text (I've used The Illiad from, text is split into words (~190000) and then each words is inserted into a dictionary (~28000 distinct words); then the same words are searched in dictionaries. Table below summarizes results on my computer (gcc 4.3.4 from Cygwin).

data struct running time [ms] speedup [%]
min avg max min avg max
std::map 269 287 355 100 100 100
first char 218 241 395 81 84 111
length 218 240 345 81 84 97
len./char 165 172 207 61 60 58
std::map 295 322 483 100 100 100
first char 243 263 460 82 82 95
length 238 248 292 80 77 60
len./char 184 190 241 62 60 50

Download test program.

Branchless signum [1.04.2010]

Problem: calculate value of sign(x):

  • -1 when x < 0
  • 0 when x = 0,
  • +1 when x > 0.

My solution do not involve any hardware specific things like ALU flags nor special instructions — just plain AND, OR, shifts.

; input: eax = X

movl %eax, %ebx
sarl $31, %eax  // eax = -1 if X less then zero, 0 otherwise

andl $0x7fffffff, %ebx
addl $0x7fffffff, %ebx // MSB is set if any lower bits were set
shrl $31, $ebx  // eax = +1 if X greater then zero, 0 otherwise

orl %ebx, %eax  // eax = result

C99 implementation:

int32_t sign(int32_t x) {
        int32_t y;
        y = (x & 0x7fffffff) + 0x7fffffff;
        return (x >> 31) | ((uint32_t)y >> 31);

Fill word with selected bit [1.04.2010]

This is continuation of the subproblem from the previous post: we have a word (byte, dword, whatever) and want to fill it with the selected bit.

The most general algorithm

  1. mask bit
    [10111010] => [00010000]
  2. clone word
    a=[00010000], b=[00010000]
  3. shift bit in first word to MSB, and to LSB in second word
    a=[10000000], b=[00000001]
  4. subtract c = a - b
  5. add missing MSB c = c OR a
uint32_t fill1(uint32_t a, int bit) {
        uint32_t b;

        b = a = a & (1 << bit);

        a <<= 31 - bit;
        b >>= bit;

        return (a - b) | a;

If arithmetic shifts are supported

  1. shift bit to MSB
  2. arithmetic shift right
uint32_t fill2(uint32_t a, int bit) {
        a <<= 31 - bit;
        return (int32_t)(a) >> 31;

386 processor specific

On processors 386+ we can clone the carry flag (CF) with sbb reg, reg and with bt reg, reg copy the selected bit from a reg to CF.

uint32_t fill386(uint32_t a, int bit) {
        uint32_t result;
        __asm__ __volatile__ (
                "bt  %1, %0\n"
                "sbb %0, %0\n"
                : "=r" (result)
                : "r" (bit), "0" (a)
        return result;

Transpose bits in byte using SIMD instructions [31.03.2010]

Method presented here allows to get any bit permutation, transposition is just one of possible operations. Lookup-based approach would be faster, but algorithm is worth to (re)show.

Algorithm outline for 8-byte vector (with SSE instruction it is possible to get 2 operations in parallel):

  1. fill vector with given byte
    [11010001] => [11010001|11010001|11010001|11010001|11010001|11010001|11010001|11010001]
  2. leave one bit per byte
  3. perform desired transposition ("move" bits around)
  4. perform horizontal OR of all bytes

Here is my old MMX code (polish text); below SSE/SSE5 implementation details.

Ad 1. Series of punpcklbw/punpcklwb/shufps or pshufb if CPU supports SSSE3.

# 1.1
movd       %eax, %xmm0
shufps    $0x00, %xmm0, %xmm0
punpcklbw %xmm0, %xmm0
punpcklwd %xmm0, %xmm0

# 1.2
pxor      %xmm1, %xmm1
movd       %eax, %xmm0
pshufb    %xmm1, %xmm0

Ad 2. Simple pand with mask packed_qword(0x8040201008040201).

pand  MASK1, %xmm0

Ad 3. If plain SSE instructions are supported this step requires some work. First, each bit is populated to fill the whole byte (using pcmpeq — we get negated result), then mask bits on desired positons.

SSE5 has powerful instruction protb that can do perform rotation of each byte with independent amount — so in this case just one instruction is needed.

pcmpeq  %xmm1, %xmm0
pandn   MASK2, %xmm0    # pandn - to negate

# SSE5
protb    ROT, %xmm0, %xmm0

Ad 4. Since bits are placed on distinct positions, we can use instruction psadbw, that calculate horizontal sums of bytewide differences from two registers (separately for low and high registers halves). If one register is full of zeros, we get sum of bytes from other register.

psadbw  %xmm1, %xmm0
movd    %xmm0, %eax

Depending on instruction set, three (SSE) or two (SSE5) additional tables are needed.

PostgreSQL: get selected rows with given order [30.03.2010]

Suppose that a database stores some kind of a dictionary and an user picks some items, but wants to keep the order. For example the dictionary has entries with id=0..10, and the user picked 9, 2, 4 and 0. This simple query does the job (query splitted):

foo = SELECT (ARRAY[9,2,4,0])[i] AS index, i AS ord FROM generate_series(1, 4) AS i
SELECT * FROM dictionary INNER JOIN (foo) ON ORDER BY foo.ord


Latency of PUSHA and POPA on Core2 [25.06.2008]

With simple program following times were measured:

  • pusha — 9-10 cycles
  • sequence pusha, popa — 14 cycles

Linux scheduler: int sched_getcpu() [21.06.2008]

Function defined in scched.h returns the number of a CPU that runs calling thread or process.

SSE: conversion uint32 to float [18.06.2008]

There is no such instruction — CVTDQ2PS converts signed 32-bit ints. Solution: first zero the MSB, such number is never negative in U2, so mentioned instruction could be used. Then add 232 if the MSB was set.

float    CONST[4]     SIMD_ALIGN = packed_float((float)((uint32_t)(1 << 31))); /* 2^31 */
uint32_t MASK_0_30[4] SIMD_ALIGN = packed_dword(0x7fffffff);
uint32_t MASK_31[4]   SIMD_ALIGN = packed_dword(0x80000000);

void convert_uint32_float(uint32_t in[4], float out[4]) {
    __asm__ volatile (
    "movdqu   (%%eax), %%xmm0  \n"
    "movdqa    %%xmm0, %%xmm1  \n"

    "pand   MASK_0_30, %%xmm0  \n" // xmm0 - mask MSB bit - never less then zero in U2
    "cvtdq2ps  %%xmm0, %%xmm0  \n" // convert this value to float

    "psrad        $32, %%xmm1  \n" // populate MSB in higher word (enough to mask CONST)
    "pand       CONST, %%xmm1  \n" // xmm1 = MSB set ? float(2^31) : float(0)

    "addps     %%xmm1, %%xmm0  \n" // add 2^31 if MSB set

    "movdqu    %%xmm0, (%%ebx) \n"

    : /* no output */
    : "a" (in),
      "b" (out)

See a sample implementation.

PABSQ — absolute value of two singed 64-bit numbers [8.06.2008]

Branch-less x86 code:

movl  %eax, %ebx
sarl   $32, %ebx        ; fill ebx with sign bit
xorl  %ebx, %eax        ; negate eax (if negative)
subl  %ebx, %eax        ; increment eax by 1 (if negative)


pshufd $0b11110101, %xmm0, %xmm1        ; populate dwords 3 and 1
psrad   $32, %xmm1      ; fill quad words with sign bit
pxor  %xmm1, %xmm0      ; negate (if negative)
psubq %xmm1, %xmm0      ; increment (if negative)

RDTSC on Core2 [8.06.2008]

RDTSC is incremented with bus-clock cycles, and then multiplied by core-clock/bus-clock ratio. From programmer view, RDTSC counter is incremented by value greater then 1, for example on C2D E8200 it is 8.

Latency of RDTSC in Pentium4 is about 60-120 cycles, on AMD CPU around 6 cycles.

GCC asm constraints [7.06.2008]

Read-write variables

        : "+a" (var)

Read-only variables, registers are clobbered

This won't work, GCC complains:

        : /* no output */
        : "a" (var)
        : "eax"

We can declare a temporary variable, and treat it as read-write:

int tmp_var = var;
        : "+a" (tmp_var)

If there are more registers, or var shouldn't be changed, then we can declare a common dummy variable:

int dummy __attribute__((unused));
        : "=a" (dummy),
          "=b" (dummy),
          "=c" (dummy)
        : "a" (var_or_value1),
          "b" (var_or_value2),
          "c" (var_or_value2)