SIMD — why you shouldn't use static vector constants

Author:	Wojciech Muła
Added on:	2018-10-28
Updated:	2018-10-29

Introduction

When work with SSE/AVX2/AVX512 it's virtually impossible not to use some vector constants, which are defined by _mm_set_epi32 or similar intrinsic functions.

If your program is written in C++ NEVER EVER use static const for such constants. Why? From what I can gather, a compiler treats vector types not as PODs (Plain-Old-Data), but as fully-featured classes that have to be constructed and destructed by some additional code.

I checked this on GCC 7.3.0 from Debian, and then confirmed on GCC 8.2.0 and Clang 7.0.0 on godbolt.org.

Details

Let's consider these trivial functions:

#include <immintrin.h>

__m128i increment(__m128i input) {

    const __m128i consts = _mm_set_epi32(1, 2, 3, 4);

    return _mm_add_epi32(input, consts);
}

__m128i increment_static(__m128i input) {

    static const __m128i consts = _mm_set_epi32(1, 2, 3, 4);

    return _mm_add_epi32(input, consts);
}

GCC 7.3.0 was invoked with options -msse4.1 -O3, and assembly for increment is as simple as we may expect:

_Z9incrementDv2_x:
.LFB4762:
    paddd    .LC0(%rip), %xmm0
    ret

...

.LC0:
    .long    4
    .long    3
    .long    2
    .long    1

There is only paddd instruction with a memory argument, which points to data at label .LC0. The compiler figured out that values we read are plain integers.

Now, let's look at the assembly generated for increment_static:

_Z16increment_staticDv2_x:
.LFB4762:
    movzbl  _ZGVZ16increment_staticDv2_xE6consts(%rip), %eax
    testb   %al, %al
    je      .L2
    movdqa  _ZZ16increment_staticDv2_xE6consts(%rip), %xmm1
    paddd   %xmm1, %xmm0
    ret
.L2:
    leaq    _ZGVZ16increment_staticDv2_xE6consts(%rip), %rdi
    subq    $24, %rsp
    movaps  %xmm0, (%rsp)
    call    __cxa_guard_acquire@PLT
    testl    %eax, %eax
    movdqa  (%rsp), %xmm0
    jne     .L4
    movdqa  _ZZ16increment_staticDv2_xE6consts(%rip), %xmm1
    addq    $24, %rsp
    paddd   %xmm1, %xmm0
    ret
.L4:
    movdqa  .LC1(%rip), %xmm1
    leaq    _ZGVZ16increment_staticDv2_xE6consts(%rip), %rdi
    movaps  %xmm0, (%rsp)
    movaps  %xmm1, _ZZ16increment_staticDv2_xE6consts(%rip)
    call    __cxa_guard_release@PLT
    movdqa  .LC0(%rip), %xmm1
    movdqa  (%rsp), %xmm0
    addq    $24, %rsp
    paddd   %xmm1, %xmm0
    ret
.LC0:
    .long   4
    .long   3
    .long   2
    .long   1

The sub-procedures labelled with .L2 and .L4 deal with static initialization of values stored as four longs at .LCO. The details of this are not that important.

The performance problem with this code is that upon each call to the procedure there's a test whether a vector was initialized or not:

_Z16increment_staticDv2_x:
    movzbl   _ZGVZ16increment_staticDv2_xE6consts(%rip), %eax  <<< HERE
    testb    %al, %al                                          <<< HERE
    je      .L2                                                <<< and HERE

    movdqa   _ZZ16increment_staticDv2_xE6consts(%rip), %xmm1
    paddd    %xmm1, %xmm0

To make things worse, for every static vector used in a procedure similar code is generated.

Example of performance problem

I spotted this problem when profiling procedures that execute dozens of various vectors shuffles. There are a lot of auxilary vector constats, procedure #1 uses 36, and procedure #2 — 22 constants. I know, it is not a typical case.

With static const we have:

#1 : 94.000 cycle/op (best)  102.627 cycle/op (avg)
#2 : 70.000 cycle/op (best)   78.737 cycle/op (avg)

Without static, just const:

#1 : 52.000 cycle/op (best)   57.248 cycle/op (avg)
#2 : 40.000 cycle/op (best)   45.390 cycle/op (avg)

Conclusions

I'm not a C++ language lawyer, but suspect that such behaviour of the compilers has roots in the language standard.

Update 2018-10-29

Matt pointed out that this is implementation of "magic constants" (standard, explanation); it can be disabled with -fno-threadsafe-statics flag.

But still I don't understand why the C++ compilers treat vector types as they were non-PODs, while they surely are PODs; std::is_pod evaluates to true for all vector types (__m128, __m128d, __m128i).