Author: | Wojciech Muła |
---|---|
Added on: | 2018-10-28 |
Updated: | 2018-10-29 |
When work with SSE/AVX2/AVX512 it's virtually impossible not to use some vector constants, which are defined by _mm_set_epi32 or similar intrinsic functions.
If your program is written in C++ NEVER EVER use static const for such constants. Why? From what I can gather, a compiler treats vector types not as PODs (Plain-Old-Data), but as fully-featured classes that have to be constructed and destructed by some additional code.
I checked this on GCC 7.3.0 from Debian, and then confirmed on GCC 8.2.0 and Clang 7.0.0 on godbolt.org.
Let's consider these trivial functions:
#include <immintrin.h> __m128i increment(__m128i input) { const __m128i consts = _mm_set_epi32(1, 2, 3, 4); return _mm_add_epi32(input, consts); } __m128i increment_static(__m128i input) { static const __m128i consts = _mm_set_epi32(1, 2, 3, 4); return _mm_add_epi32(input, consts); }
GCC 7.3.0 was invoked with options -msse4.1 -O3, and assembly for increment is as simple as we may expect:
_Z9incrementDv2_x: .LFB4762: paddd .LC0(%rip), %xmm0 ret ... .LC0: .long 4 .long 3 .long 2 .long 1
There is only paddd instruction with a memory argument, which points to data at label .LC0. The compiler figured out that values we read are plain integers.
Now, let's look at the assembly generated for increment_static:
_Z16increment_staticDv2_x: .LFB4762: movzbl _ZGVZ16increment_staticDv2_xE6consts(%rip), %eax testb %al, %al je .L2 movdqa _ZZ16increment_staticDv2_xE6consts(%rip), %xmm1 paddd %xmm1, %xmm0 ret .L2: leaq _ZGVZ16increment_staticDv2_xE6consts(%rip), %rdi subq $24, %rsp movaps %xmm0, (%rsp) call __cxa_guard_acquire@PLT testl %eax, %eax movdqa (%rsp), %xmm0 jne .L4 movdqa _ZZ16increment_staticDv2_xE6consts(%rip), %xmm1 addq $24, %rsp paddd %xmm1, %xmm0 ret .L4: movdqa .LC1(%rip), %xmm1 leaq _ZGVZ16increment_staticDv2_xE6consts(%rip), %rdi movaps %xmm0, (%rsp) movaps %xmm1, _ZZ16increment_staticDv2_xE6consts(%rip) call __cxa_guard_release@PLT movdqa .LC0(%rip), %xmm1 movdqa (%rsp), %xmm0 addq $24, %rsp paddd %xmm1, %xmm0 ret .LC0: .long 4 .long 3 .long 2 .long 1
The sub-procedures labelled with .L2 and .L4 deal with static initialization of values stored as four longs at .LCO. The details of this are not that important.
The performance problem with this code is that upon each call to the procedure there's a test whether a vector was initialized or not:
_Z16increment_staticDv2_x: movzbl _ZGVZ16increment_staticDv2_xE6consts(%rip), %eax <<< HERE testb %al, %al <<< HERE je .L2 <<< and HERE movdqa _ZZ16increment_staticDv2_xE6consts(%rip), %xmm1 paddd %xmm1, %xmm0
To make things worse, for every static vector used in a procedure similar code is generated.
I spotted this problem when profiling procedures that execute dozens of various vectors shuffles. There are a lot of auxilary vector constats, procedure #1 uses 36, and procedure #2 — 22 constants. I know, it is not a typical case.
With static const we have:
#1 : 94.000 cycle/op (best) 102.627 cycle/op (avg) #2 : 70.000 cycle/op (best) 78.737 cycle/op (avg)
Without static, just const:
#1 : 52.000 cycle/op (best) 57.248 cycle/op (avg) #2 : 40.000 cycle/op (best) 45.390 cycle/op (avg)
I'm not a C++ language lawyer, but suspect that such behaviour of the compilers has roots in the language standard.
Update 2018-10-29
Matt pointed out that this is implementation of "magic constants" (standard, explanation); it can be disabled with -fno-threadsafe-statics flag.
But still I don't understand why the C++ compilers treat vector types as they were non-PODs, while they surely are PODs; std::is_pod evaluates to true for all vector types (__m128, __m128d, __m128i).