Author: | Wojciech Muła |
---|---|
Added on: | 2019-02-02 |
Updated on: | 2021-02-18 (MSVC didn't vectorize accumulate_custom_epi8, noticed by Harold Aptroot; my mistake), 2019-02-04 (results for ICC and MSVC) |
Update 2021-02-17: please check the newest status of GCC & Clang, and MSVC.
The term "auto-vectorization" means the ability of a compiler to transform given scalar algorithm into vectorized one, i.e. express dominating operation(s) using SIMD instruction.
I'm sure nobody would argue that auto-vectorization is as important as scalar optimizations performed by compilers. Now vectorization is a must.
From what I can gather, one of the first commonly used compiler of C/C++ which automatically vectorized code was Intel compiler; by the time, luckily for us, GCC and Clang caught up.
In this comparison we consider only two targets:
A few basic algorithm available in C++ algorithm library were picked. For each algorithm two specialisations were created: one operating on int8_t values and int32_t values. (Efficient use of float or doubles would require -ffast-math and while it's not always allowed, we sticked to integers.)
I used compilers installed on amazing godbolt.org. The compilers versions:
GCC, Clang:
ICC:
MSVC:
algorithm | procedure | GCC | Clang | ICC | MSVC | ||||
---|---|---|---|---|---|---|---|---|---|
AVX2 | AVX512 | AVX2 | AVX512 | AVX2 | AVX512 | AVX2 | AVX512 | ||
accumulate — custom | accumulate_custom_epi8 | no | no | no | no | no | no | no | no[1] |
accumulate_custom_epi32 | yes | yes | yes | yes | yes | yes | yes | no[1] | |
accumulate — default | accumulate_epi8 | yes | yes | yes | yes | no | yes | yes | no[1] |
accumulate_epi32 | yes | yes | yes | yes | yes | yes | yes | no[1] | |
all_of | all_of_epi8 | no | no | no | no | no | no | no | no |
all_of_epi32 | no | no | no | no | no | no | no | no | |
any_of | any_of_epi8 | no | no | no | no | no | no | no | no |
any_of_epi32 | no | no | no | no | no | no | no | no | |
copy | copy_epi8 | no | no | no | no | no | no | no | no |
copy_epi32 | no | no | no | no | no | no | no | no | |
copy_if | copy_if_epi8 | no | no | no | no | no | no | no | no |
copy_if_epi32 | no | no | no | no | no | no | no | no | |
count | count_epi8 | yes | yes | yes | yes | yes | yes | yes | no[1] |
count_epi32 | yes | yes | yes | yes | yes | yes | yes | no[1] | |
count_if | count_if_epi8 | yes | yes | yes | yes | yes | yes | no | no |
count_if_epi32 | yes | yes | yes | yes | yes | yes | no | no | |
fill | fill_epi8 | no | no | no | no | yes | yes | no | no |
fill_epi32 | yes | yes | yes | yes | yes | yes | no | no | |
find | find_epi8 | no | no | no | no | no | no | no[2] | no[2] |
find_epi32 | no | no | no | no | no | no | no | no | |
find_if | find_if_epi8 | no | no | no | no | no | no | no | no |
find_if_epi32 | no | no | no | no | no | no | no | no | |
is_sorted | is_sorted_epi8 | no | no | no | no | no | no | no | no |
is_sorted_epi32 | no | no | no | no | no | no | no | no | |
none_of | none_of_epi8 | no | no | no | no | no | no | no | no |
none_of_epi32 | no | no | no | no | no | no | no | no | |
remove | remove_epi8 | no | no | no | no | no | no | no | no |
remove_epi32 | no | no | no | no | no | no | no | no | |
remove_if | remove_if_epi8 | no | no | no | no | no | no | no | no |
remove_if_epi32 | no | no | no | no | no | no | no | no | |
replace | replace_epi8 | no | yes | no[3] | yes | yes | yes | no | no |
replace_epi32 | yes | yes | yes | yes | yes | yes | no | no | |
replace_if | replace_if_epi8 | no | yes | no | yes | yes | yes | no | no |
replace_if_epi32 | yes | no | no | no | yes | yes | no | no | |
reverse | reverse_epi8 | yes | yes | no | no | no | no | no[4] | no[4] |
reverse_epi32 | yes | yes | no | no | no | no | no[5] | no[5] | |
transform — abs | transform_abs_epi8 | yes | yes | yes | yes | yes | yes | yes | no[1] |
transform_abs_epi32 | yes | yes | yes | yes | yes | yes | yes | no[1] | |
transform — increment | transform_inc_epi8 | yes | yes | yes | yes | yes | yes | yes | no[1] |
transform_inc_epi32 | yes | yes | yes | yes | yes | yes | yes | no[1] | |
transform — negation | transform_neg_epi8 | yes | yes | yes | yes | yes | yes | no | no |
transform_neg_epi32 | yes | yes | yes | yes | yes | yes | no | no | |
unique | unique_epi8 | no | no | no | no | no | no | no | no |
unique_epi32 | no | no | no | no | no | no | no | no |
[1] | (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) emits AVX2 code |
[2] | (1, 2) calls memchr |
[3] | loads input's chunk into a vector register, but all comparisons and stores are scalar |
[4] | (1, 2) calls ___std_reverse_trivially_swappable_1 |
[5] | (1, 2) calls ___std_reverse_trivially_swappable_4 |
A few notes:
In most compilers std::fill for int8_t ends up with memset invocation.
MSVC does not emit AVX512 instruction, only AVX2.
I was wondering why std::replace for int8_t cannot be vectorized for AVX2 targets. Following code would do the job:
m = _mm256_cmpeq_epi8(x, _mm256_set1_epi8('/')); x = _mm256_blendv_epi8(x, _mm256_set1_epi8('_'), m);
Travis Downs pointed out that such code behaves differently than original replace, as it always overwrites all bytes, regardless comparison results. To make things more interesting, ICC vectorizes std::replace using the above approach, which seems to be a compiler bug.
However, if we rewrite replace algorithm, into form where each byte is always overwritten, then compilers GCC and clang vectorize it for the AVX2 target:
for (size_t i=0; i < size; i++) bytes[i] = bytes[i] == '/' ? '_' : bytes[i];
All implementations are availabe at github.