| Author: | Wojciech Muła | 
|---|---|
| Added on: | 2019-02-02 | 
| Updated on: | 2021-02-18 (MSVC didn't vectorize accumulate_custom_epi8, noticed by Harold Aptroot; my mistake), 2019-02-04 (results for ICC and MSVC) | 
Update 2021-02-17: please check the newest status of GCC & Clang, and MSVC.
The term "auto-vectorization" means the ability of a compiler to transform given scalar algorithm into vectorized one, i.e. express dominating operation(s) using SIMD instruction.
I'm sure nobody would argue that auto-vectorization is as important as scalar optimizations performed by compilers. Now vectorization is a must.
From what I can gather, one of the first commonly used compiler of C/C++ which automatically vectorized code was Intel compiler; by the time, luckily for us, GCC and Clang caught up.
In this comparison we consider only two targets:
A few basic algorithm available in C++ algorithm library were picked. For each algorithm two specialisations were created: one operating on int8_t values and int32_t values. (Efficient use of float or doubles would require -ffast-math and while it's not always allowed, we stuck to integers.)
I used compilers installed on amazing godbolt.org. The compilers versions:
GCC, Clang:
ICC:
MSVC:
| algorithm | procedure | GCC | Clang | ICC | MSVC | ||||
|---|---|---|---|---|---|---|---|---|---|
| AVX2 | AVX512 | AVX2 | AVX512 | AVX2 | AVX512 | AVX2 | AVX512 | ||
| accumulate — custom | accumulate_custom_epi8 | no | no | no | no | no | no | no | no[1] | 
| accumulate_custom_epi32 | yes | yes | yes | yes | yes | yes | yes | no[1] | |
| accumulate — default | accumulate_epi8 | yes | yes | yes | yes | no | yes | yes | no[1] | 
| accumulate_epi32 | yes | yes | yes | yes | yes | yes | yes | no[1] | |
| all_of | all_of_epi8 | no | no | no | no | no | no | no | no | 
| all_of_epi32 | no | no | no | no | no | no | no | no | |
| any_of | any_of_epi8 | no | no | no | no | no | no | no | no | 
| any_of_epi32 | no | no | no | no | no | no | no | no | |
| copy | copy_epi8 | no | no | no | no | no | no | no | no | 
| copy_epi32 | no | no | no | no | no | no | no | no | |
| copy_if | copy_if_epi8 | no | no | no | no | no | no | no | no | 
| copy_if_epi32 | no | no | no | no | no | no | no | no | |
| count | count_epi8 | yes | yes | yes | yes | yes | yes | yes | no[1] | 
| count_epi32 | yes | yes | yes | yes | yes | yes | yes | no[1] | |
| count_if | count_if_epi8 | yes | yes | yes | yes | yes | yes | no | no | 
| count_if_epi32 | yes | yes | yes | yes | yes | yes | no | no | |
| fill | fill_epi8 | no | no | no | no | yes | yes | no | no | 
| fill_epi32 | yes | yes | yes | yes | yes | yes | no | no | |
| find | find_epi8 | no | no | no | no | no | no | no[2] | no[2] | 
| find_epi32 | no | no | no | no | no | no | no | no | |
| find_if | find_if_epi8 | no | no | no | no | no | no | no | no | 
| find_if_epi32 | no | no | no | no | no | no | no | no | |
| is_sorted | is_sorted_epi8 | no | no | no | no | no | no | no | no | 
| is_sorted_epi32 | no | no | no | no | no | no | no | no | |
| none_of | none_of_epi8 | no | no | no | no | no | no | no | no | 
| none_of_epi32 | no | no | no | no | no | no | no | no | |
| remove | remove_epi8 | no | no | no | no | no | no | no | no | 
| remove_epi32 | no | no | no | no | no | no | no | no | |
| remove_if | remove_if_epi8 | no | no | no | no | no | no | no | no | 
| remove_if_epi32 | no | no | no | no | no | no | no | no | |
| replace | replace_epi8 | no | yes | no[3] | yes | yes | yes | no | no | 
| replace_epi32 | yes | yes | yes | yes | yes | yes | no | no | |
| replace_if | replace_if_epi8 | no | yes | no | yes | yes | yes | no | no | 
| replace_if_epi32 | yes | no | no | no | yes | yes | no | no | |
| reverse | reverse_epi8 | yes | yes | no | no | no | no | no[4] | no[4] | 
| reverse_epi32 | yes | yes | no | no | no | no | no[5] | no[5] | |
| transform — abs | transform_abs_epi8 | yes | yes | yes | yes | yes | yes | yes | no[1] | 
| transform_abs_epi32 | yes | yes | yes | yes | yes | yes | yes | no[1] | |
| transform — increment | transform_inc_epi8 | yes | yes | yes | yes | yes | yes | yes | no[1] | 
| transform_inc_epi32 | yes | yes | yes | yes | yes | yes | yes | no[1] | |
| transform — negation | transform_neg_epi8 | yes | yes | yes | yes | yes | yes | no | no | 
| transform_neg_epi32 | yes | yes | yes | yes | yes | yes | no | no | |
| unique | unique_epi8 | no | no | no | no | no | no | no | no | 
| unique_epi32 | no | no | no | no | no | no | no | no | |
| [1] | (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) emits AVX2 code | 
| [2] | (1, 2) calls memchr | 
| [3] | loads input's chunk into a vector register, but all comparisons and stores are scalar | 
| [4] | (1, 2) calls ___std_reverse_trivially_swappable_1 | 
| [5] | (1, 2) calls ___std_reverse_trivially_swappable_4 | 
A few notes:
In most compilers std::fill for int8_t ends up with memset invocation.
MSVC does not emit AVX512 instruction, only AVX2.
I was wondering why std::replace for int8_t cannot be vectorized for AVX2 targets. Following code would do the job:
m = _mm256_cmpeq_epi8(x, _mm256_set1_epi8('/')); x = _mm256_blendv_epi8(x, _mm256_set1_epi8('_'), m);
Travis Downs pointed out that such code behaves differently than original replace, as it always overwrites all bytes, regardless comparison results. To make things more interesting, ICC vectorizes std::replace using the above approach, which seems to be a compiler bug.
However, if we rewrite replace algorithm, into form where each byte is always overwritten, then compilers GCC and clang vectorize it for the AVX2 target:
for (size_t i=0; i < size; i++) bytes[i] = bytes[i] == '/' ? '_' : bytes[i];
All implementations are available at github.