Author: | Wojciech Muła |
---|---|
Added on: | 2021-01-18 |
Contents
Almost two years ago I did an in-depth comparison of autovectorization abilities of popular compilers: GCC, clang, ICC and MSVC. In this text only GCC and clang are considered, as I don't see any new versions of ICC nor MSVC on godbolt.org (drop me a line if I got lost in the multitude of compiler versions). Update 2021-02-17: MSVC 19.28 status.
The question is: "what has changed between GCC 9 and GCC 10, and between clang 9 and clang 11?".
In this comparison we consider only two targets:
A few basic algorithm available in C++ algorithm library were picked.
algorithm | procedure | GCC 9 | GCC 10 | Clang 9 | Clang 11 | ||||
---|---|---|---|---|---|---|---|---|---|
AVX2 | AVX512 | AVX2 | AVX512 | AVX2 | AVX512 | AVX2 | AVX512 | ||
accumulate — custom | accumulate_custom_epi8 | no | no | no | no | no | no | no | no |
accumulate_custom_epi32 | yes | yes | yes | yes | yes | yes | yes | yes | |
accumulate — default | accumulate_epi8 | yes | yes | yes | yes | yes | yes | yes | yes |
accumulate_epi32 | yes | yes | yes | yes | yes | yes | yes | yes | |
all_of | all_of_epi8 | no | no | no | no | no | no | no | no |
all_of_epi32 | no | no | no | no | no | no | no | no | |
any_of | any_of_epi8 | no | no | no | no | no | no | no | no |
any_of_epi32 | no | no | no | no | no | no | no | no | |
copy | copy_epi8 | no | no | no | no[1] | no | no | no | no |
copy_epi32 | no | no | no | no[1] | no | no | no | no | |
copy_if | copy_if_epi8 | no | no | no | no[1] | no | no | no | no |
copy_if_epi32 | no | no | no | no[1] | no | no | no | no | |
count | count_epi8 | yes | yes | yes | yes | yes | yes | yes | yes |
count_epi32 | yes | yes | yes | yes | yes | yes | yes | yes | |
count_if | count_if_epi8 | yes | yes | yes | yes | yes | yes | yes | yes |
count_if_epi32 | yes | yes | yes | yes | yes | yes | yes | yes | |
fill | fill_epi8 | no | no | no[2] | no[2] | no | no | no[2] | no[2] |
fill_epi32 | yes | yes | yes | yes | yes | yes | yes | yes | |
find | find_epi8 | no | no | no | no | no | no | no | no |
find_epi32 | no | no | no | no | no | no | no | no | |
find_if | find_if_epi8 | no | no | no | no | no | no | no | no |
find_if_epi32 | no | no | no | no | no | no | no | no | |
is_sorted | is_sorted_epi8 | no | no | no | no | no | no | no | no |
is_sorted_epi32 | no | no | no | no | no | no | no | no | |
none_of | none_of_epi8 | no | no | no | no | no | no | no | no |
none_of_epi32 | no | no | no | no | no | no | no | no | |
remove | remove_epi8 | no | no | no | no | no | no | no | no |
remove_epi32 | no | no | no | no | no | no | no | no | |
remove_if | remove_if_epi8 | no | no | no | no | no | no | no | no |
remove_if_epi32 | no | no | no | no | no | no | no | no | |
replace | replace_epi8 | no | yes | no | yes | no[3] | yes | yes | yes |
replace_epi32 | yes | yes | yes | yes | yes | yes | yes | yes | |
replace_if | replace_if_epi8 | no | yes | no | yes | no | yes | no | no |
replace_if_epi32 | yes | no | yes | yes | no | no | no | no | |
reverse | reverse_epi8 | yes | yes | yes | yes | no | no | no | no |
reverse_epi32 | yes | yes | yes | yes | no | no | no | no | |
transform — abs | transform_abs_epi8 | yes | yes | yes | yes | yes | yes | yes | yes |
transform_abs_epi32 | yes | yes | yes | yes | yes | yes | yes | yes | |
transform — increment | transform_inc_epi8 | yes | yes | yes | yes | yes | yes | yes | yes |
transform_inc_epi32 | yes | yes | yes | yes | yes | yes | yes | yes | |
transform — negation | transform_neg_epi8 | yes | yes | yes | yes | yes | yes | yes | yes |
transform_neg_epi32 | yes | yes | yes | yes | yes | yes | yes | yes | |
unique | unique_epi8 | no | no | no | no | no | no | no | no |
unique_epi32 | no | no | no | no | no | no | no | no |
[1] | (1, 2, 3, 4) SIMD instructions present, but not in the main loop |
[2] | (1, 2, 3, 4) calls memset |
[3] | loads input's chunk into a vector register, but all comparisons and stores are scalar |
The answer to the initial question is pretty sad: there is no progress.
GCC learnt how to vectorize replace_if_epi32 for AVX512 targets. At the same time clang lost this ability. These are the only changes.
BTW, it's worth to note that MSVC gained support for AVX512 in 2020.
Tomasz Duda showed that following C++ code is nicely vectorized by clang 9 and newer:
bool is_sorted3(int32_t* a, size_t n) { size_t i = 0; if (n > 4) { for (/**/; i < n - 4; i += 4) { if ((a[i] > a[i + 1])| (a[i + 1] > a[i + 2]) | (a[i + 2] > a[i + 3]) | (a[i + 3] > a[i + 4])) { return false; } } } for (/**/; i + 1 < n; i++) { if (a[i] > a[i + 1]) return false; } return true; }
I purposely called it "vectorization" not "autovectorization", as the main loop of algorithm has to be manually adjusted to let a compiler discover vectorization opportunity. Personally I'd reserve the term "autovectorization" for procedures that don't need such hints from a programmer.
All implementations are available at github.