Autovectorization status in GCC & Clang in 2021

Author: Wojciech Muła
Added on:2021-01-18

Contents

Introduction

Almost two years ago I did an in-depth comparison of autovectorization abilities of popular compilers: GCC, clang, ICC and MSVC. In this text only GCC and clang are considered, as I don't see any new versions of ICC nor MSVC on godbolt.org (drop me a line if I got lost in the multitude of compiler versions). Update 2021-02-17: MSVC 19.28 status.

The question is: "what has changed between GCC 9 and GCC 10, and between clang 9 and clang 11?".

Comparison

In this comparison we consider only two targets:

  1. AVX2,
  2. AVX512 with all possible extensions (AVX512BW, AVX512VL, AVX512VBMI, etc.)

A few basic algorithm available in C++ algorithm library were picked.

Compiler versions

  • Debian GCC 10.2.1 20210108
  • Debian clang version 11.0.1-2

Compiler flags

  • AVX2: -O3 -mavx2
  • AVX512: -O3 -mavx512f -mavx512dq -mavx512bw -mavx512vbmi -mavx512vbmi2 -mavx512vl

Results

algorithm procedure GCC 9 GCC 10 Clang 9 Clang 11
    AVX2 AVX512 AVX2 AVX512 AVX2 AVX512 AVX2 AVX512
accumulate — custom accumulate_custom_epi8 no no no no no no no no
  accumulate_custom_epi32 yes yes yes yes yes yes yes yes
accumulate — default accumulate_epi8 yes yes yes yes yes yes yes yes
  accumulate_epi32 yes yes yes yes yes yes yes yes
all_of all_of_epi8 no no no no no no no no
  all_of_epi32 no no no no no no no no
any_of any_of_epi8 no no no no no no no no
  any_of_epi32 no no no no no no no no
copy copy_epi8 no no no no[1] no no no no
  copy_epi32 no no no no[1] no no no no
copy_if copy_if_epi8 no no no no[1] no no no no
  copy_if_epi32 no no no no[1] no no no no
count count_epi8 yes yes yes yes yes yes yes yes
  count_epi32 yes yes yes yes yes yes yes yes
count_if count_if_epi8 yes yes yes yes yes yes yes yes
  count_if_epi32 yes yes yes yes yes yes yes yes
fill fill_epi8 no no no[2] no[2] no no no[2] no[2]
  fill_epi32 yes yes yes yes yes yes yes yes
find find_epi8 no no no no no no no no
  find_epi32 no no no no no no no no
find_if find_if_epi8 no no no no no no no no
  find_if_epi32 no no no no no no no no
is_sorted is_sorted_epi8 no no no no no no no no
  is_sorted_epi32 no no no no no no no no
none_of none_of_epi8 no no no no no no no no
  none_of_epi32 no no no no no no no no
remove remove_epi8 no no no no no no no no
  remove_epi32 no no no no no no no no
remove_if remove_if_epi8 no no no no no no no no
  remove_if_epi32 no no no no no no no no
replace replace_epi8 no yes no yes no[3] yes yes yes
  replace_epi32 yes yes yes yes yes yes yes yes
replace_if replace_if_epi8 no yes no yes no yes no no
  replace_if_epi32 yes no yes yes no no no no
reverse reverse_epi8 yes yes yes yes no no no no
  reverse_epi32 yes yes yes yes no no no no
transform — abs transform_abs_epi8 yes yes yes yes yes yes yes yes
  transform_abs_epi32 yes yes yes yes yes yes yes yes
transform — increment transform_inc_epi8 yes yes yes yes yes yes yes yes
  transform_inc_epi32 yes yes yes yes yes yes yes yes
transform — negation transform_neg_epi8 yes yes yes yes yes yes yes yes
  transform_neg_epi32 yes yes yes yes yes yes yes yes
unique unique_epi8 no no no no no no no no
  unique_epi32 no no no no no no no no
[1](1, 2, 3, 4) SIMD instructions present, but not in the main loop
[2](1, 2, 3, 4) calls memset
[3]loads input's chunk into a vector register, but all comparisons and stores are scalar

Summary

The answer to the initial question is pretty sad: there is no progress.

GCC learnt how to vectorize replace_if_epi32 for AVX512 targets. At the same time clang lost this ability. These are the only changes.

BTW, it's worth to note that MSVC gained support for AVX512 in 2020.

Vectorization of is_sorted

Tomasz Duda showed that following C++ code is nicely vectorized by clang 9 and newer:

bool is_sorted3(int32_t* a, size_t n) {
    size_t i = 0;
    if (n > 4) {
        for (/**/; i < n - 4; i += 4) {
            if ((a[i] > a[i + 1])| (a[i + 1] > a[i + 2]) | (a[i + 2] > a[i + 3]) | (a[i + 3] > a[i + 4])) {
                return false;
            }
        }
    }
    for (/**/; i + 1 < n; i++) {
        if (a[i] > a[i + 1])
            return false;
    }
    return true;
}

I purposely called it "vectorization" not "autovectorization", as the main loop of algorithm has to be manually adjusted to let a compiler discover vectorization opportunity. Personally I'd reserve the term "autovectorization" for procedures that don't need such hints from a programmer.

Source code

All implementations are available at github.