Auto-vectorization status in GCC, Clang, ICC and MSVC

Author:Wojciech Muła
Added on:2019-02-02
Updated on:2019-02-04 (results for ICC and MSVC)

Contents

Introduction

The term "auto-vectorization" means the ability of a compiler to transform given scalar algorithm into vectorized one, i.e. express dominating operation(s) using SIMD instruction.

I'm sure nobody would argue that auto-vectorization is as important as scalar optimizations performed by compilers. Now vectorization is a must.

From what I can gather, one of the first commonly used compiler of C/C++ which automatically vectorized code was Intel compiler; by the time, luckily for us, GCC and Clang caught up.

Comparison

In this comparison we consider only two targets:

  1. AVX2,
  2. AVX512 with all possible extensions (AVX512BW, AVX512VL, AVX512VBMI, etc.)

A few basic algorithm available in C++ algorithm library were picked. For each algorithm two specialisations were created: one operating on int8_t values and int32_t values. (Efficient use of float or doubles would require -ffast-math and while it's not always allowed, we sticked to integers.)

Compiler versions

I used compilers installed on amazing godbolt.org. The compilers versions:

  • g++ (GCC-Explorer-Build) 9.0.0 20190120 (experimental)
  • clang version 9.0.0 (trunk 352942)
  • icc (ICC) 19.0.1.144 20181018
  • Microsoft (R) C/C++ Optimizing Compiler Version 19.16.27023.1 for x86

Compiler flags

GCC, Clang:

  • AVX2: -O3 -mavx2
  • AVX512: -O3 -mavx512f -mavx512dq -mavx512bw -mavx512vbmi -mavx512vbmi2 -mavx512vl

ICC:

  • AVX2: -O3 -axAVX2 -xAVX2
  • AVX512: -O3 -mavx512f -mavx512dq -mavx512bw -mavx512vbmi -mavx512vbmi2 -mavx512vl

MSVC:

  • AVX2: /O2 /arch:AVX2
  • AVX512: /O2 /arch:AVX512

Results

algorithm procedure GCC Clang ICC MSVC
    AVX2 AVX512 AVX2 AVX512 AVX2 AVX512 AVX2 AVX512
accumulate — default accumulate_custom_epi8 no no no no no no yes no[1]
  accumulate_custom_epi32 yes yes yes yes yes yes yes no[1]
accumulate — custom accumulate_epi8 yes yes yes yes no yes yes no[1]
  accumulate_epi32 yes yes yes yes yes yes yes no[1]
all_of all_of_epi8 no no no no no no no no
  all_of_epi32 no no no no no no no no
any_of any_of_epi8 no no no no no no no no
  any_of_epi32 no no no no no no no no
copy copy_epi8 no no no no no no no no
  copy_epi32 no no no no no no no no
copy_if copy_if_epi8 no no no no no no no no
  copy_if_epi32 no no no no no no no no
count count_epi8 yes yes yes yes yes yes yes no[1]
  count_epi32 yes yes yes yes yes yes yes no[1]
count_if count_if_epi8 yes yes yes yes yes yes no no
  count_if_epi32 yes yes yes yes yes yes no no
fill fill_epi8 no no no no yes yes no no
  fill_epi32 yes yes yes yes yes yes no no
find find_epi8 no no no no no no no[2] no[2]
  find_epi32 no no no no no no no no
find_if find_if_epi8 no no no no no no no no
  find_if_epi32 no no no no no no no no
is_sorted is_sorted_epi8 no no no no no no no no
  is_sorted_epi32 no no no no no no no no
none_of none_of_epi8 no no no no no no no no
  none_of_epi32 no no no no no no no no
remove remove_epi8 no no no no no no no no
  remove_epi32 no no no no no no no no
remove_if remove_if_epi8 no no no no no no no no
  remove_if_epi32 no no no no no no no no
replace replace_epi8 no yes no[3] yes yes yes no no
  replace_epi32 yes yes yes yes yes yes no no
replace_if replace_if_epi8 no yes no yes yes yes no no
  replace_if_epi32 yes no no no yes yes no no
reverse reverse_epi8 yes yes no no no no no[4] no[4]
  reverse_epi32 yes yes no no no no no[5] no[5]
transform — abs transform_abs_epi8 yes yes yes yes yes yes yes no[1]
  transform_abs_epi32 yes yes yes yes yes yes yes no[1]
transform — increment transform_inc_epi8 yes yes yes yes yes yes yes no[1]
  transform_inc_epi32 yes yes yes yes yes yes yes no[1]
transform — negation transform_neg_epi8 yes yes yes yes yes yes no no
  transform_neg_epi32 yes yes yes yes yes yes no no
unique unique_epi8 no no no no no no no no
  unique_epi32 no no no no no no no no
[1](1, 2, 3, 4, 5, 6, 7, 8, 9, 10) emits AVX2 code
[2](1, 2) calls memchr
[3]loads input's chunk into a vector register, but all comparisons and stores are scalar
[4](1, 2) calls ___std_reverse_trivially_swappable_1
[5](1, 2) calls ___std_reverse_trivially_swappable_4

A few notes:

  • In most compilers std::fill for int8_t ends up with memset invocation.

  • MSVC does not emit AVX512 instruction, only AVX2.

  • I was wondering why std::replace for int8_t cannot be vectorized for AVX2 targets. Following code would do the job:

    m = _mm256_cmpeq_epi8(x, _mm256_set1_epi8('/'));
    x = _mm256_blendv_epi8(x, _mm256_set1_epi8('_'), m);
    

    Travis Downs pointed out that such code behaves differently than original replace, as it always overwrites all bytes, regardless comparison results. To make things more interesting, ICC vectorizes std::replace using the above approach, which seems to be a compiler bug.

    However, if we rewrite replace algorithm, into form where each byte is always overwritten, then compilers GCC and clang vectorize it for the AVX2 target:

    for (size_t i=0; i < size; i++)
        bytes[i] = bytes[i] == '/' ? '_' : bytes[i];
    

See also

Source code

All implementations are availabe at github.