Autovectorization status in MSVC in 2021

Author: Wojciech Muła
Added on:2021-02-17
Updated on:2021-02-18 (MSVC 19.6 didn't autovectorize accumulate_custom_epi8, my mistake; noticed by Harold Aptroot)

Contents

Introduction

This year I re-checked the status of autovectorization in the latest GCC and Clang. MSVC was omitted because I didn't see any new version of this compiler on godbolt. More precisely, I didn't believe that there is a difference between versions 19.28 and 19.16 (that was tested two years ago).

Harold Aptroot pointed out that there are some differences in code generated for the AVX2 target. Additionally, in 2020 MSVC started to support AVX512. These two reasons forced me to recheck MSVC too.

Comparison

In this comparison we consider two targets:

  1. AVX2,
  2. AVX512 with all possible extensions (AVX512BW, AVX512VL, AVX512VBMI, etc.)

A few basic algorithm available in C++ algorithm library were picked.

Compiler versions

  • Microsoft (R) C/C++ Optimizing Compiler Version 19.28.29333 for x86 (version from 2021)
  • Microsoft (R) C/C++ Optimizing Compiler Version 19.16.27023.1 for x86 (version from 2019)

For sake of completeness also GCC and Clang results are included. Please refer to the article dedicated to these compilers.

  • Debian GCC 10.2.1 20210108
  • Debian clang version 11.0.1-2

MSVC compiler flags

  • AVX2: /O2 /arch:AVX2
  • AVX512: /O2 /arch:AVX512

Results

algorithm procedure MSVC 19.28.29333 MSVC 19.16.27023.1 GCC 10 Clang 11
    AVX2 AVX512 AVX2 AVX512 AVX2 AVX512 AVX2 AVX512
accumulate — custom accumulate_custom_epi8 no no no --- no no no no
  accumulate_custom_epi32 yes yes yes --- yes yes yes yes
accumulate — default accumulate_epi8 yes yes yes --- yes yes yes yes
  accumulate_epi32 yes yes yes --- yes yes yes yes
all_of all_of_epi8 no no no --- no no no no
  all_of_epi32 no no no --- no no no no
any_of any_of_epi8 no no no --- no no no no
  any_of_epi32 no no no --- no no no no
copy copy_epi8 no no no --- no no[1] no no
  copy_epi32 no no no --- no no[1] no no
copy_if copy_if_epi8 no no no --- no no[1] no no
  copy_if_epi32 no no no --- no no[1] no no
count count_epi8 yes yes yes --- yes yes yes yes
  count_epi32 yes yes yes --- yes yes yes yes
count_if count_if_epi8 no no no --- yes yes yes yes
  count_if_epi32 no no no --- yes yes yes yes
fill fill_epi8 no[2] no[2] no --- no[2] no[2] no[2] no[2]
  fill_epi32 no[3] no[3] no --- yes yes yes yes
find find_epi8 no[4] no[4] no[4] --- no no no no
  find_epi32 no no no --- no no no no
find_if find_if_epi8 no no no --- no no no no
  find_if_epi32 no no no --- no no no no
is_sorted is_sorted_epi8 no no no --- no no no no
  is_sorted_epi32 no no no --- no no no no
none_of none_of_epi8 no no no --- no no no no
  none_of_epi32 no no no --- no no no no
remove remove_epi8 no no no --- no no no no
  remove_epi32 no no no --- no no no no
remove_if remove_if_epi8 no no no --- no no no no
  remove_if_epi32 no no no --- no no no no
replace replace_epi8 no no no --- no yes yes yes
  replace_epi32 no no no --- yes yes yes yes
replace_if replace_if_epi8 no no no --- no yes no no
  replace_if_epi32 no no no --- yes yes no no
reverse reverse_epi8 no[5] no[5] no[5] --- yes yes no no
  reverse_epi32 no[6] no[6] no[6] --- yes yes no no
transform — abs transform_abs_epi8 yes yes yes --- yes yes yes yes
  transform_abs_epi32 yes yes yes --- yes yes yes yes
transform — increment transform_inc_epi8 yes yes yes --- yes yes yes yes
  transform_inc_epi32 yes yes yes --- yes yes yes yes
transform — negation transform_neg_epi8 yes yes no --- yes yes yes yes
  transform_neg_epi32 yes yes no --- yes yes yes yes
unique unique_epi8 no no no --- no no no no
  unique_epi32 no no no --- no no no no
[1](1, 2, 3, 4) SIMD instructions present, but not in the main loop
[2](1, 2, 3, 4, 5, 6) calls memset
[3](1, 2) emits rep stosd
[4](1, 2, 3) calls memchr
[5](1, 2, 3) calls ___std_reverse_trivially_swappable_1
[6](1, 2, 3) calls ___std_reverse_trivially_swappable_4

Summary

  1. First of all, kudos to the MSVC team for bringing AVX512 to the Windows world.
  2. Unfortunately, there's no big progress in autovectorization. MSVC learnt only how to deal with a transform with negate operation.
  3. The MSVC optimizer correctly detected reverse algorithm and inserted calls to some already optimized library functions ___std_reverse_trivially_swappable_{1,4}.
  4. The set of algorithms MSVC can autovectorize is smaller than GCC & Clang can handle.

Source code

All implementations are available at github.