Autovectorization status in MSVC in 2021

Author:	Wojciech Muła
Added on:	2021-02-17
Updated on:	2021-02-18 (MSVC 19.6 didn't autovectorize `accumulate_custom_epi8`, my mistake; noticed by Harold Aptroot)

Contents

Introduction
Comparison
Summary
Source code

Introduction

This year I re-checked the status of autovectorization in the latest GCC and Clang. MSVC was omitted because I didn't see any new version of this compiler on godbolt. More precisely, I didn't believe that there is a difference between versions 19.28 and 19.16 (that was tested two years ago).

Harold Aptroot pointed out that there are some differences in code generated for the AVX2 target. Additionally, in 2020 MSVC started to support AVX512. These two reasons forced me to recheck MSVC too.

Comparison

In this comparison we consider two targets:

AVX2,
AVX512 with all possible extensions (AVX512BW, AVX512VL, AVX512VBMI, etc.)

A few basic algorithm available in C++ algorithm library were picked.

Compiler versions

Microsoft (R) C/C++ Optimizing Compiler Version 19.28.29333 for x86 (version from 2021)
Microsoft (R) C/C++ Optimizing Compiler Version 19.16.27023.1 for x86 (version from 2019)

For sake of completeness also GCC and Clang results are included. Please refer to the article dedicated to these compilers.

Debian GCC 10.2.1 20210108
Debian clang version 11.0.1-2

MSVC compiler flags

AVX2: /O2 /arch:AVX2
AVX512: /O2 /arch:AVX512

Results

algorithm	procedure	MSVC 19.28.29333		MSVC 19.16.27023.1		GCC 10		Clang 11
		AVX2	AVX512	AVX2	AVX512	AVX2	AVX512	AVX2	AVX512
accumulate — custom	accumulate_custom_epi8	no	no	no	---	no	no	no	no
	accumulate_custom_epi32	yes	yes	yes	---	yes	yes	yes	yes
accumulate — default	accumulate_epi8	yes	yes	yes	---	yes	yes	yes	yes
	accumulate_epi32	yes	yes	yes	---	yes	yes	yes	yes
all_of	all_of_epi8	no	no	no	---	no	no	no	no
	all_of_epi32	no	no	no	---	no	no	no	no
any_of	any_of_epi8	no	no	no	---	no	no	no	no
	any_of_epi32	no	no	no	---	no	no	no	no
copy	copy_epi8	no	no	no	---	no	no[1]	no	no
	copy_epi32	no	no	no	---	no	no[1]	no	no
copy_if	copy_if_epi8	no	no	no	---	no	no[1]	no	no
	copy_if_epi32	no	no	no	---	no	no[1]	no	no
count	count_epi8	yes	yes	yes	---	yes	yes	yes	yes
	count_epi32	yes	yes	yes	---	yes	yes	yes	yes
count_if	count_if_epi8	no	no	no	---	yes	yes	yes	yes
	count_if_epi32	no	no	no	---	yes	yes	yes	yes
fill	fill_epi8	no[2]	no[2]	no	---	no[2]	no[2]	no[2]	no[2]
	fill_epi32	no[3]	no[3]	no	---	yes	yes	yes	yes
find	find_epi8	no[4]	no[4]	no[4]	---	no	no	no	no
	find_epi32	no	no	no	---	no	no	no	no
find_if	find_if_epi8	no	no	no	---	no	no	no	no
	find_if_epi32	no	no	no	---	no	no	no	no
is_sorted	is_sorted_epi8	no	no	no	---	no	no	no	no
	is_sorted_epi32	no	no	no	---	no	no	no	no
none_of	none_of_epi8	no	no	no	---	no	no	no	no
	none_of_epi32	no	no	no	---	no	no	no	no
remove	remove_epi8	no	no	no	---	no	no	no	no
	remove_epi32	no	no	no	---	no	no	no	no
remove_if	remove_if_epi8	no	no	no	---	no	no	no	no
	remove_if_epi32	no	no	no	---	no	no	no	no
replace	replace_epi8	no	no	no	---	no	yes	yes	yes
	replace_epi32	no	no	no	---	yes	yes	yes	yes
replace_if	replace_if_epi8	no	no	no	---	no	yes	no	no
	replace_if_epi32	no	no	no	---	yes	yes	no	no
reverse	reverse_epi8	no[5]	no[5]	no[5]	---	yes	yes	no	no
	reverse_epi32	no[6]	no[6]	no[6]	---	yes	yes	no	no
transform — abs	transform_abs_epi8	yes	yes	yes	---	yes	yes	yes	yes
	transform_abs_epi32	yes	yes	yes	---	yes	yes	yes	yes
transform — increment	transform_inc_epi8	yes	yes	yes	---	yes	yes	yes	yes
	transform_inc_epi32	yes	yes	yes	---	yes	yes	yes	yes
transform — negation	transform_neg_epi8	yes	yes	no	---	yes	yes	yes	yes
	transform_neg_epi32	yes	yes	no	---	yes	yes	yes	yes
unique	unique_epi8	no	no	no	---	no	no	no	no
	unique_epi32	no	no	no	---	no	no	no	no

[1]	(1, 2, 3, 4) SIMD instructions present, but not in the main loop

[2]	(1, 2, 3, 4, 5, 6) calls `memset`

[3]	(1, 2) emits `rep stosd`

[4]	(1, 2, 3) calls `memchr`

[5]	(1, 2, 3) calls `___std_reverse_trivially_swappable_1`

[6]	(1, 2, 3) calls `___std_reverse_trivially_swappable_4`

Summary

First of all, kudos to the MSVC team for bringing AVX512 to the Windows world.
Unfortunately, there's no big progress in autovectorization. MSVC learnt only how to deal with a transform with negate operation.
The MSVC optimizer correctly detected reverse algorithm and inserted calls to some already optimized library functions ___std_reverse_trivially_swappable_{1,4}.
The set of algorithms MSVC can autovectorize is smaller than GCC & Clang can handle.

Source code

All implementations are available at github.