Dividing unsigned 16-bit numbers

Author:	Wojciech Muła
Added on:	2025-01-03

Contents

Introduction
Experiment results
- Ryzen 7
- AlderLake
- Icelake
- Skylake-X
Source code

Introduction

This is a follow-up text for Dividing unsigned 8-bit numbers. We checked if dividing 16-bit unsigned numbers is also feasible for SIMD instructions.

Apart from obvious path, where we use floating-point division (DIVPS), 8-bit numbers could also utilize the approximate reciprocal instruction RCPPS.

Unfortunately, for 16-bit numbers the latter instruction cannot be used directly. To properly divide 16-bit integers we need to perform a single step of the Newton-Raphson algorithm, which kills performance.

Experiment results

Summary:

Suprisingly, AVX-512 implementations are not faster than AVX2 ones.
Unrolling doesn't help.

All benchmark programs were compiled with -O3 -march=native options on each machine separately.

Tested procedures
Procedure	Comments
`scalar`	plain 16-bit division
`AVX2`	division followed by integer casting with truncation (CVTTPS2DQ)
`AVX2 (x2)`	the base AVX2 procedure unrolled twice
`AVX2 (x4)`	the base AVX2 procedure unrolled fourfold
`AVX2 (rcp)`	multiplication by approximate reciprocal followed by a single step of the Newton-Raphson algorithm
`AVX512`	division followed by integer casting with truncation (CVTTPS2DQ)
`AVX512 (x2)`	the base AVX-512 procedure unrolled twice
`AVX512 (x4)`	the base AVX-512 procedure unrolled fourfold

Ryzen 7

Compiler: gcc (Debian 14.1.0-5) 14.1.0
CPU: AMD Ryzen 7 7730U with Radeon Graphics

procedure	time in cycles per byte		speed-up
	average	best
scalar	1.783	1.759	1.0	`████▏`
AVX2	0.208	0.182	9.7	`████████████████████████████████████████`
AVX2 (x2)	0.201	0.185	9.5	`███████████████████████████████████████▎`
AVX2 (x4)	0.215	0.188	9.4	`██████████████████████████████████████▋`
AVX2 (rcp)	0.323	0.302	5.8	`████████████████████████`

AlderLake

Compiler: gcc (Debian 13.2.0-25) 13.2.0
CPU: 12th Gen Intel(R) Core(TM) i7-1255U

procedure	time in cycles per byte		speed-up
	average	best
scalar	3.714	3.368	1.0	`████▍`
AVX2	0.507	0.389	8.7	`██████████████████████████████████████▋`
AVX2 (x2)	0.477	0.384	8.8	`███████████████████████████████████████▏`
AVX2 (x4)	0.518	0.376	9.0	`████████████████████████████████████████`
AVX2 (rcp)	0.531	0.440	7.7	`██████████████████████████████████▏`

Icelake

Compiler: gcc (GCC) 13.3.1 20240611 (Red Hat 13.3.1-2)
CPU: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz

procedure	time in cycles per byte		speed-up
	average	best
scalar	6.032	6.014	1.0	`██▋`
AVX2	0.630	0.628	9.6	`█████████████████████████▌`
AVX2 (x2)	0.628	0.626	9.6	`█████████████████████████▌`
AVX2 (x4)	0.404	0.401	15.0	`████████████████████████████████████████`
AVX2 (rcp)	0.501	0.491	12.2	`████████████████████████████████▋`
AVX512	0.431	0.428	14.1	`█████████████████████████████████████▍`
AVX512 (x4)	0.407	0.405	14.8	`███████████████████████████████████████▌`
AVX512 (x2)	0.415	0.413	14.6	`██████████████████████████████████████▊`

Skylake-X

CPU: Intel(R) Xeon(R) W-2104 CPU @ 3.20GHz
Compiler: gcc (GCC) 11.2.0

procedure	time in cycles per byte		speed-up
	average	best
scalar	8.116	8.092	1.0	`█████▊`
AVX2	1.235	1.212	6.7	`██████████████████████████████████████▍`
AVX2 (x2)	1.236	1.219	6.6	`██████████████████████████████████████▏`
AVX2 (x4)	1.234	1.210	6.7	`██████████████████████████████████████▍`
AVX2 (rcp)	1.394	1.368	5.9	`██████████████████████████████████`
AVX512	1.197	1.183	6.8	`███████████████████████████████████████▎`
AVX512 (x4)	1.196	1.181	6.9	`███████████████████████████████████████▍`
AVX512 (x2)	1.185	1.164	7.0	`████████████████████████████████████████`

Source code

Sample implementation is available at GitHub.