Penalties of errors in SSE floating point calculations

Author:	Wojciech Muła
Added on:	2014-01-26
Updated on:	2018-11-26 (measurements from Skylake and SkylakeX)

Contents

Introduction
Enabling exceptions
Flush to zero
Denormalized numbers
Source code

Introduction

SSE provides not widely known control register, called MXCSR. This register plays three roles:

Controls calculations:
1. flag flush to zero,
2. flag denormals are zeros,
3. rounding mode (not covered in this text).
Allow to mask/unmask floating-point exceptions.
Save information about floating-point errors — these flags are sticky, i.e. a programmer is responsible for clearing them.

Possible errors in SSE floating point calculations are:

division by zero,
underflow,
overflow,
operations on denormalized values,
invalid operations, like square root of negative number, division zero by zero.

Enabling exceptions

By default all invalid operations in SSE are masked, i.e. they are not converted into hardware exceptions. When exceptions are unmasked, then standard SIGFPE exception is raised.

Important: even if errors are masked, when erroneous situation occurs then calculations' slowdown is significant. So if our program slows down for unknown reason, it may be an error in SSE-related code — for example we load "random" values to XMM registers.

Error flags in the MXCSR are always updated, regardless of which exceptions are reported.

Flush to zero

The flag "flush to zero" forces result 0 on underflow or denormal errors, and what is more important, these errors have no impact on calculations speed.

For example in the sample loop underflow occurs, because we try to multiply FLT_MIN by FLT_MIN (FLT_MIN = 2^{− 127}) — the result can't be represented in floating point.

float min_floats[4] = packed_float(FLT_MIN);

void mulps_in_loop() {
        const int32_t iterations = 10000000;
        uint32_t dummy;

        __asm__ __volatile__(
                "movups  min_floats, %%xmm0\n"
                "1:\n"
                "movaps  %%xmm0, %%xmm1\n"
                "mulps   %%xmm1, %%xmm1\n"
                "loop   1b\n"
                : "=c" (dummy)
                : "c"  (iterations)
        );
}

completion time of the above loop [in seconds]
architecture	default settings	flush to zero	speed-up
Core2	0.796s	0.023	30 x
Skylake	0.019	0.019	---
SkylakeX	0.028	0.022	1.25 x

Denormalized numbers

A denormalized floating point number is a very small number of value (0 + fraction)⋅2^{− 126}. Such value appears, for example, when we divide FLT_MIN by 2.

There is a little problem — if the result of some operation on normalized numbers is a denormalized value it's not an SSE error. Error is reported only when one of operands is already denormalized.

So, where is the problem? If a result is denormalized, speed is noticeable degraded, but we can't detect the point where denormalization has occurred. This can be done only when denormalized value is used in subsequent calculations.

MXCSR has the flag "denormals are zeros", which forces 0 as result of an operation where at least one operand is denormalized, but do not prevent from obtaining a denormalized result from operation on normalized values.

Let summarize this with following program:

first FLT_MIN is multiplied by 0.5 resulting in denormalized value;
then this value is added to 0.

float tiny_value[4]    = packed_float(FLT_MIN);
float large_divisor[4] = packed_float(0.5);
float final_value[4];

void test_loop() {
        const int32_t iterations = 10000000;
        uint32_t dummy;

        __asm__ __volatile__(
                "1:\n"
                "movups tiny_value,    %%xmm0\n"
                "movups large_divisor, %%xmm1\n"
                "pxor   %%xmm2, %%xmm2\n"
                "mulps  %%xmm1, %%xmm0\n" // FLT_MIN * 0.5 => denormalized number
                "addps  %%xmm2, %%xmm0\n" // denormalized + 0.0 => denormal exception
                "loop   1b\n"
                "movups %%xmm0, final_value\n"
                : "=c" (dummy)
                : "c" (iterations)
                :
        );
}

With default settings execution the final value is denormalized (5.877472e-39).
With the flag "denormals are zeros" execution the result is zero.
With the flag "flush to zero" the final value is also zero.

completion time: in seconds (speed-up to default settings)
architecture	default settings	denormals are zeros	flush to zero
Core2	1.841	0.858 (2x)	0.121 (15x)
Skylake	0.440	0.400	0.033 (13x)
SkylakeX	0.524	0.526	0.039 (13x)

Source code

The test programs are available.