Revision as of 20:34, 3 June 2013

This page is about which scalar floating-point arithmetic instructions fail to execute in constant time (independently of their operands) on various CPUs. We also study some multiple-instructions constructs such as math functions.

Terminology

x86 architecture

Instruction sets

In the x86 architecture, there are two competing floating-point arithmetic instruction sets: the legacy x87 instruction set, and SSE.

While SSE offers SIMD instructions, it also provides a full replacement for old x87 scalar (MIMD) instructions. We are only concerned with scalar instructions in this document.

Here we use SSE to generically refer to all versions of SSE instruction sets. The original SSE instruction set (hereafter SSE1) was introduced in the Pentium III and only handles floats. The subsequent SSE2 instruction set, introduced in the Pentium 4, added support for doubles. We are not concerned with SSE3 and newer instruction sets in this document.

Handling of denormals

x86 CPUs have two distinct flags that can optionally be enabled to enable non-IEEE-compliant handling of denormals:

FTZ (Flush To Zero) causes any denormal result to be flushed to zero.
DAZ (Denormals Are Zero) causes any denormal operand to be handled as if it were zero.

Not all CPUs have these flags. Even CPUs with SSE2 instructions can fail to support DAZ. For example, the Pentium M has SSE2 but does not have DAZ. It is reported that some early Pentium 4 are in the same case.

x86-64 architecture

The only way in which x86-64 differs from x86-32 in these matters, is that SSE2 is unconditionally available on x86-64. Hereafter, we let x86 refer generically to x86-32 or x86-64.

ARM architecture

Instruction sets

ARM has one hardware, scalar, floating-point instruction set: VFP.

NEON is purely a SIMD instruction set, so it doesn't concern us here.

Many ARM CPUs do not have NEON, and some ARM CPUs do not even have VFP. On these, software emulation of floating-point arithmetic is used. For our purposes, this can be considered another separate instruction set.

Handling of denormals

ARM has a single flag, Flush To Zero (FZ) which plays the role of both of x86's flags, FTZ and DAZ.

Getting experimental data

Methodology

The code performing the measurements is on github.

For a given operation (e.g. float multiplication) and each set of operands (e.g. 1e-40f * 1.0f) we run the operation N times, and evaluate the mean timing and standard deviation. We then see how the overall median time for this operation, over all sets of operands, compares to that. If it is more than two standard deviations away, that means that the overall median timing is not a plausible timing for this particular set of operands, and therefore, this particular set of operands exhibits abnormal timings for this operation.

The underlying assumption, behind counting in standard deviations, is that for a given operation and set of operands, timings should follow a normal (a.k.a. Gaussian) law. This is reasonable because a given computer operation is likely to work like a series of sub-operations that can randomly either run in optimal time or take a little bit longer; so that the central limit theorem would produce a normal law for the sum of the timings of the sub-operations. TODO: experimentally verify that the timings fit a normal law.

The reason to compare these normal laws to the overall median, not the overall mean, is that the overall mean may fall in between the actual "normal" time range and some abnormal, much higher values obtained for certain operands. In that case, all actual timings would fall far away from the mean, which would cause all timings to be reported as abnormal! This problem is averted by using the median instead, which is by definition an actual timing, and is most stable in the face of outliers.

Times are measured by clock_gettime using CLOCK_THREAD_CPUTIME_ID. In other words, this is per-thread CPU time. This should be as resilient to perturbations as it gets, especially as the core benchmarking loop is free of I/O or syscalls (no malloc) and relies on only at most two cache lines (it only accesses a few bytes on the stack, and in the worst case they would extend over one cache line boundary).

Running the code

You can either build on github the source code or use prebuilt binaries. Note that only Linux is supported at the moment. These are statically-linked binaries, so they should run on any Linux distribution.

Recorded results

Test results so far are recorded there.

Analysis

Here are the conclusions that we can draw from our experimental results so far.

The boring parts

Let's start by listing a few things that "everybody knows" for which we got an experimental confirmation.

x87 instructions are highly non-constant-time

x86 code using x87 floating-point instructions exhibits highly non-constant time behavior, especially but not only with denormals and NaNs, and there doesn't appear to be anything to do about it.

Here is a sample run.

ARM software floating-point emulation is highly non-constant-time

ARM code relying on software floating-point emulation should not expect any kind of constant-time characteristics, even on regular normal finite values.

Here is a sample run.

Math functions are highly non-constant-time

Note that for math functions, we just use std:: math functions and rely on the compiler to do the right thing. It is often the case that there is no instruction for a given math function and the compiler or standard library has to implement these manually, using basic arithmetic operations. That is at least the case with SSE.

Our results show that on all architectures and instruction sets we tried, math functions are highly non-constant-time.

Divisions are highly non-constant-time, even with integer types

As expected, divisions are not constant-time whatsoever, not even on finite values, and not even on integer types.

Integer addition, subtraction and multiplication are universally constant-time

But again, note that integer division isn't, and nor are any floating-point operations.

The hopefully interesting parts

Let's now get to the findings that we made, that contradict what we thought we knew.

Some Intel CPUs have slow multiplications with NaN values, even with SSE2 instructions

Conventional "wisdom" is that while NaN's are slow with x87, they aren't anymore with SSE instructions. That turned out not to be true in full generality.

An example is a Pentium-4-era Celeron exhibiting NaN-related slow behavior with SSE2 instructions, specifically on floating-point multiplications where one of the operands is NaN. See the alert about float multiplication and about double multiplication in this log.

Some recent AMD CPUs have abnormally fast comparisons with NaN values, even with SSE2 instructions

This is another way in why the notion that NaN issues are all solved by switching to SSE2, is not quite true.

A 8-core AMD FX processor exhibits abnormally fast single-precision floating-point equality comparisons (ucomiss / ucomisd instructions) when one of the operands is NaN. See the alerts about float equality comparison in this log.

The SSE instructions being used there are ucomiss and ucomisd.

This is particularly worrying because:

This shows that we have to watch out not just for abnormally slow but also for abnormally fast operations.
This show that Intel and AMD are not making completely parallel progress on these issues.
Equality comparison is what we would use if we wanted to manually avoid NaN values. So having this not run in constant time means that the idea of manually avoiding specific "bad" values may not be feasible.

A significant mass of SSE2-capable CPUs do not have DAZ

Conventional "wisdom" is that DAZ is available nearly everywhere SSE2 is.

This turns out not to be the case even for Intel CPUs: the Pentium M supports SSE2 but evidently doesn't support DAZ.

It is also reported that some early Pentium 4 CPUs are in the same case.

TODO: investigate more CPUs from the early 2000's, in particular some AMD CPUs like the Sempron, Athlon XP and Athlon 64 series.

@@ Line 55: / Line 55: @@
 == Running the code ==
-You can either build [https://github.com/bjacob/time-math on github the source code] or use [http://people.mozilla.org/~bjacob/ta/bin/ prebuilt binaries].
+You can either build [https://github.com/bjacob/time-math on github the source code] or use [http://people.mozilla.org/~bjacob/ta/bin/ prebuilt binaries]. Note that only Linux is supported at the moment. These are statically-linked binaries, so they should run on any Linux distribution.
 == Recorded results ==

User:Bjacob/ArithmeticTimingDifferences: Difference between revisions

Revision as of 20:34, 3 June 2013

Contents

Terminology

x86 architecture

Instruction sets

Handling of denormals

x86-64 architecture

ARM architecture

Instruction sets

Handling of denormals

Getting experimental data

Methodology

Running the code

Recorded results

Analysis

The boring parts

x87 instructions are highly non-constant-time

ARM software floating-point emulation is highly non-constant-time

Math functions are highly non-constant-time

Divisions are highly non-constant-time, even with integer types

Integer addition, subtraction and multiplication are universally constant-time

The hopefully interesting parts

Some Intel CPUs have slow multiplications with NaN values, even with SSE2 instructions

Some recent AMD CPUs have abnormally fast comparisons with NaN values, even with SSE2 instructions

A significant mass of SSE2-capable CPUs do not have DAZ

Navigation menu

User:Bjacob/ArithmeticTimingDifferences: Difference between revisions

Revision as of 20:34, 3 June 2013

Terminology

x86 architecture

Instruction sets

Handling of denormals

x86-64 architecture

ARM architecture

Instruction sets

Handling of denormals

Getting experimental data

Methodology

Running the code

Recorded results

Analysis

The boring parts

x87 instructions are highly non-constant-time

ARM software floating-point emulation is highly non-constant-time

Math functions are highly non-constant-time

Divisions are highly non-constant-time, even with integer types

Integer addition, subtraction and multiplication are universally constant-time

The hopefully interesting parts

Some Intel CPUs have slow multiplications with NaN values, even with SSE2 instructions

Some recent AMD CPUs have abnormally fast comparisons with NaN values, even with SSE2 instructions

A significant mass of SSE2-capable CPUs do not have DAZ

Navigation menu

Search