User:Bjacob/ArithmeticTimingDifferences
This page is about which scalar floating-point arithmetic instructions fail to execute in constant time (independently of their operands) on various CPUs. We also study some multiple-instructions constructs such as math functions.
Terminology
x86 architecture
Instruction sets
In the x86 architecture, there are two competing floating-point arithmetic instruction sets: the legacy x87 instruction set, and SSE.
While SSE offers SIMD instructions, it also provides a full replacement for old x87 scalar (MIMD) instructions. We are only concerned with scalar instructions in this document.
Here we use SSE to generically refer to all versions of SSE instruction sets. The original SSE instruction set (hereafter SSE1) was introduced in the Pentium III and only handles floats. The subsequent SSE2 instruction set, introduced in the Pentium 4, added support for doubles. We are not concerned with SSE3 and newer instruction sets in this document.
Handling of denormals
x86 CPUs have two distinct flags that can optionally be enabled to enable non-IEEE-compliant handling of denormals:
- FTZ (Flush To Zero) causes any denormal result to be flushed to zero.
- DAZ (Denormals Are Zero) causes any denormal operand to be handled as if it were zero.
Not all CPUs have these flags. Even CPUs with SSE2 instructions can fail to support DAZ. For example, the Pentium M has SSE2 but does not have DAZ. It is reported that some early Pentium 4 are in the same case.
x86-64 architecture
The only way in which x86-64 differs from x86-32 in these matters, is that SSE2 is unconditionally available on x86-64. Hereafter, we let x86 refer generically to x86-32 or x86-64.
ARM architecture
Instruction sets
ARM has one hardware, scalar, floating-point instruction set: VFP.
NEON is purely a SIMD instruction set, so it doesn't concern us here.
Many ARM CPUs do not have NEON, and some ARM CPUs do not even have VFP. On these, software emulation of floating-point arithmetic is used. For our purposes, this can be considered another separate instruction set.
Handling of denormals
ARM has a single flag, Flush To Zero (FZ) which plays the role of both of x86's flags, FTZ and DAZ.
Getting experimental data
Methodology
The code performing the measurements is on github.
For a given operation (e.g. float multiplication) and each set of operands (e.g. 1e-40f * 1.0f) we run the operation N times, and evaluate the mean timing and standard deviation. We then see how the overall median time for this operation, over all sets of operands, compares to that. If it is more than two standard deviations away, that means that the overall median timing is not a plausible timing for this particular set of operands, and therefore, this particular set of operands exhibits abnormal timings for this operation.
The underlying assumption, behind counting in standard deviations, is that for a given operation and set of operands, timings should follow a normal (a.k.a. Gaussian) law. This is reasonable because a given computer operation is likely to work like a series of sub-operations that can randomly either run in optimal time or take a little bit longer; so that the central limit theorem would produce a normal law for the sum of the timings of the sub-operations. We verified on a few results that the timings' density plot looked plausibly similar to a Gaussian curve.
The reason to compare these normal laws to the overall median, not the overall mean, is that the overall mean may fall in between the actual "normal" time range and some abnormal, much higher values obtained for certain operands. In that case, all actual timings would fall far away from the mean, which would cause all timings to be reported as abnormal! This problem is averted by using the median instead, which is by definition an actual timing, and is most stable in the face of outliers.
Times are measured by clock_gettime using CLOCK_THREAD_CPUTIME_ID. In other words, this is per-thread CPU time. This should be as resilient to perturbations as it gets, especially as the core benchmarking loop is free of I/O or syscalls (no malloc) and relies on only at most two cache lines (it only accesses a few bytes on the stack, and in the worst case they would extend over one cache line boundary).
Caveats
Equality comparisons
One of the types of operation that we time, "equality comparison", requires some details to be explained. For a given scalar type, by "equality comparison" we mean this:
scalar equality_comparison(scalar x, scalar y) { return scalar(x == y); }
With integer types, this gets implemented with conditional moves and thus consistently runs in constant time. With floating-point types, conditional moves are not enough, for lack of floating-point variants. The compiler could still decide to compute (x==y) as a boolean with conditional move and then cast it to the floating-point type, but at least GCC 4.6 doesn't do that on x86. Instead, GCC 4.6 decided to use conditional jumps on x86. This means that the results of the floating-point "equality comparison" operation are always reported as non-constant-time as there is at least a timing difference between the (x==y) and (x!=y) cases. These reports are ignored in the analysis below.
Below, when we report significant timing differences in equality comparisons, we implicitly mean besides the above-discussed issue. For example, if we find that there are specific timing differences when one of the operands is NaN, that can't be explained by the above issue, and is thus an inherent non-constant-time characteristic of the floating-point equality comparison instruction that was used.
Running the code
You can either build on github the source code or use prebuilt binaries. Note that only Linux is supported at the moment.
The binaries should run on any Linux-based system: GNU/Linux, Android, Firefox OS. They are statically linked, so they don't have any dependency, not even on the standard library.
On x86-32 machines, note that it is only worth trying out SSE2-capable CPUs. Most of the x86-32 binaries have 'sse2' in their filename, indicating that they require support for SSE2.
On x86-64 machines, it is very much worth running both the x86-64 and the x86-32 binaries.
On ARM machines, note that the binaries require ARMv7.
Recorded results
Test results so far are recorded there.
Analysis
Here are the conclusions that we can draw from our experimental results so far.
The boring parts
Let's start by listing a few things that "everybody knows" for which we got an experimental confirmation.
x87 instructions are highly non-constant-time
x86 code using x87 floating-point instructions exhibits highly non-constant time behavior, especially but not only with denormals and NaNs, and there doesn't appear to be anything to do about it.
Here is a sample run.
ARM software floating-point emulation is highly non-constant-time
ARM code relying on software floating-point emulation should not expect any kind of constant-time characteristics, even on regular normal finite values.
Here is a sample run.
Math functions are highly non-constant-time
Note that for math functions, we just use std:: math functions and rely on the compiler to do the right thing. It is often the case that there is no instruction for a given math function and the compiler or standard library has to implement these manually, using basic arithmetic operations. That is at least the case with SSE.
Our results show that on all architectures and instruction sets we tried, math functions are highly non-constant-time.
Divisions are highly non-constant-time, even with integer types
As expected, divisions are not constant-time whatsoever, not even on finite values, and not even on integer types.
Integer addition, subtraction and multiplication are universally constant-time
But again, note that integer division isn't, and nor are any floating-point operations.
The hopefully interesting parts
Let's now get to the findings that we made, that contradict what we thought we knew.
Some Intel CPUs have slow multiplications with NaN values, even with SSE2 instructions
Conventional "wisdom" is that while NaN's are slow with x87, they aren't anymore with SSE instructions. That turned out not to be true in full generality.
An example is a Pentium-4-era Celeron exhibiting NaN-related slow behavior with SSE2 instructions, specifically on floating-point multiplications where one of the operands is NaN. See the alert about float multiplication and about double multiplication in this log.
Some Intel CPUs have abnormally slow comparisons with NaN and denormals, even with SSE2, DAZ and FTZ
This is another way in which the notion that NaN issues are all solved by switching to SSE2, is not quite true.
The SSE instructions being used there are ucomiss and ucomisd.
A Intel Core 2 exhibits abnormally slow floating-point equality comparisons, both with floats and doubles, whenever NaN or denormals are involved --- even with SSE2, DAZ and FTZ. See the alerts about float and double equality comparison in this log.
Timing differences in equality comparisons are worrisome because equality comparison is what we would naturally use if we wanted to manually avoid NaN values. So having this not run in constant time means that the idea of manually avoiding specific "bad" values may be difficult.
Some AMD CPUs have abnormally fast comparisons with NaN, even with SSE2 instructions
This one is also about equality comparisons, but in a different way. A 8-core AMD FX processor exhibits abnormally fast single-precision floating-point equality comparisons (ucomiss / ucomisd instructions) when one of the operands is NaN. See the alerts about float equality comparison in this log.
This is its own kind of worrying because:
- This shows that we have to watch out not just for abnormally slow but also for abnormally fast operations.
- This show that Intel and AMD are not making completely parallel progress on these issues.
A significant mass of SSE2-capable CPUs do not have DAZ
Conventional "wisdom" is that DAZ is available nearly everywhere SSE2 is.
This turns out not to be the case even for Intel CPUs: the Pentium M supports SSE2 but evidently doesn't support DAZ.
It is also reported that some early Pentium 4 CPUs are in the same case.
TODO: investigate more CPUs from the early 2000's, in particular some AMD CPUs like the Sempron, Athlon XP and Athlon 64 series.