User:Bjacob/ArithmeticTimingDifferences: Difference between revisions

 
(17 intermediate revisions by the same user not shown)
Line 15: Line 15:
=== Handling of denormals ===
=== Handling of denormals ===


x86 CPUs have two distinct flags that can optionally be enabled to enable non-IEEE-compliant handling of denormals:
x86 CPUs have two distinct flags that can optionally be enabled to enable non-IEEE-compliant handling of denormals by SSE instructions:
* FTZ (Flush To Zero) causes any denormal ''result'' to be flushed to zero.
* FTZ (Flush To Zero) causes any denormal ''result'' to be flushed to zero.
* DAZ (Denormals Are Zero) causes any denormal ''operand'' to be handled as if it were zero.
* DAZ (Denormals Are Zero) causes any denormal ''operand'' to be handled as if it were zero.


Not all CPUs have these flags. Even CPUs with SSE2 instructions can fail to support DAZ. For example, the Pentium M has SSE2 but does not have DAZ. It is reported that some early Pentium 4 are in the same case.
 
Not all CPUs have these flags. Even CPUs with SSE2 instructions can fail to support DAZ, see below the section about that.


== x86-64 architecture ==
== x86-64 architecture ==
Line 47: Line 48:
For a given operation (e.g. float multiplication) and each set of operands (e.g. 1e-40f * 1.0f) we run the operation N times, and evaluate the mean timing and standard deviation. We then see how the overall median time for this operation, over ''all'' sets of operands, compares to that. If it is more than two standard deviations away, that means that the overall median timing is ''not'' a plausible timing for this particular set of operands, and therefore, this particular set of operands exhibits abnormal timings for this operation.
For a given operation (e.g. float multiplication) and each set of operands (e.g. 1e-40f * 1.0f) we run the operation N times, and evaluate the mean timing and standard deviation. We then see how the overall median time for this operation, over ''all'' sets of operands, compares to that. If it is more than two standard deviations away, that means that the overall median timing is ''not'' a plausible timing for this particular set of operands, and therefore, this particular set of operands exhibits abnormal timings for this operation.


The underlying assumption, behind counting in standard deviations, is that for a given operation and set of operands, timings should follow a normal (a.k.a. Gaussian) law. This is reasonable because a given computer operation is likely to work like a series of sub-operations that can randomly either run in optimal time or take a little bit longer; so that the central limit theorem would produce a normal law for the sum of the timings of the sub-operations. TODO: experimentally verify that the timings fit a normal law.
The underlying assumption, behind counting in standard deviations, is that for a given operation and set of operands, timings should follow a normal (a.k.a. Gaussian) law. This is reasonable because a given computer operation is likely to work like a series of sub-operations that can randomly either run in optimal time or take a little bit longer; so that the central limit theorem would produce a normal law for the sum of the timings of the sub-operations. We verified on a few results that the timings' density plot looked plausibly similar to a Gaussian curve.


The reason to compare these normal laws to the overall ''median'', not the overall ''mean'', is that the overall mean may fall in between the actual "normal" time range and some abnormal, much higher values obtained for certain operands. In that case, ''all'' actual timings would fall far away from the mean, which would cause all timings to be reported as abnormal! This problem is averted by using the median instead, which is by definition an actual timing, and is most stable in the face of outliers.
The reason to compare these normal laws to the overall ''median'', not the overall ''mean'', is that the overall mean may fall in between the actual "normal" time range and some abnormal, much higher values obtained for certain operands. In that case, ''all'' actual timings would fall far away from the mean, which would cause all timings to be reported as abnormal! This problem is averted by using the median instead, which is by definition an actual timing, and is most stable in the face of outliers.


Times are measured by clock_gettime using CLOCK_THREAD_CPUTIME_ID. In other words, this is per-thread CPU time. This should be as resilient to perturbations as it gets, especially as the core benchmarking loop is free of I/O or syscalls (no malloc) and relies on only at most two cache lines (it only accesses a few bytes on the stack, and in the worst case they would extend over one cache line boundary).
Times are measured by clock_gettime using CLOCK_THREAD_CPUTIME_ID. In other words, this is per-thread CPU time. This should be as resilient to perturbations as it gets, especially as the core benchmarking loop is free of I/O or syscalls (no malloc) and relies on only at most two cache lines (it only accesses a few bytes on the stack, and in the worst case they would extend across one cache line boundary).
 
== Caveats ==
 
=== Equality comparisons ===
 
One of the types of operation that we time, "equality comparison", requires some details to be explained. For a given scalar type, by "equality comparison" we mean this:
 
  scalar equality_comparison(scalar x, scalar y)
  {
    return scalar(x == y);
  }
 
With integer types, this gets implemented with conditional moves and thus consistently runs in constant time. With floating-point types, conditional moves are not enough, for lack of floating-point variants. The compiler could still decide to compute (x==y) as a boolean with conditional move and then cast it to the floating-point type, but at least GCC 4.6 doesn't do that on x86. Instead, GCC 4.6 decided to use conditional jumps on x86. This means that the results of the floating-point "equality comparison" operation are always reported as non-constant-time as there is at least a timing difference between the (x==y) and (x!=y) cases. These reports are ignored in the analysis below.
 
Below, when we report significant timing differences in equality comparisons, we implicitly mean ''besides the above-discussed issue''. For example, if we find that there are specific timing differences when one of the operands is NaN, that can't be explained by the above issue, and is thus an inherent non-constant-time characteristic of the floating-point equality comparison instruction that was used.


== Running the code ==
== Running the code ==


You can either build [https://github.com/bjacob/time-math on github the source code] or use [http://people.mozilla.org/~bjacob/ta/bin/ prebuilt binaries]. Note that only Linux is supported at the moment. These are statically-linked binaries, so they should run on any Linux distribution.
You can either build [https://github.com/bjacob/time-math the source code] or use [http://people.mozilla.org/~bjacob/ta/bin/ prebuilt binaries]. Note that only Linux is supported at the moment.
 
The binaries should run on any Linux-based system: GNU/Linux, Android, Firefox OS. They are statically linked, so they don't have any dependency, not even on the standard library.
 
On x86-32 machines, note that it is only worth trying out SSE2-capable CPUs. Most of the x86-32 binaries have 'sse2' in their filename, indicating that they require support for SSE2.
 
On x86-64 machines, it is very much worth running ''both'' the x86-64 and the x86-32 binaries.
 
On ARM machines, note that the binaries require ARMv7.


== Recorded results ==
== Recorded results ==
Line 91: Line 115:
As expected, divisions are not constant-time whatsoever, not even on finite values, and not even on integer types.
As expected, divisions are not constant-time whatsoever, not even on finite values, and not even on integer types.


=== Integer addition, subtraction and multiplication are universally constant-time ===
=== Integer addition, subtraction and multiplication are almost universally constant-time ===


But again, note that integer ''division'' isn't, and nor are any ''floating-point'' operations.
But again, note that integer ''division'' isn't, and nor are any ''floating-point'' operations.
See below for a strange exception, where int64 subtraction is not constant time on some 32bit CPUs.


== The hopefully interesting parts ==
== The hopefully interesting parts ==
Line 103: Line 129:
Conventional "wisdom" is that while NaN's are slow with x87, they aren't anymore with SSE instructions. That turned out not to be true in full generality.
Conventional "wisdom" is that while NaN's are slow with x87, they aren't anymore with SSE instructions. That turned out not to be true in full generality.


An example is a [http://people.mozilla.org/~bjacob/ta/results/intel-celeron-1.8GHz-family-15-model-2-stepping-7-toshiba-satellite/ Pentium-4-era Celeron] exhibiting NaN-related slow behavior with SSE2 instructions, specifically on floating-point multiplications where one of the operands is NaN. See the alert about float multiplication and about double multiplication in [http://people.mozilla.org/~bjacob/ta/results/intel-celeron-1.8GHz-family-15-model-2-stepping-7-toshiba-satellite/time-math-x86-32-sse2-DAZ-FTZ.txt this log].
In particular, this is not true for a wide range of Intel CPUs based on the Netburst microarchitecture, which was popular until the mid-2000s. At least Intel family 15 models 2, 3 and 4 are affected:
* [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-2.53GHz-family-15-model-2-stepping-9-dell-dimension2400-gavin/time-math-x86-32-sse2-DAZ-FTZ.txt this Pentium 4], family 15 model 2
* [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-3.2GHz-family-15-model-3-stepping-3-tycho/time-math-x86-32-sse2-DAZ-FTZ.txt this Pentium 4], family 15 model 3
* [http://people.mozilla.org/~bjacob/ta/results/intel-celeron-1.8GHz-family-15-model-2-stepping-7-toshiba-satellite/ this Celeron], family 15 model 2
* [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-d-3GHz-family-15-model-4-stepping-7-dell-dimension9150-gavin/time-math-x86-64-DAZ-FTZ.txt this Pentium D], family 15 model 4. This set of results shows that the problem exists even in 64bit mode.
 
 
=== A significant mass of SSE2-capable CPUs do not have DAZ ===
 
Conventional "wisdom" is that DAZ is available nearly everywhere SSE2 is.


=== Some Intel CPUs have abnormally slow comparisons with NaN and denormals, even with SSE2, DAZ and FTZ ===
This turns out not to be the case even for Intel CPUs: this [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-m-1.4GHz-family-6-model-9-stepping-5-toshiba-tecra/ Pentium M] supports SSE2 but evidently [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-m-1.4GHz-family-6-model-9-stepping-5-toshiba-tecra/time-math-x86-32-sse2-DAZ-FTZ.txt doesn't support DAZ]. As a result, timings are all over the place, even on just additions, as soon as denormals are involved.


This is another way in which the notion that NaN issues are all solved by switching to SSE2, is not quite true.
It is also reported that some early Pentium 4 CPUs were in the same case. Conversely, not ''all'' Pentium M's are affected (here is an [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-m-1.6GHz-family-6-model-13-stepping-6-dell-inspiron600m-gavin/ unaffected one]).
 
=== Timing differences in equality comparisons ===


The SSE instructions being used there are ucomiss and ucomisd.
The SSE instructions being used there are ucomiss and ucomisd.


A [http://people.mozilla.org/~bjacob/ta/results/intel-core-2-2.66GHz-family-6-model-15-stepping-11-st3fan/ Intel Core 2] exhibits abnormally ''slow'' floating-point equality comparisons, both with floats and doubles, whenever NaN or denormals are involved --- even with SSE2, DAZ and FTZ. See the alerts about float and double equality comparison in [http://people.mozilla.org/~bjacob/ta/results/intel-core-2-2.66GHz-family-6-model-15-stepping-11-st3fan/time-math-x86-64-DAZ-FTZ.txt this log].
The first problem is denormals: equality comparisons are just as affected as other operations by denormals.


Timing differences in equality comparisons are worrisome because equality comparison is what we would naturally use if we wanted to manually avoid NaN values. So having this not run in constant time means that the idea of manually avoiding specific "bad" values may be difficult.
The next problem is that many CPUs have abnormally ''slow'' or ''fast'' equality comparisons when the operands are non-finite.


=== Some AMD CPUs have abnormally ''fast'' comparisons with NaN, even with SSE2 instructions ===
Specifically:
* Recent Intel CPUs tend to have abnormally slow comparisons with NaN. Examples: [http://people.mozilla.org/~bjacob/ta/results/intel-core-i7-2860QM-2.5GHz-family-6-model-42-stepping-7-thinkpad-w520/ Core i7], [http://people.mozilla.org/~bjacob/ta/results/intel-core-2-2.66GHz-family-6-model-15-stepping-11-st3fan/ Core 2].
* Recent AMD CPUs, low-end and older Intel CPUs, and Intel Atom CPUs have abnormally fast comparisons with NaN. Examples: [http://people.mozilla.org/~bjacob/ta/results/amd-fx-8150-family-21-model-1-stepping-2-benwa-bulldozer/ AMD FX], [http://people.mozilla.org/~bjacob/ta/results/amd-athlon-ii-x4-635-family-16-model-5-stepping-2-gkwong/ Athlon II], [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-b940-2GHz-family-6-model-42-stepping-7-gerv/ Pentium B940], [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-3.2GHz-family-15-model-3-stepping-3-tycho/ Pentium 4], [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-d-3GHz-family-15-model-4-stepping-7-dell-dimension9150-gavin/ Pentium D], [http://people.mozilla.org/~bjacob/ta/results/intel-atom-d510-1.66GHz-family-6-model-28-stepping-10-jvehent/ Intel Atom].
* Some VIA CPUs have abnormally fast comparisons with NaN and Inf. See [http://people.mozilla.org/~bjacob/ta/results/via-nano-u2250-1.6GHz-family-6-model-15-stepping-3-jvehent/ this VIA Nano].


This one is also about equality comparisons, but in a different way. A [http://people.mozilla.org/~bjacob/ta/results/amd-fx-8150-family-21-model-1-stepping-2-benwa-bulldozer/ 8-core AMD FX processor] exhibits abnormally ''fast'' single-precision floating-point equality comparisons (ucomiss / ucomisd instructions) when one of the operands is NaN. See the alerts about float equality comparison in [http://people.mozilla.org/~bjacob/ta/results/amd-fx-8150-family-21-model-1-stepping-2-benwa-bulldozer/time-math-x86-64-DAZ-FTZ.txt this log].
=== Some CPUs have surprising timing differences on 64-bit integer arithmetic using 32-bit integer arithmetic instructions, when DAZ is enabled ===


This is its own kind of worrying because:
On x86-32, our int64 subtraction benchmark compiles to this assembly:
* This shows that we have to watch out not just for abnormally ''slow'' but also for abnormally ''fast'' operations.
* This show that Intel and AMD are not making completely parallel progress on these issues.


=== A significant mass of SSE2-capable CPUs do not have DAZ ===
  movl (%ebx), %eax
  movl 4(%ebx), %edx
  movl 8(%ebx), %esi
  movl 12(%ebx), %edi
  subl %esi, %eax
  sbbl %edi, %edx
  movl %eax, 16(%ebx)
  movl %edx, 20(%ebx)
 
On [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-3.2GHz-family-15-model-3-stepping-3-tycho/ one Pentium 4 CPU] that we used, this does not quite run in constant time, with timing differences of up to 25% and about 3 sigma from the overall median; and the weirdest part is that this only happens when DAZ is enabled! This is of course particularly surprising as DAZ is not supposed to have any relevance to integer arithmetic.
 
Here are the results with DAZ, showing slower-than-normal operation when the right hand side of the subtraction is INT64_MIN, with the slowest case being INT64_MIN - INT64_MIN:
* [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-3.2GHz-family-15-model-3-stepping-3-tycho/time-math-x86-32-sse2-DAZ.txt With DAZ only;]
* [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-3.2GHz-family-15-model-3-stepping-3-tycho/time-math-x86-32-sse2-DAZ-FTZ.txt With DAZ and FTZ.]
 
 
On the other hand, here are the results without DAZ, showing no such timing difference with INT64_MIN operands:
* [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-3.2GHz-family-15-model-3-stepping-3-tycho/time-math-x86-32-sse2.txt Without DAZ or FTZ;]
* [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-3.2GHz-family-15-model-3-stepping-3-tycho/time-math-x86-32-sse2-FTZ.txt With FTZ only.]
 
 
It is worth noting that this only happens on that particular Pentium 4 CPU (family 15 model 3) and not on other Netburst-based CPUs that we tested.


Conventional "wisdom" is that DAZ is available nearly everywhere SSE2 is.
=== Float to int64 conversion is non-constant-time on some CPUs ===


This turns out not to be the case even for Intel CPUs: the [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-m-1.4GHz-family-6-model-9-stepping-5-toshiba-tecra/ Pentium M] supports SSE2 but evidently [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-m-1.4GHz-family-6-model-9-stepping-5-toshiba-tecra/time-math-x86-32-sse2-DAZ-FTZ.txt doesn't support DAZ].
FILL ME


It is also reported that some early Pentium 4 CPUs are in the same case.
=== std::max<int64> is non-constant-time on some CPUs ===


TODO: investigate more CPUs from the early 2000's, in particular some AMD CPUs like the Sempron, Athlon XP and Athlon 64 series.
FILL ME
Confirmed users
753

edits