User:Bjacob/ArithmeticTimingDifferences: Difference between revisions

 
(9 intermediate revisions by the same user not shown)
Line 15: Line 15:
=== Handling of denormals ===
=== Handling of denormals ===


x86 CPUs have two distinct flags that can optionally be enabled to enable non-IEEE-compliant handling of denormals:
x86 CPUs have two distinct flags that can optionally be enabled to enable non-IEEE-compliant handling of denormals by SSE instructions:
* FTZ (Flush To Zero) causes any denormal ''result'' to be flushed to zero.
* FTZ (Flush To Zero) causes any denormal ''result'' to be flushed to zero.
* DAZ (Denormals Are Zero) causes any denormal ''operand'' to be handled as if it were zero.
* DAZ (Denormals Are Zero) causes any denormal ''operand'' to be handled as if it were zero.


Not all CPUs have these flags. Even CPUs with SSE2 instructions can fail to support DAZ. For example, the Pentium M has SSE2 but does not have DAZ. It is reported that some early Pentium 4 are in the same case.
 
Not all CPUs have these flags. Even CPUs with SSE2 instructions can fail to support DAZ, see below the section about that.


== x86-64 architecture ==
== x86-64 architecture ==
Line 51: Line 52:
The reason to compare these normal laws to the overall ''median'', not the overall ''mean'', is that the overall mean may fall in between the actual "normal" time range and some abnormal, much higher values obtained for certain operands. In that case, ''all'' actual timings would fall far away from the mean, which would cause all timings to be reported as abnormal! This problem is averted by using the median instead, which is by definition an actual timing, and is most stable in the face of outliers.
The reason to compare these normal laws to the overall ''median'', not the overall ''mean'', is that the overall mean may fall in between the actual "normal" time range and some abnormal, much higher values obtained for certain operands. In that case, ''all'' actual timings would fall far away from the mean, which would cause all timings to be reported as abnormal! This problem is averted by using the median instead, which is by definition an actual timing, and is most stable in the face of outliers.


Times are measured by clock_gettime using CLOCK_THREAD_CPUTIME_ID. In other words, this is per-thread CPU time. This should be as resilient to perturbations as it gets, especially as the core benchmarking loop is free of I/O or syscalls (no malloc) and relies on only at most two cache lines (it only accesses a few bytes on the stack, and in the worst case they would extend over one cache line boundary).
Times are measured by clock_gettime using CLOCK_THREAD_CPUTIME_ID. In other words, this is per-thread CPU time. This should be as resilient to perturbations as it gets, especially as the core benchmarking loop is free of I/O or syscalls (no malloc) and relies on only at most two cache lines (it only accesses a few bytes on the stack, and in the worst case they would extend across one cache line boundary).


== Caveats ==
== Caveats ==
Line 70: Line 71:
== Running the code ==
== Running the code ==


You can either build [https://github.com/bjacob/time-math on github the source code] or use [http://people.mozilla.org/~bjacob/ta/bin/ prebuilt binaries]. Note that only Linux is supported at the moment.
You can either build [https://github.com/bjacob/time-math the source code] or use [http://people.mozilla.org/~bjacob/ta/bin/ prebuilt binaries]. Note that only Linux is supported at the moment.


The binaries should run on any Linux-based system: GNU/Linux, Android, Firefox OS. They are statically linked, so they don't have any dependency, not even on the standard library.
The binaries should run on any Linux-based system: GNU/Linux, Android, Firefox OS. They are statically linked, so they don't have any dependency, not even on the standard library.
Line 114: Line 115:
As expected, divisions are not constant-time whatsoever, not even on finite values, and not even on integer types.
As expected, divisions are not constant-time whatsoever, not even on finite values, and not even on integer types.


=== Integer addition, subtraction and multiplication are universally constant-time ===
=== Integer addition, subtraction and multiplication are almost universally constant-time ===


But again, note that integer ''division'' isn't, and nor are any ''floating-point'' operations.
But again, note that integer ''division'' isn't, and nor are any ''floating-point'' operations.
See below for a strange exception, where int64 subtraction is not constant time on some 32bit CPUs.


== The hopefully interesting parts ==
== The hopefully interesting parts ==
Line 132: Line 135:
* [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-d-3GHz-family-15-model-4-stepping-7-dell-dimension9150-gavin/time-math-x86-64-DAZ-FTZ.txt this Pentium D], family 15 model 4. This set of results shows that the problem exists even in 64bit mode.
* [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-d-3GHz-family-15-model-4-stepping-7-dell-dimension9150-gavin/time-math-x86-64-DAZ-FTZ.txt this Pentium D], family 15 model 4. This set of results shows that the problem exists even in 64bit mode.


=== Some Intel CPUs have abnormally slow comparisons with NaN and denormals, even with SSE2, DAZ and FTZ ===


This is another way in which the notion that NaN issues are all solved by switching to SSE2, is not quite true.
=== A significant mass of SSE2-capable CPUs do not have DAZ ===


The SSE instructions being used there are ucomiss and ucomisd.
Conventional "wisdom" is that DAZ is available nearly everywhere SSE2 is.


A [http://people.mozilla.org/~bjacob/ta/results/intel-core-2-2.66GHz-family-6-model-15-stepping-11-st3fan/ Intel Core 2] exhibits abnormally ''slow'' floating-point equality comparisons, both with floats and doubles, whenever NaN or denormals are involved --- even with SSE2, DAZ and FTZ. See the alerts about float and double equality comparison in [http://people.mozilla.org/~bjacob/ta/results/intel-core-2-2.66GHz-family-6-model-15-stepping-11-st3fan/time-math-x86-64-DAZ-FTZ.txt this log].
This turns out not to be the case even for Intel CPUs: this [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-m-1.4GHz-family-6-model-9-stepping-5-toshiba-tecra/ Pentium M] supports SSE2 but evidently [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-m-1.4GHz-family-6-model-9-stepping-5-toshiba-tecra/time-math-x86-32-sse2-DAZ-FTZ.txt doesn't support DAZ]. As a result, timings are all over the place, even on just additions, as soon as denormals are involved.


Timing differences in equality comparisons are worrisome because equality comparison is what we would naturally use if we wanted to manually avoid NaN values. So having this not run in constant time means that the idea of manually avoiding specific "bad" values may be difficult.
It is also reported that some early Pentium 4 CPUs were in the same case. Conversely, not ''all'' Pentium M's are affected (here is an [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-m-1.6GHz-family-6-model-13-stepping-6-dell-inspiron600m-gavin/ unaffected one]).


=== Some AMD CPUs have abnormally ''fast'' comparisons with NaN, even with SSE2 instructions ===
=== Timing differences in equality comparisons ===


This one is also about equality comparisons, but in a different way. A [http://people.mozilla.org/~bjacob/ta/results/amd-fx-8150-family-21-model-1-stepping-2-benwa-bulldozer/ 8-core AMD FX processor] exhibits abnormally ''fast'' single-precision floating-point equality comparisons (ucomiss / ucomisd instructions) when one of the operands is NaN. See the alerts about float equality comparison in [http://people.mozilla.org/~bjacob/ta/results/amd-fx-8150-family-21-model-1-stepping-2-benwa-bulldozer/time-math-x86-64-DAZ-FTZ.txt this log].
The SSE instructions being used there are ucomiss and ucomisd.
 
This is its own kind of worrying because:
* This shows that we have to watch out not just for abnormally ''slow'' but also for abnormally ''fast'' operations.
* This show that Intel and AMD are not making completely parallel progress on these issues.


=== A significant mass of SSE2-capable CPUs do not have DAZ ===
The first problem is denormals: equality comparisons are just as affected as other operations by denormals.


Conventional "wisdom" is that DAZ is available nearly everywhere SSE2 is.
The next problem is that many CPUs have abnormally ''slow'' or ''fast'' equality comparisons when the operands are non-finite.


This turns out not to be the case even for Intel CPUs: this [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-m-1.4GHz-family-6-model-9-stepping-5-toshiba-tecra/ Pentium M] supports SSE2 but evidently [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-m-1.4GHz-family-6-model-9-stepping-5-toshiba-tecra/time-math-x86-32-sse2-DAZ-FTZ.txt doesn't support DAZ]. As a result, timings are all over the place, even on just additions, as soon as denormals are involved.
Specifically:
 
* Recent Intel CPUs tend to have abnormally slow comparisons with NaN. Examples: [http://people.mozilla.org/~bjacob/ta/results/intel-core-i7-2860QM-2.5GHz-family-6-model-42-stepping-7-thinkpad-w520/ Core i7], [http://people.mozilla.org/~bjacob/ta/results/intel-core-2-2.66GHz-family-6-model-15-stepping-11-st3fan/ Core 2].
It is also reported that some early Pentium 4 CPUs were in the same case. Conversely, not ''all'' Pentium M's are affected (here is an [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-m-1.6GHz-family-6-model-13-stepping-6-dell-inspiron600m-gavin/ unaffected one]).
* Recent AMD CPUs, low-end and older Intel CPUs, and Intel Atom CPUs have abnormally fast comparisons with NaN. Examples: [http://people.mozilla.org/~bjacob/ta/results/amd-fx-8150-family-21-model-1-stepping-2-benwa-bulldozer/ AMD FX], [http://people.mozilla.org/~bjacob/ta/results/amd-athlon-ii-x4-635-family-16-model-5-stepping-2-gkwong/ Athlon II], [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-b940-2GHz-family-6-model-42-stepping-7-gerv/ Pentium B940], [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-3.2GHz-family-15-model-3-stepping-3-tycho/ Pentium 4], [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-d-3GHz-family-15-model-4-stepping-7-dell-dimension9150-gavin/ Pentium D], [http://people.mozilla.org/~bjacob/ta/results/intel-atom-d510-1.66GHz-family-6-model-28-stepping-10-jvehent/ Intel Atom].
* Some VIA CPUs have abnormally fast comparisons with NaN and Inf. See [http://people.mozilla.org/~bjacob/ta/results/via-nano-u2250-1.6GHz-family-6-model-15-stepping-3-jvehent/ this VIA Nano].


=== Some CPUs have surprising timing differences on 64-bit integer arithmetic using 32-bit integer arithmetic instructions, when DAZ is enabled ===
=== Some CPUs have surprising timing differences on 64-bit integer arithmetic using 32-bit integer arithmetic instructions, when DAZ is enabled ===
Line 171: Line 170:
   movl %edx, 20(%ebx)
   movl %edx, 20(%ebx)


On [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-3.2GHz-family-15-model-3-stepping-3-tycho/ one Pentium 4 CPU] that we used, this does not quite run in constant time, with timing differences of up to 25% and above 3 sigma from the overall median; and the weirdest part is that this only happens when DAZ is enabled! This is of course particularly surprising as DAZ is not supposed to have any relevance to integer arithmetic.
On [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-3.2GHz-family-15-model-3-stepping-3-tycho/ one Pentium 4 CPU] that we used, this does not quite run in constant time, with timing differences of up to 25% and about 3 sigma from the overall median; and the weirdest part is that this only happens when DAZ is enabled! This is of course particularly surprising as DAZ is not supposed to have any relevance to integer arithmetic.


Here are the results with DAZ, showing slower-than-normal operation when the right hand side of the subtraction is INT64_MIN, with the slowest case being INT64_MIN - INT64_MIN:
Here are the results with DAZ, showing slower-than-normal operation when the right hand side of the subtraction is INT64_MIN, with the slowest case being INT64_MIN - INT64_MIN:
Line 184: Line 183:


It is worth noting that this only happens on that particular Pentium 4 CPU (family 15 model 3) and not on other Netburst-based CPUs that we tested.
It is worth noting that this only happens on that particular Pentium 4 CPU (family 15 model 3) and not on other Netburst-based CPUs that we tested.
=== Float to int64 conversion is non-constant-time on some CPUs ===
FILL ME
=== std::max<int64> is non-constant-time on some CPUs ===
FILL ME
Confirmed users
753

edits