Confirmed users
753
edits
(17 intermediate revisions by the same user not shown) | |||
Line 15: | Line 15: | ||
=== Handling of denormals === | === Handling of denormals === | ||
x86 CPUs have two distinct flags that can optionally be enabled to enable non-IEEE-compliant handling of denormals: | x86 CPUs have two distinct flags that can optionally be enabled to enable non-IEEE-compliant handling of denormals by SSE instructions: | ||
* FTZ (Flush To Zero) causes any denormal ''result'' to be flushed to zero. | * FTZ (Flush To Zero) causes any denormal ''result'' to be flushed to zero. | ||
* DAZ (Denormals Are Zero) causes any denormal ''operand'' to be handled as if it were zero. | * DAZ (Denormals Are Zero) causes any denormal ''operand'' to be handled as if it were zero. | ||
Not all CPUs have these flags. Even CPUs with SSE2 instructions can fail to support DAZ | |||
Not all CPUs have these flags. Even CPUs with SSE2 instructions can fail to support DAZ, see below the section about that. | |||
== x86-64 architecture == | == x86-64 architecture == | ||
Line 47: | Line 48: | ||
For a given operation (e.g. float multiplication) and each set of operands (e.g. 1e-40f * 1.0f) we run the operation N times, and evaluate the mean timing and standard deviation. We then see how the overall median time for this operation, over ''all'' sets of operands, compares to that. If it is more than two standard deviations away, that means that the overall median timing is ''not'' a plausible timing for this particular set of operands, and therefore, this particular set of operands exhibits abnormal timings for this operation. | For a given operation (e.g. float multiplication) and each set of operands (e.g. 1e-40f * 1.0f) we run the operation N times, and evaluate the mean timing and standard deviation. We then see how the overall median time for this operation, over ''all'' sets of operands, compares to that. If it is more than two standard deviations away, that means that the overall median timing is ''not'' a plausible timing for this particular set of operands, and therefore, this particular set of operands exhibits abnormal timings for this operation. | ||
The underlying assumption, behind counting in standard deviations, is that for a given operation and set of operands, timings should follow a normal (a.k.a. Gaussian) law. This is reasonable because a given computer operation is likely to work like a series of sub-operations that can randomly either run in optimal time or take a little bit longer; so that the central limit theorem would produce a normal law for the sum of the timings of the sub-operations. | The underlying assumption, behind counting in standard deviations, is that for a given operation and set of operands, timings should follow a normal (a.k.a. Gaussian) law. This is reasonable because a given computer operation is likely to work like a series of sub-operations that can randomly either run in optimal time or take a little bit longer; so that the central limit theorem would produce a normal law for the sum of the timings of the sub-operations. We verified on a few results that the timings' density plot looked plausibly similar to a Gaussian curve. | ||
The reason to compare these normal laws to the overall ''median'', not the overall ''mean'', is that the overall mean may fall in between the actual "normal" time range and some abnormal, much higher values obtained for certain operands. In that case, ''all'' actual timings would fall far away from the mean, which would cause all timings to be reported as abnormal! This problem is averted by using the median instead, which is by definition an actual timing, and is most stable in the face of outliers. | The reason to compare these normal laws to the overall ''median'', not the overall ''mean'', is that the overall mean may fall in between the actual "normal" time range and some abnormal, much higher values obtained for certain operands. In that case, ''all'' actual timings would fall far away from the mean, which would cause all timings to be reported as abnormal! This problem is averted by using the median instead, which is by definition an actual timing, and is most stable in the face of outliers. | ||
Times are measured by clock_gettime using CLOCK_THREAD_CPUTIME_ID. In other words, this is per-thread CPU time. This should be as resilient to perturbations as it gets, especially as the core benchmarking loop is free of I/O or syscalls (no malloc) and relies on only at most two cache lines (it only accesses a few bytes on the stack, and in the worst case they would extend | Times are measured by clock_gettime using CLOCK_THREAD_CPUTIME_ID. In other words, this is per-thread CPU time. This should be as resilient to perturbations as it gets, especially as the core benchmarking loop is free of I/O or syscalls (no malloc) and relies on only at most two cache lines (it only accesses a few bytes on the stack, and in the worst case they would extend across one cache line boundary). | ||
== Caveats == | |||
=== Equality comparisons === | |||
One of the types of operation that we time, "equality comparison", requires some details to be explained. For a given scalar type, by "equality comparison" we mean this: | |||
scalar equality_comparison(scalar x, scalar y) | |||
{ | |||
return scalar(x == y); | |||
} | |||
With integer types, this gets implemented with conditional moves and thus consistently runs in constant time. With floating-point types, conditional moves are not enough, for lack of floating-point variants. The compiler could still decide to compute (x==y) as a boolean with conditional move and then cast it to the floating-point type, but at least GCC 4.6 doesn't do that on x86. Instead, GCC 4.6 decided to use conditional jumps on x86. This means that the results of the floating-point "equality comparison" operation are always reported as non-constant-time as there is at least a timing difference between the (x==y) and (x!=y) cases. These reports are ignored in the analysis below. | |||
Below, when we report significant timing differences in equality comparisons, we implicitly mean ''besides the above-discussed issue''. For example, if we find that there are specific timing differences when one of the operands is NaN, that can't be explained by the above issue, and is thus an inherent non-constant-time characteristic of the floating-point equality comparison instruction that was used. | |||
== Running the code == | == Running the code == | ||
You can either build [https://github.com/bjacob/time-math | You can either build [https://github.com/bjacob/time-math the source code] or use [http://people.mozilla.org/~bjacob/ta/bin/ prebuilt binaries]. Note that only Linux is supported at the moment. | ||
The binaries should run on any Linux-based system: GNU/Linux, Android, Firefox OS. They are statically linked, so they don't have any dependency, not even on the standard library. | |||
On x86-32 machines, note that it is only worth trying out SSE2-capable CPUs. Most of the x86-32 binaries have 'sse2' in their filename, indicating that they require support for SSE2. | |||
On x86-64 machines, it is very much worth running ''both'' the x86-64 and the x86-32 binaries. | |||
On ARM machines, note that the binaries require ARMv7. | |||
== Recorded results == | == Recorded results == | ||
Line 91: | Line 115: | ||
As expected, divisions are not constant-time whatsoever, not even on finite values, and not even on integer types. | As expected, divisions are not constant-time whatsoever, not even on finite values, and not even on integer types. | ||
=== Integer addition, subtraction and multiplication are universally constant-time === | === Integer addition, subtraction and multiplication are almost universally constant-time === | ||
But again, note that integer ''division'' isn't, and nor are any ''floating-point'' operations. | But again, note that integer ''division'' isn't, and nor are any ''floating-point'' operations. | ||
See below for a strange exception, where int64 subtraction is not constant time on some 32bit CPUs. | |||
== The hopefully interesting parts == | == The hopefully interesting parts == | ||
Line 103: | Line 129: | ||
Conventional "wisdom" is that while NaN's are slow with x87, they aren't anymore with SSE instructions. That turned out not to be true in full generality. | Conventional "wisdom" is that while NaN's are slow with x87, they aren't anymore with SSE instructions. That turned out not to be true in full generality. | ||
In particular, this is not true for a wide range of Intel CPUs based on the Netburst microarchitecture, which was popular until the mid-2000s. At least Intel family 15 models 2, 3 and 4 are affected: | |||
* [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-2.53GHz-family-15-model-2-stepping-9-dell-dimension2400-gavin/time-math-x86-32-sse2-DAZ-FTZ.txt this Pentium 4], family 15 model 2 | |||
* [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-3.2GHz-family-15-model-3-stepping-3-tycho/time-math-x86-32-sse2-DAZ-FTZ.txt this Pentium 4], family 15 model 3 | |||
* [http://people.mozilla.org/~bjacob/ta/results/intel-celeron-1.8GHz-family-15-model-2-stepping-7-toshiba-satellite/ this Celeron], family 15 model 2 | |||
* [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-d-3GHz-family-15-model-4-stepping-7-dell-dimension9150-gavin/time-math-x86-64-DAZ-FTZ.txt this Pentium D], family 15 model 4. This set of results shows that the problem exists even in 64bit mode. | |||
=== A significant mass of SSE2-capable CPUs do not have DAZ === | |||
Conventional "wisdom" is that DAZ is available nearly everywhere SSE2 is. | |||
This turns out not to be the case even for Intel CPUs: this [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-m-1.4GHz-family-6-model-9-stepping-5-toshiba-tecra/ Pentium M] supports SSE2 but evidently [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-m-1.4GHz-family-6-model-9-stepping-5-toshiba-tecra/time-math-x86-32-sse2-DAZ-FTZ.txt doesn't support DAZ]. As a result, timings are all over the place, even on just additions, as soon as denormals are involved. | |||
It is also reported that some early Pentium 4 CPUs were in the same case. Conversely, not ''all'' Pentium M's are affected (here is an [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-m-1.6GHz-family-6-model-13-stepping-6-dell-inspiron600m-gavin/ unaffected one]). | |||
=== Timing differences in equality comparisons === | |||
The SSE instructions being used there are ucomiss and ucomisd. | The SSE instructions being used there are ucomiss and ucomisd. | ||
The first problem is denormals: equality comparisons are just as affected as other operations by denormals. | |||
The next problem is that many CPUs have abnormally ''slow'' or ''fast'' equality comparisons when the operands are non-finite. | |||
Specifically: | |||
* Recent Intel CPUs tend to have abnormally slow comparisons with NaN. Examples: [http://people.mozilla.org/~bjacob/ta/results/intel-core-i7-2860QM-2.5GHz-family-6-model-42-stepping-7-thinkpad-w520/ Core i7], [http://people.mozilla.org/~bjacob/ta/results/intel-core-2-2.66GHz-family-6-model-15-stepping-11-st3fan/ Core 2]. | |||
* Recent AMD CPUs, low-end and older Intel CPUs, and Intel Atom CPUs have abnormally fast comparisons with NaN. Examples: [http://people.mozilla.org/~bjacob/ta/results/amd-fx-8150-family-21-model-1-stepping-2-benwa-bulldozer/ AMD FX], [http://people.mozilla.org/~bjacob/ta/results/amd-athlon-ii-x4-635-family-16-model-5-stepping-2-gkwong/ Athlon II], [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-b940-2GHz-family-6-model-42-stepping-7-gerv/ Pentium B940], [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-3.2GHz-family-15-model-3-stepping-3-tycho/ Pentium 4], [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-d-3GHz-family-15-model-4-stepping-7-dell-dimension9150-gavin/ Pentium D], [http://people.mozilla.org/~bjacob/ta/results/intel-atom-d510-1.66GHz-family-6-model-28-stepping-10-jvehent/ Intel Atom]. | |||
* Some VIA CPUs have abnormally fast comparisons with NaN and Inf. See [http://people.mozilla.org/~bjacob/ta/results/via-nano-u2250-1.6GHz-family-6-model-15-stepping-3-jvehent/ this VIA Nano]. | |||
=== Some CPUs have surprising timing differences on 64-bit integer arithmetic using 32-bit integer arithmetic instructions, when DAZ is enabled === | |||
On x86-32, our int64 subtraction benchmark compiles to this assembly: | |||
movl (%ebx), %eax | |||
movl 4(%ebx), %edx | |||
movl 8(%ebx), %esi | |||
movl 12(%ebx), %edi | |||
subl %esi, %eax | |||
sbbl %edi, %edx | |||
movl %eax, 16(%ebx) | |||
movl %edx, 20(%ebx) | |||
On [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-3.2GHz-family-15-model-3-stepping-3-tycho/ one Pentium 4 CPU] that we used, this does not quite run in constant time, with timing differences of up to 25% and about 3 sigma from the overall median; and the weirdest part is that this only happens when DAZ is enabled! This is of course particularly surprising as DAZ is not supposed to have any relevance to integer arithmetic. | |||
Here are the results with DAZ, showing slower-than-normal operation when the right hand side of the subtraction is INT64_MIN, with the slowest case being INT64_MIN - INT64_MIN: | |||
* [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-3.2GHz-family-15-model-3-stepping-3-tycho/time-math-x86-32-sse2-DAZ.txt With DAZ only;] | |||
* [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-3.2GHz-family-15-model-3-stepping-3-tycho/time-math-x86-32-sse2-DAZ-FTZ.txt With DAZ and FTZ.] | |||
On the other hand, here are the results without DAZ, showing no such timing difference with INT64_MIN operands: | |||
* [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-3.2GHz-family-15-model-3-stepping-3-tycho/time-math-x86-32-sse2.txt Without DAZ or FTZ;] | |||
* [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-4-3.2GHz-family-15-model-3-stepping-3-tycho/time-math-x86-32-sse2-FTZ.txt With FTZ only.] | |||
It is worth noting that this only happens on that particular Pentium 4 CPU (family 15 model 3) and not on other Netburst-based CPUs that we tested. | |||
=== Float to int64 conversion is non-constant-time on some CPUs === | |||
FILL ME | |||
=== std::max<int64> is non-constant-time on some CPUs === | |||
FILL ME |