User:Bjacob/ArithmeticTimingDifferences: Difference between revisions

User:Bjacob/ArithmeticTimingDifferences (view source)

Revision as of 18:55, 3 June 2013

6,913 bytes added , 3 June 2013

no edit summary

Bjacob

Confirmed users

753

edits

@@ Line 39: / Line 39: @@
 ARM has a single flag, Flush To Zero (FZ) which plays the role of both of x86's flags, FTZ and DAZ.
-= Methodology =
+= Getting experimental data =
+== Methodology ==
 The code performing the measurements is [https://github.com/bjacob/time-math on github].
+For a given operation (e.g. float multiplication) and each set of operands (e.g. 1e-40f * 1.0f) we run the operation N times, and evaluate the mean timing and standard deviation. We then see how the overall median time for this operation, over ''all'' sets of operands, compares to that. If it is more than two standard deviations away, that means that the overall median timing is ''not'' a plausible timing for this particular set of operands, and therefore, this particular set of operands exhibits abnormal timings for this operation.
+The underlying assumption, behind counting in standard deviations, is that for a given operation and set of operands, timings should follow a normal (a.k.a. Gaussian) law. This is reasonable because a given computer operation is likely to work like a series of sub-operations that can randomly either run in optimal time or take a little bit longer; so that the central limit theorem would produce a normal law for the sum of the timings of the sub-operations. TODO: experimentally verify that the timings fit a normal law.
+The reason to compare these normal laws to the overall ''median'', not the overall ''mean'', is that the overall mean may fall in between the actual "normal" time range and some abnormal, much higher values obtained for certain operands. In that case, ''all'' actual timings would fall far away from the mean, which would cause all timings to be reported as abnormal! This problem is averted by using the median instead, which is by definition an actual timing, and is most stable in the face of outliers.
+Times are measured by clock_gettime using CLOCK_THREAD_CPUTIME_ID. In other words, this is per-thread CPU time. This should be as resilient to perturbations as it gets, especially as the core benchmarking loop is free of I/O or syscalls (no malloc) and relies on only at most two cache lines (it only accesses a few bytes on the stack, and in the worst case they would extend over one cache line boundary).
+== Running the code ==
+You can either build [https://github.com/bjacob/time-math on github the source code] or use [http://people.mozilla.org/~bjacob/ta/bin/ prebuilt binaries].
+== Recorded results ==
+Test results so far are recorded [http://people.mozilla.org/~bjacob/ta/results/ there].
+= Analysis =
+Here are the conclusions that we can draw from our [http://people.mozilla.org/~bjacob/ta/results/ experimental results] so far.
+== The boring parts ==
+Let's start by listing a few things that "everybody knows" for which we got an experimental confirmation.
+=== x87 is highly non-constant-time ===
+x86 code using x87 floating-point instructions exhibits highly non-constant time behavior, especially but not only with denormals and NaNs, and there doesn't appear to be anything to do about it.
+Here is [http://people.mozilla.org/~bjacob/ta/results/intel-core-i7-2860QM-2.5GHz-family-6-model-42-stepping-7-thinkpad-w520/time-math-x86-32-x87.txt a sample run].
+=== ARM software floating-point emulation is highly non-constant-time ===
+ARM code relying on software floating-point emulation should not expect any kind of constant-time characteristics, even on regular normal finite values.
+Here is [http://people.mozilla.org/~bjacob/ta/results/arm-cortex-a8-nexus-s/time-math-arm-soft.txt a sample run].
+=== Math functions are highly non-constant-time ===
+Note that for math functions, we just use std:: math functions and rely on the compiler to do the right thing. It is often the case that there is no instruction for a given math function and the compiler or standard library has to implement these manually, using basic arithmetic operations. That is at least the case with SSE.
+Our results show that on all architectures and instruction sets we tried, math functions are highly non-constant-time.
+=== Divisions are highly non-constant-time, even with integer types ===
+As expected, divisions are not constant-time whatsoever, not even on finite values, and not even on integer types.
+=== Integer addition, subtraction and multiplication are universally constant-time ===
+But again, note that integer ''division'' isn't, and nor are any ''floating-point'' operations.
+== The hopefully interesting parts ==
+Let's now get to the findings that we made, that contradict what we thought we knew.
+=== Some Intel CPUs have slow multiplications with NaN values, even with SSE2 instructions ===
+Conventional "wisdom" is that while NaN's are slow with x87, they aren't anymore with SSE instructions. That turned out not to be true in full generality.
+An example is a [http://people.mozilla.org/~bjacob/ta/results/intel-celeron-1.8GHz-family-15-model-2-stepping-7-toshiba-satellite/ Pentium-4-era Celeron] exhibiting NaN-related slow behavior with SSE2 instructions, specifically on floating-point multiplications where one of the operands is NaN. See the alert about float multiplication and about double multiplication in [http://people.mozilla.org/~bjacob/ta/results/intel-celeron-1.8GHz-family-15-model-2-stepping-7-toshiba-satellite/time-math-x86-32-sse2-DAZ-FTZ.txt this log].
+=== Some recent AMD CPUs have slow float comparisons with NaN values, even with SSE2 instructions ===
+This is another way in why the notion that NaN issues are all solved by switching to SSE2, is not quite true.
+A [http://people.mozilla.org/~bjacob/ta/results/amd-fx-8150-family-21-model-1-stepping-2-benwa-bulldozer/ 8-core AMD FX processor] exhibits slow single-precision floating-point equality comparisons (ucomiss / ucomisd instructions) when one of the operands is NaN. See the alerts about float equality comparison in [http://people.mozilla.org/~bjacob/ta/results/amd-fx-8150-family-21-model-1-stepping-2-benwa-bulldozer/time-math-x86-64-DAZ-FTZ.txt this log].
+The SSE instructions being used there are ucomiss and ucomisd.
+This is particularly worrying because:
+* This show that Intel and AMD are not making completely parallel progress on these issues.
+* Equality comparison is what we would use if we wanted to manually avoid NaN values. So having this not run in constant time means that the idea of manually avoiding specific "bad" values may not be feasible.
+=== A significant mass of SSE2-capable CPUs do not have DAZ ===
+Conventional "wisdom" is that DAZ is available nearly everywhere SSE2 is.
+This turns out not to be the case even for Intel CPUs: the [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-m-1.4GHz-family-6-model-9-stepping-5-toshiba-tecra/ Pentium M] supports SSE2 but evidently [http://people.mozilla.org/~bjacob/ta/results/intel-pentium-m-1.4GHz-family-6-model-9-stepping-5-toshiba-tecra/time-math-x86-32-sse2-DAZ-FTZ.txt doesn't support DAZ].
+It is also reported that some early Pentium 4 CPUs are in the same case.
+TODO: investigate more CPUs from the early 2000's, in particular some AMD CPUs like the Sempron, Athlon XP and Athlon 64 series.