ConcurrencyTesting

The Mozilla platform is becoming increasingly concurrent --- already it has multi-threaded network and image decoding (right?) code, and soon JavaScript "worker threads." Multi-threaded parsing is also being implemented. All this calls for more rigorous testing of potential concurrency bugs, in the same way that regression and performance tests are currently run.

This page aims to provide a fairly high-level overview of concurrency testing in general, strategies that seem likely to work on the Mozilla code base, coding practices that improve concurrency testing, and particular tools that could work for Mozilla (or if necessary, proposals for in-house tools).

Overview of Concurrency Testing

There are several general strategies, roughly listed by their applicability to Mozilla's code base.

Static analysis

Static analyses enforce invariants by only checking a program's source code. One type of invariant is that all follow a locking protocol (lock before unlock, etc.). With additional programmer-provided annotations, static analyses can discover more bugs, or weed out false positives. An example annotation:

 lock l;
 int x LOCKEDBY(l);

which means that 'x' must only be accessed when the current thread holds the lock 'l'.

Almost all static analyses are either sound (no "false negatives," i.e. no missed errors), complete (no "false positives," i.e. all errors are "real"), or neither --- but they are almost never both sound and complete. A difficulty of analyzing C++ code is in designing a (non-trivial) sound analysis that both understands pointers and isn't made completely useless by too many false positives. So pretty much any (non-trivial) static concurrency analysis for C++ has to choose between soundness and usefulness. Luckily, many "trivial" static analyses are quite useful --- type checking is one.

Dynamic analysis

Dynamic analyses usually instrument a binary to record memory accesses, lock acquisitions, and so forth. Then the instrumented binary is run as normal to gather data, and desirable properties are checked during execution (or afterwards). One such property is: there are no data races, i.e., unordered writes/reads to a global memory location.

Most dynamic analyses use one of two algorithms: "happens-before" and "lockset." Loosely, the "happens-before" algorithm checks that there is a (partial) order on accesses to shared memory. The order is determined by several types of events; for example, if the main thread accesses variable X, then spawns a child thread which also accesses X, then those two accesses of X are definitely ordered --- the child thread can't touch X until it has been spawned. Thus spawning a thread is one way to establish an order on memory accesses.

The "lockset" algorithm, again roughly speaking, keeps a set of locks for each shared memory location (L[v]). Initially, each L[v] contains all locks in the program. As the program runs, each spawned thread t track of the locks it owns, in a set L[t]. Each time a thread t accesses a shared memory location v, L[v] is to set to the intersection of L[v] and L[t]. If L[v] ever becomes empty, then there's a potential bug! This is an incomplete and potentially dense description, but consider the following cases that illustrate the algorithm:

t holds no locks (L[t]={}) and accesses memory location v
1. L[v] = L[v].intersect ({}) // t accesses v
2. L[v] == {}
3. Bug!

t1 holds lock l1 (L[t1]={l1}), t2 holds lock l2 (L[t2]={l2}), and both access memory location v.
1. L[v] = L[v].intersect ({ l1 }) // t1 accesses v
2. L[v] == { l1 } // OK
3. L[v] = L[v].intersect ({ l2 }) // t2 accesses v
4. L[v] == { }
5. Bug!

Both these analyses (and basically all other dynamic analyses) suffer from the problem that they can only catch bugs that are exposed by a particular execution of the instrumented program. That is, it's possible for a tool to report different bugs on different runs of the program. This can be ameliorated by running the program multiple times. One benefit over static analyses, especially for C++ code, is that dynamic analyses have perfectly precise information about what pointers actually point to --- they only need to dereference the pointer.

Record and replay debugging

A program (or system) is run, while the recorder stores enough information about the program's (system's) execution to enable deterministic replay of the entire execution. Usually what needs to be recorded is all sources of nondeterminism; including, but not limited to, potentially racy system calls (e.g. read/write), nondeterministic scheduling decisions, random input, nondeterministic memory access, ... Any dynamic analysis described above can be run at leisure on the recorded trace, in addition to more sophisticated analyses like exploring alternative scheduling decisions, etc.

The recording is usually done with a virtual machine, although it can also be done with binary instrumentation or translation.

Model checking

Loosely speaking, model checkers attempt to check all possible "executions" of a "model." Most program models can be thought of as state machines: the "state" is the value of all variables at some point during execution, and "transitions" are program statements executed by a particular thread, changing one state into another. Note that there may be more one transition possible from a particular program state --- this means that more than one thread is executable at that state.

To make the model-checking problem tenable, most model checkers require that models have only finite state. Then, the checker starts with the program's "start state" and does a complete exploration of the model's state space --- this isn't anything fancy, just a depth/breadth-first search of a graph starting from the "start state" node. Loosely speaking again, a model checker finds an error when some "assertion" fails at a particular program state. (Other types of errors are beyond the scope of this document.)

Model checkers are awesome when they can be used, but they need lots of memory and CPU time to check non-trivial programs. However, there are many clever optimizations that can shrink the number of program states that need to be explored. Also, there are also heuristics for exploring program states (e.g., breadth- vs. depth-first) that increase the likelihood of finding errors in an allotted amount of CPU/memory.

Formal methods (verification)

Proofs using process calculi and the like. Bless the soul who attempts this on any part of Mozilla code!

Writing Testable Code

Use assertions The very first thing we need is the ability to write an NS_ASSERTION that a particular lock or monitor is held. That's dead simple but we can't do it today because of NSPR issues I think. Then we can sprinkle those through our getters and setters and near field accesses to approximate static annotations that a particular field is protected by a particular lock. This would help us a lot, today. In the future maybe we can write static annotations to that effect and use static analysis to verify that either the right lock is held or we're on the right thread, or at least there is an assertion that will perform the correct check, but I think that's quite far out. (roc)

Don't use PR_Monitor/PR_Lock We probably should also stop using PR_Monitor/PR_Lock directly and wrap them in nsLock/nsMonitor. See https://bugzilla.mozilla.org/show_bug.cgi?id=482938. (roc)

Use annotations This is future work, but stay tuned. Annotations can act as a rich type system, more easily allowing compilers and analyzers to check statically what assertions check at run time.

Testing Strategies

These methods are not mutually exclusive --- they're quite complementary in fact.

Static concurrency analyses are performed along with our other static analyses (outparams, et al.), hopefully at compile time.
Dynamic checkers (possibly multiple ones) run a normally-built Firefox on a suite of tests, a la regression tests/reftests (the tests can be shared). The tests could be run multiple times.
Model checkers run on small, specially-constructed test harnesses of critical code. For example, a small driver program that AddRefs/UnRefs k objects in t threads.

Tools

Roughly in order of how easily they can be applied to Mozilla code.

Helgrind (valgrind tool): dynamic (happens-before) checker
- + should work out-of-the-box
- - Linux only (maybe on OS X now?)
- + GPL (mostly)
- - false positives

Chris Pearce's VMWare-based, record-and-replay debugger
- + already working on FF code
- - Windows only (just at the moment?)
- - not an analysis tool per se: can only find crashes/assertion failures (but can replay them)

CHESS: "iterative context-bounding" model checker
- + should work out-of-the-box
- - Windows only
- - license is a problem
- + no false positives
- (schach --- a prototype of an iterative context-bounding checker for linux --- is being investigated by bent and cjones)

ReVirt (or whatever the parallel version is called): record and replay debugger
- + should work out of the box
- - linux only (last cjones heard)
- + GPL
- - performs well on less-active system, unknown performance on active Firefox
- - not an analysis tool per se: can only find crashes/assertion failures (but can replay them)

SharC: static and dynamic checker
- - requires work to run on Mozilla
- + should start using their sharing annotations whether or not we run the SharC tool
- - pthread only (as far as cjones knows)
- - only works for C code (as far as cjones knows)
- + BSD licensed
- - false positives

Hybrid checker: dynamic (happens-before and lockset)
- - might require significant work (no C++ impl exists, as far as cjones knows)
- + or, could be implemented as a less-significant modification of helgrind
- - false positives

RaceFuzzer: something like a hybrid between a dynamic checker and model checker
- + requires significant work
- - Java only (and no public release, as far as cjones knows)
- - technique probably won't scale to program of Firefox's size
- + no false positives

SPIN: model checker
- - resource requirements can be high, to put it mildly
- - "good" models have to be hand-written in Promela, an "obscure" modeling language, to put it mildly
- +/- can automatically generate models from C code, but the models are pretty crappy
- + license looks OK
- + no false positives

Plan (TODOs)

Roughly in order of importance.

If you take one, please mark it as taken.

(??) fix "NSPR issues" preventing the assertion that locks/monitors are held (is there a bug open already?)
(??) add sharing assertions to existing concurrent code
(see bug) "replace" PR_Lock/PR_Monitor with nsLock/nsMonitor (bug report)
(QA?) create set of test cases for dynamic analyses
- same as regression tests?
(??) integrate testing into build system and/or tinderbox, wherever appropriate
- (XXX/cjones: are we already doing regression testing on multi-core machines?)
evaluate tools above (except CHESS, for legal reasons)
- (??) (list the tool you want to check here)
- if none of these suffices, write one
(??) decide on desired static analyses
- use SharC, other tool, or write them with treehydra
(??) add sharing annotations to existing concurrent code
(??) find least amount of platform code needed to usefully support non-trivial modules (such as worker threads)
- create separate build target for each concurrent module? (like unit tests)
- attempt to model-check these targets?
- integrate with tinderbox?