Project Uptime's goal is to reduce the crash rate of Firefox (desktop and mobile) and keep it down. This project is a Platform Engineering initiative that aims to extend and complement existing work relating to stability within Mozilla.

David Bryant's internal email announcement said: "The overall responsibility to hit our stability target is something that falls on the entire platform organization, and in terms of priority, stability is second only to shipping e10s."

Goals

The basics

We can break Uptime's goals down into four sub-goals.

Reduce the Release crash rate, and keep it down. This gives a better experience for ~100M/7M users, and makes them less likely to switch to a different browser.
Reduce the Beta crash rate, and keep it down. This gives a better experience for ~2M/150k users, and makes them less likely to switch to a different browser or a later channel. This goal partially subsumes goal #1.
Reduce the Aurora crash rate, and keep it down. This gives a better experience for ~100k/2k users, and makes them less likely to switch to a different browser or a later channel. This goal partially subsumes goals #1 and #2.
Reduce the Nightly crash rate, and keep it down. This gives a better experience for ~30k/1k users, and makes them less likely to switch to a different browser or a later channel. This goal partially subsumes goals #1, #2 and #3.

(Channel ADI numbers are approximate, taken from http://arewestableyet.com on Apr 26.)

These sub-goals are listed in order of priority. But note that the relative priorities do not match the relative user counts. E.g. even though Release has ~3000x as many users as Nightly, goal #1 is not 3000x more important than goal #4. The higher-numbered goals, as well as being (decreasingly) virtuous in their own right, partially subsume the lower-numbered goals, because a crash fixed or avoided in an earlier channel is one that doesn't show up in a later channel. (The later goals don't fully subsume the earlier goals because some crashes don't show up in significant numbers until we get to the larger populations of the later channels.)

There are multiple strategies (tools, processes, systems) to achieve these goals -- see below -- but those strategies should not be confused with the goals themselves.

Definition of "crash rate"

For a long time the crash rate metric was "crashes per 100 ADIs". This has various shortcomings and is in the process of being replaced with "crashes per 1000 hours of usage".

Not all crashes are equal. For example: persistent start-up crashes are terrible; with e10s enabled, content process crashes are not as bad as chrome process crashes; plugin crashes may be less important. It would be possible to adjust the per-1000-hours-of-usage measurements by weighting different types of crashes differently. However, any such formula would be arbitrary and not clearly better than the plain metric. Therefore, we will use the plain metric for defining "crash rate", and that is the value we want to get down.

Having said that, when considering a problem as complex as browser crashes, any single metric will have shortcomings and should not be the only thing paid attention to. In particular, we should build systems and processes (such as automated alerts) that help us give attention to the worst kinds of crashes.

Target crash rate

There are eight crash rates to track: {Release, Beta, Aurora, Nightly} x {desktop, mobile}.

The actual target values are TBD, and cannot be decided until the new per-1000-hours-of-usage metric has fully replaced the old per-100-ADIs metric and we know what our current values are.

All about crashes

We use "crashes" to refer to several kinds of events.

Uncontrolled crashes: most commonly due to bad memory accesses, which are caused by bad pointers. Also sometimes illegal instructions, division-by-zero, etc.
Controlled aborts: e.g. due to MOZ_CRASH, MOZ_RELEASE_ASSERT, Rust panics, etc.
OOMs: running out of virtual or physical memory. These usually present as controlled aborts but contain additional annotations about the amount of memory being allocated.
Hangs: sometimes when a plugin or content process stops responding for too long, we will collect a dump from that process and then kill it.

Crashes are grouped by process type:

"main" process crashes halt Firefox execution.
When a content process crashes, Firefox itself stays open but some or all tabs must be reloaded.
When an NPAPI plugin or Gecko media plugin crashes, the web page will stay alive but some functionality may be broken and require the user to reload pages which used that plugin.

There are multiple different causes of crashes, multiple ways they can be identified and addressed, and multiple ways we can improve our processes for doing so.

A word about coverage

For all of the following paths to a bug report (except static analysis and code inspection) coverage affects how effective the path is. There are several kinds of coverage.

Code coverage: e.g. which lines of code are we not testing?
Web feature coverage: e.g. which web features are we not testing?
Site coverage: e.g. which important websites are we not testing?
Hardware configuration coverage: e.g. which important gfx card/drivers combinations are we not testing?
Firefox configuration coverage: e.g. which add-ons or non-default options are we not testing?

Paths to a bug report

Almost without exception, for a crash to be fixed we must have a bug report for it. There are multiple paths from a crash to a bug report.

User crash report (automated)

Steps

A "normal" user hits a crash.
The crash reporter catches the crash.
The user consents to crash report submission.
The crash report is received and processed.
The crash report is clustered into a signature and ranked.
A Mozillian triages the signature and files a bug report.

Pros and cons

[+] This is the only path that gives us a clear idea of how frequent the crash is among our user population.
[+] Crash reports on Nightly provide fast feedback and small regression windows.
[-] This is a long, complex path with numerous points of potential failure.
[-] Submission rates vary, and are low for content, NPAPI and GMPlugin crashes.
[-] Clustering and ranking is imperfect. Sometimes crash reports with distinct causes get conflated into one signature. Sometimes crash reports with a single cause get separate signatures (people sometimes spot this and link multiple signatures to a single bug report). The ranking does not consider crash severity.
[-] Some crashes only manifest frequently in later release channels, where they affect more people, and are harder to address due to late feedback and large regression windows. Crashes that only manifest significantly on Release may require chemspills.
[-] Crash reports frequently lack enough info to diagnose/address. (E.g. no steps to reproduce, and it's obvious that a pointer was bad, but unclear how it got that way.)
[-] Crash report data can be unreliable. (E.g. incorrect crash addresses and stack traces.)

Possible improvements

Increase the number of users on earlier channels, especially Nightly.
- Ideas for Nightly from Marcia Knous.
- A blog post encouraging people to switch to Nightly.
- Improve earlier channel populations to be more representative of the Release population.
Increase crash report submission rate.
- Always submit non-sensitive data, and make only the sensitive data part (e.g. minidumps) optional.
- bug 1269998 Prompt users with pending crash reports to submit them.
- bug 1270553 Allow users to opt-in to auto submitting crash data.
- Provide feedback to users when their crash report leads to a fix, which improves engagement.
Improve the clustering of crash reports.
- Improve crash signatures by skipping/ignoring generic stack frames.
- Ignore minor differences in graphics driver versions.
- bug 1268029 Use jit classifier to change signature for jit crashes. (More improvements to JIT crash reports probably still required after that, though.)
- Distinguish GC crashes better. (We have an IsGarbageCollecting annotation already, we could use that.)
- Use something more sophisticated than the crash signature, which is mostly based on stack traces.
- Provide ability to run custom analysis jobs on crash reports (similar to telemetry analysis jobs).
- bug 1274345 Skip the graphics drivers signatures in the report, letting us cluster on the first calls for which we have the symbols, rather than just dealing with the DLL offsets. See the two reports in this bug for an example.
Improve ranking/prioritization of crash clusters.
- Take into account crash severity as well as frequency.
- Use Crystal Ball (or other means) to identify how minor crashes on early release channels might become major crashes in later channels.
Improve triaging of crash reports.
- Ensure sufficient manual triage occurs.
  - Nightly has coverage for every build (see the roster below).
  - Aurora: ?
  - Beta: release drivers?
  - Release: release drivers?
- Do closer triage of classes of crash reports that are likely to be fixable (e.g. aborts)
  - A search for all MOZ_CRASH crashes.
- Increase the amount of automation in triage.
  - Use Clouseau to automatically detect new crashes and trace them back to recent changes around the location of the crash. And possibly notify whoever most recently changed that code.
- Map crashes to bugzilla components - bug 1277337
- Generate consistent signatures across crash-stats/fuzzing/bughunter/automated tests. Socorro API for generating signatures? bug 828452

User crash report (manual)

Steps

A developer, QA person or advanced user hits a crash.
That person files a bug report.

Pros and cons

[+/-] We may or may not get clear steps to reproduce.
[+] We usually get a crash report as well.

Possible improvements

Improve coverage of web features and sites by providing test exercises for QA people and keen Nightly users to run regularly.

Crash on automation

Steps

An automated non-TreeHerder job crashes, e.g. fuzzers, BugHunter.
A monitoring person (e.g. fuzzer, QA person) files a bug report.

Note: many tests are performed regularly on TreeHerder and are cause for immediate patch backout if they crash. Such crashes do not match this path, but instead fall under the category of prevention (see below).

Pros and cons

[+] We have a test case...
[-] ...but reproduction may be difficult or unreliable.

Possible improvements

Improve code coverage with more tests.
- http://people.mozilla.org/~sledru/reports/coverage-2015/
Improve fuzzing coverage.
- Better isolate components so they can be fuzzed more easily (e.g. the JS shell).
- Increase hardware available for fuzzing, for both greater throughput and better hardware configuration coverage.
- Increase fuzzing coverage of non-default options.
- Improve gtests, which are a good starting point for fuzzing.
Improve reproducibility of these crashes.
- Use rr to record crashes so they can be played back reliably.

Analysis (automated)

Steps

A non-TreeHerder dynamic analysis job (e.g. Valgrind, ASan, TSan, UBSan) or static analysis job (e.g. Coverity) detects a possible crash defect.
A monitoring person files a bug report.

Note: some dynamic and static analysis jobs are performed regularly on TreeHerder and are cause for immediate patch backout if they fail. Crash defects found this way do not match this path, but instead fall under the category of prevention (see below).

Pros and cons

[+] The defect is usually well-identified and actionable.
[-] Not always clear how often the defect will cause crashes, though this depends on the tool. (E.g. ASan is clearer than TSan.)

Possible improvements

Ensure sufficient triaging occurs.
- sewardj periodically runs Mochitests under Valgrind and files found bugs.
- Coverity triage: SoftVision working on it, bug 1230156 is the tracking bug.
bug 929478 Make TSan (ThreadSanitizer) usable with Firefox. (And eventually run TSan on automation in some fashion?)
Try out UBSan?
Run mochitests under Valgrind on automation twice a day as a Tier 2 job.
Improve dynamic analysis coverage by using it in tandem with automation tools (e.g. fuzzing, BugHunter).
Implement more internal verification ("extended assertions"), e.g. verify complex data structures, triggered manually (by users) or programatically (by fuzzers).
- bug 1276097 Add a bytecode sanity check

Analysis (manual)

Steps

A developer detects a possible crash defect via code inspection.
The developer files a bug report.

Pros and cons

[+] The defect is usually well-identified and actionable.
[-] Not always clear if the defect will cause a crash or how likely it is.

Possible improvements

?

Paths from bug report to a diagnosis

As mentioned above, different kinds of bug report data have different likelihoods of leading to a diagnosis.

High: Reliable test case or steps to reproduce.
High: Well-identified defect, e.g. from analysis.
Lower: Unreliable test case or steps to reproduce.
Lower: crash report.

Also, the following things increase the likelihood that a crash bug report will lead to a diagnosis.

A needinfo request of an appropriate developer.
Data indicating that the crash has high severity or frequency.
Nagging.

Possible improvements

Improve developer understanding of what's in a crash report.
- Improve documentation about crash reports.
  - bug 1275172 Fix incorrect crash addresses on 64-bit
  - bug 1275799 When viewing a crash report, add explanations of field meanings via tooltips
- Training?
- https://air.mozilla.org/login/?next=/a-talk-about-understanding-crash-reports/
- https://support.mozilla.org/en-US/kb/helping-crashes
Improve data in crash reports.
- Don't omit frame pointers?
- bug 1268029 Use jit classifier to change signature for jit crashes.
- Improve data in GC crash reports.
- Improve data in shutdown hangs.
- Include unresolved Promises (and other expected callbacks).
- Include source location (file & line number) in Firefox-provide JS script being executed.
- Better identify threads of all platforms, including threadpools.
- Increase use of deliberate, informative aborts (e.g. MOZ_DIAGNOSTIC_ASSERT).
- bug 1250687 Consider MiniDumpWithPrivateWriteCopyMemory for Windows minidumps
- bug 1251395 Include non-JIT, executable, private pages in crash dumps
- bug 1280442 - detect differences between live memory and original binary
Identify correlations for crash signatures, e.g. this crash occurs more frequently or always when certain things are true.
Increase amount of tracking of and nagging in crash bugs, e.g. from EPMs.

Addressing a diagnosed crash

Once the cause of a crash in a bug report has been diagnosed, it can be addressed. The possible ways to address a crash -- and to prevent similar crashes in the future -- depends on what kind of defect it is.

A defect in our code

Ways to address

Fix defect.
Backout defect.
Disable defect.
Recover and continue? (Difficult in general.)

Ways to mitigate

Recover, e.g. restart e10s content processes.

Ways to prevent (by making certain defects less likely or impossible)

Reimplement C and C++ components in JS or Rust.
- bug 1135640 Add Rust compiler support.
Eliminate crash-prone code patterns.
- Low-level, e.g. use smart pointers.
  - bug 1272203 Add mozilla::NotNull to MFBT.
  - bug 1277368 Introduce mozilla::Result<T, E> for fallible return values.
- High-level, e.g. disallow binary extensions.

Ways to prevent (by detecting and backing defective code out immediately)

Use compiler annotations and warnings more.
- bug 1268766 Use MOZ_MUST_USE everywhere.
- bug 1272513 Enable -Wshadow warnings.
Use more static analysis in TreeHerder jobs.
- Add more checks to the clang static analysis job.
Add more dynamic analysis jobs to TreeHerder.
- bug 1030826 Support AddressSanitizer builds on Windows with clang-cl.

A defect in something we rely on

Some crashes are caused by defects things we depend on: OS, gfx drivers, CPUs (e.g. the infamous AMD bug), RAM, disks.

Ways to address

In general, we cannot fix or prevent these defects.
Avoid the defect.
- Disable hardware acceleration in the presence of buggy drivers.
- Rewrite code that tickles the AMD bug.
Tolerate
- Handle gfx driver resets.
Detect and inform user
- Detect if Firefox is misinstalled (e.g. perform checksums on files) and ask the user to reinstall. (bug 1270232 and bug 1194856 are bugs where a bad install might be the cause.)
- Detect bad memory/CPU and inform user? (Or tolerate?)
  - We have evidence that 15--20% of JIT crashes might be due to a particular buggy AMD CPU, and another 23--37% might be due to bit flips caused by bad RAM.
  - See also this paper about memory errors being not uncommon.
  - bug 995652 Run memtest from the crash reporter.
  - bug 1270554 Run memtest continuously on the live browser.
  - Overclocking may cause some crashes?
- Detect bad disk and inform user? (Or tolerate?)
  - bug 1274428 Mark crashes that happen at invalid instruction pointers?
  - bug 1274628 Annotate crashes where the code around the instruction pointer is not the code that's in the binary we ship.

A defective interaction

Some crashes are caused by defective interactions with an external entity (e.g. Malware and/or Anti-virus).

Ways to address

Prevent
- Windows 10 has better blocking, at least for content processes?
Tolerate?
Evangelism
- bug 1229252 startup crashes in nss3.dll@0x1e5b60 | GetFileInfo, due to software from ESET (eOppMonitor.dll)
Inform user?

OOMs

OOMs are a bit different to other crashes in terms of cause, though not any different in terms of the effect on users.

Ways to address

Reducing memory usage in general helps.
"Large" OOMs: make large allocations fallible, especially those whose size is controlled by user input or web content.
"Small" OOMs: for Windows users, migrate them to 64-bit Firefox where possible.

Data Sources and Useful Links

Crash Stats: entry point to crash reports.
Platform/Uptime/Logs has some more specific links useful when analyzing crash reports.
ActiveData (about) (try it)
- Individual unit test results
- Buildbot steps, and MozHarness steps with success/failure
Bugzilla API (about)
- Bugzilla ES Cluster (about) easy-to-query historical data

Administration

Things to be decided.

Regular meetings: content and time TBD
Progress tracking: high-level stability measurements, bug lists, etc. Links to those.
Communication:
- IRC: #uptime
- Email: projectuptime-team@mozilla.com
- Vidyo: ProjectUptime

Crash stats checking rosters

Nightly builds are produced at 3am each day (California time). Each build's crashes can be viewed here. The current roster for checking Nightly build crashes:

Monday (Australian time): njn analyzes Friday's build.
Monday (US East time): marcia analyzes Saturday's build.
Tuesday (Taiwan time): ting analyzes Sunday's build.
Wednesday (Taiwan time): kanru analyzes Monday's build.
Wednesday (US East time): jchen analyzes Tuesday's build.
Thursday (US West time): mccr8 analyzes Wednesday's build.
Friday (US West time): dbaron analyzes Thursday's build.

A rough guide to Nightly crash analysis. We keep notes on what we do.

Platform/Uptime

Goals

The basics

Definition of "crash rate"

Target crash rate

All about crashes

A word about coverage

Paths to a bug report

User crash report (automated)

User crash report (manual)

Crash on automation

Analysis (automated)

Analysis (manual)

Paths from bug report to a diagnosis

Addressing a diagnosed crash

A defect in our code

A defect in something we rely on

A defective interaction

OOMs

Data Sources and Useful Links

Administration

Crash stats checking rosters