Revision as of 05:09, 3 August 2016

Project Uptime's goal is to reduce the crash rate of Firefox (desktop and mobile) and keep it down. This project is a Platform Engineering initiative that aims to extend and complement existing work relating to stability within Mozilla.

David Bryant's internal email announcement said: "The overall responsibility to hit our stability target is something that falls on the entire platform organization, and in terms of priority, stability is second only to shipping e10s."

Goals

The basics

We can break Uptime's goals down into four sub-goals.

Reduce the Release crash rate, and keep it down. This gives a better experience for ~100M/7M users, and makes them less likely to switch to a different browser.
Reduce the Beta crash rate, and keep it down. This gives a better experience for ~2M/150k users, and makes them less likely to switch to a different browser or a later channel. This goal partially subsumes goal #1.
Reduce the Aurora crash rate, and keep it down. This gives a better experience for ~100k/2k users, and makes them less likely to switch to a different browser or a later channel. This goal partially subsumes goals #1 and #2.
Reduce the Nightly crash rate, and keep it down. This gives a better experience for ~30k/1k users, and makes them less likely to switch to a different browser or a later channel. This goal partially subsumes goals #1, #2 and #3.

(Channel ADI numbers are approximate, taken from http://arewestableyet.com on Apr 26.)

These sub-goals are listed in order of priority. But note that the relative priorities do not match the relative user counts. E.g. even though Release has ~3000x as many users as Nightly, goal #1 is not 3000x more important than goal #4. The higher-numbered goals, as well as being (decreasingly) virtuous in their own right, partially subsume the lower-numbered goals, because a crash fixed or avoided in an earlier channel is one that doesn't show up in a later channel. (The later goals don't fully subsume the earlier goals because some crashes don't show up in significant numbers until we get to the larger populations of the later channels.)

There are multiple strategies (tools, processes, systems) to achieve these goals -- see below -- but those strategies should not be confused with the goals themselves.

Definition of "crash rate"

For a long time the crash rate metric was "crashes per 100 ADIs". This has various shortcomings and is in the process of being replaced with "crashes per 1000 hours of usage".

Not all crashes are equal. For example: persistent start-up crashes are terrible; with e10s enabled, content process crashes are not as bad as chrome process crashes; plugin crashes may be less important. It would be possible to adjust the per-1000-hours-of-usage measurements by weighting different types of crashes differently. However, any such formula would be arbitrary and not clearly better than the plain metric. Therefore, we will use the plain metric for defining "crash rate", and that is the value we want to get down.

Having said that, when considering a problem as complex as browser crashes, any single metric will have shortcomings and should not be the only thing paid attention to. In particular, we should build systems and processes (such as automated alerts) that help us give attention to the worst kinds of crashes.

Target crash rate

There are eight crash rates to track: {Release, Beta, Aurora, Nightly} x {desktop, mobile}.

The actual target values are TBD, and cannot be decided until the new per-1000-hours-of-usage metric has fully replaced the old per-100-ADIs metric and we know what our current values are.

All about crashes

We use "crashes" to refer to several kinds of events.

Uncontrolled crashes: most commonly due to bad memory accesses, which are caused by bad pointers. Also sometimes illegal instructions, division-by-zero, etc.
Controlled aborts: MOZ_CRASH, MOZ_RELEASE_ASSERT, Rust panics, etc.
OOMs: running out of virtual or physical memory, or page file. Can be caused by Firefox using too much memory, or other processes on the system using too much memory. These usually present as controlled aborts.
Hangs: sometimes when a plugin or content process stops responding for too long, we will collect a dump from that process and then kill it. Most common at shutdown.

Crashes are grouped by process type:

When the "main" (chrome) process crashes, Firefox terminates.
When a content process crashes, Firefox itself stays open but some or all tabs must be reloaded.
When an NPAPI plugin or Gecko media plugin crashes, the web page will stay alive but some functionality may be broken and require the user to reload pages which used that plugin.

There are multiple different causes of crashes, and multiple ways they can be identified and addressed.

A word about coverage

For all of the following paths to a bug report (except static analysis and code inspection) coverage affects how effective the path is. There are several kinds of coverage.

Code coverage: e.g. which lines of code are we not testing?
Web feature coverage: e.g. which web features are we not testing?
Site coverage: e.g. which important websites are we not testing?
Hardware configuration coverage: e.g. which important gfx card/drivers combinations are we not testing?
Firefox configuration coverage: e.g. which add-ons or non-default options are we not testing?

Paths to a bug report

Almost without exception, for a crash to be fixed we must have a bug report for it. There are multiple paths from a crash to a bug report.

User crash report (automated)

Steps

A "normal" user hits a crash.
The crash reporter catches the crash.
The user consents to crash report submission.
The crash report is created.
The crash report is submitted, received and processed.
The crash report is clustered into a signature and ranked.
A Mozillian triages the signature and files a bug report.

Pros and cons

[+] This is the only path that gives us a clear idea of how frequent the crash is among our user population.
[+] Crash reports on Nightly provide fast feedback and small regression windows.
[-] This is a long, complex path with numerous points of potential failure.
[-] Submission rates vary, and are low for content, NPAPI and GMPlugin crashes.
[-] Clustering and ranking is imperfect. Sometimes crash reports with distinct causes get conflated into one signature. Sometimes crash reports with a single cause get separate signatures (people sometimes spot this and link multiple signatures to a single bug report). The ranking does not consider crash severity.
[-] Some crashes only manifest frequently in later release channels, where they affect more people, and are harder to address due to late feedback and large regression windows. Crashes that only manifest significantly on Release may require chemspills.
[-] Crash reports frequently lack enough info to diagnose/address. (E.g. no steps to reproduce, and it's obvious that a pointer was bad, but unclear how it got that way.) Or the info is present but easy to overlook.
[-] Crash report data can be unreliable. (E.g. incorrect crash addresses and stack traces.)

User crash report (manual)

Steps

A developer, QA person or advanced user hits a crash.
That person files a bug report.

Pros and cons

[+/-] We may or may not get clear steps to reproduce.
[+] We usually get a crash report as well.

Crash on automation

Steps

An automated non-TreeHerder job crashes, e.g. fuzzers, BugHunter.
A monitoring person (e.g. fuzzer, QA person) files a bug report.

Note: many tests are performed regularly on TreeHerder and are cause for immediate patch backout if they crash. Such crashes do not match this path because no bug report is filed.

Pros and cons

[+] We have a test case...
[-] ...but reproduction may be difficult or unreliable.

Analysis

Steps

A non-TreeHerder dynamic analysis job (e.g. Valgrind, ASan, TSan, UBSan) or static analysis job (e.g. Coverity) detects a possible crash defect. Or, a developer identifies a possible crash defect via code inspection.
A monitoring person files a bug report.

Note: some dynamic and static analysis jobs are performed regularly on TreeHerder and are cause for immediate patch backout if they fail. Crash defects found this way do not match this path because no bug report is filed.

Pros and cons

[+] The defect is usually well-identified and actionable.
[-] Not always clear how often the defect will cause crashes, though this depends on the analysis kind. (E.g. ASan is clearer than TSan.)

Paths from bug report to a diagnosis

As mentioned above, different kinds of bug report data have different likelihoods of leading to a diagnosis.

High: Reliable test case or steps to reproduce.
High: Well-identified defect, e.g. from analysis.
Lower: Unreliable test case or steps to reproduce.
Lower: Crash report.

Also, the following things increase the likelihood that a crash bug report will lead to a diagnosis.

A needinfo request of an appropriate developer.
Evidence that the crash has high severity or frequency.
Nagging.

Addressing a diagnosed crash

Once the cause of a crash in a bug report has been diagnosed, it can be addressed. The possible ways to address a crash -- and to prevent similar crashes in the future -- depends on what kind of defect it is.

A defect in our code

Many crashes are due to defects in our code. Ways to address include: fix the defect; backout the defect; disable the defect; recover and continue (e.g. e10s content process restarting).

It's also possible to prevent entire classes of defects by using better code patterns and/or languages, or tools (e.g. dynamic analysis with sufficient coverage).

A defect in something we rely on

Some crashes are caused by defects in things we depend on: OS, gfx drivers, CPUs (e.g. the infamous AMD bug; overclocking can cause problems), RAM, disks. E.g. bug 1270232 and bug 1194856 are bugs where a bad installation might be the cause.

In general, we cannot fix or prevent these defects. Ways to address include: avoid the defect; detect and tolerate the defect; detect the defect and inform the user.

A defective interaction

Some crashes are caused by defective interactions with other software such as malware and/or anti-virus products. Sometimes these can be blocked, and sometimes we must rely on evangelism.

OOMs

OOMs are different to other crashes in terms of cause, though not any different in terms of the effect on users. In general, anything that significantly reduces memory usage helps, as does making large allocations fallible, especially those whose size is controlled by user input or web content. Using 64-bit builds also prevents address space exhaustion.

Tasks

Tasks are divided into numerous categories. Each category has one or both of (a) a meta-bug and a table showing the dependent bugs, and (b) a list.
Bugs indicate well-defined tasks.
List elements indicate (a) tasks that do not yet have a bug filed, (b) tasks that are not well specified or whose benefit is unclear (these often have a trailing '?'), and (c) broad topics best represented by a link to another page (such as a project tracking page).
The goal is for every category to end up with a meta-bug and table, and as few list elements as possible.
Not every bug that could be included in these tables should be. This is about tracking sizeable pieces of work, rather than enumerating every single related thing that has been done.
Some bugs are included in the tables in more than one section, because they are relevant to more than one section.
bug 1289677 is a top-level tracking bug that is blocked by all the category-level meta-bugs.

Crash rate tracking

This category is about the tracking of crash rates, using both crash report and telemetry data.

Switch from the per-100-ADI metric to the per-1000-hours metric
Better dashboards?
Track other metrics that may provide insight into how we can improve things?

Pre-release coverage

This category is about the usage and coverage (code, feature, site, hardware, and configuration coverage) of users on pre-release channels, especially Nightly.

bug 1280394 [meta] Increase the number of Nightly users [pascal, marcia]
Provide test exercises for QA people and keen Nightly users to run regularly [marcia]
Improve earlier channel populations to be more representative of the Release population [lonnen?]

Crash report creation

This category is about the creation and contents of raw crash reports as collected on the client.

Bugs blocking bug 1289663: [Uptime] Crash report creation

Full Query

ID	Priority	Summary	Keywords	Assigned to
1250687	P3	Consider MiniDumpWithPrivateWriteCopyMemory for Windows minidumps		Carl Corcoran [:ccorcoran]
1251395	P3	Include non-JIT, executable, private pages in crash dumps
1277448	--	unify MozCrashReason and AbortMessage crash annotations (and use __func__ in them and for assertions)
1280469	P2	[meta] Client-side stack walking	meta	Gabriele Svelto [:gsvelto]
1286802	--	Add heap regions of the crash context to minidump (Windows)		Cervantes Yu [:cyu] [:cervantes]
1295918	--	Include JS stacks in crash report
1295934	--	Add a crash report annotation when we hit DoneStartingUp		Nicholas Nethercote [inactive]
1309573	--	Many places (including all of SpiderMonkey) cannot set the crash reason in crash reports		Emanuel Hoogeveen [:ehoogeveen]
1334027	--	Add unloaded modules and process/thread data to minidumps		Ting-Yu Chou [:ting] (away)
1337688	--	Remove NIGHTLY_BUILD wrapping if the increased size from adding unloaded modules and process/thread data to minidumps is acceptable
1351277	--	Add heap regions of the crash context to minidump (MacOS and Linux)
1488404	--	Consider including heap data in minidump for beta release channel

12 Total; 12 Open (100%); 0 Resolved (0%); 0 Verified (0%);

Include JS execution info (stacks? recent JS file/functions we called into? unresolved promises?)
Include more URL data (URLS of all open tabs? recent URLs in crashing tab? non-anonymized memory reports?); might require policy/legal approval
Don't omit frame pointers?
Better identify threads of all platforms, including threadpools

Crash report submission

This category is about increasing the crash report submission rate.

Bugs blocking bug 1289671: [Uptime] Crash report submission

Full Query

ID	Priority	Summary	Keywords	Assigned to
1269998	P1	Prompt users with pending crash reports to submit them		Brad Lassey [:blassey] (use needinfo?)
1270553	--	Allow users to opt-in to auto submitting crash data
1280469	P2	[meta] Client-side stack walking	meta	Gabriele Svelto [:gsvelto]
1287178	--	Refactor unsubmitted crash report handling and allow users to always send backlogged crash reports	dev-doc-needed	Mike Conley (:mconley) (:⚙️)
1333125	P4	Improve HTTP proxy support in the crashreporter client and pingsender

5 Total; 5 Open (100%); 0 Resolved (0%); 0 Verified (0%);

Always submit non-sensitive data, and make only the sensitive data part (e.g. minidumps) optional

Crash report handling

This category is about how crash reports are processed, clustered, analyzed, and triaged once they are received by Socorro.

Bugs blocking bug 1289676: [Uptime] Crash report handling

Full Query

ID	Priority	Summary	Assigned to
974420	--	Addresses >128TB are displayed as 0xffffffffffffffff on crash-stats
977778	P3	Allow users to request get-minidump-instructions report on-demand
1268029	--	Use jit classifier to change signature for jit crashes	[DEACTIVATED] Adrian Gaudebert
1273657	--	[tracker] Publish public crash stats to the data platform	Peter Bengtsson [:peterbe]
1274345	--	Add support for skipping a dll in the signature	[DEACTIVATED] Adrian Gaudebert
1274428	--	Mark crashes that happen at invalid instruction pointers?
1274628	P3	Annotate crashes when the code in memory around the crashing instruction differs from the code in the shipped binary
1277337	P3	Use hg.mozilla.org to map crashes to bug components by way of source files when possible
1291173	--	Show important info from memory reports in crash-stats	Nicholas Nethercote [inactive]
1297966	--	Show the new "StartupCrash" annotation in the crash report page	Marco Castelluccio [:marco]
1305888	--	Add the new CPU microcode annotation to SuperSearch	[DEACTIVATED] Adrian Gaudebert
1306891	--	Integrate correlation results from https://mozilla.github.io/stab-crashes/correlations.html on crash-stats	Marco Castelluccio [:marco]
1308474	--	Add the new StartupCrash annotation to SuperSearch	[DEACTIVATED] Adrian Gaudebert
1308476	--	Replace the current heuristic for startup crashes by using the new StartupCrash annotation	[DEACTIVATED] Adrian Gaudebert

14 Total; 14 Open (100%); 0 Resolved (0%); 0 Verified (0%);

Clouseau: Automatically identify changesets that cause regressions [calixte]
crash-correlations: Identify correlations for crash signatures [mcastellucio]
Provide ability to run custom analysis jobs on crash reports (similar to telemetry analysis jobs)

Note: a cross-variate analysis of FHR data, by Brendan Colloran, which may have useful techniques.

Crash cluster ranking

This category is about how each cluster of crash reports (e.g. those with the same signature) is prioritized, whether by frequency or other means.

Take into account crash severity as well as frequency
Use Crystal Ball (or other means) to identify how minor crashes on early release channels might become major crashes in later channels

Crash report comprehensibility

This category is about making the contents of crash reports easier to understand, via better presentation and documentation.

Bugs blocking bug 1289675: [Uptime] Crash report comprehensibility

Full Query

ID	Priority	Summary	Assigned to
1275799	--	Add descriptions to crash report fields	[DEACTIVATED] Adrian Gaudebert
1288309	--	Improve documentation about individual crash reports	Nicholas Nethercote [inactive]
1288310	--	Improve documentation about analyzing clusters of crash reports	Nicholas Nethercote [inactive]

3 Total; 3 Open (100%); 0 Resolved (0%); 0 Verified (0%);

Fuzzing

This category is about fuzzing and similar automatic test exploration tools such as BugHunter.

bug 828452 Generate consistent signatures across crash-stats/fuzzing/BugHunter/automated tests Socorro API for generating signatures?
bug 1289194 Add LibFuzzer support for testing xul code
Better isolate components so they can be fuzzed more easily (like the JS shell)
Increase hardware available for fuzzing, for both greater throughput and better hardware configuration coverage
Increase fuzzing coverage of non-default options
Improve gtests, which are a good starting point for fuzzing
Run BugHunter with common antivirus software [tomcat]

Dynamic analysis

This category is about the use of dynamic analysis tools.

Bugs blocking bug 1289646: [Uptime] Dynamic analysis

Full Query

ID	Priority	Summary	Keywords	Assigned to
929478	--	[meta] Make TSan (ThreadSanitizer) usable with Firefox	meta, sec-want
1030826	--	Support AddressSanitizer builds on Windows with clang-cl	sec-want
1280637	--	Add a TSan-enabled TaskCluster JS shell job		Steve Fink [:sfink] [:s:]
1284975	P2	[meta] Make SpiderMonkey clean on UBSan	meta
1288596	--	Make SM-tc(msan) a first class test job		Steve Fink [:sfink] [:s:]
1288993	--	Run valgrind-mochitest twice a day as a Tier 2 job		Joel Maher ( :jmaher ) (UTC -8)
1289994	P3	[meta] Use Application Verifier	meta
1291954	P3	Make SM(tsan) a tier 1 build		Steve Fink [:sfink] [:s:]

8 Total; 8 Open (100%); 0 Resolved (0%); 0 Verified (0%);

Static analysis

This category is about the use of static analysis tools.

Bugs blocking bug 1289660: [Uptime] Static analysis

Full Query

ID	Priority	Summary	Keywords	Assigned to
1230156	P3	[meta] Coverity Static Analysis fixes	coverity, meta
1272513	--	Enable -Wshadow warnings		Chris Peterson [:cpeterson]

2 Total; 2 Open (100%); 0 Resolved (0%); 0 Verified (0%);

Add more checks to the clang static analysis job [sledru, etc?]

Low-level defect prevention and detection

This category covers low-level changes we can make to the code to prevent entire classes defects, such as using smart pointers and compiler annotations, and also changes we can make to detect defects, such as adding assertions and internal consistency checks.

Bugs blocking bug 1289662: [Uptime] Low-level defect prevention and detection

Full Query

ID	Priority	Summary	Keywords	Assigned to
1268766	--	Use [[nodiscard]] everywhere
1272203	--	Add mozilla::NotNull to MFBT		Nicholas Nethercote [inactive]
1276097	P5	Add a bytecode sanity check
1277368	P3	[meta] Use mozilla::Result<T, E> for fallible return values in the JS engine	meta, triage-deferred	Jan de Mooij [:jandem]

4 Total; 4 Open (100%); 0 Resolved (0%); 0 Verified (0%);

Something like v8's --debug-heap, which checks the GC heap

High-level defect prevention

This category covers high-level changes we can make to the code to avoid entire classes of defects, such as architectural or language changes.

Oxidation: integrate Rust code into Firefox [larsberg, etc.]
Ideas to prevent shutdown hangs [nical]

Defective software

This category is about actively tolerating or responding to defective software (OS, drivers).

Disable hardware acceleration in the presence of buggy gfx drivers
Handle gfx driver resets

Defective hardware

This category is about detecting and tolerating defective hardware: CPUs, memory, disks, etc.

Bugs blocking bug 1289666: [Uptime] Defective hardware

Full Query

ID	Priority	Summary	Keywords	Assigned to
995652	P3	Run memtest from the crash reporter
1270554	--	Run memtest continuously on the live browser
1274428	--	Mark crashes that happen at invalid instruction pointers?
1274628	P3	Annotate crashes when the code in memory around the crashing instruction differs from the code in the shipped binary
1281759	P3	Work around mysterious AMD JIT crashes	crash	Jan de Mooij [:jandem]
1293188	P3	Crash in EnterBaseline (can be defective hardware [Mem or VGA])	crash, triage-deferred
1293996	P1	Crash in adapt_probs	crash, regression	Jean-Yves Avenard [:jya]
1317253	P5	Best-effort detection of faulty memory at page-request time

8 Total; 8 Open (100%); 0 Resolved (0%); 0 Verified (0%);

Detect if Firefox is mis-installed (e.g. perform checksums on files) and ask the user to reinstall.

Malware, etc.

This category is about preventing malware, anti-virus, and other third-party code from interfering with Firefox.

Windows 10 has better blocking, at least for content processes? (e.g. a DLL whitelist)
Reduce export space of xpcom symbols

OOMs

This category is about avoiding and tolerating OOM crashes.

Bugs blocking bug 1291174: [Uptime] OOMs

Full Query

ID	Priority	Summary	Assigned to
1291068	--	Large-scale analysis of OOM crash reports with ContainsMemoryReport=1
1291173	--	Show important info from memory reports in crash-stats	Nicholas Nethercote [inactive]
1299747	--	Create a tool to track 64k unaligned virtual memory allocation	Ting-Yu Chou [:ting] (away)

3 Total; 3 Open (100%); 0 Resolved (0%); 0 Verified (0%);

Increase usage of 64-bit Firefox on Windows: Firefox/win64 [cpeterson, etc.]
Discuss common OOM cases with partners [harald?]
MemShrink

Data Sources and Useful Links

Crash Stats: entry point to crash reports.
Platform/Uptime/Logs has some more specific links useful when analyzing crash reports.
Data mining
- ActiveData: a giant queryable database of unit test results, Buildbot steps, and MozHarness steps with success/failure outcomes
- Bugzilla REST API
- Bugzilla ElasticSearch Cluster: easy-to-query historical data
Test code coverage
Fuzzing Explained - A Guide for Software Developers
Crash rate dashboards
- ADI & Telemetry Crash Rate (Release) [harald]
- Telemetry Crash Rate (Beta) [harald]
- Crashes per 1000 usage hours (Beta, DevEd, Nightly) [bsmedberg]

Participants

The grouping below is by time zone.

Taiwan

Kan-Ru Chen, platform engineering
Ting-Yu Chou, platform engineering

Australia (Eastern)

Nicholas Nethercote, platform engineering

North America (Pacific)

Andrew McCreight, platform engineering
Anthony Vaughn, project management
Chris Lonnen, Socorro
David Baron, platform engineering

North America (Eastern)

Jim Chen, platform engineering
Kyle Lahnakoski, data visualization/analysis
Marcia Knous, Release Management
Milan Sreckovic, platform engineering

Europe (Western)

Marco Castelluccio, crash report analysis and automation

Europe (Central)

Calixte Denizet, crash report analysis and automation
Julian Seward, dynamic analysis

Administration

Regular meetings: content and time TBD.

Progress tracking: high-level stability measurements, bug lists, etc., TDB.

Communication:

IRC: #uptime
Email: projectuptime-team@mozilla.com
Vidyo: ProjectUptime

Crash stats checking rosters

Nightly builds are produced at 3am each day (California time). Each build's crashes can be viewed here. The current roster for checking Nightly build crashes:

Monday (Australian time): njn analyzes Friday's build.
Monday (US East time): marcia analyzes Saturday's build.
Tuesday (Taiwan time): ting analyzes Sunday's build.
Wednesday (Taiwan time): kanru analyzes Monday's build.
Wednesday (US East time): jchen analyzes Tuesday's build.
Thursday (US West time): mccr8 analyzes Wednesday's build.
Friday (US West time): jchen analyzes Thursday's build.

A rough guide to Nightly crash analysis. We keep notes on what we do.

@@ Line 147: / Line 147: @@
 Many crashes are due to defects in our code. Ways to address include: fix the defect; backout the defect; disable the defect; recover and continue (e.g. e10s content process restarting).
-It's also possible to prevent entire classes of defects by using better code patterns and/or languages, or tools (e.g. dynamic analysis with sufficient coverage).
+It's also possible to prevent entire classes of defects by using better code patterns and/or [[Oxidation|languages]], or tools (e.g. dynamic analysis with sufficient coverage).
 === A defect in something we rely on ===

Platform/Uptime: Difference between revisions

Revision as of 05:09, 3 August 2016

Goals

The basics

Definition of "crash rate"

Target crash rate

All about crashes

A word about coverage

Paths to a bug report

User crash report (automated)

User crash report (manual)

Crash on automation

Analysis

Paths from bug report to a diagnosis

Addressing a diagnosed crash

A defect in our code

A defect in something we rely on

A defective interaction

OOMs

Tasks

Crash rate tracking

Pre-release coverage

Crash report creation

Crash report submission

Crash report handling

Crash cluster ranking

Crash report comprehensibility

Fuzzing

Dynamic analysis

Static analysis

Low-level defect prevention and detection

High-level defect prevention

Defective software

Defective hardware

Malware, etc.

OOMs

Data Sources and Useful Links

Participants

Administration

Crash stats checking rosters

Navigation menu

Search