Platform/Uptime
Uptime's mission is to not only reduce the crash rate of Firefox (desktop and mobile) but also to keep that crash rate down. This project is a Platform Engineering initiative that aims to extend and complement existing work relating to stability within Mozilla.
Goals Brainstorm
We can break Uptime's mission statement down into four sub-goals.
1 Reduce the Release crash rate, and keep it down.
- This gives a better experience for ~100M/7M users.
- Which makes them less likely to switch to a different browser.
2 Reduce the Beta crash rate, and keep it down.
- This gives a better experience for ~2M/150k users.
- Which makes them less likely to switch to a different browser or a later channel.
- This goal partially subsumes goal #1.
3 Reduce the Aurora crash rate, and keep it down.
- This gives a better experience for ~100k/2k users.
- Which makes them less likely to switch to a different browser or a later channel.
- This goal partially subsumes goals #1 and #2.
4 Reduce the Nightly crash rate, and keep it down
- This gives a better experience for ~30k/1k users.
- Which makes them less likely to switch to a different browser or a later channel.
- This goal partially subsumes goals #1, #2 and #3.
(Channel ADI numbers are approximate, taken from http://arewestableyet.com on Apr 26.)
These sub-goals are listed in order of priority. But note that the relative
priorities do not match the relative user counts. E.g. even though Release
has 1000x as many users as Nightly, goal #1 is certainly not 1000x more
important than goal #4. The higher-numbered goals, as well as being
(decreasingly) virtuous in their own right, partially subsume the
lower-numbered goals, because a crash fixed or avoided in an earlier channel is
one that doesn't show up in a later channel. (The later goals don't fully
subsume the earlier goals because some crashes don't show up in significant
numbers until we get to the larger populations of the later channels.)
There are multiple strategies (tools, processes, systems) to achieve these goals, but those strategies should not be confused with the goals themselves.
Specifics to Consider
What is the definition of a "crash"?
- Definitely: Obvious "true" crashes, e.g. segmentation fault.
- Definitely: Deliberate aborts, e.g. MOZ_CRASH.
- Probably: hangs
- Not sure: plugins?
What is the definition of "crash rate"?
- http://arewestableyet.com has the following, all measured per 100 ADI:
- browser + content processes
- browser crashes
- startup crashes
- content crashes
- plugin crashes
- plugin hangs
- unclear which of these are subsets of or overlap others
- possible weightings:
- weight start-up crashes more heavily than other crashes?
- if persistent, they almost guarantee a lost user
- can we distinguish "occasional start-up" crashes from "persistent start-up" crashes?
- e10s: weight chrome process crashes more heavily than content process crashes?
- content process crashes don't take down the whole browser
- weight plugin crashes less?
- it's more obvious that it's not Firefox's fault
- that's if we are counting them at all?
- weight shutdown hangs/crashes less?
- weight start-up crashes more heavily than other crashes?
What are the target crash rates?
- Eight of them: {Release, Beta, Aurora, Nightly} x {desktop, mobile}
- No idea what actual values to choose here
Strategies for improvement
Reactive
Reactive strategies are those that help us better identify, diagnose and fix crash-prone code once it has shipped in a Firefox build (from Nightly through to Release). Reactive strategies are based around crash reports. The following is a list of ideas.
- Improve manual inspection of crash reports; ensure all significant crashes on all release channels are checked in a timely fashion.
- The Nightly roster is below.
- Improve automated analysis of crash reports.
- Use Crystal Ball to identify and predict correlations between crashes on release and earlier channels.
- Use Clouseau to automatically detect new crashes and trace them back to recent changes around the location of the crash and notify whoever most recently changed that code.
- High-level analysis to identify which crash kinds are most common.
- Increase the number of users on Nightly, Aurora and Beta, because higher numbers of crash reports from earlier channels make it more likely that new regressions are caught earlier.
- Ideas for Nightly from Marcia Knous.
- Improve crash report aggregation and presentation, to make it easier to identify important crashes.
- Improve crash reports that have little actionable data, e.g. many JIT and GC crashes.
- bug 1268029 should help somewhat with JIT crashes.
- Ability to run custom analysis jobs on crash reports (similar to telemetry analysis jobs).
- Gather more runtime data at the time of crash, e.g.:
- Unresolved Promises (and other expected callbacks),
- file&position in JS script being interpreted, at least for Firefox-provided scripts,
- Better identify threads of all platforms, including threadpools.
Also, a non-trivial fraction of crashes may be due to users having faulty hardware.
- Run a memory check upon crashing, probably based on some heuristic such as crash frequency, and inform the user if they have faulty memory.
Proactive
Proactive strategies are those that help us prevent crash-prone code from shipping. The following is a list of possibilities.
- Improve test coverage.
- Improve fuzzing coverage.
- Better isolate components so they can be fuzzed more easily (e.g. the JS shell).
- Record non-reproducible fuzz crashes in rr so they can be played back reliably.
- Extend use of static analysis.
- Compiler warnings
- bug 1268766 is about adding MOZ_MUST_USE annotations throughout the code.
- clang's static analysis
- Coverity
- Compiler warnings
- Extend use of dynamic analysis.
- ASan: Already used on automation with Linux. bug 1030826 is for extending that to Windows.
- TSan: bug 929478 is the meta-bug tracking progress.
- Valgrind:
- Eliminate crash-prone code patterns.
- Low-level, e.g. replace raw pointers with smart pointers such as UniquePtr.
- High-level, e.g. disallow binary extensions.
- Implement more internal verification ("extended assertions"), e.g. verify complex data structures such an compiler IR.
- Reimplement existing C and C++ components in less crash-prone languages (e.g. JavaScript, Rust).
- bug 1135640 is about adding support for Rust code.
- Better utilize available OS protection against malware (largely a Windows-only issue).
- Windows 10 has better protection, at least for content processes?
- Recover from crash-causing events and continue. Difficult in general, may be possible in restricted cases?
- Use data from users (telemetry, crash reports, etc.) to determine gaps in coverage in Nightly/Aurora/Beta, and then use that to drive additional testing, channel recruitment, etc. Types of coverage:
- Web feature coverage (e.g. are we not testing web workers enough?)
- Site coverage (e.g. are we not testing Facebook enough?)
- Hardware configuration coverage (e.g. are we not testing particular gfx card/drivers combinations enough?)
- Firefox configuration coverage (e.g. are we not testing particular add-ons or non-standard options enough?)
Administration
Things to be decided.
- Regular meetings: content and time TBD
- Progress tracking: high-level stability measurements, bug lists, etc. Links to those.
- Communication:
- IRC: #uptime
- Email: projectuptime-team@mozilla.com
- Vidyo: ProjectUptime
Crash stats checking rosters
Nightly builds are produced at 3am each day (California time). Each build's crashes can be viewed here. The current roster for checking Nightly build crashes:
- Sunday (late): njn analyzes Friday and Saturday's builds
- Monday (late): njn analyzes Sunday's and Monday's builds
- Wednesday: jchen analyzes Tuesday's builds
- Thursday: mccr8 analyzes Wednesday's builds
- Friday: dbaron analyzes Thursday's builds
We keep notes on what we do.