Platform/Uptime: Difference between revisions
Nnethercote (talk | contribs) |
(→Reactive: added 'Gather more runtime data' suggestion) |
||
Line 26: | Line 26: | ||
* Ability to run custom analysis jobs on crash reports (similar to telemetry analysis jobs). | * Ability to run custom analysis jobs on crash reports (similar to telemetry analysis jobs). | ||
* Gather more runtime data at the time of crash, e.g.: | |||
** Unresolved Promises (and other expected callbacks), | |||
** file&position in JS script being interpreted, at least for Firefox-provided scripts, | |||
** Better identify threads of all platforms, including threadpools. | |||
Also, a non-trivial fraction of crashes may be due to users having faulty hardware. | Also, a non-trivial fraction of crashes may be due to users having faulty hardware. |
Revision as of 02:06, 4 May 2016
Uptime is a project that aims to improve Firefox's stability, i.e. reduce its crash rate.
Uptime is a Platform Engineering initiative that aims to extend and complement existing work relating to stability within Mozilla.
Strategies for improvement
Reactive
Reactive strategies are those that help us better identify, diagnose and fix crash-prone code once it has shipped in a Firefox build (from Nightly through to Release). Reactive strategies are based around crash reports. The following is a list of ideas.
- Improve manual inspection of crash reports; ensure all significant crashes on all release channels are checked in a timely fashion.
- The Nightly roster is below.
- Improve automated analysis of crash reports.
- Use Crystal Ball to identify and predict correlations between crashes on release and earlier channels.
- Use Clouseau to automatically detect new crashes and trace them back to recent changes around the location of the crash and notify whoever most recently changed that code.
- High-level analysis to identify which crash kinds are most common.
- Increase the number of users on Nightly, Aurora and Beta, because higher numbers of crash reports from earlier channels make it more likely that new regressions are caught earlier.
- Ideas for Nightly from Marcia Knous.
- Improve crash report aggregation and presentation, to make it easier to identify important crashes.
- Improve crash reports that have little actionable data, e.g. many JIT and GC crashes.
- bug 1268029 should help somewhat with JIT crashes.
- Ability to run custom analysis jobs on crash reports (similar to telemetry analysis jobs).
- Gather more runtime data at the time of crash, e.g.:
- Unresolved Promises (and other expected callbacks),
- file&position in JS script being interpreted, at least for Firefox-provided scripts,
- Better identify threads of all platforms, including threadpools.
Also, a non-trivial fraction of crashes may be due to users having faulty hardware.
- Run a memory check upon crashing, probably based on some heuristic such as crash frequency, and inform the user if they have faulty memory.
Proactive
Proactive strategies are those that help us prevent crash-prone code from shipping. The following is a list of possibilities.
- Improve test coverage.
- Improve fuzzing coverage.
- Better isolate components so they can be fuzzed more easily (e.g. the JS shell).
- Record non-reproducible fuzz crashes in rr so they can be played back reliably.
- Extend use of static analysis.
- Compiler warnings
- bug 1268766 is about adding MOZ_MUST_USE annotations throughout the code.
- clang's static analysis
- Coverity
- Compiler warnings
- Extend use of dynamic analysis.
- ASan: Already used on automation with Linux. bug 1030826 is for extending that to Windows.
- TSan: bug 929478 is the meta-bug tracking progress.
- Valgrind:
- Eliminate crash-prone code patterns.
- Low-level, e.g. replace raw pointers with smart pointers such as UniquePtr.
- High-level, e.g. disallow binary extensions.
- Implement more internal verification ("extended assertions"), e.g. verify complex data structures such an compiler IR.
- Reimplement existing C and C++ components in less crash-prone languages (e.g. JavaScript, Rust).
- bug 1135640 is about adding support for Rust code.
- Better utilize available OS protection against malware (largely a Windows-only issue).
- Windows 10 has better protection, at least for content processes?
- Recover from crash-causing events and continue. Difficult in general, may be possible in restricted cases?
- Use data from users (telemetry, crash reports, etc.) to determine gaps in coverage in Nightly/Aurora/Beta, and then use that to drive additional testing, channel recruitment, etc. Types of coverage:
- Web feature coverage (e.g. are we not testing web workers enough?)
- Site coverage (e.g. are we not testing Facebook enough?)
- Hardware configuration coverage (e.g. are we not testing particular gfx card/drivers combinations enough?)
- Firefox configuration coverage (e.g. are we not testing particular add-ons or non-standard options enough?)
Administration
Things to be decided.
- Regular meetings: time, frequency, content?
- Progress tracking: high-level stability measurements, bug lists, etc. Links to those.
- Communication: IRC, email, other?
Crash stats checking rosters
Nightly builds are produced at 3am each day (California time). Each build's crashes can be viewed here. The current roster for checking Nightly build crashes:
- Sunday (late): njn analyzes Friday and Saturday's builds
- Monday (late): njn analyzes Sunday's and Monday's builds
- Wednesday: jchen analyzes Tuesday's builds
- Thursday: mccr8 analyzes Wednesday's builds
- Friday: dbaron analyzes Thursday's builds
We keep notes on what we do.