Platform/Uptime: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(→‎Reactive: added 'Gather more runtime data' suggestion)
Line 26: Line 26:


* Ability to run custom analysis jobs on crash reports (similar to telemetry analysis jobs).
* Ability to run custom analysis jobs on crash reports (similar to telemetry analysis jobs).
* Gather more runtime data at the time of crash, e.g.:
** Unresolved Promises (and other expected callbacks),
** file&position in JS script being interpreted, at least for Firefox-provided scripts,
** Better identify threads of all platforms, including threadpools.


Also, a non-trivial fraction of crashes may be due to users having faulty hardware.
Also, a non-trivial fraction of crashes may be due to users having faulty hardware.

Revision as of 02:06, 4 May 2016

Uptime is a project that aims to improve Firefox's stability, i.e. reduce its crash rate.

Uptime is a Platform Engineering initiative that aims to extend and complement existing work relating to stability within Mozilla.

Strategies for improvement

Reactive

Reactive strategies are those that help us better identify, diagnose and fix crash-prone code once it has shipped in a Firefox build (from Nightly through to Release). Reactive strategies are based around crash reports. The following is a list of ideas.

  • Improve manual inspection of crash reports; ensure all significant crashes on all release channels are checked in a timely fashion.
    • The Nightly roster is below.
  • Improve automated analysis of crash reports.
    • Use Crystal Ball to identify and predict correlations between crashes on release and earlier channels.
    • Use Clouseau to automatically detect new crashes and trace them back to recent changes around the location of the crash and notify whoever most recently changed that code.
    • High-level analysis to identify which crash kinds are most common.
  • Increase the number of users on Nightly, Aurora and Beta, because higher numbers of crash reports from earlier channels make it more likely that new regressions are caught earlier.
  • Improve crash report aggregation and presentation, to make it easier to identify important crashes.
  • Improve crash reports that have little actionable data, e.g. many JIT and GC crashes.
  • Ability to run custom analysis jobs on crash reports (similar to telemetry analysis jobs).
  • Gather more runtime data at the time of crash, e.g.:
    • Unresolved Promises (and other expected callbacks),
    • file&position in JS script being interpreted, at least for Firefox-provided scripts,
    • Better identify threads of all platforms, including threadpools.

Also, a non-trivial fraction of crashes may be due to users having faulty hardware.

  • Run a memory check upon crashing, probably based on some heuristic such as crash frequency, and inform the user if they have faulty memory.

Proactive

Proactive strategies are those that help us prevent crash-prone code from shipping. The following is a list of possibilities.

  • Improve test coverage.
  • Improve fuzzing coverage.
    • Better isolate components so they can be fuzzed more easily (e.g. the JS shell).
    • Record non-reproducible fuzz crashes in rr so they can be played back reliably.
  • Extend use of static analysis.
    • Compiler warnings
      • bug 1268766 is about adding MOZ_MUST_USE annotations throughout the code.
    • clang's static analysis
    • Coverity
  • Extend use of dynamic analysis.
    • ASan: Already used on automation with Linux. bug 1030826 is for extending that to Windows.
    • TSan: bug 929478 is the meta-bug tracking progress.
    • Valgrind:
  • Eliminate crash-prone code patterns.
    • Low-level, e.g. replace raw pointers with smart pointers such as UniquePtr.
    • High-level, e.g. disallow binary extensions.
  • Implement more internal verification ("extended assertions"), e.g. verify complex data structures such an compiler IR.
  • Reimplement existing C and C++ components in less crash-prone languages (e.g. JavaScript, Rust).
  • Better utilize available OS protection against malware (largely a Windows-only issue).
    • Windows 10 has better protection, at least for content processes?
  • Recover from crash-causing events and continue. Difficult in general, may be possible in restricted cases?
  • Use data from users (telemetry, crash reports, etc.) to determine gaps in coverage in Nightly/Aurora/Beta, and then use that to drive additional testing, channel recruitment, etc. Types of coverage:
    • Web feature coverage (e.g. are we not testing web workers enough?)
    • Site coverage (e.g. are we not testing Facebook enough?)
    • Hardware configuration coverage (e.g. are we not testing particular gfx card/drivers combinations enough?)
    • Firefox configuration coverage (e.g. are we not testing particular add-ons or non-standard options enough?)

Administration

Things to be decided.

  • Regular meetings: time, frequency, content?
  • Progress tracking: high-level stability measurements, bug lists, etc. Links to those.
  • Communication: IRC, email, other?

Crash stats checking rosters

Nightly builds are produced at 3am each day (California time). Each build's crashes can be viewed here. The current roster for checking Nightly build crashes:

  • Sunday (late): njn analyzes Friday and Saturday's builds
  • Monday (late): njn analyzes Sunday's and Monday's builds
  • Wednesday: jchen analyzes Tuesday's builds
  • Thursday: mccr8 analyzes Wednesday's builds
  • Friday: dbaron analyzes Thursday's builds

We keep notes on what we do.