Platform/Uptime: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
m (Added a bug for skipping certain DLLs when extracting crash signature to help with the clustering of the graphics driver crashes.)
m (name fixup)
 
(236 intermediate revisions by 14 users not shown)
Line 1: Line 1:
'''Project Uptime ran from April 2016 to June 2017.'''
''Preventing and fixing crashes remains an important task, and there is ongoing work there. But this work is no longer being coordinated under Project Uptime. This page is being kept in place as a historical record.''
Project Uptime's goal is to reduce the crash rate of Firefox (desktop and mobile) and keep it down. This project is a Platform Engineering initiative that aims to extend and complement existing work relating to stability within Mozilla.  
Project Uptime's goal is to reduce the crash rate of Firefox (desktop and mobile) and keep it down. This project is a Platform Engineering initiative that aims to extend and complement existing work relating to stability within Mozilla.  
David Bryant's internal email announcement said: "The overall responsibility to hit our stability target is something that falls on the entire platform organization, and in terms of priority, stability is second only to shipping e10s."


= Goals =
= Goals =
Line 40: Line 45:
The actual target values are TBD, and cannot be decided until the new per-1000-hours-of-usage metric has fully replaced the old per-100-ADIs metric and we know what our current values are.
The actual target values are TBD, and cannot be decided until the new per-1000-hours-of-usage metric has fully replaced the old per-100-ADIs metric and we know what our current values are.


= Strategies for improvement =
= All about crashes =
 
We use "crashes" to refer to several kinds of events.
* Uncontrolled crashes: most commonly due to bad memory accesses, which are caused by bad pointers. Also sometimes illegal instructions, division-by-zero, etc.
* Controlled aborts: MOZ_CRASH, MOZ_RELEASE_ASSERT, Rust panics, etc.
* OOMs: running out of virtual or physical memory, or page file. Can be caused by Firefox using too much memory, or other processes on the system using too much memory. These usually present as controlled aborts.
* Hangs: sometimes when a plugin or content process stops responding for too long, we will collect a dump from that process and then kill it. Most common at shutdown.
 
Crashes are grouped by process type:
* When the "main" (chrome) process crashes, Firefox terminates.
* When a content process crashes, Firefox itself stays open but some or all tabs must be reloaded.
* When an NPAPI plugin or Gecko media plugin crashes, the web page will stay alive but some functionality may be broken and require the user to reload pages which used that plugin.


Crashes are caused by defects. There are multiple kinds of crash defects, multiple ways they can be identified and addressed, and multiple ways we can improve our processes for doing so.
There are multiple different causes of crashes, and multiple ways they can be identified and addressed.


== A word about coverage ==
== A word about coverage ==
Line 56: Line 72:
== Paths to a bug report ==
== Paths to a bug report ==


Almost without exception, for a crash defect to be fixed we must have a bug report for it. There are multiple paths from a crash defect to a bug report.
Almost without exception, for a crash to be fixed we must have a bug report for it. There are multiple paths from a crash to a bug report.


=== User crash report (automated) ===
=== User crash report (automated) ===
Line 64: Line 80:
* The crash reporter catches the crash.
* The crash reporter catches the crash.
* The user consents to crash report submission.
* The user consents to crash report submission.
* The crash report is received and processed.
* The crash report is created.
* The crash report is submitted, received and processed.
* The crash report is clustered into a signature and ranked.
* The crash report is clustered into a signature and ranked.
* A Mozillian triages the signature and files a bug report.
* A Mozillian triages the signature and files a bug report.
Line 73: Line 90:
* [-] This is a long, complex path with numerous points of potential failure.
* [-] This is a long, complex path with numerous points of potential failure.
* [-] Submission rates vary, and are low for content, NPAPI and GMPlugin crashes.
* [-] Submission rates vary, and are low for content, NPAPI and GMPlugin crashes.
* [-] Clustering and ranking is imperfect. Sometimes crash reports caused by distinct defects get conflated into one signature. Sometimes crash reports caused by a single defect get separate signatures (people sometimes spot this and link multiple signatures to a single bug report). The ranking does not consider crash severity.
* [-] Clustering and ranking is imperfect. Sometimes crash reports with distinct causes get conflated into one signature. Sometimes crash reports with a single cause get separate signatures (people sometimes spot this and link multiple signatures to a single bug report). The ranking does not consider crash severity.
* [-] Some crashes only manifest frequently in later release channels, where they affect more people, and are harder to address due to late feedback and large regression windows. Crashes that only manifest significantly on Release may require chemspills.
* [-] Some crashes only manifest frequently in later release channels, where they affect more people, and are harder to address due to late feedback and large regression windows. Crashes that only manifest significantly on Release may require chemspills.
* [-] Crash reports frequently lack enough info to diagnose/address. (E.g. no steps to reproduce.)
* [-] Crash reports frequently lack enough info to diagnose/address. (E.g. no steps to reproduce, and it's obvious that a pointer was bad, but unclear how it got that way.) Or the info is present but easy to overlook.
* [-] Crash report data can be unreliable. (E.g. incorrect crash addresses and stack traces.)
* [-] Crash report data can be unreliable. (E.g. incorrect crash addresses and stack traces.)
Possible improvements
* Increase the number of users on earlier channels, especially Nightly.
** [https://public.etherpad-mozilla.org/p/nightlyideas Ideas for Nightly] from Marcia Knous.
** [https://blog.mozilla.org/nnethercote/2016/05/19/i-want-more-users-on-the-nightly-channel/ A blog post encouraging people to switch to Nightly].
** Improve earlier channel populations to be more representative of the Release population.
* Increase crash report submission rate.
** Always submit non-sensitive data, and make only the sensitive data part (e.g. minidumps) optional.
** {{Bug|1269998}} Prompt users with pending crash reports to submit them.
** {{Bug|1270553}} Allow users to opt-in to auto submitting crash data.
* Improve the clustering of crash reports.
** Improve crash signatures by skipping/ignoring generic stack frames.
** Ignore minor differences in graphics driver versions.
** {{bug|1268029}} Use jit classifier to change signature for jit crashes. (More improvements to JIT crash reports probably still required after that, though.)
** Distinguish GC crashes better.
** Use something more sophisticated than the crash signature, which is mostly based on stack traces.
** Provide ability to run custom analysis jobs on crash reports (similar to telemetry analysis jobs).
* Improve ranking of crash clusters.
** Take into account crash severity.
** Use [https://github.com/vvuk/crystalball Crystal Ball] to identify how minor crashes on early release channels might become major crashes in later channels.
** {{bug|1274345}} Skip the graphics drivers signatures in the report, letting us cluster on the first calls for which we have the symbols, rather than just dealing with the DLL offsets.  See the two reports in this bug for an example.
* Improve triaging of crash reports.
** Ensure sufficient manual triage occurs.
*** Nightly has coverage for every build (see the roster below).
*** Aurora: ?
*** Beta: release drivers?
*** Release: release drivers?
** Increase the amount of automation in triage.
*** Use Clouseau to automatically detect new crashes and trace them back to recent changes around the location of the crash. And possibly notify whoever most recently changed that code.


=== User crash report (manual) ===
=== User crash report (manual) ===
Line 116: Line 104:
* [+/-] We may or may not get clear steps to reproduce.
* [+/-] We may or may not get clear steps to reproduce.
* [+] We usually get a crash report as well.
* [+] We usually get a crash report as well.
Possible improvements
* Improve coverage of web features and sites by providing test exercises for QA people and keen Nightly users to run regularly.


=== Crash on automation ===
=== Crash on automation ===
Line 126: Line 111:
* A monitoring person (e.g. fuzzer, QA person) files a bug report.
* A monitoring person (e.g. fuzzer, QA person) files a bug report.


Note: many tests are performed regularly on TreeHerder and are cause for immediate patch backout if they crash. Such crashes do not match this path, but instead fall under the category of prevention (see below).
Note: many tests are performed regularly on TreeHerder and are cause for immediate patch backout if they crash. Such crashes do not match this path because no bug report is filed.


Pros and cons
Pros and cons
Line 132: Line 117:
* [-] ...but reproduction may be difficult or unreliable.
* [-] ...but reproduction may be difficult or unreliable.


Possible improvements
=== Analysis ===
* Improve code coverage with more tests.
** http://people.mozilla.org/~sledru/reports/coverage-2015/
* Improve fuzzing coverage.
** Better isolate components so they can be fuzzed more easily (e.g. the JS shell).
** Increase hardware available for fuzzing, for both greater throughput and better hardware configuration coverage.
** Increase fuzzing coverage of non-default options.
** Improve gtests, which are a good starting point for fuzzing.
* Improve reproducibility of these crashes.
** Use rr to record crashes so they can be played back reliably.
 
=== Analysis (automated) ===


Steps
Steps
* A non-TreeHerder dynamic analysis job (e.g. Valgrind, ASan, TSan, UBSan) or static analysis job (e.g. Coverity) detects a possible crash defect.
* A non-TreeHerder dynamic analysis job (e.g. Valgrind, ASan, TSan, UBSan) or static analysis job (e.g. Coverity) detects a possible crash defect. Or, a developer identifies a possible crash defect via code inspection.
* A monitoring person files a bug report.
* A monitoring person files a bug report.


Note: some dynamic and static analysis jobs are performed regularly on TreeHerder and are cause for immediate patch backout if they fail. Crash defects found this way do not match this path, but instead fall under the category of prevention (see below).
Note: some dynamic and static analysis jobs are performed regularly on TreeHerder and are cause for immediate patch backout if they fail. Crash defects found this way do not match this path because no bug report is filed.
 
Pros and cons
* [+] The defect is usually well-identified and actionable.
* [-] Not always clear how often the defect will cause crashes, though this depends on the tool. (E.g. ASan is clearer than TSan.)
 
Possible improvements
* Ensure sufficient triaging occurs.
** sewardj periodically runs Mochitests under Valgrind and files found bugs.
** Coverity triage: ?
* {{Bug|929478}} Make TSan (ThreadSanitizer) usable with Firefox. (And eventually run TSan on automation in some fashion?)
* Try out UBSan?
* Run mochitests under Valgrind on automation twice a day as a Tier 2 job.
* Improve dynamic analysis coverage by using it in tandem with automation tools (e.g. fuzzing, BugHunter).
* Implement more internal verification ("extended assertions"), e.g. verify complex data structures, triggered manually (by users) or programatically (by fuzzers).
 
=== Analysis (manual) ===
 
Steps
* A developer detects a possible crash defect via code inspection.
* The developer files a bug report.


Pros and cons
Pros and cons
* [+] The defect is usually well-identified and actionable.
* [+] The defect is usually well-identified and actionable.
* [-] Not always clear if the defect will cause a crash or how likely it is.
* [-] Not always clear how often the defect will cause crashes, though this depends on the analysis kind. (E.g. ASan is clearer than TSan.)
 
Possible improvements
* ?


== Paths from bug report to a diagnosis ==
== Paths from bug report to a diagnosis ==
Line 184: Line 135:
* High: Well-identified defect, e.g. from analysis.
* High: Well-identified defect, e.g. from analysis.
* Lower: Unreliable test case or steps to reproduce.
* Lower: Unreliable test case or steps to reproduce.
* Lower: crash report.
* Lower: Crash report.


Also, the following things increase the likelihood that a crash bug report will lead to a diagnosis.
Also, the following things increase the likelihood that a crash bug report will lead to a diagnosis.
* A needinfo request of an appropriate developer.
* A needinfo request of an appropriate developer.
* Data indicating that the crash has high severity or frequency.
* Evidence that the crash has high severity or frequency.
* Nagging.
* Nagging.
Possible improvements
* Improve data in crash reports.
** {{bug|1268029}} Use jit classifier to change signature for jit crashes.
** Improve data in GC crash reports.
** Include unresolved Promises (and other expected callbacks).
** Include source location (file & line number) in Firefox-provide JS script being executed.
** Better identify threads of all platforms, including threadpools.
** Increase use of deliberate, informative aborts (e.g. MOZ_DIAGNOSTIC_ASSERT).
* Increase amount of tracking of and nagging in crash bugs, e.g. from EPMs.


== Addressing a diagnosed crash ==
== Addressing a diagnosed crash ==


Once the cause of a crash in a bug report has been diagnosed, it can be addressed. The possible ways to address a crash defect -- and to prevent similar crashes in the future -- depends on what kind of defect it is.
Once the cause of a crash in a bug report has been diagnosed, it can be addressed. The possible ways to address a crash -- and to prevent similar crashes in the future -- depends on what kind of defect it is.


=== A defect in our code ===
=== A defect in our code ===


Ways to address
Many crashes are due to defects in our code. Ways to address include: fix the defect; backout the defect; disable the defect; recover and continue (e.g. e10s content process restarting).
* Fix defect.
* Backout defect.
* Disable defect.
* Recover and continue? (Difficult in general.)
 
Ways to prevent (by making them impossible)
* Reimplement C and C++ components in JS or Rust.
** {{bug|1135640}} Add rust compiler support.
* Eliminate crash-prone code patterns.
** Low-level, e.g. use smart pointers.
*** {{bug|1272203}} Add mozilla::NotNull to MFBT.
** High-level, e.g. disallow binary extensions.


Ways to prevent (by detecting and backing defective code out immediately)
It's also possible to prevent entire classes of defects by using better code patterns and/or [[Oxidation|languages]], or tools (e.g. dynamic analysis with sufficient coverage).
* Use compiler annotations and warnings more.
** {{bug|1268766}} Use MOZ_MUST_USE everywhere.
** Use -Wshadow and related options more (in combination with warnings-as-errors).
* Use more static analysis in TreeHerder jobs.
** Add more checks to the clang static analysis job.
* Add more dynamic analysis jobs to TreeHerder.
** {{bug|1030826}} Support AddressSanitizer builds on Windows with clang-cl.


=== A defect in something we rely on ===
=== A defect in something we rely on ===


Some crashes are caused by defects things we depend on: OS, gfx drivers, CPUs
Some crashes are caused by defects in things we depend on: OS, gfx drivers, CPUs
(e.g. the infamous AMD bug), RAM, disks.
(e.g. the infamous AMD bug; overclocking can cause problems), RAM, disks. E.g. {{bug|1270232}} and {{bug|1194856}} are bugs where a bad installation might be the cause.


Ways to address
In general, we cannot fix or prevent these defects. Ways to address include: avoid the defect; detect and tolerate the defect; detect the defect and inform the user.
* In general, we cannot fix or prevent these defects.
* Avoid the defect.
** Disable hardware acceleration in the presence of buggy drivers.
** Rewrite code that tickles the AMD bug.
* Tolerate
** Handle gfx driver resets.
* Detect and inform user
** Detect if Firefox is misinstalled (e.g. perform checksums on files) and ask the user to reinstall. ({{bug|1270232}} and {{bug|1194856}} are bugs where a bad install might be the cause.)
** Detect bad memory and inform user? (Or tolerate?)
*** We have [https://bugzilla.mozilla.org/show_bug.cgi?id=1034706#c44 evidence] that 15--20% of JIT crashes might be due to a particular buggy AMD CPU, and another 23--37% might be due to bit flips caused by bad RAM.
*** {{Bug|995652}} Run memtest from the crash reporter.
*** {{Bug|1270554}} Run memtest continuously on the live browser.


=== A defective interaction ===
=== A defective interaction ===


Some crashes are caused by defective interactions with an external entity (e.g. Malware and/or Anti-virus).
Some crashes are caused by defective interactions with other software such as malware and/or anti-virus products. Sometimes these can be blocked, and sometimes we must rely on evangelism.
 
=== OOMs ===
 
OOMs are different to other crashes in terms of cause, though not any different in terms of the effect on users. In general, anything that [[Performance/MemShrink|significantly reduces memory usage]] helps, as does
making large allocations fallible, especially those whose size is controlled by user input or web content. Using [[Firefox/win64|64-bit builds]] also prevents address space exhaustion.
 
= OKRs =
 
[https://en.wikipedia.org/wiki/OKR OKRs] are "objectives and key results". Within the following sections, each sub-section represents an objective, and the bullet points within that sub-section represent key results. Some objective sub-sections have additional supporting information (marked with '''Extra:''') such as a table of relevant bug reports.
 
These OKRs are for H2 2016.
 
== In progress ==
 
These are OKRs that have people actively working on them now.
 
=== Establish crash rate metrics and targets ===
* Verify that the crashes-per-1000-hour metric is reliable [digitarald]
* Define and choose target crash rate(s) [digitarald, njn]
* Implement dashboard(s) for easy monitoring [digitarald, ddurst]
 
=== Improve pre-release coverage ===
* Get the number of Nightly users growing again [pascal, marcia]
* <s>Provide test exercises for QA people and keen Nightly users to run regularly [marcia]</s> (One and Done has been [https://blog.mozilla.org/webqa/2016/09/28/one-and-done-is-being-decommissioned-october-2016/ decommissioned].)
* ? Improve earlier channel populations to be more representative of the Release population [lonnen?]
 
'''Extra:''' {{Bug|1280394}} [tracking] Nightly adoption bugs
 
=== Improve analysis of crash reports ===
* [https://libraries.io/github/calixteman/clouseau Clouseau]: Automatically identify changesets that cause regressions [calixte]
* {{mdone|}} [https://github.com/marco-c/crashcorrelations crash-correlations]: Identify correlations for crash signatures [mcastelluccio]
* ? Provide ability to run custom analysis jobs on crash reports (similar to telemetry analysis jobs) [ekyle?]
 
'''Extra:''' Bugs blocking {{bug|1289676}}: [Uptime] Crash report handling
<bugzilla>
{
  "blocked": "1289676",
  "include_fields": "id, priority, summary, keywords, assigned_to"
}
</bugzilla
>
'''Extra:''' A [https://docs.google.com/presentation/d/1j-w1Mxgh7xQBPa57gdx_PSP2yf9yMJh8Oqu_7izReAY/edit#slide=id.p cross-variate analysis of FHR data], by Brendan Colloran, which may have useful techniques.
 
=== {{mdone|}} Improve understanding of OOM causes ===
* {{mdone|}} Do a large-scale analysis of memory reports from OOM crashes [njn]
* {{mdone|}} Show important data from memory reports in crash-stats [njn]
* ? Discuss common OOM cases with partners [digitarald?]
 
'''Extra:''' Bugs blocking {{bug|1291174}}: [Uptime] OOMs
<bugzilla>
{
  "blocked": "1291174",
  "include_fields": "id, priority, summary, keywords, assigned_to"
}
</bugzilla
>
=== Catch more crashes with static and dynamic analysis ===
* Add additional, regular ASan, TSan, and Valgrind jobs to TreeHerder [jseward, sfink, tcole]
* Experiment with TSan, UBSan and Application Verifier [jseward, tcole, cervantes]
* Reduce the number of outstanding Coverity issues to zero and keep it there [bogdan]
* ? Add more checks to the clang static analysis job [sledru?]
 
'''Extra:''' Bugs blocking {{bug|1289660}}: [Uptime] Static analysis
<bugzilla>
{
  "blocked": "1289660",
  "include_fields": "id, priority, summary, keywords, assigned_to"
}
</bugzilla
>
'''Extra:''' Bugs blocking {{bug|1289646}}: [Uptime] Dynamic analysis
<bugzilla>
{
  "blocked": "1289646",
  "include_fields": "id, priority, summary, keywords, assigned_to"
}
</bugzilla
>
=== {{mdone|}} Improve crash report comprehensibility ===
* {{mdone|}} Improve presentation of crash reports on crash-stats [njn, adngdb]
* {{mdone|}} Improve crash report documentation [njn]
 
'''Extra:''' Bugs blocking {{bug|1289675}}: [Uptime] Crash report comprehensibility
<bugzilla>
{
  "blocked": "1289675",
  "include_fields": "id, priority, summary, keywords, assigned_to"
}
</bugzilla
>
=== Prevent common C++ defects with better types and annotations ===
* Finish mozilla::Result<T,E> and use in SpiderMonkey [jandem]
* Use MOZ_MUST_USE outside SpiderMonkey [njn, wcpan]
 
'''Extra:''' Bugs blocking {{bug|1289662}}: [Uptime] Low-level defect prevention and detection
<bugzilla>
{
  "blocked": "1289662",
  "include_fields": "id, priority, summary, keywords, assigned_to"
}
</bugzilla
>
 
== H2 2016 Backlog ==
 
These are unassigned OKRs that would be helpful and are higher priority.
 
=== Refine how crash clusters are ranked ===
* Take into account crash severity and/or user distribution as well as frequency
* Use [https://github.com/vvuk/crystalball Crystal Ball] (or other means) to identify how pre-Release crashes will manifest on Release
 
'''Extra:''' Bugs blocking {{bug|1291174}}: [Uptime] Crash cluster ranking
<bugzilla>
{
  "blocked": "1309114",
  "include_fields": "id, priority, summary, keywords, assigned_to"
}
</bugzilla
>
'''Extra:''' Relevant research papers.
* http://ieeexplore.ieee.org/document/5711013/
* http://swat.polymtl.ca/~foutsekh/docs/Khomh-WCRE-entropy.pdf
 
=== Reduce shutdown hang rate ===
* ?
 
'''Extra:''' [[Gecko:Shutdown_issues]]
 
=== Detect if Firefox is misinstalled ===
* Implement an installation check mechanism
* Provide guidance to the user when the check fails
 
=== Prevent crashes caused by malware and anti-virus on Windows ===
* Utilize Windows 10 blocking protection for content processes
* Reduce export space of xpcom symbols [bsmedberg]
* ? Provide an API for A/V so they don't need to use dangerous tricks
 
'''Extra:''' {{Bug|1299187}} Tracking: stop exporting symbols such that 3rd-party software cannot use XPCOM effectively
 
=== Catch more crashes with fuzzing ===
* Better isolate components so they can be fuzzed more easily (like the JS shell)
* Increase fuzzing coverage of non-default options
* Improve gtests, which are a good starting point for fuzzing
* {{mdone|}} [https://intranet.mozilla.org/Auto-tools/Systems/Inventory/Applications/Bughunter/TopSiteTests Run BugHunter with common antivirus software] [tomcat]
 
'''Extra:''' {{bug|1289194}} Add LibFuzzer support for testing xul code
 
== Future Backlog ==
 
These are unassigned OKRs that would be helpful and are lower priority.
 
=== Detect when crashes are caused by faulty RAM ===
* Run some kind of live RAM test in the browser
* Annotate crash reports when faulty RAM is detected
 
'''Extra:''' Bugs blocking {{bug|1289666}}: [Uptime] Defective hardware
<bugzilla>
{
  "blocked": "1289666",
  "include_fields": "id, priority, summary, keywords, assigned_to"
}
</bugzilla
>
=== Improve crash report generation on the client ===
* Record additional interesting pages of memory
* Include more JS execution info (stacks; recently called JS file/functions; unresolved promises)
* Include more URL data (URLS of all open tabs; recent URLs of the crashing tab; non-anonymized memory reports); might require policy/legal approval
* Don't omit frame pointers?
* Better identify threads of all platforms, including threadpools
 
'''Extra:''' Bugs blocking {{bug|1289663}}: [Uptime] Crash report creation
<bugzilla>
{
  "blocked": "1289663",
  "include_fields": "id, priority, summary, keywords, assigned_to"
}
</bugzilla
>
=== Increase crash report submission rate ===
* Submit more crash reports from content process crashes
* Always submit non-sensitive data, and make only the sensitive data part (e.g. minidumps) optional


Ways to address
'''Extra:''' Bugs blocking {{bug|1289671}}: [Uptime] Crash report submission
* Prevent
<bugzilla>
** Windows 10 has better blocking, at least for content processes?
{
* Tolerate?
  "blocked": "1289671",
* Evangelism
  "include_fields": "id, priority, summary, keywords, assigned_to"
** {{bug|1229252}} startup crashes in nss3.dll@0x1e5b60 | GetFileInfo, due to software from ESET (eOppMonitor.dll)
}
* Inform user?
</bugzilla
>


= Data Sources and Useful Links =
= Data Sources and Useful Links =
Line 265: Line 357:
* [https://crash-stats.mozilla.com/home/product/Firefox Crash Stats]: entry point to crash reports.
* [https://crash-stats.mozilla.com/home/product/Firefox Crash Stats]: entry point to crash reports.
* [[Platform/Uptime/Logs]] has some more specific links useful when analyzing crash reports.
* [[Platform/Uptime/Logs]] has some more specific links useful when analyzing crash reports.
* ActiveData ([https://wiki.mozilla.org/Auto-tools/Projects/ActiveData about]) ([http://activedata.allizom.org/tools/query.html try it])
* Data mining
** Individual unit test results
** [[Auto-tools/Projects/ActiveData|ActiveData]]: a giant [http://activedata.allizom.org/tools/query.html queryable] database of unit test results, Buildbot steps, and MozHarness steps with success/failure outcomes
** Buildbot steps, and MozHarness steps with success/failure
** [[Bugzilla:REST_API|Bugzilla REST API]]
* Bugzilla API ([https://wiki.mozilla.org/Bugzilla:REST_API about])
** [[BMO/ElasticSearch|Bugzilla ElasticSearch Cluster]]: easy-to-query historical data
** Bugzilla ES Cluster ([https://wiki.mozilla.org/BMO/ElasticSearch about]) easy-to-query historical data
* [http://people.mozilla.org/~sledru/reports/coverage-2015/ Test code coverage]
* [https://docs.google.com/document/d/14nnlEiyl80qyirhIqzCHCDz9zo0R72wDorvmB8jctgs/ Fuzzing Explained - A Guide for Software Developers]
* Crash rate dashboards
** [https://crash-analysis.mozilla.com/release-mgmt/crash-report-tools/longtermgraph/ Crash rate history]: old ADI-based view
** [https://telemetry.mozilla.org/crashes/ Stability dashboard]: new view, intended to replace arewestableyet.com [chutten]
** [https://health.graphics/crashes ADI & Telemetry Crash Rate (Release)] [harald]
** [https://health.graphics/crashes/beta Telemetry Crash Rate (Beta)] [harald]
** [https://bsmedberg.github.io/telemetry-dashboard/crashes/office-dashboard.html Crashes per 1000 usage hours (Beta, DevEd, Nightly)] [bsmedberg]
** {{bug|1324528}} - Bug about removing old dashboards.
** {{bug|1324526}} - Bug about the new dashboards we want.
* [https://dataviz.mozilla.org/views/PlatformVersionFirefoxADI/DesktopADIbyPlatform Firefox ADI dashboard] (requires Tableau/dataviz privileges to view)
 
= Participants =
 
The grouping below is by time zone.
 
Taiwan
* Kan-Ru Chen, platform engineering
* Ting-Yu Chou, platform engineering
* Cervantes Yu, platform engineering
* Wei-Cheng Pan, platform engineering
 
Australia (Eastern)
* Nicholas Nethercote, platform engineering
 
North America (Pacific)
* Andrew McCreight, platform engineering
* Lonnen, Socorro
* David Baron, platform engineering
 
North America (Eastern)
* Jim Chen, platform engineering
* Kyle Lahnakoski, data visualization/analysis
* Marcia Knous, Release Management
* Milan Sreckovic, platform engineering
* Benjamin Smedberg, Firefox engineering
* David Durst, Firefox engineering
 
Europe (Western)
* Marco Castelluccio, crash report analysis and automation
 
Europe (Central)
* Calixte Denizet, crash report analysis and automation
* Julian Seward, dynamic analysis
* Sylvestre Ledru, static analysis, release management & stability
* Gabriele Svelto, Firefox engineering
 
Europe (Eastern)
* Andi Bogdan Postelnicu, static analysis
 
= Meetings =
 
We held meetings for several months, then switched to email updates because we deemed that to be more effective use of everybody's time.
 
Here are the minutes of the meetings we have had.


= Administration =
* [https://public.etherpad-mozilla.org/p/uptime20170410 2017-04-10 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20170327 2017-03-27 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20170313 2017-03-13 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20170227 2017-02-27 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20170213 2017-02-13 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20161219 2016-12-19 minutes]
* [https://public.etherpad-mozilla.org/p/uptime_hawaii_general Hawaii minutes: general]
* [https://public.etherpad-mozilla.org/p/uptime_hawaii_windows Hawaii minutes: Windows third-party crashes]
* [https://public.etherpad-mozilla.org/p/uptime_hawaii_generation Hawaii minutes: client-side crash report generation]
* [https://public.etherpad-mozilla.org/p/uptime_hawaii_analysis Hawaii minutes: crash report analysis]
* [https://public.etherpad-mozilla.org/p/uptime_hawaii_metrics Hawaii minutes: crash metrics]
* [https://public.etherpad-mozilla.org/p/uptime20161121 2016-11-21 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20161107 2016-11-07 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20161024 2016-10-24 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20161010 2016-10-10 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20160926 2016-09-26 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20160912 2016-09-12 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20160829 2016-08-29 minutes]


Things to be decided.
Here is the meeting minutes template.


* Regular meetings: content and time TBD
* [https://public.etherpad-mozilla.org/p/uptime-template Minutes template]
* Progress tracking: high-level stability measurements, bug lists, etc. Links to those.
* Communication:
** IRC: #uptime
** Email: projectuptime-team@mozilla.com
** Vidyo: ProjectUptime


== Crash stats checking rosters ==
= Communication channels =


Nightly builds are produced at 3am each day (California time). Each build's crashes can be viewed [http://dbaron.org/mozilla/crashes-by-build here].
* IRC: #uptime
The current roster for checking Nightly build crashes:
* Vidyo: [https://v.mozilla.com/flex.html?roomdirect.html&key=tGTDjguBXn29Ldaww7BCeVhp4M Uptime]
* Email: There are two email lists, both of which are low-volume.
** Public: The [https://mail.mozilla.org/listinfo/stability stability] list. This list should be used for  the vast majority of discussion. Anybody can join this list.
** Private: <code>projectuptime-team@mozilla.com</code>. This is only for private discussion, e.g. discussions relating to commercially-sensitive information. Please email nnethercote if you want to join this list.


* Monday (Australian time): njn analyzes Friday and Saturday's builds.
= Nightly crash triage =
* Tuesday (Taiwan time): ting analyzes Sunday's build.
* Wednesday (Taiwan time): kanru analyzes Monday's build.
* Wednesday (US East time): jchen analyzes Tuesday's build.
* Thursday (US West time): mccr8 analyzes Wednesday's build.
* Friday (US West time): dbaron analyzes Thursday's build.


A [[Platform/Uptime/NightlyCrashAnalysis|rough guide to Nightly crash analysis]]. We keep [[Platform/Uptime/Logs|notes]] on what we do.
''This documentation has been moved to [[NightlyCrashTriage]].''

Latest revision as of 22:23, 5 May 2023

Project Uptime ran from April 2016 to June 2017. Preventing and fixing crashes remains an important task, and there is ongoing work there. But this work is no longer being coordinated under Project Uptime. This page is being kept in place as a historical record.

Project Uptime's goal is to reduce the crash rate of Firefox (desktop and mobile) and keep it down. This project is a Platform Engineering initiative that aims to extend and complement existing work relating to stability within Mozilla.

David Bryant's internal email announcement said: "The overall responsibility to hit our stability target is something that falls on the entire platform organization, and in terms of priority, stability is second only to shipping e10s."

Goals

The basics

We can break Uptime's goals down into four sub-goals.

  1. Reduce the Release crash rate, and keep it down. This gives a better experience for ~100M/7M users, and makes them less likely to switch to a different browser.
  2. Reduce the Beta crash rate, and keep it down. This gives a better experience for ~2M/150k users, and makes them less likely to switch to a different browser or a later channel. This goal partially subsumes goal #1.
  3. Reduce the Aurora crash rate, and keep it down. This gives a better experience for ~100k/2k users, and makes them less likely to switch to a different browser or a later channel. This goal partially subsumes goals #1 and #2.
  4. Reduce the Nightly crash rate, and keep it down. This gives a better experience for ~30k/1k users, and makes them less likely to switch to a different browser or a later channel. This goal partially subsumes goals #1, #2 and #3.

(Channel ADI numbers are approximate, taken from http://arewestableyet.com on Apr 26.)

These sub-goals are listed in order of priority. But note that the relative priorities do not match the relative user counts. E.g. even though Release has ~3000x as many users as Nightly, goal #1 is not 3000x more important than goal #4. The higher-numbered goals, as well as being (decreasingly) virtuous in their own right, partially subsume the lower-numbered goals, because a crash fixed or avoided in an earlier channel is one that doesn't show up in a later channel. (The later goals don't fully subsume the earlier goals because some crashes don't show up in significant numbers until we get to the larger populations of the later channels.)

There are multiple strategies (tools, processes, systems) to achieve these goals -- see below -- but those strategies should not be confused with the goals themselves.

Definition of "crash rate"

For a long time the crash rate metric was "crashes per 100 ADIs". This has various shortcomings and is in the process of being replaced with "crashes per 1000 hours of usage".

Not all crashes are equal. For example: persistent start-up crashes are terrible; with e10s enabled, content process crashes are not as bad as chrome process crashes; plugin crashes may be less important. It would be possible to adjust the per-1000-hours-of-usage measurements by weighting different types of crashes differently. However, any such formula would be arbitrary and not clearly better than the plain metric. Therefore, we will use the plain metric for defining "crash rate", and that is the value we want to get down.

Having said that, when considering a problem as complex as browser crashes, any single metric will have shortcomings and should not be the only thing paid attention to. In particular, we should build systems and processes (such as automated alerts) that help us give attention to the worst kinds of crashes.

Target crash rate

There are eight crash rates to track: {Release, Beta, Aurora, Nightly} x {desktop, mobile}.

The actual target values are TBD, and cannot be decided until the new per-1000-hours-of-usage metric has fully replaced the old per-100-ADIs metric and we know what our current values are.

All about crashes

We use "crashes" to refer to several kinds of events.

  • Uncontrolled crashes: most commonly due to bad memory accesses, which are caused by bad pointers. Also sometimes illegal instructions, division-by-zero, etc.
  • Controlled aborts: MOZ_CRASH, MOZ_RELEASE_ASSERT, Rust panics, etc.
  • OOMs: running out of virtual or physical memory, or page file. Can be caused by Firefox using too much memory, or other processes on the system using too much memory. These usually present as controlled aborts.
  • Hangs: sometimes when a plugin or content process stops responding for too long, we will collect a dump from that process and then kill it. Most common at shutdown.

Crashes are grouped by process type:

  • When the "main" (chrome) process crashes, Firefox terminates.
  • When a content process crashes, Firefox itself stays open but some or all tabs must be reloaded.
  • When an NPAPI plugin or Gecko media plugin crashes, the web page will stay alive but some functionality may be broken and require the user to reload pages which used that plugin.

There are multiple different causes of crashes, and multiple ways they can be identified and addressed.

A word about coverage

For all of the following paths to a bug report (except static analysis and code inspection) coverage affects how effective the path is. There are several kinds of coverage.

  • Code coverage: e.g. which lines of code are we not testing?
  • Web feature coverage: e.g. which web features are we not testing?
  • Site coverage: e.g. which important websites are we not testing?
  • Hardware configuration coverage: e.g. which important gfx card/drivers combinations are we not testing?
  • Firefox configuration coverage: e.g. which add-ons or non-default options are we not testing?

Paths to a bug report

Almost without exception, for a crash to be fixed we must have a bug report for it. There are multiple paths from a crash to a bug report.

User crash report (automated)

Steps

  • A "normal" user hits a crash.
  • The crash reporter catches the crash.
  • The user consents to crash report submission.
  • The crash report is created.
  • The crash report is submitted, received and processed.
  • The crash report is clustered into a signature and ranked.
  • A Mozillian triages the signature and files a bug report.

Pros and cons

  • [+] This is the only path that gives us a clear idea of how frequent the crash is among our user population.
  • [+] Crash reports on Nightly provide fast feedback and small regression windows.
  • [-] This is a long, complex path with numerous points of potential failure.
  • [-] Submission rates vary, and are low for content, NPAPI and GMPlugin crashes.
  • [-] Clustering and ranking is imperfect. Sometimes crash reports with distinct causes get conflated into one signature. Sometimes crash reports with a single cause get separate signatures (people sometimes spot this and link multiple signatures to a single bug report). The ranking does not consider crash severity.
  • [-] Some crashes only manifest frequently in later release channels, where they affect more people, and are harder to address due to late feedback and large regression windows. Crashes that only manifest significantly on Release may require chemspills.
  • [-] Crash reports frequently lack enough info to diagnose/address. (E.g. no steps to reproduce, and it's obvious that a pointer was bad, but unclear how it got that way.) Or the info is present but easy to overlook.
  • [-] Crash report data can be unreliable. (E.g. incorrect crash addresses and stack traces.)

User crash report (manual)

Steps

  • A developer, QA person or advanced user hits a crash.
  • That person files a bug report.

Pros and cons

  • [+/-] We may or may not get clear steps to reproduce.
  • [+] We usually get a crash report as well.

Crash on automation

Steps

  • An automated non-TreeHerder job crashes, e.g. fuzzers, BugHunter.
  • A monitoring person (e.g. fuzzer, QA person) files a bug report.

Note: many tests are performed regularly on TreeHerder and are cause for immediate patch backout if they crash. Such crashes do not match this path because no bug report is filed.

Pros and cons

  • [+] We have a test case...
  • [-] ...but reproduction may be difficult or unreliable.

Analysis

Steps

  • A non-TreeHerder dynamic analysis job (e.g. Valgrind, ASan, TSan, UBSan) or static analysis job (e.g. Coverity) detects a possible crash defect. Or, a developer identifies a possible crash defect via code inspection.
  • A monitoring person files a bug report.

Note: some dynamic and static analysis jobs are performed regularly on TreeHerder and are cause for immediate patch backout if they fail. Crash defects found this way do not match this path because no bug report is filed.

Pros and cons

  • [+] The defect is usually well-identified and actionable.
  • [-] Not always clear how often the defect will cause crashes, though this depends on the analysis kind. (E.g. ASan is clearer than TSan.)

Paths from bug report to a diagnosis

As mentioned above, different kinds of bug report data have different likelihoods of leading to a diagnosis.

  • High: Reliable test case or steps to reproduce.
  • High: Well-identified defect, e.g. from analysis.
  • Lower: Unreliable test case or steps to reproduce.
  • Lower: Crash report.

Also, the following things increase the likelihood that a crash bug report will lead to a diagnosis.

  • A needinfo request of an appropriate developer.
  • Evidence that the crash has high severity or frequency.
  • Nagging.

Addressing a diagnosed crash

Once the cause of a crash in a bug report has been diagnosed, it can be addressed. The possible ways to address a crash -- and to prevent similar crashes in the future -- depends on what kind of defect it is.

A defect in our code

Many crashes are due to defects in our code. Ways to address include: fix the defect; backout the defect; disable the defect; recover and continue (e.g. e10s content process restarting).

It's also possible to prevent entire classes of defects by using better code patterns and/or languages, or tools (e.g. dynamic analysis with sufficient coverage).

A defect in something we rely on

Some crashes are caused by defects in things we depend on: OS, gfx drivers, CPUs (e.g. the infamous AMD bug; overclocking can cause problems), RAM, disks. E.g. bug 1270232 and bug 1194856 are bugs where a bad installation might be the cause.

In general, we cannot fix or prevent these defects. Ways to address include: avoid the defect; detect and tolerate the defect; detect the defect and inform the user.

A defective interaction

Some crashes are caused by defective interactions with other software such as malware and/or anti-virus products. Sometimes these can be blocked, and sometimes we must rely on evangelism.

OOMs

OOMs are different to other crashes in terms of cause, though not any different in terms of the effect on users. In general, anything that significantly reduces memory usage helps, as does making large allocations fallible, especially those whose size is controlled by user input or web content. Using 64-bit builds also prevents address space exhaustion.

OKRs

OKRs are "objectives and key results". Within the following sections, each sub-section represents an objective, and the bullet points within that sub-section represent key results. Some objective sub-sections have additional supporting information (marked with Extra:) such as a table of relevant bug reports.

These OKRs are for H2 2016.

In progress

These are OKRs that have people actively working on them now.

Establish crash rate metrics and targets

  • Verify that the crashes-per-1000-hour metric is reliable [digitarald]
  • Define and choose target crash rate(s) [digitarald, njn]
  • Implement dashboard(s) for easy monitoring [digitarald, ddurst]

Improve pre-release coverage

  • Get the number of Nightly users growing again [pascal, marcia]
  • Provide test exercises for QA people and keen Nightly users to run regularly [marcia] (One and Done has been decommissioned.)
  • ? Improve earlier channel populations to be more representative of the Release population [lonnen?]

Extra: bug 1280394 [tracking] Nightly adoption bugs

Improve analysis of crash reports

  • Clouseau: Automatically identify changesets that cause regressions [calixte]
  • [DONE] crash-correlations: Identify correlations for crash signatures [mcastelluccio]
  • ? Provide ability to run custom analysis jobs on crash reports (similar to telemetry analysis jobs) [ekyle?]

Extra: Bugs blocking bug 1289676: [Uptime] Crash report handling

Full Query
ID Priority Summary Keywords Assigned to
974420 -- Addresses >128TB are displayed as 0xffffffffffffffff on crash-stats
977778 P3 Allow users to request get-minidump-instructions report on-demand
1268029 -- Use jit classifier to change signature for jit crashes [DEACTIVATED] Adrian Gaudebert
1273657 -- [tracker] Publish public crash stats to the data platform Peter Bengtsson [:peterbe]
1274345 -- Add support for skipping a dll in the signature [DEACTIVATED] Adrian Gaudebert
1274428 -- Mark crashes that happen at invalid instruction pointers?
1274628 P3 Annotate crashes when the code in memory around the crashing instruction differs from the code in the shipped binary
1277337 P3 Use hg.mozilla.org to map crashes to bug components by way of source files when possible
1291173 -- Show important info from memory reports in crash-stats Nicholas Nethercote [inactive]
1297966 -- Show the new "StartupCrash" annotation in the crash report page Marco Castelluccio [:marco]
1305888 -- Add the new CPU microcode annotation to SuperSearch [DEACTIVATED] Adrian Gaudebert
1306891 -- Integrate correlation results from https://mozilla.github.io/stab-crashes/correlations.html on crash-stats Marco Castelluccio [:marco]
1308474 -- Add the new StartupCrash annotation to SuperSearch [DEACTIVATED] Adrian Gaudebert
1308476 -- Replace the current heuristic for startup crashes by using the new StartupCrash annotation [DEACTIVATED] Adrian Gaudebert

14 Total; 14 Open (100%); 0 Resolved (0%); 0 Verified (0%);

Extra: A cross-variate analysis of FHR data, by Brendan Colloran, which may have useful techniques.

[DONE] Improve understanding of OOM causes

  • [DONE] Do a large-scale analysis of memory reports from OOM crashes [njn]
  • [DONE] Show important data from memory reports in crash-stats [njn]
  • ? Discuss common OOM cases with partners [digitarald?]

Extra: Bugs blocking bug 1291174: [Uptime] OOMs

Full Query
ID Priority Summary Keywords Assigned to
1291068 -- Large-scale analysis of OOM crash reports with ContainsMemoryReport=1
1291173 -- Show important info from memory reports in crash-stats Nicholas Nethercote [inactive]
1299747 -- Create a tool to track 64k unaligned virtual memory allocation Ting-Yu Chou [:ting] (away)

3 Total; 3 Open (100%); 0 Resolved (0%); 0 Verified (0%);

Catch more crashes with static and dynamic analysis

  • Add additional, regular ASan, TSan, and Valgrind jobs to TreeHerder [jseward, sfink, tcole]
  • Experiment with TSan, UBSan and Application Verifier [jseward, tcole, cervantes]
  • Reduce the number of outstanding Coverity issues to zero and keep it there [bogdan]
  • ? Add more checks to the clang static analysis job [sledru?]

Extra: Bugs blocking bug 1289660: [Uptime] Static analysis

Full Query
ID Priority Summary Keywords Assigned to
1230156 P3 [meta] Coverity Static Analysis fixes coverity, meta
1272513 -- Enable -Wshadow warnings Chris Peterson [:cpeterson]

2 Total; 2 Open (100%); 0 Resolved (0%); 0 Verified (0%);

Extra: Bugs blocking bug 1289646: [Uptime] Dynamic analysis

Full Query
ID Priority Summary Keywords Assigned to
929478 -- [meta] Make TSan (ThreadSanitizer) usable with Firefox meta, sec-want
1030826 -- Support AddressSanitizer builds on Windows with clang-cl sec-want
1280637 -- Add a TSan-enabled TaskCluster JS shell job Steve Fink [:sfink] [:s:]
1284975 P2 [meta] Make SpiderMonkey clean on UBSan meta
1288596 -- Make SM-tc(msan) a first class test job Steve Fink [:sfink] [:s:]
1288993 -- Run valgrind-mochitest twice a day as a Tier 2 job Joel Maher ( :jmaher ) (UTC -8)
1289994 P3 [meta] Use Application Verifier meta
1291954 P3 Make SM(tsan) a tier 1 build Steve Fink [:sfink] [:s:]

8 Total; 8 Open (100%); 0 Resolved (0%); 0 Verified (0%);

[DONE] Improve crash report comprehensibility

  • [DONE] Improve presentation of crash reports on crash-stats [njn, adngdb]
  • [DONE] Improve crash report documentation [njn]

Extra: Bugs blocking bug 1289675: [Uptime] Crash report comprehensibility

Full Query
ID Priority Summary Keywords Assigned to
1275799 -- Add descriptions to crash report fields [DEACTIVATED] Adrian Gaudebert
1288309 -- Improve documentation about individual crash reports Nicholas Nethercote [inactive]
1288310 -- Improve documentation about analyzing clusters of crash reports Nicholas Nethercote [inactive]

3 Total; 3 Open (100%); 0 Resolved (0%); 0 Verified (0%);

Prevent common C++ defects with better types and annotations

  • Finish mozilla::Result<T,E> and use in SpiderMonkey [jandem]
  • Use MOZ_MUST_USE outside SpiderMonkey [njn, wcpan]

Extra: Bugs blocking bug 1289662: [Uptime] Low-level defect prevention and detection

Full Query
ID Priority Summary Keywords Assigned to
1268766 -- Use [[nodiscard]] everywhere
1272203 -- Add mozilla::NotNull to MFBT Nicholas Nethercote [inactive]
1276097 P5 Add a bytecode sanity check
1277368 P3 [meta] Use mozilla::Result<T, E> for fallible return values in the JS engine meta, triage-deferred Jan de Mooij [:jandem]

4 Total; 4 Open (100%); 0 Resolved (0%); 0 Verified (0%);


H2 2016 Backlog

These are unassigned OKRs that would be helpful and are higher priority.

Refine how crash clusters are ranked

  • Take into account crash severity and/or user distribution as well as frequency
  • Use Crystal Ball (or other means) to identify how pre-Release crashes will manifest on Release

Extra: Bugs blocking bug 1291174: [Uptime] Crash cluster ranking

Full Query
ID Priority Summary Keywords Assigned to
1303730 -- Consider changing the default precision for cardinality queries
1305016 -- In the Top Crashers list, add the number of installations that are generating crash reports for a given signature [DEACTIVATED] Adrian Gaudebert

2 Total; 2 Open (100%); 0 Resolved (0%); 0 Verified (0%);

Extra: Relevant research papers.

Reduce shutdown hang rate

  • ?

Extra: Gecko:Shutdown_issues

Detect if Firefox is misinstalled

  • Implement an installation check mechanism
  • Provide guidance to the user when the check fails

Prevent crashes caused by malware and anti-virus on Windows

  • Utilize Windows 10 blocking protection for content processes
  • Reduce export space of xpcom symbols [bsmedberg]
  • ? Provide an API for A/V so they don't need to use dangerous tricks

Extra: bug 1299187 Tracking: stop exporting symbols such that 3rd-party software cannot use XPCOM effectively

Catch more crashes with fuzzing

  • Better isolate components so they can be fuzzed more easily (like the JS shell)
  • Increase fuzzing coverage of non-default options
  • Improve gtests, which are a good starting point for fuzzing
  • [DONE] Run BugHunter with common antivirus software [tomcat]

Extra: bug 1289194 Add LibFuzzer support for testing xul code

Future Backlog

These are unassigned OKRs that would be helpful and are lower priority.

Detect when crashes are caused by faulty RAM

  • Run some kind of live RAM test in the browser
  • Annotate crash reports when faulty RAM is detected

Extra: Bugs blocking bug 1289666: [Uptime] Defective hardware

Full Query
ID Priority Summary Keywords Assigned to
995652 P3 Run memtest from the crash reporter
1270554 -- Run memtest continuously on the live browser
1274428 -- Mark crashes that happen at invalid instruction pointers?
1274628 P3 Annotate crashes when the code in memory around the crashing instruction differs from the code in the shipped binary
1281759 P3 Work around mysterious AMD JIT crashes crash Jan de Mooij [:jandem]
1293188 P3 Crash in EnterBaseline (can be defective hardware [Mem or VGA]) crash, triage-deferred
1293996 P1 Crash in adapt_probs crash, regression Jean-Yves Avenard [:jya]
1317253 P5 Best-effort detection of faulty memory at page-request time

8 Total; 8 Open (100%); 0 Resolved (0%); 0 Verified (0%);

Improve crash report generation on the client

  • Record additional interesting pages of memory
  • Include more JS execution info (stacks; recently called JS file/functions; unresolved promises)
  • Include more URL data (URLS of all open tabs; recent URLs of the crashing tab; non-anonymized memory reports); might require policy/legal approval
  • Don't omit frame pointers?
  • Better identify threads of all platforms, including threadpools

Extra: Bugs blocking bug 1289663: [Uptime] Crash report creation

Full Query
ID Priority Summary Keywords Assigned to
1250687 P3 Consider MiniDumpWithPrivateWriteCopyMemory for Windows minidumps Carl Corcoran [:ccorcoran]
1251395 P3 Include non-JIT, executable, private pages in crash dumps
1277448 -- unify MozCrashReason and AbortMessage crash annotations (and use __func__ in them and for assertions)
1280469 P2 [meta] Client-side stack walking meta Gabriele Svelto [:gsvelto]
1286802 -- Add heap regions of the crash context to minidump (Windows) Cervantes Yu [:cyu] [:cervantes]
1295918 -- Include JS stacks in crash report
1295934 -- Add a crash report annotation when we hit DoneStartingUp Nicholas Nethercote [inactive]
1309573 -- Many places (including all of SpiderMonkey) cannot set the crash reason in crash reports Emanuel Hoogeveen [:ehoogeveen]
1334027 -- Add unloaded modules and process/thread data to minidumps Ting-Yu Chou [:ting] (away)
1337688 -- Remove NIGHTLY_BUILD wrapping if the increased size from adding unloaded modules and process/thread data to minidumps is acceptable
1351277 -- Add heap regions of the crash context to minidump (MacOS and Linux)
1488404 -- Consider including heap data in minidump for beta release channel

12 Total; 12 Open (100%); 0 Resolved (0%); 0 Verified (0%);

Increase crash report submission rate

  • Submit more crash reports from content process crashes
  • Always submit non-sensitive data, and make only the sensitive data part (e.g. minidumps) optional

Extra: Bugs blocking bug 1289671: [Uptime] Crash report submission

Full Query
ID Priority Summary Keywords Assigned to
1269998 P1 Prompt users with pending crash reports to submit them Brad Lassey [:blassey] (use needinfo?)
1270553 -- Allow users to opt-in to auto submitting crash data
1280469 P2 [meta] Client-side stack walking meta Gabriele Svelto [:gsvelto]
1287178 -- Refactor unsubmitted crash report handling and allow users to always send backlogged crash reports dev-doc-needed Mike Conley (:mconley) (:⚙️)
1333125 P4 Improve HTTP proxy support in the crashreporter client and pingsender

5 Total; 5 Open (100%); 0 Resolved (0%); 0 Verified (0%);


Data Sources and Useful Links

Participants

The grouping below is by time zone.

Taiwan

  • Kan-Ru Chen, platform engineering
  • Ting-Yu Chou, platform engineering
  • Cervantes Yu, platform engineering
  • Wei-Cheng Pan, platform engineering

Australia (Eastern)

  • Nicholas Nethercote, platform engineering

North America (Pacific)

  • Andrew McCreight, platform engineering
  • Lonnen, Socorro
  • David Baron, platform engineering

North America (Eastern)

  • Jim Chen, platform engineering
  • Kyle Lahnakoski, data visualization/analysis
  • Marcia Knous, Release Management
  • Milan Sreckovic, platform engineering
  • Benjamin Smedberg, Firefox engineering
  • David Durst, Firefox engineering

Europe (Western)

  • Marco Castelluccio, crash report analysis and automation

Europe (Central)

  • Calixte Denizet, crash report analysis and automation
  • Julian Seward, dynamic analysis
  • Sylvestre Ledru, static analysis, release management & stability
  • Gabriele Svelto, Firefox engineering

Europe (Eastern)

  • Andi Bogdan Postelnicu, static analysis

Meetings

We held meetings for several months, then switched to email updates because we deemed that to be more effective use of everybody's time.

Here are the minutes of the meetings we have had.

Here is the meeting minutes template.

Communication channels

  • IRC: #uptime
  • Vidyo: Uptime
  • Email: There are two email lists, both of which are low-volume.
    • Public: The stability list. This list should be used for the vast majority of discussion. Anybody can join this list.
    • Private: projectuptime-team@mozilla.com. This is only for private discussion, e.g. discussions relating to commercially-sensitive information. Please email nnethercote if you want to join this list.

Nightly crash triage

This documentation has been moved to NightlyCrashTriage.