Platform/Uptime: Difference between revisions

m
name fixup
m (name fixup)
 
(44 intermediate revisions by 4 users not shown)
Line 1: Line 1:
'''Project Uptime ran from April 2016 to June 2017.'''
''Preventing and fixing crashes remains an important task, and there is ongoing work there. But this work is no longer being coordinated under Project Uptime. This page is being kept in place as a historical record.''
Project Uptime's goal is to reduce the crash rate of Firefox (desktop and mobile) and keep it down. This project is a Platform Engineering initiative that aims to extend and complement existing work relating to stability within Mozilla.  
Project Uptime's goal is to reduce the crash rate of Firefox (desktop and mobile) and keep it down. This project is a Platform Engineering initiative that aims to extend and complement existing work relating to stability within Mozilla.  


Line 182: Line 185:
=== Improve pre-release coverage ===
=== Improve pre-release coverage ===
* Get the number of Nightly users growing again [pascal, marcia]
* Get the number of Nightly users growing again [pascal, marcia]
* Provide test exercises for QA people and keen Nightly users to run regularly [marcia]
* <s>Provide test exercises for QA people and keen Nightly users to run regularly [marcia]</s> (One and Done has been [https://blog.mozilla.org/webqa/2016/09/28/one-and-done-is-being-decommissioned-october-2016/ decommissioned].)
* ? Improve earlier channel populations to be more representative of the Release population [lonnen?]
* ? Improve earlier channel populations to be more representative of the Release population [lonnen?]


Line 189: Line 192:
=== Improve analysis of crash reports ===
=== Improve analysis of crash reports ===
* [https://libraries.io/github/calixteman/clouseau Clouseau]: Automatically identify changesets that cause regressions [calixte]
* [https://libraries.io/github/calixteman/clouseau Clouseau]: Automatically identify changesets that cause regressions [calixte]
* [https://github.com/marco-c/crashcorrelations crash-correlations]: Identify correlations for crash signatures [mcastelluccio]
* {{mdone|}} [https://github.com/marco-c/crashcorrelations crash-correlations]: Identify correlations for crash signatures [mcastelluccio]
* ? Provide ability to run custom analysis jobs on crash reports (similar to telemetry analysis jobs) [ekyle?]
* ? Provide ability to run custom analysis jobs on crash reports (similar to telemetry analysis jobs) [ekyle?]


Line 202: Line 205:
'''Extra:''' A [https://docs.google.com/presentation/d/1j-w1Mxgh7xQBPa57gdx_PSP2yf9yMJh8Oqu_7izReAY/edit#slide=id.p cross-variate analysis of FHR data], by Brendan Colloran, which may have useful techniques.
'''Extra:''' A [https://docs.google.com/presentation/d/1j-w1Mxgh7xQBPa57gdx_PSP2yf9yMJh8Oqu_7izReAY/edit#slide=id.p cross-variate analysis of FHR data], by Brendan Colloran, which may have useful techniques.


=== Improve understanding of OOM causes ===
=== {{mdone|}} Improve understanding of OOM causes ===
* Do a large-scale analysis of memory reports from OOM crashes [njn]
* {{mdone|}} Do a large-scale analysis of memory reports from OOM crashes [njn]
* Show important data from memory reports in crash-stats [njn]
* {{mdone|}} Show important data from memory reports in crash-stats [njn]
* ? Discuss common OOM cases with partners [digitarald?]
* ? Discuss common OOM cases with partners [digitarald?]


Line 237: Line 240:
</bugzilla
</bugzilla
>
>
=== Improve crash report comprehensibility ===
=== {{mdone|}} Improve crash report comprehensibility ===
* {{mdone|}} Improve presentation of crash reports on crash-stats [njn, adngdb]
* {{mdone|}} Improve presentation of crash reports on crash-stats [njn, adngdb]
* Improve crash report documentation [njn]
* {{mdone|}} Improve crash report documentation [njn]


'''Extra:''' Bugs blocking {{bug|1289675}}: [Uptime] Crash report comprehensibility
'''Extra:''' Bugs blocking {{bug|1289675}}: [Uptime] Crash report comprehensibility
Line 249: Line 252:
</bugzilla
</bugzilla
>
>
=== Prevent common C++ defects with better types and annotations ===
=== Prevent common C++ defects with better types and annotations ===
* Finish mozilla::Result<T,E> and use in SpiderMonkey [jorendorff]
* Finish mozilla::Result<T,E> and use in SpiderMonkey [jandem]
* Use MOZ_MUST_USE outside SpiderMonkey [njn]
* Use MOZ_MUST_USE outside SpiderMonkey [njn, wcpan]


'''Extra:''' Bugs blocking {{bug|1289662}}: [Uptime] Low-level defect prevention and detection
'''Extra:''' Bugs blocking {{bug|1289662}}: [Uptime] Low-level defect prevention and detection
Line 262: Line 264:
</bugzilla
</bugzilla
>
>
== H2 2016 Backlog ==
== H2 2016 Backlog ==


Line 270: Line 273:
* Use [https://github.com/vvuk/crystalball Crystal Ball] (or other means) to identify how pre-Release crashes will manifest on Release
* Use [https://github.com/vvuk/crystalball Crystal Ball] (or other means) to identify how pre-Release crashes will manifest on Release


'''Extra:''' Bugs blocking {{bug|1291174}}: [Uptime] Crash cluster ranking
<bugzilla>
{
  "blocked": "1309114",
  "include_fields": "id, priority, summary, keywords, assigned_to"
}
</bugzilla
>
'''Extra:''' Relevant research papers.
'''Extra:''' Relevant research papers.
* http://ieeexplore.ieee.org/document/5711013/
* http://ieeexplore.ieee.org/document/5711013/
Line 353: Line 364:
* [https://docs.google.com/document/d/14nnlEiyl80qyirhIqzCHCDz9zo0R72wDorvmB8jctgs/ Fuzzing Explained - A Guide for Software Developers]
* [https://docs.google.com/document/d/14nnlEiyl80qyirhIqzCHCDz9zo0R72wDorvmB8jctgs/ Fuzzing Explained - A Guide for Software Developers]
* Crash rate dashboards
* Crash rate dashboards
** [https://crash-analysis.mozilla.com/release-mgmt/crash-report-tools/longtermgraph/ Crash rate history]: old ADI-based view
** [https://telemetry.mozilla.org/crashes/ Stability dashboard]: new view, intended to replace arewestableyet.com [chutten]
** [https://health.graphics/crashes ADI & Telemetry Crash Rate (Release)] [harald]
** [https://health.graphics/crashes ADI & Telemetry Crash Rate (Release)] [harald]
** [https://health.graphics/crashes/beta Telemetry Crash Rate (Beta)] [harald]
** [https://health.graphics/crashes/beta Telemetry Crash Rate (Beta)] [harald]
** [https://bsmedberg.github.io/telemetry-dashboard/crashes/office-dashboard.html Crashes per 1000 usage hours (Beta, DevEd, Nightly)] [bsmedberg]
** [https://bsmedberg.github.io/telemetry-dashboard/crashes/office-dashboard.html Crashes per 1000 usage hours (Beta, DevEd, Nightly)] [bsmedberg]
** {{bug|1324528}} - Bug about removing old dashboards.
** {{bug|1324526}} - Bug about the new dashboards we want.
* [https://dataviz.mozilla.org/views/PlatformVersionFirefoxADI/DesktopADIbyPlatform Firefox ADI dashboard] (requires Tableau/dataviz privileges to view)


= Participants =
= Participants =
Line 372: Line 388:
North America (Pacific)
North America (Pacific)
* Andrew McCreight, platform engineering
* Andrew McCreight, platform engineering
* Chris Lonnen, Socorro
* Lonnen, Socorro
* David Baron, platform engineering
* David Baron, platform engineering


Line 390: Line 406:
* Julian Seward, dynamic analysis
* Julian Seward, dynamic analysis
* Sylvestre Ledru, static analysis, release management & stability
* Sylvestre Ledru, static analysis, release management & stability
* Gabriele Svelto, Firefox engineering


Europe (Eastern)
Europe (Eastern)
Line 396: Line 413:
= Meetings =
= Meetings =


Meetings are every two weeks. Because Uptime participants span so many timezones, there are two meetings. People should attend the meeting that best suits their timezone.
We held meetings for several months, then switched to email updates because we deemed that to be more effective use of everybody's time.


Meetings will take place in the [https://v.mozilla.com/flex.html?roomdirect.html&key=tGTDjguBXn29Ldaww7BCeVhp4M Uptime Vidyo room].
Here are the minutes of the meetings we have had.
 
== Meeting A ==
 
This meeting is at 9am US (Pacific) time. The times for Europe may shift by one hour during the transition to/from summer time.
 
{| class="wikitable"
! North America (Pacific)
! North America (Eastern)
! Europe (Western)
! Europe (Central)
! Europe (Eastern)
|-
| Monday 9am
| Monday 12pm
| Monday 5pm
| Monday 6pm
| Monday 7pm
|}
 
== Meeting B ==
 
This meeting is at 9am Taiwan time. Taiwan does not observe daylight saving, so the meeting time changes in other locations that do. The table below shows the resulting times for the two main segments of the year. The time when the meeting time changes for the non-Taiwanese locations depends on when they enter/leave summer time.
 
{| class="wikitable"
!
! Taiwan
! Australia (Eastern)
! North America (Pacific)
! North America (Eastern)
|-
| Northern summer<br>(e.g. July)
| Tuesday 9am
| Tuesday 11am
| Monday 6pm
| Monday 9pm
|-
| Northern winter<br>(e.g. January)
| Tuesday 9am
| Tuesday 12pm
| Monday 5pm
| Monday 8pm
|}
 
== Minutes ==
 
Meeting minutes are taken so people who are unable to attend a meeting can know what happened, and also to record action items. Minutes are taken in etherpads to allow multiple people to edit them during the meeting. The following etherpad contains a minutes template that can be copied into a new etherpad for each meeting. After doing so, please (a) fill in the date, and (b) copy action items from the previous meeting's minutes into the relevant section of the new meeting's minutes.
 
* [https://public.etherpad-mozilla.org/p/uptime-template Minutes template]
 
Please write minutes so they are comprehensible to people who were not at the meeting. In particular, links to bug reports, project pages, etc., are very helpful. Prior to the meeting, the minutes document can serve as the agenda; please feel free to add items ahead of time.


* [https://public.etherpad-mozilla.org/p/uptime20170410 2017-04-10 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20170327 2017-03-27 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20170313 2017-03-13 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20170227 2017-02-27 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20170213 2017-02-13 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20161219 2016-12-19 minutes]
* [https://public.etherpad-mozilla.org/p/uptime_hawaii_general Hawaii minutes: general]
* [https://public.etherpad-mozilla.org/p/uptime_hawaii_windows Hawaii minutes: Windows third-party crashes]
* [https://public.etherpad-mozilla.org/p/uptime_hawaii_generation Hawaii minutes: client-side crash report generation]
* [https://public.etherpad-mozilla.org/p/uptime_hawaii_analysis Hawaii minutes: crash report analysis]
* [https://public.etherpad-mozilla.org/p/uptime_hawaii_metrics Hawaii minutes: crash metrics]
* [https://public.etherpad-mozilla.org/p/uptime20161121 2016-11-21 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20161107 2016-11-07 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20161024 2016-10-24 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20161010 2016-10-10 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20160926 2016-09-26 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20160926 2016-09-26 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20160912 2016-09-12 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20160912 2016-09-12 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20160829 2016-08-29 minutes]
* [https://public.etherpad-mozilla.org/p/uptime20160829 2016-08-29 minutes]


== Follow-up email threads ==
Here is the meeting minutes template.


Follow-up threads are encouraged on the email list, for discussing things that were not clear from the meeting minutes. (Updating the minutes with additional information based on these meetings is also encouraged.)
* [https://public.etherpad-mozilla.org/p/uptime-template Minutes template]


= Communication channels =
= Communication channels =


* IRC: #uptime
* IRC: #uptime
* Email: projectuptime-team@mozilla.com
* Vidyo: [https://v.mozilla.com/flex.html?roomdirect.html&key=tGTDjguBXn29Ldaww7BCeVhp4M Uptime]
* Vidyo: [https://v.mozilla.com/flex.html?roomdirect.html&key=tGTDjguBXn29Ldaww7BCeVhp4M Uptime]
* Email: There are two email lists, both of which are low-volume.
** Public: The [https://mail.mozilla.org/listinfo/stability stability] list. This list should be used for  the vast majority of discussion. Anybody can join this list.
** Private: <code>projectuptime-team@mozilla.com</code>. This is only for private discussion, e.g. discussions relating to commercially-sensitive information. Please email nnethercote if you want to join this list.


= Nightly crash triage =
= Nightly crash triage =


We aim to analyze the crashes for every Nightly build.
''This documentation has been moved to [[NightlyCrashTriage]].''
 
== Roster ==
 
Nightly builds are produced at 3am each day (California time).
 
* Monday (Australian time): njn analyzes Friday's build.
* Monday (US East time): marcia analyzes Saturday's build.
* Tuesday (Taiwan time): ting analyzes Sunday's build.
* Wednesday (Taiwan time): kanru analyzes Monday's build.
* Wednesday (US East time): jchen analyzes Tuesday's build.
* Thursday (US West time): mccr8 analyzes Wednesday's build.
* Friday (US East time): jchen analyzes Thursday's build.
 
A [https://calendar.google.com/calendar/b/1/embed?src=mozilla.com_37e791c3iohijr18mi02l0dvi0@group.calendar.google.com&ctz=America/Los_Angeles live calendar] is also available for Mozilla employees. Please use it to schedule deviations from the usual roster, e.g. for PTO.
 
== Notes ==
 
Triage notes are kept in the following pages.
 
* [[Platform/Uptime/NightlyCrashTriage/2016Q3|2016 Q3]]
* [[Platform/Uptime/NightlyCrashTriage/2016Q2|2016 Q2]]
 
Use the date you are doing the triage, rather than the date of the build, to decide which page to put your notes in. The reason for this is that the triage date has a heading, which makes it more prominent in the notes than the build date.
 
== Data sources, tools, documentation ==
 
Crucial links
* [http://dbaron.org/mozilla/crashes-by-build Nightly and Aurora crashes by build]
* [https://mozilla.github.io/stab-crashes/correlations.html The new correlation reports]
 
Other links
* [https://crash-stats.mozilla.com/search/?product=Firefox&_facets=moz_crash_reason&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=moz_crash_reason#facet-moz_crash_reason All MOZ_CRASH] crashes.
* [http://bsmedberg.github.io/socorro-toolbox/html/multiple-minidumps.html Display stacks from multi-dump hangs/crashes]
* [https://crash-analysis.mozilla.com/crash_analysis/ The old, busted correlation reports]
 
Documentation
* A [[Platform/Uptime/NightlyCrashAnalysis|rough guide to Nightly crash analysis]].
* [https://developer.mozilla.org/en-US/docs/Understanding_crash_reports$edit Understanding crash reports]
Confirmed users
1,031

edits