Platform/Uptime: Difference between revisions

Added Goals Brainstorm section from internal communications and added communications specifics (IRC, mailing list, vidyo)
(→‎Reactive: added 'Gather more runtime data' suggestion)
(Added Goals Brainstorm section from internal communications and added communications specifics (IRC, mailing list, vidyo))
Line 1: Line 1:
Uptime is a project that aims to improve Firefox's stability, i.e. reduce its crash rate.
Uptime's mission is to not only reduce the crash rate of Firefox (desktop and mobile) but also to keep that crash rate down. This project is a Platform Engineering initiative that aims to extend and complement existing work relating to stability within Mozilla.  


Uptime is a Platform Engineering initiative that aims to extend and complement existing work relating to stability within Mozilla.  
== Goals Brainstorm ==
We can break Uptime's mission statement down into four sub-goals.
 
1 Reduce the Release crash rate, and keep it down.
*This gives a better experience for ~100M/7M users.
*Which makes them less likely to switch to a different browser.
 
2 Reduce the Beta crash rate, and keep it down.
*This gives a better experience for ~2M/150k users.
*Which makes them less likely to switch to a different browser or a later channel.
*This goal partially subsumes goal #1.
 
3 Reduce the Aurora crash rate, and keep it down.
* This gives a better experience for ~100k/2k users.
* Which makes them less likely to switch to a different browser or a later channel.
* This goal partially subsumes goals #1 and #2.
 
4 Reduce the Nightly crash rate, and keep it down
* This gives a better experience for ~30k/1k users.
* Which makes them less likely to switch to a different browser or a later channel.
* This goal partially subsumes goals #1, #2 and #3.
 
 
(Channel ADI numbers are approximate, taken from http://arewestableyet.com on Apr 26.)
 
 
These sub-goals are listed in order of priority. But note that the relative
priorities do not match the relative user counts. E.g. even though Release
has 1000x as many users as Nightly, goal #1 is certainly not 1000x more
important than goal #4. The higher-numbered goals, as well as being
(decreasingly) virtuous in their own right, partially subsume the
lower-numbered goals, because a crash fixed or avoided in an earlier channel is
one that doesn't show up in a later channel. (The later goals don't fully
subsume the earlier goals because some crashes don't show up in significant
numbers until we get to the larger populations of the later channels.)
 
There are multiple strategies (tools, processes, systems) to achieve these
goals, but those strategies should not be confused with the goals themselves.
 
=== Specifics to Consider ===
 
What is the definition of a "crash"?
* Definitely: Obvious "true" crashes, e.g. segmentation fault.
* Definitely: Deliberate aborts, e.g. MOZ_CRASH.
* Probably: hangs
* Not sure: plugins?
 
What is the definition of "crash rate"?
 
* http://arewestableyet.com has the following, all measured per 100 ADI:
** browser + content processes
** browser crashes
** startup crashes
** content crashes
** plugin crashes
** plugin hangs
* unclear which of these are subsets of or overlap others
 
* possible weightings:
** weight start-up crashes more heavily than other crashes?
*** if persistent, they almost guarantee a lost user
*** can we distinguish "occasional start-up" crashes from "persistent start-up" crashes?
** e10s: weight chrome process crashes more heavily than content process crashes?
*** content process crashes don't take down the whole browser
** weight plugin crashes less?
*** it's more obvious that it's not Firefox's fault
*** that's if we are counting them at all?
** weight shutdown hangs/crashes less?
 
What are the target crash rates?
* Eight of them: {Release, Beta, Aurora, Nightly} x {desktop, mobile}
* No idea what actual values to choose here


== Strategies for improvement ==
== Strategies for improvement ==
Line 81: Line 152:
Things to be decided.
Things to be decided.


* Regular meetings: time, frequency, content?
* Regular meetings: content and time TBD
* Progress tracking: high-level stability measurements, bug lists, etc. Links to those.
* Progress tracking: high-level stability measurements, bug lists, etc. Links to those.
* Communication: IRC, email, other?
* Communication:
** IRC: #uptime
** Email: projectuptime-team@mozilla.com
** Vidyo: ProjectUptime


=== Crash stats checking rosters ===
=== Crash stats checking rosters ===