Performance/Fenix: Difference between revisions

Latest revision as of 20:16, 17 June 2022

This page describes some basics of Fenix performance. For an in-depth look at some specific topics, see:

Best Practices for tips to write performant code
Profilers and Tools for comparison of profiling and benchmarking tools
Performance reviews for how to know if your change impacts performance

Performance testing

Performance tests can have a goal of preventing regressions, measuring absolute performance as experienced by users, or measuring performance vs. a baseline (e.g. comparing fenix to fennec). It can be difficult to write tests that manage all of these. We tend to focus on preventing regressions.

Dashboards

We have a series of dashboards that we review during our biweekly performance sync meeting. The dashboards may be unstable and may be difficult to interpret so be cautious when drawing conclusions from the results. A shortcoming is that we only run these tests on Nightly builds. Here are the current dashboards:

Start up duration: this is represents COLD MAIN (app icon launch) to reportFullyDrawn/visual completeness and COLD VIEW (app link) to page load complete (i.e. this includes a variable-duration network call) across a range of devices. We're trying to replace this with more stable tests
Page load duration: we're iterating on the presentation to make this more useful. More complex visualizations are available in the grafana folder, such as the tests for Pixel 2
App size: via Google Play

Unmonitored tests running in fenix

In addition to the tests we actively look at above, there are other tests that run in mozilla-central on fenix or GeckoView example. We're not sure who looks at these. The perftest team is working to dynamically generate the list of tests that run. Some progress can be seen in this query and this treeherder page. Until then, we manually list the tests below.

As of Feb. 23, 2021, we run at least the following performance tests on fenix:

Additional page load duration tests: see the query above for a list of sites (sometimes run in automation, sometimes run manually; todo: details)
media playback tests (TODO: details; in the query above, they are prefixed with ytp)
Start up duration via mach perftest
Speedometer: JS responsiveness tests (todo: details)
tier 3 unity webGL tests (todo: details)

There are other tests that run on desktop that will cover other parts of the platform.

Preventing regressions automatically

We use the following measures:

Crash on main thread IO in debug builds using StrictMode (code)
Use our StartupExcessiveResourceUseTest, for which we are Code Owners, to:
- Avoid StrictMode suppressions
- Avoid runBlocking calls
- Avoid additional component initialization
- Avoid increasing the view hierarchy depth
- Avoid having ConstraintLayout as a RecyclerView child
- Avoid increasing the number of inflations
Use lint to avoid multiple ConstraintLayouts in the same file (code)

How to measure what users experience

When analyzing performance, it's critical to measure the app as users experience it. This section describes how to do that and avoid pitfalls. Note: our automated measurement tools, such as the measure_start_up.py script, will always use our most up-to-date techniques while this page may get outdated. Prefer to use automated systems if practical and read the source if you have questions!

When measuring performance manually, you might follow a pattern like the following (see the footnotes for explanations):

Configure your device and build. Use:
- a low-end device¹ (a Samsung Galaxy A51 is preferred)
- a debuggable=false build such as Nightly²
- enable any compile-time options that are enabled in the production app (e.g. Sentry, Nimbus, Adjust, etc.)³
Warm-up run:
- Start the app, especially after an installation⁴
- Wait at least 60 seconds⁵. To be extra safe, wait 2 minutes.
- Set the state of the app as you want to test it (e.g. clear onboarding)
- Force-stop the app (to make sure you're measuring at least the 2nd run after installation)
Measure or test:
- Start the app and measure what you want to measure
- If you force-stop the app, wait a few seconds before starting the app to let the device settle
- If you're testing code that waits for gecko initialization (e.g. page loads) and need to force-stop the app before measuring, make sure to 1) load a page and 2) wait 15 seconds before force-stopping the app⁶

Footnotes:

1: high-end devices may be fast enough to hide performance problems. For context, a Pixel 2 is still relatively high-end in our user base
2: `debuggable=true` builds (e.g. Debug builds) have performance characteristics that don't represent what users experience. See https://www.youtube.com/watch?v=ZffMCJdA5Qc&feature=youtu.be&t=625 for details
3: if these SDKs are disabled, you may miss performance issues introduced by them or their absence will change the timing of our operations, possibly hiding performance issues. Note: the performance team would prefer for all SDKs to be enabled by default so developers can error-free build an APK similar to production APKs
4: we've observed the first run after installation is always slower than subsequent runs for an unknown reason
5: on first run, we populate certain caches, e.g. we'll fetch Pocket data and start a Gecko cache. 60 seconds will address most of these
6: the ScriptPreloader will generate a new cache on each app start up. If you don't let the cache fill (i.e. by loading a page and waiting until it caches (source), the cache will be empty and you won't page load it as most users experience it

Glossary

Start up "type"

This is an aggregation of all of the variables that make up a start up, described more fully below. Currently, these variables are:

state
path

For example, a type of start up could be described as cold_main.

Start up "state": COLD/WARM/HOT

"State" refers to how cached the application is, which will impact how quickly it starts up.

Google Play provides a set of definitions and our definitions are similar to, but not identical, to them:

COLD = starting up "from scratch": the process and HomeActivity need to be created
WARM = the process is already created but HomeActivity needs to be created (or recreated)
HOT = basically just foregrounding the app: the process and HomeActivity are already created

Start up "path": MAIN/VIEW

"Path" refers to the code path taken for this start up. We name these after the action inside the Intents received by the app such as ACTION_MAIN that tell the app what to do:

MAIN = a start up where the app icon was clicked. If there are no existing tabs, the homescreen will be shown. If there are existing tabs, the last selected one will be restored
VIEW = a start up where a link was clicked. In the default case, a new tab will be opened and the URL will be loaded

Caveat: if an Intent is invalid, we may end up on a different screen (and thus taking a different code path) than the one specified by the Intent. For example, an invalid VIEW Intent may instead be treated as a MAIN Intent.

@@ Line 1: / Line 1: @@
-== Critical flows ==
+This page describes some basics of Fenix performance. For an in-depth look at some specific topics, see:
-Any areas of the app where users might retain significantly less if performance regressed are considered to be '''critical flows'''. For example, if pages took twice as long to load, we might expect to see some users stop using the app.
+* [[Performance/Fenix/Best Practices|Best Practices]] for tips to write performant code
+* [[Performance/Fenix/Profilers and Tools|Profilers and Tools]] for comparison of profiling and benchmarking tools
-When analyzing a critical flow for performance issues, it's essential to know what code is relevant. The '''endpoints''' tell you when the flow starts and stops. When using the profiler, it's common to constrain the timeline to these endpoints. The '''bottleneck''' is the resource (e.g. CPU, disk) that is maxed out and preventing the device from reaching the flow's final endpoint sooner (e.g. removing expensive computation may not improve perf if the bottleneck is the desk).
+* [[Performance/Fenix/Performance reviews|Performance reviews]] for how to know if your change impacts performance
-Our critical flows, their endpoints, and their bottlenecks are listed below.
-=== Start up ===
-There are a few different types of start up: see [[#Terminology|Terminology]] for clarifications.
-* COLD MAIN (to homescreen) start
-**Endpoints: process start and visual completeness (i.e. the homescreen is fully drawn)
-***Neither endpoint is available in the profiler. Surrogates are <code>Application.onCreate</code> or <code>*Activity.onCreate</code> to the first frame is drawn (the last two have profiler markers)
-**Bottleneck: we believe it's the main thread
-**Misc: a possibly important event for perceived performance is when the first frame is drawn
-* WARM MAIN (to homescreen) start
-**Endpoints: <code>MigrationDecisionActivity.onCreate</code> (Beta & Release builds) or <code>HomeActivity.onCreate</code> (Nightly & debug builds) and visual completeness.
-**Bottleneck: see COLD MAIN
-**Misc: see COLD MAIN
-* COLD VIEW start
-**Endpoints: process start and <code>GeckoSession.load</code>
-***The latter endpoint is available as a profiler marker
-**Bottleneck: unknown
-**Misc: a possibly important event for perceived performance is when the first frame is drawn
-* WARM VIEW start
-**Endpoints: <code>IntentReceiverActivity.onCreate</code> and <code>GeckoSession.load</code>
-**Bottleneck: see COLD VIEW
-**Misc: see COLD VIEW
-In addition to these types of start up, there are many states/configurations a client can start up with: for example, they can have FxA set up or not set up, they may have 1000s or 0 bookmarks, etc. We haven't yet found any to have a significant impact on performance but, to be honest, we haven't yet to investigate deeply.
-=== Page load ===
-TODO
-=== Search experience ===
-TODO
 == Performance testing ==
 Performance tests can have a goal of preventing regressions, measuring absolute performance as experienced by users, or measuring performance vs. a baseline (e.g. comparing fenix to fennec). It can be difficult to write tests that manage all of these. We tend to focus on preventing regressions.
-=== List of tests running in fenix ===
+=== Dashboards ===
-The perftest team is working to dynamically generate the list of tests that run on the fenix application. Some progress can be seen in [https://sql.telemetry.mozilla.org/queries/77734/source this query] and [https://treeherder.mozilla.org/perfherder/tests this treeherder page]. Until then, we manually list the tests below.
+We have a series of dashboards that we review during our biweekly performance sync meeting. The dashboards may be unstable and '''may be difficult to interpret so be cautious when drawing conclusions from the results.''' A shortcoming is that we only run these tests on Nightly builds. Here are the current dashboards:
+* [https://earthangel-b40313e5.influxcloud.net/d/DfK1IhzGz/fenix-startup-testing-per-device?orgId=1&refresh=1d Start up duration]: this is represents COLD MAIN (app icon launch) to <code>reportFullyDrawn</code>/visual completeness and COLD VIEW (app link) to page load complete (i.e. this includes a variable-duration network call) across a range of devices. We're trying to replace this with more stable tests
+* [https://earthangel-b40313e5.influxcloud.net/d/uYAfY3eGk/fenix-page-load-speedindex-geomean?orgId=1 Page load duration]: we're iterating on the presentation to make this more useful. More complex visualizations are available [https://earthangel-b40313e5.influxcloud.net/d/uYAfY3eGk/fenix-page-load-speedindex-geomean?orgId=1&search=open&folder=current in the grafana folder], such as [https://earthangel-b40313e5.influxcloud.net/d/ZmV33sqMk/fenix-page-load-pixel-2-vizrange-combined?orgId=1 the tests for Pixel 2]
+* App size: via Google Play
+=== Unmonitored tests running in fenix ===
+In addition to the tests we actively look at above, there are other tests that run in mozilla-central on fenix or GeckoView example. '''We're not sure who looks at these.''' The perftest team is working to dynamically generate the list of tests that run. Some progress can be seen in [https://sql.telemetry.mozilla.org/queries/77734/source this query] and [https://treeherder.mozilla.org/perfherder/tests this treeherder page]. Until then, we manually list the tests below.
 As of Feb. 23, 2021, we run at least the following performance tests on fenix:
-* Page load duration: see the query above for a list of sites (sometimes run in automation, sometimes run manually; todo: details)
+* Additional page load duration tests: see the query above for a list of sites (sometimes run in automation, sometimes run manually; todo: details)
 * media playback tests (TODO: details; in the query above, they are prefixed with ytp)
-* Start up duration (see [[#Terminology|Terminology]] for start up type definitions)
+* Start up duration via mach perftest
-** COLD VIEW tests on mach perftest. Runs per master merge to fenix on unreleased Nightly builds so we can identify the commit that caused a regression
-** COLD MAIN & VIEW tests on FNPRMS. Runs Nightly on production Nightly builds. This is being transitioned out in favor of mach perftest.
 * Speedometer: JS responsiveness tests (todo: details)
 * tier 3 unity webGL tests (todo: details)
-There are other tests that run on desktop that will cover other parts of the platform. We also have other methodologies to check for excessive resource use including lint rules and UI tests that measure things such as
+There are other tests that run on desktop that will cover other parts of the platform.
-Notable gaps in our test coverage includes:
-* Duration testing for front-end UI flows such as the search experience
-* Testing on non-Nightly builds (does this apply outside of start up?)
-== Offense vs. defense ==
-TODO: merge this into sections below and clean up those sections: not sure how useful it is
-Performance can be thought in terms of "Offense" – the changes that you make to actively improve performance – and "Defense" – the systems you have in place to prevent performance regression's (this offense/defense idea from [https://medium.com/@ricomariani/dos-and-don-ts-for-performance-teams-7f52c41b5355?source=rss-e6e91dab0708------2 this blog post]).
-== Defense: discouraging use of expensive APIs ==
+== Preventing regressions automatically ==
-In some cases, we want to discourage folks from using expensive APIs such as <code>runBlocking</code>. As a first draft solution, we propose a multi-step check:
+We use the following measures:
-# '''Compile-time check throughout the codebase:''' write a code ownered test asserting the number of references to the API.
+* Crash on main thread IO in debug builds using <code>StrictMode</code> ([https://github.com/mozilla-mobile/fenix/blob/13f33049122e0f06c026632812dee405360c53b0/app/src/main/java/org/mozilla/fenix/StrictModeManager.kt#L63-L69 code])
-## ''Question: given the lint rule, should we just count the number of `@Suppress` for this?''
+* Use [https://searchfox.org/mozilla-mobile/rev/3af703be7790ff00f78d15465f3b8bb0fde0dccc/fenix/app/src/androidTest/java/org/mozilla/fenix/perf/StartupExcessiveResourceUseTest.kt#103 our StartupExcessiveResourceUseTest], for which we are Code Owners, to:
-## ''Question: would it help if this was an annotation processor on our lint rule and we look for <code>@Suppress</code>?''
+** Avoid StrictMode suppressions
-## '''Add lint rule to discourage use of the API.''' This overlaps with the compile-time check, however:
+** Avoid <code>runBlocking</code> calls
-### We can't just use the compile-time check because in the best case it'll only run before the git push – it won't appear in the IDE – and the feedback loop will be too long for devs
+** Avoid additional component initialization
-### We can't just use the lint rule because it can be suppressed and we won't notice
+** Avoid increasing the view hierarchy depth
-# '''Run-time check on critical paths:''' wrap the API and increment a counter each time it is called. For each critical path (e.g. start up, page load), write a code ownered test asserting the number of calls to the API.
+** Avoid having ConstraintLayout as a RecyclerView child
-## ''Question: is this too "perfect is the enemy of the good?"''
+** Avoid increasing the number of inflations
-## '''If you're doing this on a built-in API, you'll need to ban use of the old API e.g. with ktlint rule since it's harder to suppress'''
+* Use lint to avoid multiple ConstraintLayouts in the same file ([https://searchfox.org/mozilla-mobile/rev/3af703be7790ff00f78d15465f3b8bb0fde0dccc/fenix/mozilla-lint-rules/src/main/java/org/mozilla/fenix/lintrules/perf/ConstraintLayoutPerfDetector.kt code])
-== App start up ==
+== How to measure what users experience ==
-=== Defense ===
+When analyzing performance, it's critical to measure the app as users experience it. This section describes how to do that and avoid pitfalls. Note: our automated measurement tools, such as the [https://github.com/mozilla-mobile/perf-tools/blob/main/measure_start_up.py <code>measure_start_up.py</code> script], will always use our most up-to-date techniques while this page may get outdated. Prefer to use automated systems if practical and read the source if you have questions!
-The FE perf team has the following measures in place to prevent regressions:
-* Show long term start up trends with Nightly performance tests (note: these are not granular enough to practically identify & fix regressions)
-* Prevent main thread IO by:
-** Crashing on main thread IO in debug builds using <code>StrictMode</code> ([https://github.com/mozilla-mobile/fenix/blob/13f33049122e0f06c026632812dee405360c53b0/app/src/main/java/org/mozilla/fenix/StrictModeManager.kt#L63-L69 code])
-** Preventing <code>StrictMode</code> suppressions by running tests that assert for the current known suppression count. We are Code Owners for the tests so we will have a discussion if the count changes ([https://github.com/mozilla-mobile/fenix/blob/13f33049122e0f06c026632812dee405360c53b0/app/src/androidTest/java/org/mozilla/fenix/ui/StrictModeStartupSuppressionCountTest.kt#L48-L57 code])
-* Code added to the start up path should be reviewed by the performance team:
-** We are Code Owners for a few files
-==== WIP ====
+When measuring performance manually, you might follow a pattern like the following (see the footnotes for explanations):
-We're working on adding:
-* Regression testing per master merge ([https://github.com/mozilla-mobile/perf-frontend-issues/issues/162 issue])
-* Prevent main thread IO with:
-** Static analysis to prevent <code>runBlocking</code>, which can circumvent <code>StrictMode</code> ([https://github.com/mozilla-mobile/fenix/issues/15278 issue])
-* Code added to the start-up path should be reviewed by the performance team:
-** We're investigating other files that can be separated so we can be code owners for the start up parts ([https://github.com/mozilla-mobile/fenix/issues/15274 issue])
-* Minimize component initialization with:
-** Avoid unnecessary initialization ([https://github.com/mozilla-mobile/fenix/issues/15279 issue])
-* Prevent unnecessarily expensive UI with:
-** NestedConstraintLayout static analysis ([https://github.com/mozilla-mobile/fenix/issues/15280 issue])
-=== Offense ===
+* Configure your device and build. Use:
-TODO: improve this section, if useful (let us know if it is)
+** a low-end device<sup>1</sup> (a Samsung Galaxy A51 is preferred)
+** a <code>debuggable=false</code> build such as Nightly<sup>2</sup>
+** enable any compile-time options that are enabled in the production app (e.g. Sentry, Nimbus, Adjust, etc.)<sup>3</sup>
+* Warm-up run:
+**Start the app, especially after an installation<sup>4</sup>
+**Wait at least 60 seconds<sup>5</sup>. To be extra safe, wait 2 minutes.
+**Set the state of the app as you want to test it (e.g. clear onboarding)
+**Force-stop the app (to make sure you're measuring at least the 2nd run after installation)
+* Measure or test:
+**Start the app and measure what you want to measure
+**If you force-stop the app, wait a few seconds before starting the app to let the device settle
+**If you're testing code that waits for gecko initialization (e.g. page loads) and need to force-stop the app before measuring, make sure to 1) load a page and 2) wait 15 seconds before force-stopping the app<sup>6</sup>
-We're keeping a list of the biggest known performance improvements we can make. Also, we have a startup improvement plan.
+Footnotes:
+* 1: high-end devices may be fast enough to hide performance problems. For context, a Pixel 2 is still relatively high-end in our user base
+* 2: `debuggable=true` builds (e.g. Debug builds) have performance characteristics that don't represent what users experience. See https://www.youtube.com/watch?v=ZffMCJdA5Qc&feature=youtu.be&t=625 for details
+* 3: if these SDKs are disabled, you may miss performance issues introduced by them or their absence will change the timing of our operations, possibly hiding performance issues. Note: the performance team would prefer for all SDKs to be enabled by default so developers can error-free build an APK similar to production APKs
+* 4: we've observed the first run after installation is always slower than subsequent runs for an unknown reason
+* 5: on first run, we populate certain caches, e.g. we'll fetch Pocket data and start a Gecko cache. 60 seconds will address most of these
+* 6: the ScriptPreloader will generate a new cache on each app start up. If you don't let the cache fill (i.e. by loading a page and waiting until it caches ([https://searchfox.org/mozilla-central/rev/fc4d4a8d01b0e50d20c238acbb1739ccab317ebc/js/xpconnect/loader/ScriptPreloader.cpp#769 source]), the cache will be empty and you won't page load it as most users experience it
 =={{#if:Glossary|<span id="Glossary"></span>|<span class="error">Error in {{tl|anchor}}: no anchor name has been specified.</span>}}<!--
@@ Line 115: / Line 75: @@
 -->{{#if:|<span id=""></span>}}<!--
 -->{{#if:|<span class="error">Error in {{tl|anchor}}: too many anchors, maximum is 10.</span>}}Glossary ==
-=== State: COLD/WARM/HOT ===
+=== Start up "type" ===
+This is an aggregation of all of the variables that make up a start up, described more fully below. Currently, these variables are:
+* state
+* path
+For example, a type of start up could be described as <code>cold_main</code>.
+=== Start up "state": COLD/WARM/HOT ===
 "State" refers to how cached the application is, which will impact how quickly it starts up.
@@ Line 123: / Line 90: @@
 * HOT = basically just foregrounding the app: the process and HomeActivity are already created
-=== MAIN/VIEW start up ===
+=== Start up "path": MAIN/VIEW ===
-These are named after the actions passed inside the <code>Intent</code>s received by the app [https://developer.android.com/reference/android/content/Intent#ACTION_MAIN such as <code>ACTION_MAIN</code>]:
+"Path" refers to the code path taken for this start up. We name these after the <code>action</code> inside the <code>Intent</code>s received by the app [https://developer.android.com/reference/android/content/Intent#ACTION_MAIN such as <code>ACTION_MAIN</code>] that tell the app what to do:
 * MAIN = a start up where the app icon was clicked. If there are no existing tabs, the homescreen will be shown. If there are existing tabs, the last selected one will be restored
 * VIEW = a start up where a link was clicked. In the default case, a new tab will be opened and the URL will be loaded
+Caveat: if an <code>Intent</code> is invalid, we may end up on a different screen (and thus taking a different code path) than the one specified by the <code>Intent</code>. For example, an invalid VIEW <code>Intent</code> may instead be treated as a MAIN <code>Intent</code>.