Performance/Fenix/Performance reviews: Difference between revisions

Revision as of 23:58, 5 November 2021

Do you want to know if your change impacts Fenix or Focus performance? If so, here are the methods you can use, in order of preference:

Benchmark in CI: not yet available. However, it'd be preferred as it's the most consistent
Benchmark locally: use an automated test to measure the change in duration
Timestamp benchmark: add temporary code and manually measure the change in duration
Profile: use a profile to measure the change in duration

The trade-offs for each technique are mentioned in their respective section.

Benchmark locally

A benchmark is an automated test that measures performance, usually the duration from point A to point B. Automated benchmarks have similar trade-offs to automated functionality tests when compared to one-off manual testing: they can continuously catch regressions and minimize human error. For manual benchmarks in particular, it can be tricky to be consistent about how we aggregate each test run into the results. However, automated benchmarks are time consuming and difficult to write so sometimes it's better to perform manual tests.

To benchmark, do the following:

Select a benchmark that measures your change or write a new one yourself
Run the benchmark on the commit before your change
Run the benchmark on the commit after your change
Compare the results: generally, this means comparing the median

We currently support the following benchmarks:

Cold start up with our custom system in perf-tools/measure_start_up.py
Non-start up use cases with Jetpack Benchmark. See here for existing benchmarks.

Measuring cold start up duration

To measure the cold start up duration, the approach is usually simple:

From the mozilla-mobile/perf-tools repository, use measure_start_up.py.
The arguments for start-up should include your target (Fenix or Focus).
Determine the start-up path that your code affects this could be:
1. cold_main_first_frame: when clicking the app's homescreen icon, this is the duration from process start until the first frame drawn
2. cold_view_nav_start: when opening the browser through an outside link (e.g. a link in gmail), this is the duration from process start until roughly Gecko's Navigation::Start event
After determining the path your changes affect, these are the steps that you should follow:

Example:

Run measure_start_up.py located in perf-tools. Note:
- The usual iteration coumbered list itemnts used is 25. Running less iterations might affect the results due to noise
- Make sure the application you're testing is a fresh install. If testing the Main intent (which is where the browser ends up on its homepage), make sure to clear the onboarding process before testing

 python3 measure_start_up.py -c=25 --product=fenix nightly cold_view_nav_start results.txt

where -c refers to the iteration count. The default of 25 should be good.

Once you have gathered your results, you can analyze them using analyze_durations.py in perf-tools.

  python3 analyze_durations.py results.txt

NOTE:For testing before and after to compare changes made to Fenix: repeat these steps, but this time for the code before the changes. Therefore, you could checkout the parent comment (I.e: using git rev-parse ${SHA}^ where ${SHA} is the first commit on the branch where the changes are)

An example of using these steps to review a PR can be found (here).

Testing non start-up changes

Testing for non start-up changes is a bit different than the steps above since the performance team doesn't have tools as of now to test different part of the browser.

The first step here would be to instrument the code to take (manual timings). By getting timings before and after the changes, it could potentially indicate any changes in performance.
Using profiles and markers.
1. (Profiles) can be a good visual representative for performance changes. A simple way to find your code and its changes could be either through the call tree, the flame graph or stack graph. NOTE: some code may be missing from the stack since pro-guard may inline it, or the sampling rate of the profiler is more than the time taken by the code.
2. Another useful tool to find changes in performance is markers. Markers can be good to show the time elapsed between point A and point B or to pin point when a certain action happens.

Timestamp benchmark

A timestamp benchmark is a manual test where a developer adds temporary code to log the duration they want to measure and then performs the use case on the device themselves to get the values printed. Here's an example of a simple use case:

val start = SystemClock.elapsedRealtime()
thingWeWantToMeasure()
val end = SystemClock.elapsedRealtime()
Log.e("benchmark", "${end - start}") // result is in milliseconds
// Note: elapsedRealtime() is preferred over currentTimeMillis

Like automated benchmarks, these tests can accurately measure what users experience. However, they are fairly quick to write and execute but are tedious and time-consuming to carry out and have many places to introduce errors.

Here's an outline of a typical timestamp benchmark:

Decide the duration you want to measure
Do the following once for the commit before your changes and once for the commit after your changes...
1. Add code to measure the duration
2. Build & install a release build like Nightly or Beta (debug builds have unrepresentative perf)
3. Do a "warm up" run first: the first run will always be slower because the JIT cache isn't primed so you should run and ignore it, i.e. run your test case, wait a few seconds, force-stop the app, clear logcat, and then begin testing & measuring
4. Run the use case several times (maybe 10 times if it's quick, 5 if it's slow). You probably want to measure "cold" performance: we assume users will generally only perform a use case a few times per process lifetime. However, the more times a code path is run during the process lifetime, the more likely it'll execute faster because it's cached. Thus, if we want to measure a use case in a way that is similar to what users experience, we must measure the first time an interaction occurs during the process. In practice this means after you execute your use case once, force-stop the app before executing it again
5. Capture the results from logcat
Compare the results, generally by comparing the median of the two runs

Example: time to display the home screen

For step 1) in the outline, we want to measure the time it takes to fully display the home screen. We always need to be very specific: we'll want the duration from hitting the home button on an open tab until the homescreen is visually complete.

For step 2.1), we'll first add code to capture the timestamp when the home button is pressed. To get the duration closest to what the user experiences, we need to record the time when the touch event is initially received: HomeActivity.dispatchTouchEvent.

object TimestampBenchmark {
    var start = -1L
}

class HomeActivity(...) {
    ...
    override fun dispatchTouchEvent(ev: MotionEvent?): Boolean {
        TimestampBenchmark.start = SystemClock.elapsedRealtime()
        return super.dispatchTouchEvent(ev)
    }

When running the test, we'll need to be careful that we don't touch the screen after we press the home button because it'd override this value and give us the wrong measurement. We could avoid this problem by recording the timestamp in the home button's click listener but that may leave out a non-trivial duration: for example, what if the touch event was handled asynchronously and got blocked before dispatching to the home button? Furthermore, this may be simpler: the home button's click listener may exist in android-components, requiring us to build it as well.

TODO...

Profile

TODO

Performance/Fenix/Performance reviews: Difference between revisions

Revision as of 23:58, 5 November 2021

Contents

Benchmark locally

Measuring cold start up duration

Testing non start-up changes

Timestamp benchmark

Example: time to display the home screen

Profile

Navigation menu

Performance/Fenix/Performance reviews: Difference between revisions

Revision as of 23:58, 5 November 2021

Benchmark locally

Measuring cold start up duration

Testing non start-up changes

Timestamp benchmark

Example: time to display the home screen

Profile

Navigation menu

Search