Firefox OS/Performance/Profiling: Difference between revisions

Completely replace the very outdated perf(1) section with a summary of the state of bug 831611
No edit summary
(Completely replace the very outdated perf(1) section with a summary of the state of bug 831611)
Line 3: Line 3:
See [https://developer.mozilla.org/en-US/docs/Performance/Profiling_with_the_Built-in_Profiler#Profiling_Boot_to_Gecko_%28with_a_real_device%29 these instructions].  Patches are in-flight to get native stacks in profiles, but that's not in default configurations yet.
See [https://developer.mozilla.org/en-US/docs/Performance/Profiling_with_the_Built-in_Profiler#Profiling_Boot_to_Gecko_%28with_a_real_device%29 these instructions].  Patches are in-flight to get native stacks in profiles, but that's not in default configurations yet.


= Profiling with perf =
= Profiling with perf(1) =
The perf utility is a performance analysis tools for Linux.


== Setup ==
Work is in progress to make the Linux kernel profiler, called "perf", useful for debugging on B2G.  See [https://bugzilla.mozilla.org/show_bug.cgi?id=831611 bug 831611] for more information; the main issue for getting it landed is obtaining stack traces, which [https://bugzilla.mozilla.org/show_bug.cgi?id=856899 bug 856899] goes into more detail on. Also, currently it requires a Linux build host, but see [[#Experimental MacOS Host Support]] below.
The profiling data is collected at target device, and the report been generated at host side.<br>
You need to install perf tool at host side, and create a directory for kernel and libraries with symbols.


* Install perf at host side for Ubuntu
== Quick Start ==
$ sudo apt-get install linux-tools
$ perf --version
perf version 3.0.17


* Create direcotry for libaries with symbols<br>Here's a B2G makefile helper to create this directory.
# Add git://github.com/jld/B2G.git as a remote and check out the "profiling" branch from it.
$ make perf-create-symfs
# ./config.sh.  Don't set BRANCH here; the default is "profiling-v1", which the v1-train manifests with suitable changes.  You can check out gecko and/or gaia to different versions afterwards.
# "export B2G_PROFILING=1" in .userconfig
# Delete "out" and "objdir-gecko" (or whatever your gecko objdir is named), then ./build.sh.
# ./flash.sh
# Now profile something:
## ./run-perf.sh record-sps
## Do something of interest on the device.
## Hit Enter in the shell window, like the message said to.
# There should have been a line like "Writing profile to perf_20130423_122912.txt".  Go to https://people.mozilla.com/~bgirard/cleopatra/ (or a local clone, if you have one) and feed it that file.


== Real time report ==
== Experimental MacOS Host Support ==
On target device, use perf top to generate and display performance counters in real time.
# perf top -p `pidof b2g`
The output will be like this:
  PerfTop:    388 irqs/sec  kernel:13.1%  exact:  0.0% [1000Hz cycles],  (target_pid: 7852)
-------------------------------------------------------------------------------
              samples  pcnt function                          DSO
              _______ _____ __________________________________ _________________
              403.00 31.8% _downsample_2x2_rgba8888          libGLESv2_mali.so
              119.00  9.4% JaegerStubVeneer                  libxul.so       
                93.00  7.3% _raw_spin_unlock_irqrestore        [kernel.kallsyms]
                59.00  4.7% _m200_texture_deinterleave_16x16_b libMali.so     
                56.00  4.4% memcpy                            libc.so         
                40.00  3.2% finish_task_switch                [kernel.kallsyms]
                37.00  2.9% vfprintf                          libc.so         
                23.00  1.8% _gles_fb_tex_sub_image_2d          libGLESv2_mali.so
                16.00  1.3% __sfvwrite                        libc.so         
                16.00  1.3% __do_softirq                      [kernel.kallsyms]
                15.00  1.2% __memzero                          [kernel.kallsyms]
                13.00  1.0% getnstimeofday                    [kernel.kallsyms]
                12.00  0.9% _gles_generate_mipmaps_sw_16x16blo libGLESv2_mali.so
                12.00  0.9% snprintf                          libc.so         
                12.00  0.9% __divsi3                          libmozglue.so   
              10.00  0.8% v7_dma_clean_range                [kernel.kallsyms]


== Recording for a period and generating report ==
As above, but use the "miniperf" branch.  This uses Python code to parse the performance event records directly instead of running the Linux "perf" command and scraping its output, to convert them into Cleopatra/SPS format.  (It also replaces the perf command run on the device with a small C program that implements enough of "perf record" for our purposes and outputs it in a simplified format; the perf.data file format is not well documented, and the perf command is large and difficult to cross-compile.)
Record at target side: (Hit CTRL-C to stop recording)
# perf record -o /data/local/perf.data -p `pidof b2g`


Generate report at host side:
This may also be useful for Linux hosts that don't have the same libraries as the system I built perf on; it has a lot of dependencies, and some of them have compatibility issues between distributions.  The current plan is to make miniperf the default for this.
$ adb pull /data/local/perf.data .
  $ perf report --symfs=/tmp/b2g_symfs_galaxys2 --vmlinux=/vmlinux
The output will be like this:
# Events: 4K cycles
#
# Overhead  Command      Shared Object                                                                                               
# ........  .......  .................  ...............................................................................................
#
      8.00%      b2g  perf-7852.map      [.] 0x438413fc     
      4.46%      b2g  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
      4.36%      b2g  [unknown]          [.] 0x43843500     
      2.61%      b2g  [kernel.kallsyms]  [k] finish_task_switch
      1.69%      b2g  libxul.so          [.] JaegerStubVeneer
      1.20%      b2g  libxul.so          [.] TypedArrayTemplate<float>::obj_getElement(JSContext*, JSObject*, JSObject*, unsigned int, J
      1.06%      b2g  libxul.so          [.] void js::mjit::stubs::SetElem<0>(js::VMFrame&)
      1.05%      b2g  libxul.so          [.] js::mjit::stubs::GetElem(js::VMFrame&)
      1.01%      b2g  libc.so            [.] pthread_mutex_lock
      1.00%      b2g  libc.so            [.] memcpy
      0.90%      b2g  libxul.so          [.] JSObject::nativeLookup(JSContext*, int)
      0.88%      b2g  [kernel.kallsyms]  [k] sub_preempt_count
      0.86%      b2g  libGLESv2_mali.so  [.] 0xa3a0         
      0.82%      b2g  [kernel.kallsyms]  [k] add_preempt_count
      0.80%      b2g  [kernel.kallsyms]  [k] __do_softirq
      0.79%      b2g  libxul.so          [.] js_IsTypedArray(JSObject*)
      0.78%      b2g  libMali.so        [.] 0x13be8       
      0.67%      b2g  libxul.so          [.] js::GetPropertyHelper(JSContext*, JSObject*, int, unsigned int, JS::Value*)
      0.66%      b2g  libxul.so          [.] js::PropertyTable::search(int, bool)
      0.66%      b2g  libxul.so          [.] js_GetProperty(JSContext*, JSObject*, JSObject*, int, JS::Value*)
      0.65%      b2g  libc.so            [.] pthread_mutex_unlock
      0.59%      b2g  libxul.so          [.] castNativeFromWrapper(JSContext*, JSObject*, unsigned int, nsISupports**, JS::Value*, XPCLa
      0.57%      b2g  libmozglue.so      [.] __udivsi3
      0.53%      b2g  libxul.so          [.] mozilla::gl::GLContextEGL::MakeCurrentImpl(bool)
      0.52%      b2g  libxul.so          [.] XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode)
      0.49%      b2g  libxul.so          [.] js::TypedArray::getTypedArray(JSObject*)
      0.49%      b2g  libxul.so          [.] js::GetPropertyOperation(JSContext*, unsigned char*, JS::Value const&, JS::Value*)
      0.48%      b2g  [kernel.kallsyms]  [k] vector_swi
      0.47%      b2g  [kernel.kallsyms]  [k] get_parent_ip
      0.42%      b2g  libxul.so          [.] DisabledGetElem(js::VMFrame&, js::mjit::ic::GetElementIC*)


== Recording with callgraph ==
== Fine-Tuning ==


Use option '-g' to do callgraph recording:
By default, perf samples based on the CPU's cycle counter, adjusting the period to gather approximately 4000 samples/sec. However, it gathers nothing while the CPU is idle, and currently Cleopatra (the Gecko profiler front-end) ignores these times — it doesn't display them in the timeline, and its "real interval" is an average over both real inter-sample intervals and idle times.
  # perf record -g -o /data/local/perf.data -p `pidof b2g`


Note:
Other timers are available; use the -e flag to select one.  In particular, "-e cpu-clock" uses a real-time interval timer, which gather samples even when the CPU is idle. However, at least on unagi it seems to be restricted to 2500 samples/sec.
# To get correct call graph report, you need to compile libaries with "-fno-omit-frame-pointer".
# On SGS2 device, it's easy to crash when doing perf with callgraph, this is an issue to be fixed.


== System-wide and specific application profiling ==
Note that on the miniperf branch, cpu-clock is the default; use "-e cycles" to use the cycle counter.  Or try something else; run "perf list" on a Linux host, or look at the table at the top of gonk-misc/miniperf/miniperf-record.c on the miniperf branch.  The hardware might not support some of them.


Use option '-a' to do system-wide profiling:
To set the sample rate, use the -F flag to specify a target frequency (samples/sec) or -c to give an absolute number of cycles (note that the kernel will adjust the CPU speed in response to demand).  The "cycle time" for cpu-clock appears to be in nanoseconds regardless of the physical timer used. Note that, at very high rates (empirically, >10 kHz), the CPU may spend enough time gathering samples to noticeably slow down the application being profiled.
  # perf record -o /data/local/perf.data -a


Profiling on specified command:
== Other Commands ==
# perf -o /data/local/perf.data /system/b2g/b2g


Use option '-p' to profile an existing process: (On some devices there's no pidof, and you need to use ps to find out b2g PID)
The original run-perf.sh commands were "record" and "report", corresponding to those perf(1) commands, except that "record" is run on the device (and the perf.data file pulled afterwards, along with kallsyms) and "report" constructs a symlink farm to provide symbol information.  This may be useful to those who are already familiar with perf, but it's not the most obvious interface for new users.
# perf record  -o /data/local/perf.data -p `pidof b2g`


== Makefile helpers for perf ==
The next layer is "./run-perf.sh sps", which converts the perf.data file pulled by "./run-perf.sh record" to the format used by the Gecko profiler (with symbols), as long as it was made with -a (all CPUs); and "./run-perf.sh record-sps", which combines "record -a -g" and "sps".


Here are B2G makefile helpers to generate perf reports at host side.
On the miniperf branch, there are currently "minirecord" (runs the miniperf-record command instead of perf record), and the "sps" subcommand recognizes miniperf files, and record-sps uses minirecord instead.
 
* Create direcotry for libaries with symbols
$ make perf-create-symfs
* Remove directory for libaries with symbols
$ make perf-clean-symfs
* Real time perf report for system wide
$ make perf-top
* Real time report for B2G process
$ make perf-top-b2g
* Summary perf report for system wide
$ make perf-report
* Summary perf report for B2G process
$ make perf-report-b2g
* Change recording duration<br>For perf-report-*, it automatically records for 10 seconds then generate report.  You can change it by giving argument "RECORD_DURATION".<br>Below is an example to record for 30 seconds:
$ make perf-report RECORD_DURATION=30
39

edits