Sfink/Performance Thoughts: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
Line 56: Line 56:
== Tasks ==
== Tasks ==


* App Startup
* App Startup - see [[Firefox/Projects/Startup Time Improvements]]
* Page Load
* Page Load - see [[Performance/Pageloader]]
* Applications
* Applications
** gmail
** gmail
Line 66: Line 66:
* Javascript
* Javascript
* DOM
* DOM
* CSS
* Layout
* Layout
* Garbage Collector
* Networking
* Networking
* Caching
* Caching
* Extensions (cross-cutting)
* Extensions (cross-cutting)
== Metrics ==
Basic metrics for now. Should really be broken down further. Example: look at various metrics within the scope of a call or page or process or...
* Latency
** Wallclock time
** CPU clock cycles for this process only
** Cache misses
** Initial disk block reads
** Disk bandwidth
* Memory
** VM Size
** RSS
** Private memory
** Garbage collector-specific
*** Number, duration of pauses
*** Garbage generated
*** Meat (non-garbage) generated


== Tools ==
== Tools ==
Line 97: Line 118:
==== Latency ====
==== Latency ====


I'm  
I'm breaking these down by platform, in hopes that someone looking for a tool will find this more useful and less overwhelming.


Platform-independent:
Platform-independent:
Line 104: Line 125:
* [http://www.mozilla.org/performance/jsprofiler.html JS Runtime Profiler]. Profile JavaScript usage, dumps output to a file.
* [http://www.mozilla.org/performance/jsprofiler.html JS Runtime Profiler]. Profile JavaScript usage, dumps output to a file.
* TraceVis
* TraceVis
* [[Perfomatic Mozilla graph server]]
* [[Buildbot/Talos]] - really nice setup where various performance tests are run automatically and compared against previous builds. Results are posted to the [news://news.mozilla.org/mozilla.dev.tree-management mozilla.dev.tree-management] newsgroup
* [[Buildbot/Talos]] - really nice setup where various performance tests are run automatically and compared against previous builds. Results are posted to the [news://news.mozilla.org/mozilla.dev.tree-management mozilla.dev.tree-management] newsgroup
* [[Dromaeo]] - in-browser Javascript tests
* [[Dromaeo]] - in-browser Javascript tests
* mochitests - I need to look at this still


Platform-specific:
Platform-specific:

Revision as of 22:53, 28 June 2010

I'm going to go a little crazy with taxonomies.

What Kind of Performance

  • Latency - how long do you have to wait between the time you initiate an action and the time some detectable response appears?
    • This is all the user really cares about, but it's not always the right thing to look at as a developer, since it is the end result of lots of other things that may be affected by multiple variables.
    • Latency of the complete response is one thing, but in reality some things are going to take some time, and so the latency to a visible progress indicator may be more important. Depends on the situation.
    • Variability of latency can be important too. It's critically important if the output depends on it, eg watching an animation of some sort. But it also interferes with learning: "I click on this button to make it do this... oh wait, it didn't work, did I forget something? Let me see if I need to -- oh, there it is. Odd, it normally doesn't take that long."
  • Memory usage - why is using memory bad?
    • latency is going to get very bad once you hit a certain "problem size"
    • you can't have as many other things active at the same time
    • the rest of the system gets sluggish as a side effect
    • you'll eventually crash the browser or some other application
  • Storage space - normally a far lesser concern with browsers.
    • But especially with local storage, it could become more relevant.

Where Does Time Go?

I/O

I'm using I/O very generally, inclusive of disk, network, and memory.

You can kind of walk up the cache hierarchy, though it isn't really strictly a hierarchy.

At each level, you'll usually have asynchronous and synchronous behavior.

  • synchronous: stuff you have to wait for. Reads are usually synchronous. (Exceptions: readahead, or when you have another thread you can switch to. If you get close to actual devices, DMA can be a form of asynchronous read.)
  • asynchronous: stuff you don't have to wait for. Writes are often asynchronous. (Many more exceptions than with read=synchronous.)
    • Asynchronous requests don't matter, until they do: when too much asynchronous data is outstanding, it starts blocking and becoming synchronous. Asynchronous requests can also slow down or block synchronous ones earlier. Often this is because of dependencies between the requests. Those dependencies may or may not be fundamental -- they might just be a driver limitation or a simplification in the logic of whatever is handling the resource. (A memory read might unnecessarily block on an earlier write because it's hard to be certain they don't alias the same address space. Or a write may trigger a read to fill in the rest of a cache line.)

Most of these levels also have two types, slow and fast. (Often mapping to seeks vs sequential reads.)

All of them have weird exceptions to the simple taxonomy. (Networks may be faster or slower than disk. Disks have memory caches. Cache line aliasing has weird effects. Etc.)

Cache hierarchy, roughly ordered from most expensive to least:

  • Network I/O
    • Latency cost varies widely with the server and phase of moon.
  • Disk I/O
    • Writes can be asynchronous unless you need to explicitly synchronize for durability (in the ACID sense). But the OS will tend to flush them every so often even if it doesn't strictly need to, and those flushes can block reads.
    • Sequential I/O is fast, Random I/O is slow. Somewhat less true with an SSD, but I don't know much about those. The difference is significant enough that it's probably worthwhile to track this with two different metrics: initial block reads and total disk bandwidth.
  • Main RAM. In my myopic view, RAM is RAM. Mobile devices may break this with different types of memory (eg volatile vs nonvolatile memory can be different speeds.) But if you stick your thumbs in your ears and waggle your fingers, you can ignore that. (NUMA = Non-Uniform Memory Access)
  • TLB. Waggle faster. (Not often worth worrying about for non-specialized workloads.)
  • L3 then L2 then L1 caches. Highest (aka fastest aka closest to the CPU) levels are usually split between separate data and instruction stores, not that you generally need to care. Writeback vs write-through is another critically important difference that you can usually ignore.

This is a cache hierarchy, so for the most part the later layers will fall back to the earlier when their capacity is exceeded. And writes often go straight through to a slower layer.

CPU

CPUs are fast. Even on mobile devices, they're pretty fast. CPUs rarely consume large chunks of time just crunching through basic math operations. The time normally disappears into loading and saving data to and from memory, which I'm describing here as I/O.

Except that many measurement tools describe I/O in terms of CPU clock ticks. An L1 cache miss, for example, is normally measured in clock ticks. So you can think of it as CPU time if you like. (Clock ticks map fairly well to actual time, although you may need to adjust for occasional frequency scaling or whatever.)

More importantly, tools have a distinction of "I/O wait time" vs "CPU time". This distinction is mostly real: with I/O wait, your process is scheduled out and not running. CPU time includes time when the CPU is twiddling its thumbs waiting for a cache miss to be resolved, but the CPU isn't going anywhere; it'll keep running your process immediately after the needed data gets loaded in. (Even in SMT eg hyperthreading, where it is pulling from multiple threads all the time. When your process blocks on a cache miss, the scheduler will just pull instructions from the other thread for a while until you can start feeding it again.)

Tasks

Mozilla Platform "Cost Centers"

  • Javascript
  • DOM
  • CSS
  • Layout
  • Garbage Collector
  • Networking
  • Caching
  • Extensions (cross-cutting)

Metrics

Basic metrics for now. Should really be broken down further. Example: look at various metrics within the scope of a call or page or process or...

  • Latency
    • Wallclock time
    • CPU clock cycles for this process only
    • Cache misses
    • Initial disk block reads
    • Disk bandwidth
  • Memory
    • VM Size
    • RSS
    • Private memory
    • Garbage collector-specific
      • Number, duration of pauses
      • Garbage generated
      • Meat (non-garbage) generated

Tools

Mozilla has a ton of great tools already available for analyzing performance, including many I haven't uncovered yet. I'll try to enumerate the ones that I have uncovered so far, even though I'll just be duplicating Performance:Tools. But that page is oddly organized and out of date, so I'm just going to pile things here until I get a better handle on all of this, then I'll move it all over to that page instead.

Audiences

  • Mozilla platform developers
  • Mozilla application developers
  • Plugin authors
  • Add-on authors

The audience determines whether a tool can be enabled via a conditional compilation directive, or whether it needs to be available in a release build. It also gives guidance as to whether a platform-specific tool is acceptable (eg dtrace).

Tool Ideas

First, some random tool ideas, so they don't get lost. Some of these were proposed by other people. Some may already exist, and will be removed as I discover them.

  • Mozilla-independent external tool that breaks down the cost of performing some task into the levels of the caching hierarchy (with CPU time included.)
    • This would allow comparing a standard operation across different browsers and seeing where firefox uses more of some resource, as a guide to optimization. Lots more cache misses => look at cache oblivious data structures. Lots more seeks => figure out a way to reorder accesses.
  • Mark up a user's Javascript to show what percentage of each line (statement? minification is a pain) in a particular run was "on trace" with TraceMonkey.
    • This would allow the user to figure out what code works well with the trace engine and what doesn't.

Current Tools at Mozilla

Latency

I'm breaking these down by platform, in hopes that someone looking for a tool will find this more useful and less overwhelming.

Platform-independent:

Platform-specific:

Windows

Mac

  • Shark
  • The Metrowerks Profiler (on subsets of the codebase), or Apple's Instrumentation SDK. For details, see the Mac performance page.

Linux

  • oprofile
  • sysprof
  • jprof. Sampling profiling tool by Jim Nance that runs on 32-bit Linux.
  • eazel profilers. Two profiling tools from Eazel/gnome.org.

Other

  • hiprof. A hierarchical instruction profiler for Digital Unix. There is an online tutorial, courtesy Jim Nance (jim_nance@yahoo.com):
    • "I like it because it does not require you to recompile the program, because it give accurate times for callers, and because it comes with DU so that everyone developing there should already have it." *

Multiple

  • callgrind (part of valgrind)
  • Rational PurifyPlus. (Commercial) Used to be called Rational Quantify.
    • qfy2html.pl. A Perl script for munging Quantify 6.0 output into hyperlinked HTML to share with your friends. Run Quantify, save your results as text, and then let this thing rip.
    • Quantifying Mozilla on Solaris. Instructions on how to make Quantify play nicely with Mozilla on Solaris.
  • gprof. Part of the GNU Binutils. Back in May 1999 there were rumors that gprof had trouble with dynamically linked and loaded objects (like Mozilla). Currently (March 2006) gprof is still being developed, so hopefully that's no longer true (assuming it ever was). The link to the gprof manual above is *years* out of date but, at the time of writing, it's the latest version of the manual on gnu.org. To get an up to date version of the manual, get the gprof source and run |. configure; make html| in the gprof source directory.

Memory Usage

See Performance:Leak Tools, which is already broken down better than this page is.