Sfink/Performance Thoughts: Difference between revisions

Jump to navigation Jump to search
no edit summary
(general rambling. Intermediate checkin.)
 
No edit summary
Line 13: Line 13:
** you'll eventually crash the browser or some other application
** you'll eventually crash the browser or some other application
* Storage space - normally a far lesser concern with browsers.
* Storage space - normally a far lesser concern with browsers.
** But especially with local storage, it could become more relevant.


== Where Does Time Go? ==
== Where Does Time Go? ==
Line 26: Line 27:
* synchronous: stuff you have to wait for. Reads are usually synchronous. (Exceptions: readahead, or when you have another thread you can switch to. If you get close to actual devices, DMA can be a form of asynchronous read.)
* synchronous: stuff you have to wait for. Reads are usually synchronous. (Exceptions: readahead, or when you have another thread you can switch to. If you get close to actual devices, DMA can be a form of asynchronous read.)
* asynchronous: stuff you don't have to wait for. Writes are often asynchronous. (Many more exceptions than with read=synchronous.)
* asynchronous: stuff you don't have to wait for. Writes are often asynchronous. (Many more exceptions than with read=synchronous.)
** Asynchronous requests don't matter, until they do: when too much asynchronous data is outstanding, it starts blocking and becoming synchronous. Asynchronous requests can also slow down or block synchronous ones earlier. Often this is because of dependencies between the requests. Those dependencies may or may not be fundamental -- they might just be a driver limitation or a simplification in the logic of whatever is handling the resource. (A memory read might block on an earlier write because it's hard to be certain they don't alias the same address space. Or a write may trigger a read to fill in the rest of a cache line.)
** Asynchronous requests don't matter, until they do: when too much asynchronous data is outstanding, it starts blocking and becoming synchronous. Asynchronous requests can also slow down or block synchronous ones earlier. Often this is because of dependencies between the requests. Those dependencies may or may not be fundamental -- they might just be a driver limitation or a simplification in the logic of whatever is handling the resource. (A memory read might unnecessarily block on an earlier write because it's hard to be certain they don't alias the same address space. Or a write may trigger a read to fill in the rest of a cache line.)


Most of these levels also have two types, slow and fast. (Often mapping to seeks vs sequential reads.)
Most of these levels also have two types, slow and fast. (Often mapping to seeks vs sequential reads.)


All of them have weird exceptions to the simple taxonomy. (Disks have memory caches. Cache line aliasing has weird effects. Etc.)
All of them have weird exceptions to the simple taxonomy. (Networks may be faster or slower than disk. Disks have memory caches. Cache line aliasing has weird effects. Etc.)


Cache hierarchy, roughly ordered from most expensive to least:
Cache hierarchy, roughly ordered from most expensive to least:
Line 39: Line 40:
** Writes can be asynchronous unless you need to explicitly synchronize for durability (in the ACID sense). But the OS will tend to flush them every so often even if it doesn't strictly need to, and those flushes can block reads.
** Writes can be asynchronous unless you need to explicitly synchronize for durability (in the ACID sense). But the OS will tend to flush them every so often even if it doesn't strictly need to, and those flushes can block reads.
** Sequential I/O is fast, Random I/O is slow. Somewhat less true with an SSD, but I don't know much about those. The difference is significant enough that it's probably worthwhile to track this with two different metrics: initial block reads and total disk bandwidth.
** Sequential I/O is fast, Random I/O is slow. Somewhat less true with an SSD, but I don't know much about those. The difference is significant enough that it's probably worthwhile to track this with two different metrics: initial block reads and total disk bandwidth.
* Main RAM. In my myopic view, RAM is RAM. Mobile devices may break this with different types of memory (eg volatile vs involatile memory can be different speeds.) But if you stick your thumbs in your ears and waggle your fingers, you can ignore that.
* Main RAM. In my myopic view, RAM is RAM. Mobile devices may break this with different types of memory (eg volatile vs nonvolatile memory can be different speeds.) But if you stick your thumbs in your ears and waggle your fingers, you can ignore that. (NUMA = Non-Uniform Memory Access)
* TLB. Waggle faster. (Not often worth worrying about for non-specialized workloads.)
* TLB. Waggle faster. (Not often worth worrying about for non-specialized workloads.)
* L3 then L2 then L1 caches.
* L3 then L2 then L1 caches. Highest (aka fastest aka closest to the CPU) levels are usually split between separate data and instruction stores, not that you generally need to care. Writeback vs write-through is another critically important difference that you can usually ignore.


This is a cache hierarchy, so for the most part the later layers will fall back to the earlier when their capacity is exceeded. And writes often go straight through to a slower layer.
This is a cache hierarchy, so for the most part the later layers will fall back to the earlier when their capacity is exceeded. And writes often go straight through to a slower layer.
Line 47: Line 48:
=== CPU ===
=== CPU ===


CPUs are fast. Even on mobile devices, they're pretty fast. CPUs rarely consume large chunks of time just crunching through basic math operations. The time normally disappears into loading and saving data to and from memory, which counts as I/O.
CPUs are fast. Even on mobile devices, they're pretty fast. CPUs rarely consume large chunks of time just crunching through basic math operations. The time normally disappears into loading and saving data to and from memory, which I'm describing here as I/O.


Except that many measurement tools describe I/O in terms of CPU clock ticks. An L1 cache miss, for example, is normally measured in clock ticks. So you can think of it as CPU time if you like. (Clock ticks map fairly well to actual time, although you may need to adjust for occasional frequency scaling or whatever.)
Except that many measurement tools describe I/O in terms of CPU clock ticks. An L1 cache miss, for example, is normally measured in clock ticks. So you can think of it as CPU time if you like. (Clock ticks map fairly well to actual time, although you may need to adjust for occasional frequency scaling or whatever.)


More importantly, tools have a distinction of "I/O wait time" vs "CPU time". This distinction is mostly real: with I/O wait, your process is scheduled out and not running. CPU time includes time when the CPU is twiddling its thumbs waiting for a cache miss to be resolved, but the CPU isn't going anywhere; it'll keep running your process immediately after the needed data gets loaded in. (Exception: in SMT aka hyperthreading, it may be able to very very quickly go do work for someone else but be back in time to deal with your data before you've even noticed it was gone.)
More importantly, tools have a distinction of "I/O wait time" vs "CPU time". This distinction is mostly real: with I/O wait, your process is scheduled out and not running. CPU time includes time when the CPU is twiddling its thumbs waiting for a cache miss to be resolved, but the CPU isn't going anywhere; it'll keep running your process immediately after the needed data gets loaded in. (Even in SMT eg hyperthreading, where it is pulling from multiple threads all the time. When your process blocks on a cache miss, the scheduler will just pull instructions from the other thread for a while until you can start feeding it again.)
 
== Tasks ==
 
== Mozilla Platform "Cost Centers" ==
 
* Javascript
* DOM
* Layout
* Networking
* Caching
 
== Tools ==
 
Mozilla has a ton of great tools already available for analyzing performance, including many I haven't uncovered yet. I'll try to enumerate the ones that I have uncovered so far, even though I'll just be duplicating [[Performance:Tools]]. But that page is oddly organized and out of date, so I'm just going to pile things here until I get a better handle on all of this, then I'll move it all over to that page instead.
 
== Audiences ==
 
* Mozilla platform developers
* Mozilla application developers
* Plugin authors
* Add-on authors
 
The audience determines whether a tool can be enabled via a conditional compilation directive, or whether it needs to be available in a release build. It also gives guidance as to whether a platform-specific tool is acceptable (eg dtrace).
 
=== Tool Ideas ===
 
First, some random tool ideas, so they don't get lost. Some of these were proposed by other people. Some may already exist, and will be removed as I discover them.
 
* Mozilla-independent external tool that breaks down the cost of performing some task into the levels of the caching hierarchy (with CPU time included.)
** This would allow comparing a standard operation across different browsers and seeing where firefox uses more of some resource, as a guide to optimization. Lots more cache misses => look at cache oblivious data structures. Lots more seeks => figure out a way to reorder accesses.
* Mark up a user's Javascript to show what percentage of each line (statement? minification is a pain) in a particular run was "on trace" with TraceMonkey.
** This would allow the user to figure out what code works well with the trace engine and what doesn't.
 
=== Current Tools at Mozilla ===
 
==== Latency ====
 
Platform-independent:
 
* [http://www.mozilla.org/performance/measureStartup.html MOZ_TIMELINE] Timing instrumentation. Extremely useful for app startup measurements.
* [http://www.mozilla.org/performance/jsprofiler.html JS Runtime Profiler]. Profile JavaScript usage, dumps output to a file.
* TraceVis
 
Platform-specific:
 
''Windows''
* [https://developer.mozilla.org/Profiling_with_AMD_CodeAnalyst AMD CodeAnalyst]
* Xperf
* vtune
* [http://www.sysinternals.com/ Sysinternals utilities]. Win9x/Me/NT/2K utilites for monitoring system usage. CPU/Mon, Diskmon, Filemon, Regmon, etc.
 
''Mac''
* [https://developer.mozilla.org/en/Profiling_JavaScript_with_Shark Shark]
* The Metrowerks Profiler (on subsets of the codebase), or Apple's Instrumentation SDK. For details, see the [http://www.mozilla.org/performance/mac-performance.html Mac performance page].
 
''Linux''
* oprofile
* sysprof
* [http://www.mozilla.org/performance/jprof.html jprof]. Sampling profiling tool by [mailto:jim_nance@yahoo.com Jim Nance] that runs on 32-bit Linux.
* [http://www.mozilla.org/performance/eazel.html eazel profilers]. Two profiling tools from Eazel/gnome.org.
 
''Other''
* [http://alisa.ucsd.edu/cgi/webman?SEARCH+man5+hiprof.5.gz hiprof]. A hierarchical instruction profiler for Digital Unix. There is an [http://www.mozilla.org/performance/atom/atom.html online tutorial], courtesy [mailto:jim_nance@yahoo.com Jim Nance] (jim_nance@yahoo.com):
** "I like it because it does not require you to recompile the program, because it give accurate times for callers, and because it comes with DU so that everyone developing there should already have it." [news://news.mozilla.org/slrn7jop6f.jlq.jlnance%40Bessie.acs.uncwil.edu *]
 
''Multiple''
* [http://valgrind.org/docs/manual/cl-manual.html callgrind] (part of valgrind)
* [http://www.ibm.com/software/awdtools/purifyplus/ Rational PurifyPlus]. (Commercial) Used to be called Rational Quantify.
** [http://www.mozilla.org/performance/qfy2html.pl qfy2html.pl]. A Perl script for munging Quantify 6.0 output into hyperlinked HTML to share with your friends. Run Quantify, save your results as text, and then let this thing rip.
** [http://www.mozilla.org/unix/quantify.html Quantifying Mozilla on Solaris]. Instructions on how to make Quantify play nicely with Mozilla on Solaris.
* [http://www.gnu.org/software/binutils/manual/gprof-2.9.1/gprof.html gprof]. Part of the [http://www.gnu.org/software/binutils/ GNU Binutils]. Back in May 1999 there were rumors that gprof had trouble with dynamically linked and loaded objects (like Mozilla). Currently (March 2006) gprof is still being developed, so hopefully that's no longer true (assuming it ever was). The link to the gprof manual above is *years* out of date but, at the time of writing, it's the latest version of the [http://www.gnu.org/manual/ manual on gnu.org]. To get an up to date version of the manual, get the gprof source and run |. configure; make html| in the gprof source directory.
 
==== Memory Usage ====
 
See [[Performance:Leak Tools]], which is already broken down better than this page is.
Confirmed users
328

edits

Navigation menu