Performance/MemShrink/DMD

From MozillaWiki
Jump to navigation Jump to search

DMD (short for "dark matter detector") is a tool for that tracks which heap blocks have been reported by memory reporters. It's designed to help us reduce the "heap-unclassified" value in Firefox's about:memory page. It also detects if any heap blocks are reported twice.

Build

Everything other than B2G-device builds

If you're not building for a B2G device, add this line to the mozconfig file of your choice:

 ac_add_options --enable-dmd

Build with that mozconfig. Optimized builds should work fine.

B2G device builds

First, update your B2G checkout with git pull or git fetch && git merge origin/master. ./repo sync is not sufficient! You must git pull to get the latest version of the relevant tools.

For B2G device builds, we don't usually modify the mozconfig (although you can; it's hiding under gonk-misc/default-gecko-config).

Instead, modify your .userconfig and add

  export MOZ_DMD=1

(don't forget the export).

You probably need to clobber your objdir (rm -rf objdir-gecko). Then build normally.

Run

Desktop

To run DMD on a desktop build (including Firefox and B2G desktop), you need to precede your usual invocation of Firefox with three environment variable definitions.

On Linux, do this:

 LD_PRELOAD=$OBJDIR/dist/lib/libdmd.so \
 LD_LIBRARY_PATH=$OBJDIR/dist/lib/ \
 DMD=1 \
 <command>

On Mac OS X, do this:

 DYLD_INSERT_LIBRARIES=$OBJDIR/dist/lib/libdmd.dylib \
 LD_LIBRARY_PATH=$OBJDIR/dist/lib/ \
 DMD=1 \
 <command>

On Windows, do this:

 [Doesn't currently work on windows, sorry.  See bug 819839.]

On start-up, you'll see some commentary on stderr, such as:

 DMD[20523] $DMD = '1'
 DMD[20523] DMD is enabled

The number in brackets is the process ID.

The browser will run a little slower than usual.

Fennec

To run DMD on Fennec, run the following commands (be sure to replace "org.mozilla.fennec" with the app identifier as appropriate; this will usually be org.mozilla.fennec_$USERNAME for a local build).

 adb push $OBJDIR/dist/lib/libdmd.so /sdcard/
 adb shell am start -n org.mozilla.fennec/.App --es \
     env0 MOZ_REPLACE_MALLOC_LIB=/sdcard/libdmd.so \
     --es env1 DMD=1

The commentary on Fennec goes to logcat, and looks like this:

 I/DMD  (27314): $DMD = '1'
 I/DMD  (27314): DMD is enabled

The number in the parentheses is the process ID.

B2G for devices

If you built B2G with export MOZ_DMD=1, your build will automagically run with DMD enabled. (The b2g.sh script figures out whether to enable DMD by checking for the presence of libdmd.so in /system/b2g)

If DMD is enabled, you'll see a message in logcat when a process starts up:

  I/DMD     (  305): $DMD = '1'
  I/DMD     (  305): DMD is enabled

The run-gdb.sh script also knows to start DMD builds with DMD enabled, so you don't need to do anything special.

Analyze

DMD doesn't do anything notable until you ask it to.

Desktop

To trigger DMD's analysis phase, enter this URI:

 javascript:DMDReportAndDump(<filename>)

where <filename> is the name of the file that output will be written to.

Note that Firefox doesn't let you do this directly in the address bar, so you must either (a) create a bookmark for it, or (b) run it from the error console. The latter is discouraged because opening the error console allocates a lot of memory that will skew your results.

This command tells DMD to do three things.

  • Run all the memory reporters. While this happens, DMD builds up some state recording which heap blocks have and have not been reported.
  • Analyze the state and produce output.
  • Clear the state.

Because of the third step, you can run DMDReportAndDump multiple times in a single session.

However, it's important that the memory reporters only run once between invocations of DMDReportAndDump, otherwise the state will be inaccurate. For example, if you view about:memory (which runs the memory reporters) and then invoke DMDReportAndDump (which runs them again before analyzing) you'll get _many_ reports about blocks being reported twice. Therefore, there is a second way to invoke DMD, which doesn't run the reporters:

 javascript:DMDDump(<filename>)

When either command is invoked, you'll see some more commentary:

 DMD[20600] Dump 1 {
 DMD[20600]   gathering live block groups...
 DMD[20600]   creating and sorting double-reported block group array...
 DMD[20600]   creating and sorting unreported block group array...
 DMD[20600]   printing unreported block group array...
 DMD[20600]   creating and sorting unreported frame group array...
 DMD[20600]   printing unreported frame group array...
 DMD[20600]   creating and sorting reported block group array...
 DMD[20600]   printing reported block group array...
 DMD[20600]   creating and sorting reported frame group array...
 DMD[20600]   printing reported frame group array...
 DMD[20600] }

On Linux this step is very fast. On Mac it can take 30+ seconds (and it was 5x slower before bug 819817 landed).

Once that's finished printing you can look in <filename> for the output.

Fennec

On Fennec you can use the existing memory-report dumping hook to get a DMD report as well, assuming you have a DMD-enabled build. Run the following command:

 adb shell am broadcast -a org.mozilla.gecko.MEMORY_DUMP

In logcat, you should see output similar to this:

 E/GeckoConsole (27314): nsIMemoryInfoDumper dumped reports to /data/data/org.mozilla.fennec_kats/app_tmp/memory-report-default-27314.json.gz

The path (should always be /data/data/$APPID/app_tmp/) is where the memory reports and DMD reports get dumped to. You can pull them like so:

 adb pull /data/data/org.mozilla.fennec_kats/app_tmp/memory-report-default-27314.json.gz
 adb pull /data/data/org.mozilla.fennec_kats/app_tmp/dmd-default-27314.txt.gz

B2G

Run tools/get_about_memory.py. If DMD is enabled on your device, you should see output like the following:

   $ ./get_about_memory.py 
   Got 3/3 files.
   Pulled files into about-memory-18.
   Got 3 DMD dump(s).
   [...]
   Done processing DMD files.  Have a look in about-memory-18.

get_about_memory.py invokes fix_b2g_stack.py, so you shouldn't need to run it yourself, but it's there in case you need it. It works just like fix_linux_stack.pl on desktop.

See get_about_memory.py --help for more options, but you probably don't need anything other than the defaults.

The output

Pre-processing

Note: You can skip this step if you're on a B2G device build.

DMD's output file contains a lot of stack traces. As printed, many of the stack trace entries will look like this:

 ???[/home/njn/moz/mi2/dmdo64/dist/bin/libxul.so +0x1761BCD] 0x7f845186bbcd

To make them more useful, you need to run them through a "stack-fixing" script.

  • On Linux, use tools/rb/fix-linux-stack.pl.
  • On Mac, use tools/rb/fix_macosx_stack.py.

Both scripts read from stdin and print to stdout. After doing so, these lines should look something more like this:

 nsStringBuffer::Alloc(unsigned long) (/home/njn/moz/mi2/xpcom/string/src/nsSubstring.cpp:177) 0x7f845186bbcd

This shows the function, filename, line number, and PC.

Note that fix-linux-stack.pl is very slow, and can take 2+ minutes to process a DMD output file. (It was almost 10x slower than that before bug 812070 landed.) fix_macosx_stack.py is faster, but can still take 30+ seconds.

Output Sections

DMD's output is broken into multiple sections.

  1. "Invocation". This tells you how DMD was invoked, i.e. what options were used.
  2. "Twice-reported stack trace records". This tells you which heap blocks were reported twice or more. The presence of any such records indicates bugs in one or more memory reporters.
  3. "Unreported stack trace records". This tells you which heap blocks were not reported, which indicate where additional memory reporters would be most helpful.
  4. "Unreported stack frame records". This also presents information about unreported blocks, but in a different way. It can be used to better understand the "Unreported stack trace records" section.
  5. "Once-reported stack trace records": like the "Unreported stack trace records" section, but for blocks reported once.
  6. "Once-reported stack frame records": like the "Unreported stack frame records" section, but for blocks reported once.
  7. "Summary": gives measurements of the total heap, and the unreported/once-reported/twice-reported portions of it.
  8. "Execution measurements": gives some statistics about DMD's execution, which are mostly of interest to DMD's developers.

The "Twice-reported stack trace records" and "Unreported stack trace records" sections are the most important, because they indicate ways in which the memory reporters can be improved.

Stack trace records

The stack trace record sections are the most important ones. Here's an example stack trace record from the "Unreported stack trace records" section.

Unreported: 3 blocks in stack trace record 209 of 1,891
 36,864 bytes (26,184 requested / 10,680 slop)
 0.03% of the heap (64.55% cumulative);  0.04% of unreported (86.78% cumulative)
 Allocated at
   malloc (/home/njn/moz/mi2/memory/build/replace_malloc.c:151) 0x417170
   PR_Malloc (/home/njn/moz/mi2/nsprpub/pr/src/malloc/prmem.c:435) 0x7f68650f423c
   PL_ArenaAllocate (/home/njn/moz/mi2/nsprpub/lib/ds/plarena.c:200) 0x7f68652463e1
   nsFixedSizeAllocator::Alloc(unsigned long) (/home/njn/moz/mi2/xpcom/ds/nsFixedSizeAllocator.cpp:95) 0x7f6860f528dc
   nsNodeInfo::Create(nsIAtom*, nsIAtom*, int, unsigned short, nsIAtom*, nsNodeInfoManager*) (/home/njn/moz/mi2/content/base/src/nsNodeInfo.cpp:64) 0x7f685f640933
   nsNodeInfoManager::GetNodeInfo(nsIAtom*, nsIAtom*, int, unsigned short, nsIAtom*) (/home/njn/moz/mi2/content/base/src/nsNodeInfoManager.cpp:225) 0x7f685f642d05
   mozilla::dom::Element::SetAttrAndNotify(int, nsIAtom*, nsIAtom*, nsAttrValue const&, nsAttrValue&, unsigned char, bool, bool, bool) (/home/njn/moz/mi2/content/base/src/Element.cpp:1862) 0x7f685f60ad87
   mozilla::dom::Element::SetAttr(int, nsIAtom*, nsIAtom*, nsAString_internal const&, bool) (/home/njn/moz/mi2/content/base/src/Element.cpp:1778) 0x7f685f60a9b3
   nsXMLContentSink::AddAttributes(unsigned short const**, nsIContent*) (/home/njn/moz/mi2/content/xml/document/src/nsXMLContentSink.cpp:1464) 0x7f685fa76c5c
   nsXBLContentSink::AddAttributes(unsigned short const**, nsIContent*) (/home/njn/moz/mi2/content/xbl/src/nsXBLContentSink.cpp:882) 0x7f685fb3ad42
   nsXMLContentSink::HandleStartElement(unsigned short const*, unsigned short const**, unsigned int, int, unsigned int, bool) (/home/njn/moz/mi2/content/xml/document/src/nsXMLContentSink.cpp:1018) 0x7f685fa73db5
   nsXMLContentSink::HandleStartElement(unsigned short const*, unsigned short const**, unsigned int, int, unsigned int) (/home/njn/moz/mi2/content/xml/document/src/nsXMLContentSink.cpp:947) 0x7f685fa7370a
   nsXBLContentSink::HandleStartElement(unsigned short const*, unsigned short const**, unsigned int, int, unsigned int) (/home/njn/moz/mi2/content/xbl/src/nsXBLContentSink.cpp:258) 0x7f685fb37cc0

It tells you that there were 3 heap blocks that were allocated from the program point indicated by the "Allocated at" stack trace, that these blocks took up 36,864 bytes, and that 10,680 of those bytes were "slop" (wasted space caused by the heap allocator rounding up request sizes). It also indicates what percentage of the total heap size and the unreported portion of the heap these blocks represent.

Within each section, records are listed from largest to smallest.

Once-reported and twice-reported stack trace records also have stack traces for the report point(s). For example:

Reported at
  mozilla::dmd::Report(void const*) (/home/njn/moz/mi2/memory/replace/dmd/DMD.cpp:1740) 0x7f68652581ca
  CycleCollectorMallocSizeOf(void const*) (/home/njn/moz/mi2/xpcom/base/nsCycleCollector.cpp:3008) 0x7f6860fdfe02
  nsPurpleBuffer::SizeOfExcludingThis(unsigned long (*)(void const*)) const (/home/njn/moz/mi2/xpcom/base/nsCycleCollector.cpp:933) 0x7f6860fdb7af
  nsCycleCollector::SizeOfIncludingThis(unsigned long (*)(void const*), unsigned long*, unsigned long*, unsigned long*, unsigned long*, unsigned long*) const (/home/njn/moz/mi2/xpcom/base/nsCycleCollector.cpp:3029) 0x7f6860fdb6b1
  CycleCollectorMultiReporter::CollectReports(nsIMemoryMultiReporterCallback*, nsISupports*) (/home/njn/moz/mi2/xpcom/base/nsCycleCollector.cpp:3075) 0x7f6860fde432
  nsMemoryInfoDumper::DumpMemoryReportsToFileImpl(nsAString_internal const&) (/home/njn/moz/mi2/xpcom/base/nsMemoryInfoDumper.cpp:626) 0x7f6860fece79
  nsMemoryInfoDumper::DumpMemoryReportsToFile(nsAString_internal const&, bool, bool) (/home/njn/moz/mi2/xpcom/base/nsMemoryInfoDumper.cpp:344) 0x7f6860febaf9
  mozilla::(anonymous namespace)::DumpMemoryReportsRunnable::Run() (/home/njn/moz/mi2/xpcom/base/nsMemoryInfoDumper.cpp:58) 0x7f6860fefe03

You can tell which memory reporter made the report by the name of the MallocSizeOf function near the top of the stack trace. In this case it was the cycle collector's reporter.

By default, DMD measures heap blocks above a certain size precisely, but uses sampling to measure blocks below that size. Any measurements that involve sampled blocks (even if combined with non-sampled measurements) are approximate, and this is indicated by a preceding '~'. For example:

Unreported: ~273 blocks in block group 17 of 14,611
 ~1,125,590 bytes (~1,117,936 requested / ~7,654 slop)
 0.07% of the heap (2.58% cumulative);  0.43% of unreported (16.36% cumulative)

The sampling threshold can be adjusted with an option (see below). This will affect the precision of the output and the speed at which Firefox+DMD runs.

Stack frame Records

One shortcoming of stack trace records is that sometimes there are multiple stack trace records that have stacks that only vary by a small amount, and this can obscure the fact that a single function is ultimately responsible for all of the blocks covered by these heap allocations. Stack frame records are an attempt to identify such cases.

Here's an example stack frame record from the "Unreported stack frame records" section:

Unreported: ~4,047 blocks from ~979 strack trace records in stack frame record 39 of 11,482
 ~16,634,045 bytes (~16,624,445 requested / ~9,600 slop)
 0.99% of the heap;  0.00% of unreported
 PC is
   PL_DHashTableInit (/home/njn/moz/mi2/dmdo64/xpcom/build/pldhash.cpp:231) 0x7f8451816018

A stack frame record is a merging of every stack trace record that had an allocation stack trace containing its stack frame. In this example we can see that the initialization of pldhash structures (via pldhash.cpp:231) accounted for over 16 MiB of heap allocations, and those are spread over 4,047 heap blocks, which covered approximately 979 stack trace records.

Options

Setting the DMD environment variable to 1 gives default options. But you can also specify non-default options by setting DMD to a whitespace separated list of --option=val entries.

At the moment, you can provide two options to DMD: --sample-below=<1..n> and --mode=<normal|test|stress>.

--sample-below=<1..n>

By default, DMD samples blocks with a sample-below size of 4093. You can change this by invoking Firefox with e.g. DMD="--sample-below=15".

When DMD samples allocations, it ignores some allocations in an attempt to run faster. The chance that an allocation is recorded depends on its size.

If the sample-below size is 4093, then when a malloc allocating n >= 4093 bytes occurs, DMD records that allocation precisely.

If on the other hand a malloc allocating n < 4093 bytes occurs, DMD may or may not record that allocation at all.

When the allocation occurs, DMD increments a global counter by n. If the counter's new value is less than 4093 bytes, we ignore the allocation. If the counter's new value is greater than 4093, we pretend as though the current stack trace just allocated a block of size 4093 bytes, and we decrement the counter by 4093.

In this way, we record callsites roughly in proportion to how much memory they allocate. The idea is that if one callsite allocates many small blocks, it will cause the counter to roll over often enough that the callsite will show up in DMD and be blamed for roughly the right amount of memory usage.

(Note that "a malloc allocating n bytes" is not the same as malloc(n)! DMD attributes to a malloc call the full amount of memory reserved by the allocator for that malloc(), which may be greater than n. For example, if you malloc(1023), the allocator will probably give you a 1024-byte block.)

We've found that sampling in this way works well in practice, particularly since most of the pieces of memory we're trying to track down with DMD happen to be large allocations. Sampling leads to a huge speedup in DMD's performance and a large reduction in its memory usage.

If you want DMD to record all allocations precisely, pass --sample-below=1. If instead you want DMD to run faster, pass --sample-below=K for K > 4093.

Choosing your sample-below value

Although you can pass --sample-below=K for any natural number K, we've found that primes seem to work particularly well. We chose the default as 4093 because it's the largest prime smaller than 4096.

There are number-theoretic reasons why prime numbers work well here. At a high level: We want the probability that we record a malloc allocating n bytes to be n/K, independent of which allocations came before us.

Since whether we record a malloc is a deterministic function of n and the global counter C, this means that we want C to have as much entropy as possible.

But consider the case when all our allocations are for multiples of 4 bytes (that is, n % 4 == 0 for all allocs). If K=4096, C is only ever a multiple of 4. But on the other hand if K is a prime, it can be shown that C can take on all values 0..K-1.

More possible values of C means more randomness, which means better sampling.

--mode=<normal|test|stress>

--mode=<normal|test|stress> can be used to invoke "test" or "stress" mode, which are useful if you're hacking on DMD. The default is normal mode.

"test" and "stress" modes set their own --sample-below values, so you should never have to specify both --sample-below and --mode.

Setting options on B2G device builds

If you want to run B2G on a device with args other than DMD=1, you'll need to modify the gonk-misc/b2g.sh script and then push it to the device.

To push the modified script, do something like

  adb shell stop b2g
  adb remount
  adb push b2g.sh /system/bin
  adb shell chmod 0755 /system/bin/b2g.sh
  adb shell start b2g

If you want to run B2G on the device under GDB with args other than DMD=1, modify the run-gdb.sh script. You don't need to push anything.

Which heap blocks are reported?

At this stage you might wonder how DMD knows which allocations have been reported and which haven't. DMD only knows about heap blocks that are measured via a function created with one of the following two macros:

 NS_MEMORY_REPORTER_MALLOC_SIZEOF_FUN
 NS_MEMORY_REPORTER_MALLOC_SIZEOF_ON_ALLOC_FUN

Fortunately, most of the existing memory reporters do this. See Platform/Memory_Reporting for more details about how memory reporters are written.

Troubleshooting DMD

Contact Nick Nethercote ("njn" on IRC) or Justin Lebar ("jlebar" on IRC).