Sfink/Useful Notes: Difference between revisions

No edit summary
No edit summary
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[#A1]] This is A1
== Problem A: System too unusable to diagnose ==


fish
When a bad memory leak kicks in, the system can be too unusable to get useful data out.


face
=== Solution ===


fish
Make it easier to get information out when the system is suffering


face
[[#A1]] Periodically log memory-related information (existing bug, I think? also
telemetry)


fish
[[#A2]] Maintain a rotating database of detailed memory-related information (cf
atop)


face
[[#A3]] Make about:memory capable of outputting to a file, for use with a
command-line invocation 'firefox about:memory?verbose=1&outfile=...'


fish
=== Solution ===


face
Prevent the system from getting into such a bad state


fish
[[#A4]] Make a per-compartment (or per-?) cap on memory usage


face
[[#A5]] When sufferingMode==true, disable GC/CC on big tabs. Probably need to
deactivate them too.


fish
[[#A6]] Early warning when memory usage is getting too high


face
[[#A7]] Crash reporter-like UI for reporting memory problems (do not require an
actual crash to trigger)


fish
== Problem B: Regular users can't generate useful reports  ==


face
Hard for regular users to generate a useful memory problem report


fish
(all solutions from problem A are relevant here)


face
[[#B1]] Provide a way to dump and submit a reachability graph


fish
[[#B2]] Documentation for how to best help with a memory problem, with various
steps to follow.


face
[[#B3]] Track memory to individual page/tab/compartment/principals.


fish
[[#B4]] Tools for generating profiles with subsets of addons installed (or for
running with different subsets of addons within one profile)


face
[[#B5]] Tools for blaming memory usage on addons (eg detecting "safe" addons to
remove from consideration. Cross-referencing other users' addons and memory
usage similar to the crash correlation reports -- requires telemetry.)


fish
== Problem C: Knowledgeable users can't generate useful reports ==


face
Hard for developers or knowledgeable and motivated users to generate
a useful memory problem report


fish
The above problem B crossed into this, so everything there is relevant.


face
[[#C1]] Rationalize and document all of our various leak-detection tools.


fish
[[#C2]] Automation and Windows equivalents of my /proc/<pid>/maps hacks


face
[[#C3]] Dumpers that give full heap, full graph, pruned graph. Visualizers,
ish
analyzers, etc. of the dumps.


face
[[#C4]] Collect age of various memory objects (how many CCs or GCs it has been
alive.)


fish
== Problem D: Uncollected garbage ==


face
Garbage is not collected


fish
=== Solution ===


face
Report cycles that CC misses


fish
[[#D1]] Conservative scanner to find cycles involving things not marked as
CC-participants and report them as suspicious.


face
Solution: Report resources that leak over time but are still referenced (so
they are cleaned up before shutdown)


fish
[[#D2]] Register "expected lifetime" at acquisition time. Report things that live
longer than expected, filtered by diagnostics. ("lifetime assertions"? Not
quite.)


face
[[#D3]] Detect subgraphs that grow (at a constant rate?) while a page is open.


fish
[[#D4]] Detect subgraphs that are never accessed


face
== Problem E: Unleaked but excessive memory usage ==


fish
High memory usage, not leaked


face
(aside from current work like generational gc)


fish
[[#E1]] "Simulator" that runs over logs and estimates peak memory usage if CC/GC
ran at optimal times.


face
[[#E2]] Use reproducible test runs to evaluate what the performance/memory
tradeoff is for various things (eg jit code, structure sizes)


fish
== Problem F: Hard to track down problems ==


face
Hard to navigate through a memory dump or the current state to track
down a specific problem


fish
[[#F1]] Dump all roots of a compartment, and trace roots back to the XPCOM/DOM/whatever thing that is holding onto that root (when available)


face
[[#F2]] Go from JS object to things keeping it alive (dump out GC edges) -- see jimb's findReferences (currently JS shell only)


fish
[[#F3]] Record addr,size,stack at every allocation (kgadd's heap visualizer)


face
[[#F4]]


fish
----------------------------------------------------------------------


face
Details:


fish
<div id='A2'>A2. atop records a ton of statistics about memory, disk, network, CPU, and
other things at a 10 minute sampling interval. Stats are collected both on a
global and per-process granularity. It monitors every process that starts and
stops, even if the process appeared and disappeared entirely between two
samples. It dumps all this in a somewhat-compressed binary log.


face
The visual UI has a good set of heuristics for detecting "large" values, and
coloring the output accordingly. If your disk is busy for >90% of the sampling
interval, it'll turn red. If your network traffic is a high percentage of the
expected maximum bandwidth, it'll turn red. etc.


fish
It lets you use it in 'top-like' mode, where it displays the current state of
things, as well as in a historical mode where it reads from a log file. (It is
decidedly *not* seamless between the two, but it should be.)


face
It also allows dumping historical data to text files. I've used that for
ish
generating graphs of various values.


face
For the browser, many of the same metrics are applicable, but I'd also like an
equivalent of the processes' info. The idea is to know "what was going on at
XXX?" So it should be user and browser actions, which tab was active, network
requests, significant events firing, etc.


fish
</div>
----


face
<div id='A3'>A3. The idea is that rather than waiting for the screen to redraw for every
action in getting to about:memory, you just do firefox 'about:memory...' and go
have a cup of tea while it thinks about it.


fish
</div>
----


face
<div id='A5'>A5. This is based on pure speculation, but I don't understand why the browser
is so incredibly unusable when memory usage is going nuts. Why is all that
memory being touched? Why isn't it just swapped out and forgotten? Under the
assumption that it's the GC scanning it over and over again, it seems like it
would be nice to suppress GC in this situation. Generational GC could eliminate
this problem in a nicer and much more principled way.


fish
</div>
----


face
<div id='B2'>B2. I have the impression that we have many, many memory-related problem
reports that end up being useless. I think that's really our fault; it's too
hard for users to file useful bug reports. Experienced Mozilla devs don't even
know what to do.


fish
</div>
----


face
<div id='B5'>B5. eg: collect up all API calls that an addon makes (or record them, or
whatever.) Maintain a whitelist of APIs. (If you pass in a string, assume it
may be duplicated a thousand times and stored in a sqlite DB forever, but if
you're just setting existing booleans or reading state, you're blameless.)


fish
</div>
----


face
<div id='C2'>C2. When looking at a memory leak, I took several snapshots of
/proc/<pid>/maps, diffed them to find a memory region that appeared and did not
disappear, and then dumped out the raw memory to a file. Then I ran strings on
it.


fish
</div>
----


face
<div id='D2'>D2. I don't really know enough about the system to flesh this out properly, but
it seems like when you have a bunch of memory lingering around when it really
ought to be dead, that many of the objects comprising that memory should be
able to "know" that they *probably* shouldn't live past... the current page, or
for more than a few seconds, or whatever. Assuming this is possible, it should
be possible to walk up a dominator graph and give a fairly directed answer to
"why has this outlived what it thought its lifespan would be?"


fish
Not every memory allocation needs to be marked for this to work. You just need
one object within the "leaked" memory to be marked.


face
It could also walk the graph "en masse" to ignore individual objects that are
reachable longer than expected and focus on the clusters of objects that are
kept alive by the same thing. (I'm thinking that the expected lifetime is a
guess, and may be inaccurate.)


fish
</div>
----


face
<div id='D4'>D4. eg use mprotect on a random subset of the heap to find pages (or smaller
 
regions, but that's harder) that are never accessed after some point. Remove
fish
the GC/CC from consideration.
 
</div>
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
ish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
ish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
fish
 
face
 
<div id="A1">Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text Some longer explanatory text </div

Latest revision as of 19:15, 15 August 2011

Problem A: System too unusable to diagnose

When a bad memory leak kicks in, the system can be too unusable to get useful data out.

Solution

Make it easier to get information out when the system is suffering

#A1 Periodically log memory-related information (existing bug, I think? also telemetry)

#A2 Maintain a rotating database of detailed memory-related information (cf atop)

#A3 Make about:memory capable of outputting to a file, for use with a command-line invocation 'firefox about:memory?verbose=1&outfile=...'

Solution

Prevent the system from getting into such a bad state

#A4 Make a per-compartment (or per-?) cap on memory usage

#A5 When sufferingMode==true, disable GC/CC on big tabs. Probably need to deactivate them too.

#A6 Early warning when memory usage is getting too high

#A7 Crash reporter-like UI for reporting memory problems (do not require an actual crash to trigger)

Problem B: Regular users can't generate useful reports

Hard for regular users to generate a useful memory problem report

(all solutions from problem A are relevant here)

#B1 Provide a way to dump and submit a reachability graph

#B2 Documentation for how to best help with a memory problem, with various steps to follow.

#B3 Track memory to individual page/tab/compartment/principals.

#B4 Tools for generating profiles with subsets of addons installed (or for running with different subsets of addons within one profile)

#B5 Tools for blaming memory usage on addons (eg detecting "safe" addons to remove from consideration. Cross-referencing other users' addons and memory usage similar to the crash correlation reports -- requires telemetry.)

Problem C: Knowledgeable users can't generate useful reports

Hard for developers or knowledgeable and motivated users to generate a useful memory problem report

The above problem B crossed into this, so everything there is relevant.

#C1 Rationalize and document all of our various leak-detection tools.

#C2 Automation and Windows equivalents of my /proc/<pid>/maps hacks

#C3 Dumpers that give full heap, full graph, pruned graph. Visualizers, analyzers, etc. of the dumps.

#C4 Collect age of various memory objects (how many CCs or GCs it has been alive.)

Problem D: Uncollected garbage

Garbage is not collected

Solution

Report cycles that CC misses

#D1 Conservative scanner to find cycles involving things not marked as CC-participants and report them as suspicious.

Solution: Report resources that leak over time but are still referenced (so they are cleaned up before shutdown)

#D2 Register "expected lifetime" at acquisition time. Report things that live longer than expected, filtered by diagnostics. ("lifetime assertions"? Not quite.)

#D3 Detect subgraphs that grow (at a constant rate?) while a page is open.

#D4 Detect subgraphs that are never accessed

Problem E: Unleaked but excessive memory usage

High memory usage, not leaked

(aside from current work like generational gc)

#E1 "Simulator" that runs over logs and estimates peak memory usage if CC/GC ran at optimal times.

#E2 Use reproducible test runs to evaluate what the performance/memory tradeoff is for various things (eg jit code, structure sizes)

Problem F: Hard to track down problems

Hard to navigate through a memory dump or the current state to track down a specific problem

#F1 Dump all roots of a compartment, and trace roots back to the XPCOM/DOM/whatever thing that is holding onto that root (when available)

#F2 Go from JS object to things keeping it alive (dump out GC edges) -- see jimb's findReferences (currently JS shell only)

#F3 Record addr,size,stack at every allocation (kgadd's heap visualizer)

#F4


Details:

A2. atop records a ton of statistics about memory, disk, network, CPU, and

other things at a 10 minute sampling interval. Stats are collected both on a global and per-process granularity. It monitors every process that starts and stops, even if the process appeared and disappeared entirely between two samples. It dumps all this in a somewhat-compressed binary log.

The visual UI has a good set of heuristics for detecting "large" values, and coloring the output accordingly. If your disk is busy for >90% of the sampling interval, it'll turn red. If your network traffic is a high percentage of the expected maximum bandwidth, it'll turn red. etc.

It lets you use it in 'top-like' mode, where it displays the current state of things, as well as in a historical mode where it reads from a log file. (It is decidedly *not* seamless between the two, but it should be.)

It also allows dumping historical data to text files. I've used that for generating graphs of various values.

For the browser, many of the same metrics are applicable, but I'd also like an equivalent of the processes' info. The idea is to know "what was going on at XXX?" So it should be user and browser actions, which tab was active, network requests, significant events firing, etc.


A3. The idea is that rather than waiting for the screen to redraw for every

action in getting to about:memory, you just do firefox 'about:memory...' and go have a cup of tea while it thinks about it.


A5. This is based on pure speculation, but I don't understand why the browser

is so incredibly unusable when memory usage is going nuts. Why is all that memory being touched? Why isn't it just swapped out and forgotten? Under the assumption that it's the GC scanning it over and over again, it seems like it would be nice to suppress GC in this situation. Generational GC could eliminate this problem in a nicer and much more principled way.


B2. I have the impression that we have many, many memory-related problem

reports that end up being useless. I think that's really our fault; it's too hard for users to file useful bug reports. Experienced Mozilla devs don't even know what to do.


B5. eg: collect up all API calls that an addon makes (or record them, or

whatever.) Maintain a whitelist of APIs. (If you pass in a string, assume it may be duplicated a thousand times and stored in a sqlite DB forever, but if you're just setting existing booleans or reading state, you're blameless.)


C2. When looking at a memory leak, I took several snapshots of

/proc/<pid>/maps, diffed them to find a memory region that appeared and did not disappear, and then dumped out the raw memory to a file. Then I ran strings on it.


D2. I don't really know enough about the system to flesh this out properly, but

it seems like when you have a bunch of memory lingering around when it really ought to be dead, that many of the objects comprising that memory should be able to "know" that they *probably* shouldn't live past... the current page, or for more than a few seconds, or whatever. Assuming this is possible, it should be possible to walk up a dominator graph and give a fairly directed answer to "why has this outlived what it thought its lifespan would be?"

Not every memory allocation needs to be marked for this to work. You just need one object within the "leaked" memory to be marked.

It could also walk the graph "en masse" to ignore individual objects that are reachable longer than expected and focus on the clusters of objects that are kept alive by the same thing. (I'm thinking that the expected lifetime is a guess, and may be inaccurate.)


D4. eg use mprotect on a random subset of the heap to find pages (or smaller

regions, but that's harder) that are never accessed after some point. Remove the GC/CC from consideration.