Confirmed users
328
edits
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
== Problem A == | |||
When a bad memory leak kicks in, the system can be too unusable to | |||
get useful data out. | |||
=== Solution === | |||
Make it easier to get information out when the system is suffering | |||
[[#A1]] Periodically log memory-related information (existing bug, I think? also | |||
telemetry) | |||
[[#A2]] Maintain a rotating database of detailed memory-related information (cf | |||
atop) | |||
[[#A3]] Make about:memory capable of outputting to a file, for use with a | |||
command-line invocation 'firefox about:memory?verbose=1&outfile=...' | |||
=== Solution === | |||
Prevent the system from getting into such a bad state | |||
[[#A4]] Make a per-compartment (or per-?) cap on memory usage | |||
[[#A5]] When sufferingMode==true, disable GC/CC on big tabs. Probably need to | |||
deactivate them too. | |||
[[#A6]] Early warning when memory usage is getting too high | |||
[[#A7]] Crash reporter-like UI for reporting memory problems (do not require an | |||
actual crash to trigger) | |||
== Problem B == | |||
Hard for regular users to generate a useful memory problem report | |||
(all solutions from problem A are relevant here) | |||
[[#B1]] Provide a way to dump and submit a reachability graph | |||
[[#B2]] Documentation for how to best help with a memory problem, with various | |||
steps to follow. | |||
[[#B3]] Track memory to individual page/tab/compartment/principals. | |||
[[#B4]] Tools for generating profiles with subsets of addons installed (or for | |||
running with different subsets of addons within one profile) | |||
[[#B5]] Tools for blaming memory usage on addons (eg detecting "safe" addons to | |||
remove from consideration. Cross-referencing other users' addons and memory | |||
usage similar to the crash correlation reports -- requires telemetry.) | |||
== Problem C == | |||
Hard for developers or knowledgeable and motivated users to generate | |||
a useful memory problem report | |||
The above problem B crossed into this, so everything there is relevant. | |||
[[#C1]] Rationalize and document all of our various leak-detection tools. | |||
[[#C2]] Automation and Windows equivalents of my /proc/<pid>/maps hacks | |||
[[#C3]] Dumpers that give full heap, full graph, pruned graph. Visualizers, | |||
analyzers, etc. of the dumps. | |||
[[#C4]] Collect age of various memory objects (how many CCs or GCs it has been | |||
alive.) | |||
== Problem D == | |||
Garbage is not collected | |||
=== Solution === | |||
Report cycles that CC misses | |||
[[#D1]] Conservative scanner to find cycles involving things not marked as | |||
CC-participants and report them as suspicious. | |||
Solution: Report resources that leak over time but are still referenced (so | |||
they are cleaned up before shutdown) | |||
[[#D2]] Register "expected lifetime" at acquisition time. Report things that live | |||
longer than expected, filtered by diagnostics. ("lifetime assertions"? Not | |||
quite.) | |||
[[#D3]] Detect subgraphs that grow (at a constant rate?) while a page is open. | |||
[[#D4]] Detect subgraphs that are never accessed | |||
== Problem E == | |||
High memory usage, not leaked | |||
(aside from current work like generational gc) | |||
[[#E1]] "Simulator" that runs over logs and estimates peak memory usage if CC/GC | |||
ran at optimal times. | |||
[[#E2]] Use reproducible test runs to evaluate what the performance/memory | |||
tradeoff is for various things (eg jit code, structure sizes) | |||
== Problem F == | |||
Hard to navigate through a memory dump or the current state to track | |||
down a specific problem | |||
[[#F1]] Dump all roots of a compartment, and trace roots back to the XPCOM/DOM/whatever thing that is holding onto that root (when available) | |||
[[#F2]] Record addr,size,stack at every allocation | |||
[[#F3]] | |||
</div> | |||
---------------------------------------------------------------------- | |||
Details: | |||
<div id='A2'>A2. atop records a ton of statistics about memory, disk, network, CPU, and | |||
other things at a 10 minute sampling interval. Stats are collected both on a | |||
global and per-process granularity. It monitors every process that starts and | |||
stops, even if the process appeared and disappeared entirely between two | |||
samples. It dumps all this in a somewhat-compressed binary log. | |||
The visual UI has a good set of heuristics for detecting "large" values, and | |||
coloring the output accordingly. If your disk is busy for >90% of the sampling | |||
interval, it'll turn red. If your network traffic is a high percentage of the | |||
expected maximum bandwidth, it'll turn red. etc. | |||
It lets you use it in 'top-like' mode, where it displays the current state of | |||
things, as well as in a historical mode where it reads from a log file. (It is | |||
decidedly *not* seamless between the two, but it should be.) | |||
It also allows dumping historical data to text files. I've used that for | |||
generating graphs of various values. | |||
For the browser, many of the same metrics are applicable, but I'd also like an | |||
equivalent of the processes' info. The idea is to know "what was going on at | |||
XXX?" So it should be user and browser actions, which tab was active, network | |||
requests, significant events firing, etc. | |||
</div> | |||
---- | |||
<div id='A3'>A3. The idea is that rather than waiting for the screen to redraw for every | |||
action in getting to about:memory, you just do firefox 'about:memory...' and go | |||
have a cup of tea while it thinks about it. | |||
</div> | |||
---- | |||
<div id='A5'>A5. This is based on pure speculation, but I don't understand why the browser | |||
is so incredibly unusable when memory usage is going nuts. Why is all that | |||
memory being touched? Why isn't it just swapped out and forgotten? Under the | |||
assumption that it's the GC scanning it over and over again, it seems like it | |||
would be nice to suppress GC in this situation. Generational GC could eliminate | |||
this problem in a nicer and much more principled way. | |||
</div> | |||
---- | |||
<div id='B2'>B2. I have the impression that we have many, many memory-related problem | |||
reports that end up being useless. I think that's really our fault; it's too | |||
hard for users to file useful bug reports. Experienced Mozilla devs don't even | |||
know what to do. | |||
</div> | |||
---- | |||
<div id='B5'>B5. eg: collect up all API calls that an addon makes (or record them, or | |||
whatever.) Maintain a whitelist of APIs. (If you pass in a string, assume it | |||
may be duplicated a thousand times and stored in a sqlite DB forever, but if | |||
you're just setting existing booleans or reading state, you're blameless.) | |||
</div> | |||
---- | |||
<div id='C2'>C2. When looking at a memory leak, I took several snapshots of | |||
/proc/<pid>/maps, diffed them to find a memory region that appeared and did not | |||
disappear, and then dumped out the raw memory to a file. Then I ran strings on | |||
it. | |||
</div> | |||
---- | |||
<div id='D2'>D2. I don't really know enough about the system to flesh this out properly, but | |||
it seems like when you have a bunch of memory lingering around when it really | |||
ought to be dead, that many of the objects comprising that memory should be | |||
able to "know" that they *probably* shouldn't live past... the current page, or | |||
for more than a few seconds, or whatever. Assuming this is possible, it should | |||
be possible to walk up a dominator graph and give a fairly directed answer to | |||
"why has this outlived what it thought its lifespan would be?" | |||
Not every memory allocation needs to be marked for this to work. You just need | |||
one object within the "leaked" memory to be marked. | |||
It could also walk the graph "en masse" to ignore individual objects that are | |||
reachable longer than expected and focus on the clusters of objects that are | |||
kept alive by the same thing. (I'm thinking that the expected lifetime is a | |||
guess, and may be inaccurate.) | |||
</div> | |||
---- | |||
<div id='D4'>D4. eg use mprotect on a random subset of the heap to find pages (or smaller | |||
regions, but that's harder) that are never accessed after some point. Remove | |||
the GC/CC from consideration. | |||
</div> | |||
<div id= |