Sfink/Memory Ideas
Contents
Problem A: System too unusable to diagnose
When a bad memory leak kicks in, the system can be too unusable to get useful data out.
Solutions: Make it easier to get information out when the system is suffering
#A1 Periodically log memory-related information (existing bug, I think? also telemetry)
#A2 Maintain a rotating database of detailed memory-related information (cf atop)
#A3 Make about:memory capable of outputting to a file, for use with a command-line invocation 'firefox about:memory?verbose=1&outfile=...'
Solution: Prevent the system from getting into such a bad state
#A4 Make a per-compartment (or per-?) cap on memory usage
#A5 When sufferingMode==true, disable GC/CC on big tabs. Probably need to deactivate them too.
#A6 Early warning when memory usage is getting too high
#A7 Crash reporter-like UI for reporting memory problems (do not require an actual crash to trigger)
Problem B: Regular users can't generate useful reports
Hard for regular users to generate a useful memory problem report
(all solutions from problem A are relevant here)
#B1 Provide a way to dump and submit a reachability graph
#B2 Documentation for how to best help with a memory problem, with various steps to follow.
#B3 Track memory to individual page/tab/compartment/principals.
#B4 Tools for generating profiles with subsets of addons installed (or for running with different subsets of addons within one profile)
#B5 Tools for blaming memory usage on addons (eg detecting "safe" addons to remove from consideration. Cross-referencing other users' addons and memory usage similar to the crash correlation reports -- requires telemetry.)
Problem C: Knowledgeable users can't generate useful reports
Hard for developers or knowledgeable and motivated users to generate a useful memory problem report
The above problem B crossed into this, so everything there is relevant.
#C1 Rationalize and document all of our various leak-detection tools.
#C2 Automation and Windows equivalents of my /proc/<pid>/maps hacks
#C3 Dumpers that give full heap, full graph, pruned graph. Visualizers, analyzers, etc. of the dumps.
#C4 Collect age of various memory objects (how many CCs or GCs it has been alive.)
Problem D: Uncollected garbage
Garbage is not collected
Solution: Report cycles that CC misses
#D1 Conservative scanner to find cycles involving things not marked as CC-participants and report them as suspicious.
Solution: Report resources that leak over time but are still referenced (so they are cleaned up before shutdown)
#D2 Register "expected lifetime" at acquisition time. Report things that live longer than expected, filtered by diagnostics. ("lifetime assertions"? Not quite.)
#D3 Detect subgraphs that grow (at a constant rate?) while a page is open.
#D4 Detect subgraphs that are never accessed
Problem E: Unleaked but excessive memory usage
High memory usage, not leaked
(aside from current work like generational gc)
#E1 "Simulator" that runs over logs and estimates peak memory usage if CC/GC ran at optimal times.
#E2 Use reproducible test runs to evaluate what the performance/memory tradeoff is for various things (eg jit code, structure sizes)
Problem F: Hard to track down problems
Hard to navigate through a memory dump or the current state to track down a specific problem
#F1 Dump all roots of a compartment, and trace roots back to the XPCOM/DOM/whatever thing that is holding onto that root (when available)
#F2 Go from JS object to things keeping it alive (dump out GC edges) -- see jimb's findReferences (currently JS shell only)
#F3 Record addr,size,stack at every allocation (kgadd's heap visualizer)
Details:
The visual UI has a good set of heuristics for detecting "large" values, and coloring the output accordingly. If your disk is busy for >90% of the sampling interval, it'll turn red. If your network traffic is a high percentage of the expected maximum bandwidth, it'll turn red. etc.
It lets you use it in 'top-like' mode, where it displays the current state of things, as well as in a historical mode where it reads from a log file. (It is decidedly *not* seamless between the two, but it should be.)
It also allows dumping historical data to text files. I've used that for generating graphs of various values.
For the browser, many of the same metrics are applicable, but I'd also like an equivalent of the processes' info. The idea is to know "what was going on at XXX?" So it should be user and browser actions, which tab was active, network requests, significant events firing, etc.
Not every memory allocation needs to be marked for this to work. You just need one object within the "leaked" memory to be marked.
It could also walk the graph "en masse" to ignore individual objects that are reachable longer than expected and focus on the clusters of objects that are kept alive by the same thing. (I'm thinking that the expected lifetime is a guess, and may be inaccurate.)