Firefox/Projects/Startup Time Improvements/joelr notes
Contents
- 1 Intro
- 2 Current status
- 3 Previous statuses
- 3.1 September 22, 2009
- 3.2 September 21, 2009
- 3.3 September 16, 2009
- 3.4 September 14, 2009
- 3.5 September 11, 2009
- 3.6 September 9, 2009
- 3.7 September 3, 2009
- 3.8 September 2, 2009
- 3.9 August 30, 2009
- 3.10 August 28, 2009
- 3.11 August 27, 2009
- 3.12 August 26, 2009
- 3.13 August 25, 2009
- 3.14 August 24, 2009
- 3.15 August 21, 2009
- 4 Shark tips and tricks
- 5 DTrace tips and tricks
Intro
I'm trying to figure out where Firefox startup time goes, up to the return from BrowserStartup (Javascript function). I'm also manipulating DTrace into telling me where time is going, without making any assumptions.
Current status
September 23, 2009
x86-64
I now have x86-64 Firefox up and running on Snow Leopard. Josh Aas did all the work, I just put on the final touches.
A shiny new world
Damon thinks I should help Josh with the 64-bit port. 64-bit Trace Monkey gives us a 20% improvement in speed which beats the 5% DWARF improvement from -fomit-frame-pointer.
I didn't know it was my birthday. I'm leaving the 32-bit world behind. 64 bits here I come!
Previous statuses
September 22, 2009
I have two bottlenecks on my plate, Bug 517045 and Bug 517576. Both of them may not be possible to speed up significantly, not without a large research effort. The former requires consultation with the owners and the later requires me to become a layout expert.
Benjamin Smedberg suggested taking care of DWARF symbols in breakpad as that will let us use the -fomit-frame-pointer gcc option. Omitting the frame pointer would result in at least a 5% overall speed up because a register will be freed on x86-32.
DWARF is a tree of dependent bugs, though, and Jim Blandy is working on them. Jim is a Linux guy, though, and also has ECMAScript 5 strict mode to take care of whereas I'm strictly focused on performance optimization.
I'll ask management to see if I can work on DWARF with Jim.
September 21, 2009
Wins
Kudos to Masayuki for fixing Bug 517549. It's hard to believe but it looks like our Mac Ts [{%22test%22:%2216%22,%22branch%22:%221%22,%22machine%22:%22169%22},{%22test%22:%2216%22,%22branch%22:%221%22,%22machine%22:%22173%22},{%22test%22:%2216%22,%22branch%22:%221%22,%22machine%22:%22178%22}&sel=1252833380,1253542320 went down 30%] and it was only 12.3% of my startup. I wonder what happens when Bug 517576 gets fixed, that's 21% of my startup!
One other thing I'm wondering about... What if all the bottlenecks I find will get fixed by someone else? Damon wants me to write code!!!
September 16, 2009
That mysterious forking
Apparently, Firefox forks the first time after a build to rebuild caches, etc. Need to make sure to profile from the 2nd run on. This forking behavior baffled me when I started with DTrace.
I'm an idiot
I spent a month becoming an expert on DTrace, only to realize that it's not very useful to get a global picture of the bottlenecks because the overhead of tracing everything is so high!
Shark is a much better tool to get a global picture, specially using "Time profile all states" which allocates time spent blocking to the calling function. I overlooked this initially since I did not zero in on the right thread, the one that has start and XRE_main or exit. Without this it looks like Firefox is spending most of its time waiting on threads and Mach messages.
Timing startup
A clear picture courtesy of Shark
0.0% 92.8% firefox-bin start 0.0% 92.8% firefox-bin main 0.0% 91.6% XUL XRE_main 0.0% 40.8% XUL nsAppStartup::CreateHiddenWindow() 0.0% 34.7% XUL ScopedXPCOMStartup::Initialize() 0.0% 6.4% XUL ScopedXPCOMStartup::~ScopedXPCOMStartup() 0.0% 3.4% XUL ScopedXPCOMStartup::SetWindowCreator(nsINativeAppSupport*) 0.0% 2.4% XUL nsXPTCStubBase::Stub3() 0.0% 0.8% XUL nsAppStartupNotifier::Observe(nsISupports*, char const*, unsigned short const*) 0.0% 0.8% HIToolbox GetCurrentEventKeyModifiers 0.0% 0.6% XUL nsXREDirProvider::DoStartup() 0.0% 0.5% XUL RemoveComponentRegistries(nsIFile*, nsIFile*, int) 0.0% 0.4% XUL NS_NewToolkitProfileService(nsIToolkitProfileService**) 0.0% 0.3% XUL nsToolkitProfile::Lock(nsIProfileUnlocker**, nsIProfileLock**) 0.0% 0.1% XUL WriteVersion(nsIFile*, nsCString const&, nsCString const&, nsIFile*, nsIFile*) 0.0% 0.1% XUL nsINIParser_internal::Init(nsILocalFile*) 0.0% 0.1% XUL LaunchChildMac 0.0% 0.4% XUL XRE_GetBinaryPath 0.3% 0.4% XUL XRE_CreateAppData 0.4% 0.4% XUL NS_LogInit_P 0.0% 6.2% dyld _dyld_start 1.0% 1.0% libSystem.B.dylib _exit
September 14, 2009
Breaking startup into smaller chunks
bsmedberg:joelr: I really think that "Ts" is still much too large of a chunk to make meaningful improvements on bsmedberg:joelr: in order to make meaninful discoveries, I think you have to break startup down into pieces and then maybe improve each piece joelr:bsmedberg: i'm listening bsmedberg:well, you've got various stuff that happens up until you start the event loop bsmedberg:breaking XRE_main into phases up until nsAppStartup::Run would be interesting bsmedberg:I still haven't seen a graph or anything of how we break down bsmedberg:joelr: I think stepping through in a debugger might give you a better sense of what's going on (up until Run()) than profiles would bsmedberg:even if you broke things up into "from main() to the beginning of event loop" and "after the event loop starts" that would be more interesting bsmedberg:joelr: and then within the first chunk, timing NS_InitXPCOM, the extension manager bits, chromereg startup bsmedberg:joelr: I think this intersects with some of the startup-timeline stuff taras was doing joelr:bsmedberg: what are the functions for "extension manager bits" and "chromereg startup"? bsmedberg:joelr: basically from http://mxr.mozilla.org/mozilla-central/source/toolkit/xre/nsAppRunner.cpp#2607 to #3453 joelr:bsmedberg: thanks for the suggestion, i'll do that bsmedberg:joelr: #3382 to #3408 or so
XUL reflow
bz:joelr: so I'm still wondering about some of the reflow stuff bz:joelr: I know it's hard, but that's why no one's really dived into it yet bz:joelr: and I'd love to know whether we do in fact end up looping significantly in sprocket layout during startup, for example joelr:bz: i'm still wondering about reflow myself but i'm not quite ready to tackle it yet joelr:bz: don't know enough
XPT_ArenaMalloc and zero-filling memory
Closed Bug 516241. There's no evidence that zero-filling memset is a bottleneck. DTrace cannot latch onto the return from memset on Snow Leopard, so I can't time it. All I know is that it's called 133,245 times during startup and zero-fills 14,851,995 bytes of memory.
bsmedberg:joelr: do you have evidence that the calloc is hurting? It really doesn't sound like something that would take a lot of time bsmedberg:there's no IO, and arena pages are roughly page-sized, right? joelr:bsmedberg: i don't have evidence that it's hurting, not at all. i just see it in the startup path a lot. bsmedberg:"see it" in what way? joelr:bsmedberg: there's no IO but arena pages are not page-sized. they are 1k blocks whereas pages are 4k joelr:bsmedberg: i'm dtracing startup, all functions, that's how i see it bsmedberg:it's less than a page, then... you really shouldn't be having memory-thrashing issues with the calloc bsmedberg:joelr: you mean it takes time, or its just a callcount? bsmedberg:I'm a little skeptical that the arena actually helps at all, but that's a different matter. joelr:bsmedberg: call count, i suppose. i'm also skeptical that arena helps at all since mac osx has it's own memory allocation optimizations bsmedberg:And the xpt code has no tests, so unles you have a measured performance problem I suggest not touching it. joelr:bsmedberg: i see (tests)... well, i'll look into it a little bit more since i want to measure a difference vs malloc (no zero-fill). it does crash with malloc at the moment so i won't be looking into it too much. bsmedberg:you'd have to basically zero-fill when you allocate items from the arena joelr:bsmedberg: because those items expect to be zero-filled? bsmedberg:seems so, if you're crashing with malloc... bsmedberg:joelr: ok, well, I'm saying "you have no evidence this bug is worth fixing, please move on"
September 11, 2009
Reflows
taras: joelr: btw, you gave up on reflows? taras: that makes me sad taras: i was hoping we'd end up with some sort of tool to measure cost of laying out various xul components joelr: taras: i did not give up, i just don't know how to go about it yet taras: so people could profile their guis taras: instead of going at it blindly joelr: taras: i didn't hear your prayers, my apologies joelr: i'll make a note on the idea of a tool that measures guis, though joelr: taras: how do i measure? from where to where? joelr: layout? reflow? something else? taras: joelr: it's simple, time how long they take taras: just see long functions take to run, ones that trigger all those fun reflow stacks
Restarting from the ground up
I'm very comfortable with DTrace but still not comfortable with the Firefox code base. It will take me a while to grok reflow, for example, much less optimize it. I feel that there are startup speedups that can be achieved quicker, with a better return on investment (aka low-hanging fruit).
Yes, reflow is in the hot path during startup but it's not clear if it's a major source of slowness. I would like to know what happens during Firefox startup, in as much detail as possible and DTrace has granted me my wish in the form of a 400Mb+ log file. The file lists all the function calls up to the return from the JS function BrowserStartup, e.g.
... 6200 . . -> ImageLoaderMachO::segIsReadOnlyImport(unsigned int) const(0x8FE46820, 0x1, 0xBFFFC1C8) 6202 . . -> ImageLoaderMachOCompressed::segmentCommandOffsets() const(0x8FE46820, 0x0, 0x0) 6203 81 1 <- ImageLoaderMachOCompressed::segmentCommandOffsets() const = 10 6205 270 4 <- ImageLoaderMachO::segIsReadOnlyImport(unsigned int) const = 116 6206 . . -> ImageLoaderMachO::segFileOffset(unsigned int) const(0x8FE46820, 0x1, 0xBFFFC1C8) 6208 . . -> ImageLoaderMachOCompressed::segmentCommandOffsets() const(0x8FE46820, 0x0, 0x0) ...
There are obviously interesting bits here. How can we piggyback on the ImageLoaderMachOCompressed for example?
Minimizing file system access provides a good return on investment since it apparently takes a few milliseconds on mobile devices. I'm focusing on this first and foremost since we do have file system access in the midst of dynamic linker static initialization, e.g.
20974362 861470 708476 <- xptiSortFileList(void const*, void const*, void*) = 622 20974385 . . -> xptiSortFileList(void const*, void const*, void*)(0x3050F9C, 0x3050F7C, 0xBFFFEBC0) 20974394 . . -> nsACString_internal::nsACString_internal(char*, unsigned int, unsigned int)(0xBFFFE978, 0xBFFFE98C, 0x0) 20974397 113 2 <- nsACString_internal::nsACString_internal(char*, unsigned int, unsigned int) = 24 20974400 . . -> nsACString_internal::nsACString_internal(char*, unsigned int, unsigned int)(0xBFFFE924, 0xBFFFE938, 0x0) 20974402 112 2 <- nsACString_internal::nsACString_internal(char*, unsigned int, unsigned int) = 24 20974424 . . -> nsLocalFile::GetNativeLeafName(nsACString_internal&)(0x3083600, 0xBFFFE978, 0x40) 20974427 . . -> objc_exception_try_enter(0xBFFFE868, 0xE000002, 0x181A4)
It may be interesting to study the startup path of Chromium and Safari.
September 9, 2009
Reflow optimization
I've been thinking about speeding up reflows (Bug 514275) but did not make progress until I got Boris Zbarsky involved. There are tons and tons of classes that reflow and I'm completely unfamiliar with that code.
Boris suggested restricting invocations of PresShell::ProcessReflowCommands to those that actually do work
0 != mDirtyRoots.Length()
on XUL documents
FrameManager()->GetRootFrame()->GetFirstChild(nsnull) && FrameManager()->GetRootFrame()->GetFirstChild(nsnull)->GetType() == nsGkAtoms::rootFrame
Looking at the 25 reflows and timings for the "Getting involved with Mozilla..." page, it's clear that some reflows are more expensive than others. It's also interesting that there are just 11 reflows for a blank page.
bz: 1) try to reduce number of reflows bz: (e.g. make text controls not reflow sync in this setup if we can; reduce the number of text value sets if we can't) bz: Or see whether the reflow events we get are really needed (e.g. if the load is not done yet, why are they happening?) bz: But separately, it looks like reflows seem to be generally expensive
Performance measurement
Using Vlad's benchmark, I have 5.3s cold start and 0.9s warm start. Timing using the return from BrowserStartup (JS) gives me almost exactly the same results as Vlad's.
javascript*:::function-return /copyinstr(arg2) == "BrowserStartup"/ { exit(1); }
The difference is about 50ms which copyinstr probably accounts for.
September 3, 2009
On reflow...
bz:so there are a few things we could talk about: bz:1) make any given reflow pass faster (this is hard) bz:2) reduce number of reflow passes bz:s/this is hard/this may be hard/ bz:So one place to start would be seeing why we're doing reflow at all bz:that is, what things cause posting of reflow events and what things trigger reflow in their post-reflow callbacks bz:and seeing whether we can reduce or eliminate them. joelr:post-reflow callbacks? bz:see comment 1 in the bug bz:firefox has about 4000 elements in its ui bz:which means about 4000 boxes to reflow bz:modulo optimizations, etc bz:BoxReflow is the reflow of a single box bz:those are the two things to look into bz:reducing number of reflow passes bz:and speeding up individual passes bz:if you have info on what parts actually take up time during reflow, we should look into that bz:I would also be interested in the number of nsIFrame::Layout calls you see during startup bz:that's a more interesting metric to me bz:esp. if we can chunk it up per reflow pass
A reflow pass is PresShell::ProcessReflowCommands which can re-enter itself via post-reflow callbacks.
September 2, 2009
My previous stack traces were all bogus since symbols are stripped during packaging. This is all sorted out now using this config (for Snow Leopard):
export CC="gcc-4.2 -arch i386" export CXX="g++-4.2 -arch i386" ac_add_options --target=i386-apple-darwin8.0.0 ac_add_options --with-macos-sdk=/Developer/SDKs/MacOSX10.5.sdk ac_add_options --enable-macos-target=10.5 export HOST_CC="gcc-4.2" export HOST_CXX="g++-4.2" export RANLIB=ranlib export AR=ar export AS=$CC export LD=ld export STRIP="strip -x -S" export CROSS_COMPILE=1 export MOZ_DEBUG_SYMBOLS=1 . $topsrcdir/browser/config/mozconfig mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/objdir-release-profiling mk_add_options MOZ_MAKE_FLAGS=-j2 ac_add_options --disable-tests ac_add_options --disable-crashreporter ac_add_options --enable-dtrace ac_add_options --enable-libxul ac_add_options --enable-optimize ac_add_options --enable-debugger-info-modules ac_add_options --disable-strip ac_add_options --disable-install-strip
I'm getting 3-5s from cold start to return from BrowserStartup (JavaScript), e.g.
./d firefox-bin startup.d stop-js.d dtrace -Zqw -x dynvarsize=64m -x evaltime=preinit -p 55752 -s startup.d -s stop-js.d -s sigcont.d Total: 5000.294085ms ./d firefox-bin startup.d stop-js.d dtrace -Zqw -x dynvarsize=64m -x evaltime=exec -p 55855 -s startup.d -s stop-js.d -s sigcont.d Total: 3387.623720ms
The hot path for those 5s looks like this:
./d firefox-bin hotpath.d stop-js.d dtrace -Zqw -x dynvarsize=64m -x evaltime=preinit -p 55765 -s hotpath.d -s stop-js.d -s sigcont.d kill -CONT 55765 XUL`nsFrame::BoxReflow(nsBoxLayoutState&, nsPresContext*, nsHTMLReflowMetrics&, nsIRenderingContext*, int, int, int, int, int) 1382 firefox-bin`0x0 623 XUL`nsHTMLInputElement::GetReadOnly(int*) 53 XUL`nsOverflowContinuationTracker::Insert(nsIFrame*, unsigned int&) 48 XUL`MapAttributesIntoRule(nsMappedAttributes const*, nsRuleData*) 24 XUL`nsBlockFrame::Reflow(nsPresContext*, nsHTMLReflowMetrics&, nsHTMLReflowState const&, unsigned int&) 20 XUL`nsColumnSetFrame::ReflowChildren(nsHTMLReflowMetrics&, nsHTMLReflowState const&, unsigned int&, nsColumnSetFrame::ReflowConfig const&, int, nsCollapsingMargin*, nsColumnSetFrame::ColumnBalanceData&) 18 XUL`DocumentViewerImpl::PageHide(int) 18 XUL`nsBlockFrame::DoReflowInlineFrames(nsBlockReflowState&, nsLineLayout&, nsLineList_iterator, nsFlowAreaRect&, int&, nsFloatManager::SavedState*, int*, LineReflowStatus*, int) 14 XUL`non-virtual thunk to nsHTMLInputElement::Release() 11 ...
August 30, 2009
Using my "Lego blocks" DTrace toolchain I thought I would check the "hot paths" through Firefox code during the first 20 seconds of startup.
./d firefox-bin hotpath.d stop-20s.d dtrace -Zqw -x dynvarsize=64m -x evaltime=preinit -p 1478 -s hotpath.d -s stop-20s.d -s sigcont.d kill -CONT 1478 XUL`gfxContextAutoSaveRestore::Reset(gfxContext*) 2766 firefox-bin`0x0 2234 XUL`operator new(unsigned long, void*) 348 XUL`gfxImageSurface::Height() const 335 XUL`gfxPattern::Release() 60 XUL`DumpJSValue 44 XUL`gfxContextAutoSaveRestore::gfxContextAutoSaveRestore(gfxContext*) 29 XUL`gfxUserFontSet::Release() 17 XUL`gfxImageSurface::Data() const 16 XUL`gfxContextPathAutoSaveRestore::~gfxContextPathAutoSaveRestore() 8
I'm sampling at 997Hz so that's roughly 2.7s for the top result. A small script is all it took to see where gfxContextAutoSaveRestore::Reset is being called from.
BEGIN { i = 0; } pid$target::gfxContextAutoSaveRestore??Reset(gfxContext*):return / i < 10 / { printf("arg0: 0x%x, arg1: 0x%x\n", arg0, arg1); ustack(); i++; } pid$target::gfxContextAutoSaveRestore??Reset(gfxContext*):return / i >= 10 / { exit(0); }
The stack traces (first one of ten below) include gfxImageSurface::Height and DumpJSValue.
./d firefox-bin autosave.d dtrace -Zqw -x dynvarsize=64m -x evaltime=preinit -p 1527 -s autosave.d -s sigcont.d kill -CONT 1527 arg0: 0x90, arg1: 0x1a71da8 XUL`gfxContextAutoSaveRestore::Reset(gfxContext*)+0x90 XUL`gfxImageSurface::Height() const+0x2d976 XUL`gfxImageSurface::Height() const+0x135b6 XUL`gfxImageSurface::Height() const+0x107b7 XUL`non-virtual thunk to nsPrintSession::QueryInterface(nsID const&, void**)+0x53d16 XUL`non-virtual thunk to nsPrintSession::QueryInterface(nsID const&, void**)+0x53f1b XUL`NS_GetComponentRegistrar_P+0x21cd XUL`NS_GetComponentRegistrar_P+0x2324 XUL`NS_GetComponentRegistrar_P+0x4964 XUL`non-virtual thunk to nsPrintSession::QueryInterface(nsID const&, void**)+0x4b3fd XUL`non-virtual thunk to nsPrintSession::QueryInterface(nsID const&, void**)+0x4b850 XUL`non-virtual thunk to nsPrintSession::QueryInterface(nsID const&, void**)+0x4a9e1 XUL`DumpJSValue+0x1dfbab XUL`DumpJSValue+0x1e0103 XUL`gfxCornerSizes::gfxCornerSizes(gfxSize const&, gfxSize const&, gfxSize const&, gfxSize const&)+0xc9e8e XUL`gfxCornerSizes::gfxCornerSizes(gfxSize const&, gfxSize const&, gfxSize const&, gfxSize const&)+0xb64b4 XUL`gfxCornerSizes::gfxCornerSizes(gfxSize const&, gfxSize const&, gfxSize const&, gfxSize const&)+0xbb918 XUL`gfxCornerSizes::gfxCornerSizes(gfxSize const&, gfxSize const&, gfxSize const&, gfxSize const&)+0x151bdf XUL`gfxCornerSizes::gfxCornerSizes(gfxSize const&, gfxSize const&, gfxSize const&, gfxSize const&)+0x14e234 XUL`gfxCornerSizes::gfxCornerSizes(gfxSize const&, gfxSize const&, gfxSize const&, gfxSize const&)+0x14ed99
gfxImageSurface::Height is low-hanging fruit, defined as
PRInt32 Height() const { return mSize.height; }
I would suggest making it inline and checking startup time again.
DumpJSValue is trickier because it appears to be a JavaScript API function. Still, it's called during the static initialization portion of dynamic linking and all that printing is not helping in a release version of Firefox.
./d firefox-bin dump-js-value.d dtrace -Zqw -x dynvarsize=64m -x evaltime=preinit -p 1579 -s dump-js-value.d -s sigcont.d kill -CONT 1579 XUL`DumpJSValue+0x3917 XUL`DumpJSValue+0x3b92 XUL`0x111c146 XUL`0x111c161 dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&)+0xeb dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int)+0xf6 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int)+0x9a dyld`ImageLoader::runInitializers(ImageLoader::LinkContext const&)+0x3d dyld`dyld::initializeMainExecutable()+0x86 dyld`dyld::_main(macho_header const*, unsigned long, int, char const**, char const**, char const**)+0x105d dyld`dyldbootstrap::start(macho_header const*, int, char const**, long)+0x30b dyld`_dyld_start+0x27 firefox-bin`0x1000
I suppose the invocation should only be compiled in in a debug version of Firefox but I'm still searching for that invocation!
August 28, 2009
Eric Gouriou on my prior use of dtrace -c...:
I am afraid that the data collected using this methodology is highly suspect. - firefox-bin gets run as root This is not a normal environment for an application. There are plenty of high-level frameworks which are not quite happy running in that environment, in good part because the root user will not have a typical home directory setup and due to differences in environment variables. This is a known limitation of dtrace -c . Feel free to add your voice via bugreporter.apple.com - A typical launch is not just fork/exec or posix_spawn Even if dtrace -c could magically launch using a regular user, it still isn't a typical launch. dtrace uses fork/exec or posix_spawn (I didn't check which one), while a typical launch (Dock, Finder, open) uses LaunchServices. It isn't all that different but you might as well address both issues together. I'd recommend reworking your script to watch the exec-success of firefox-bin and use a static probe (no need for $target then) to mark the end time. Then you can use a more normal launch method, for example open .../Minefiled.app .
I did rework my cold start script which can be used like this now:
./d firefox-bin startup.d stop-main.d dtrace -Zqw -x dynvarsize=64m -x evaltime=exec -p 35331 -s sigcont.d -s startup.d -s stop-main.d kill -CONT 35331 Total: 2000.275985ms
The above assumes that I launch Minefield.app after d, from the dock, Finder, command line, etc.
August 27, 2009
The dyld shared cache lives in /var/db/dyld/. The two files of interest are dyld_shared_cache_i386.map (for x86-32) and shared_region_roots/Applications.paths. Both are regular text files. The former shows the contents of the shared cache for the i386 architecture and the latter is what update_dyld_shared_cache inspects.
There's no prebinding on newer versions of Mac OSX and the dyld shared cache is automatically updated as needed. Tracing Safari disk activity during startup reveals that basically all its dynamic libraries are pulled from the dyld shared cache.
It's possible to add Firefox (.../Firefox.app/Contents/MacOS/firefox-bin) to Applications.paths and the change will persist across reboots. Unfortunately, only a handful of libraries that Firefox uses are pulled into the cache by update_dyld_shared_cache. I'm speculating that this may have something to do with @executable_path/XUL and friends (otool -L .../firefox-bin).
Safari uses absolute paths to frameworks in /System/Library/Frameworks so I speculate that relative paths are what is preventing XUL and others from going into the cache. We may want to investigate /System/Library/Frameworks/Firefox.framework or similar.
It's possible to fix relative library paths in a given library, e.g. fix.sh XUL where fix.sh looks like this
#!/bin/bash function dylibs () { otool -L $1 |grep executable_path|awk '{print $1}'|cut -d"/" -f2 } for i in `dylibs $1` do install_name_tool -change @executable_path/$i `pwd`/$i $1 done install_name_tool -id `pwd`/$1 $1
Firefox has to be recompiled with LDFLAGS=-header-pad_max_install_names in MOZCONFIG to make this happen since new library paths are greater than the space allocated in the Mach-O binary. See the man page for install_name_tool for details.
It's possible to force dynamic libraries into the cache by putting dynamic library paths into shared_region_roots/Applications.paths instead of executables. I wasn't successful in caching XUL, though, regardless of what I did.
In the end it doesn't seem to matter since there's a baffling lack of difference between Firefox and Safari cold stats, despite Safari pulling everything from the cache and Firefox using a large number of non-cached dylibs.
Here's the cold startup stats for Safari
sync && purge && DYLD_PRINT_STATISTICS=1 /Applications/Safari.app/Contents/MacOS/Safari total time: 696.9 milliseconds (100.0%) total images loaded: 116 (114 from dyld shared cache, 114 needed no fixups) total segments mapped: 5, into 30 pages with 10 pages pre-fetched total images loading time: 204.9 milliseconds (29.4%) total rebase fixups: 1,298 total rebase fixups time: 0.1 milliseconds (0.0%) total binding fixups: 2,476 total binding symbol lookups: 234, average images searched per symbol: 1.6 total binding fixups time: 80.5 milliseconds (11.5%) total bindings lazily fixed up: 3 of 901 total init time time: 411.3 milliseconds (59.0%) total images with weak exports: 1
and Firefox
total time: 731.2 milliseconds (100.0%) total images loaded: 106 (93 from dyld shared cache, 56 needed no fixups) total segments mapped: 49, into 5903 pages with 684 pages pre-fetched total images loading time: 235.3 milliseconds (32.1%) total rebase fixups: 149,011 total rebase fixups time: 3.7 milliseconds (0.5%) total binding fixups: 24,932 total binding symbol lookups: 797, average images searched per symbol: 2.3 total binding fixups time: 149.9 milliseconds (20.5%) total bindings lazily fixed up: 45 of 19,109 total init time time: 342.2 milliseconds (46.8%) total images with weak exports: 3
Notice a large and significant difference? Me neither.
The other thing that I cannot explain at the moment is where the rest of the startup time goes, e.g.
./cold.sh startup.d Total: 10001.723521ms
So it took less than 1 second to dynamically link Firefox but where did the other 9 seconds of startup go?
cold.sh is rather simple
#!/bin/bash cmd="./Minefield.app/Contents/MacOS/firefox-bin -no-remote -foreground -P 2" sync && purge && dtrace -x dynvarsize=64m -x evaltime=exec -c "$cmd" -wZs $1
and the startup.d script doesn't do much either
#pragma D option quiet BEGIN { start = timestamp; } /* stop tracing here */ mozilla$target:::main-entry { exit(0); } END { this->total = timestamp - start; printf("Total: %u.%06ums\n", this->total / 1000000, this->total % 1000000); }
I don't have an explanation yet but 9 seconds is a very large difference.
August 26, 2009
Suppose we sample at about 1Khz on a dual-core CPU and run an app for 10 seconds. An app that hogs the CPU should give us 20k samples, or 10k samples if it's pegged to a single core.
I coded this up in a cpu.d DTrace script and sampled Firefox. Apparently, Firefox only gets ~1700 samples on the CPU or 1.7s so its first 10 seconds of life are spent doing something else, e.g. disk IO.
We already know that Firefox is slow to start up but this type of sampling neatly points the finger in the right direction. Props to Brendan Gregg for teaching me how to fish!
What kind of IO does Firefox do upon startup, though? I find the output from iosnoop a bit confusing so I whipped up a version of my own that prints just disk access times, disk operation deltas, block size and file name accessed.
The script samples Firefox during the first 20 seconds of its life and the output from ./cold.sh io.d looks like this:
DELTA DTIME D SIZE PATHNAME 1399 1407 R 4096 ??/dtrace/libdtrace_dyld.dylib 1141 1145 R 8192 ??/dtrace/libdtrace_dyld.dylib 1301 1305 R 4096 ??/MacOS/XUL 1701 1705 R 4096 ??/MacOS/XUL 11081 685 R 4096 ??/MacOS/libmozjs.dylib 1989 2253 R 36864 ??/MacOS/libmozjs.dylib 1768 1894 R 24576 ??/MacOS/libmozjs.dylib 27111 2135 R 200704 ??/MacOS/XUL 1411 1134 R 8192 ??/MacOS/libxpcom.dylib 1784 1968 R 4096 ??/MacOS/libplds4.dylib 2383 464 R 4096 ??/MacOS/libxpcom.dylib 2886 456 R 4096 ??/MacOS/libxpcom.dylib 1431 732 R 4096 ??/MacOS/libplds4.dylib 2640 706 R 4096 ??/MacOS/libnspr4.dylib 3092 484 R 4096 ??/MacOS/libnspr4.dylib ...
I also have a version that rolls disk operations by file name and read/write op. It also samples during 20 seconds and outputs something like this:
DELTA DTIME SIZE D PATHNAME 307 311 4096 R ??/en_US.ISO8859-1/LC_TIME 437 441 4096 R ??/pref/firefox-branding.js 439 443 4096 R ??/su3x40q6.2/extensions.cache 441 445 4096 R ??/en_US.ISO8859-1/LC_NUMERIC 640 644 4096 R ??/en_US.ISO8859-1/LC_MONETARY 643 648 4096 R ??/chrome/browser.manifest 644 650 4096 R ??/la_LN.US-ASCII/LC_COLLATE 645 650 4096 R ??/Firefox/profiles.ini 646 650 4096 R ??/chrome/en-US.manifest ...
All timings are in microseconds and specific to my Apple 256Gb SSD.
There's something else that I discovered during my snooping around with io.d. Compare the first 20 seconds of the life of Safari and the life of Firefox.
Basically, all the shared libraries used by Safari are pulled in from the dynamic library shared cache. Safari uses mostly system frameworks, e.g.
DYLD_PRINT_LIBRARIES_POST_LAUNCH=1 /Applications/Safari.app/Contents/MacOS/Safari dyld: loaded: /System/Library/CoreServices/RawCamera.bundle/Contents/MacOS/RawCamera dyld: loaded: /System/Library/PrivateFrameworks/Shortcut.framework/Versions/A/Shortcut dyld: loaded: /System/Library/PrivateFrameworks/HelpData.framework/Versions/A/HelpData dyld: loaded: /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Frameworks/CoreGraphics.framework/Versions/A/Resources/libCSync.A.dylib dyld: loaded: /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Frameworks/CoreGraphics.framework/Versions/A/Resources/libCGATS.A.dylib dyld: loaded: /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Frameworks/CoreGraphics.framework/Versions/A/Resources/libRIP.A.dylib dyld: loaded: /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Frameworks/CoreGraphics.framework/Versions/A/Resources/libCGXCoreImage.A.dylib
whereas Firefox uses a whole lot more custom shared libraries, e.g.
DYLD_PRINT_LIBRARIES_POST_LAUNCH=1 ./Minefield.app/Contents/MacOS/firefox-bin -no-remote -foreground -P 2 dyld: loaded: /System/Library/CoreServices/RawCamera.bundle/Contents/MacOS/RawCamera dyld: loaded: /System/Library/PrivateFrameworks/Shortcut.framework/Versions/A/Shortcut dyld: loaded: /System/Library/PrivateFrameworks/HelpData.framework/Versions/A/HelpData dyld: loaded: /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Frameworks/CoreGraphics.framework/Versions/A/Resources/libCSync.A.dylib dyld: loaded: /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Frameworks/CoreGraphics.framework/Versions/A/Resources/libCGATS.A.dylib dyld: loaded: /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Frameworks/CoreGraphics.framework/Versions/A/Resources/libRIP.A.dylib dyld: loaded: /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Frameworks/ATS.framework/Versions/A/Resources/ATSHI.dylib dyld: loaded: /Library/Internet Plug-Ins/QuickTime Plugin.plugin/Contents/MacOS/QuickTime Plugin dyld: loaded: /System/Library/QuickTime/QuickTimeComponents.component/Contents/MacOS/QuickTimeComponents dyld: loaded: /System/Library/Frameworks/AGL.framework/Versions/A/AGL dyld: loaded: /System/Library/PrivateFrameworks/CoreAUC.framework/Versions/A/CoreAUC dyld: loaded: /Users/joelr/Work/mozilla/startup/Minefield.app/Contents/MacOS/components/libbrowserdirprovider.dylib dyld: loaded: /Users/joelr/Work/mozilla/startup/Minefield.app/Contents/MacOS/components/libbrowsercomps.dylib dyld: loaded: /Users/joelr/Work/mozilla/startup/./Minefield.app/Contents/MacOS/libnssdbm3.dylib dyld: loaded: /Users/joelr/Work/mozilla/startup/./Minefield.app/Contents/MacOS/libfreebl3.dylib dyld: loaded: /Users/joelr/Work/mozilla/startup/Minefield.app/Contents/MacOS/libnssckbi.dylib dyld: unloaded: /Users/joelr/Work/mozilla/startup/Minefield.app/Contents/MacOS/libnssckbi.dylib
I separated the list of libraries loaded by Firefox into three groups: same libraries as Safari, other system frameworks, custom libraries.
What needs to be done to make Firefox use the dynamic library shared cache as aggressively as Safari?
Another puzzling bit involves the difference between Safari libraries loaded by dyld (above) and the list of libraries that Safari is linked against, e.g.
otool -L /Applications/Safari.app/Contents/MacOS/Safari /System/Library/PrivateFrameworks/CrashReporterSupport.framework/Versions/A/CrashReporterSupport /System/Library/Frameworks/Cocoa.framework/Versions/A/Cocoa /System/Library/Frameworks/Carbon.framework/Versions/A/Carbon /System/Library/Frameworks/WebKit.framework/Versions/A/WebKit /System/Library/Frameworks/AddressBook.framework/Versions/A/AddressBook /System/Library/Frameworks/Security.framework/Versions/A/Security /System/Library/Frameworks/SystemConfiguration.framework/Versions/A/SystemConfiguration /System/Library/Frameworks/SecurityFoundation.framework/Versions/A/SecurityFoundation /System/Library/Frameworks/SecurityInterface.framework/Versions/A/SecurityInterface /System/Library/PrivateFrameworks/SyndicationUI.framework/Versions/A/SyndicationUI /System/Library/Frameworks/JavaScriptCore.framework/Versions/A/JavaScriptCore /System/Library/Frameworks/IOKit.framework/Versions/A/IOKit /System/Library/Frameworks/Quartz.framework/Versions/A/Quartz /System/Library/Frameworks/OpenGL.framework/Versions/A/OpenGL /usr/lib/libicucore.A.dylib /usr/lib/libstdc++.6.dylib /usr/lib/libgcc_s.1.dylib /usr/lib/libSystem.B.dylib /System/Library/Frameworks/CoreServices.framework/Versions/A/CoreServices /usr/lib/libobjc.A.dylib /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation /System/Library/Frameworks/AppKit.framework/Versions/C/AppKit /System/Library/Frameworks/ApplicationServices.framework/Versions/A/ApplicationServices /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation /System/Library/Frameworks/QuartzCore.framework/Versions/A/QuartzCore
Does Safari lazily loading some frameworks? Why do I see OpenGL initializers run when I set DYLD_PRINT_INITIALIZERS then? The OpenGL framework does not appear to be loaded when Safari starts (dyld output above).
For the sake of comparing, here's the list of libraries that Firefox is linked against
otool -L ./Minefield.app/Contents/MacOS/firefox-bin./Minefield.app/Contents/MacOS/firefox-bin: /System/Library/Frameworks/Cocoa.framework/Versions/A/Cocoa @executable_path/XUL @executable_path/libmozjs.dylib @executable_path/libxpcom.dylib @executable_path/libplds4.dylib @executable_path/libplc4.dylib @executable_path/libnspr4.dylib /System/Library/Frameworks/Carbon.framework/Versions/A/Carbon /System/Library/Frameworks/CoreAudio.framework/Versions/A/CoreAudio /System/Library/Frameworks/AudioToolbox.framework/Versions/A/AudioToolbox /System/Library/Frameworks/AudioUnit.framework/Versions/A/AudioUnit /System/Library/Frameworks/AddressBook.framework/Versions/A/AddressBook /usr/lib/libstdc++.6.dylib /usr/lib/libgcc_s.1.dylib /usr/lib/libSystem.B.dylib
as well as the list of libraries for XUL
otool -L ./Minefield.app/Contents/MacOS/XUL @executable_path/XUL /System/Library/Frameworks/Cocoa.framework/Versions/A/Cocoa @executable_path/libsqlite3.dylib @executable_path/libmozjs.dylib @executable_path/libsmime3.dylib @executable_path/libssl3.dylib @executable_path/libnss3.dylib @executable_path/libnssutil3.dylib @executable_path/libsoftokn3.dylib /usr/lib/libcups.2.dylib @executable_path/libplds4.dylib @executable_path/libplc4.dylib @executable_path/libnspr4.dylib /System/Library/Frameworks/SystemConfiguration.framework/Versions/A/SystemConfiguration /System/Library/Frameworks/QuickTime.framework/Versions/A/QuickTime /System/Library/Frameworks/IOKit.framework/Versions/A/IOKit /usr/lib/libcrypto.0.9.7.dylib /System/Library/Frameworks/Carbon.framework/Versions/A/Carbon /System/Library/Frameworks/CoreAudio.framework/Versions/A/CoreAudio /System/Library/Frameworks/AudioToolbox.framework/Versions/A/AudioToolbox /System/Library/Frameworks/AudioUnit.framework/Versions/A/AudioUnit /System/Library/Frameworks/AddressBook.framework/Versions/A/AddressBook /usr/lib/libstdc++.6.dylib /usr/lib/libgcc_s.1.dylib /usr/lib/libSystem.B.dylib /System/Library/Frameworks/CoreServices.framework/Versions/A/CoreServices /usr/lib/libobjc.A.dylib /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation /System/Library/Frameworks/AppKit.framework/Versions/C/AppKit /System/Library/Frameworks/ApplicationServices.framework/Versions/A/ApplicationServices /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation
It turns out that DYLD_PRINT_LIBRARIES=1 is the way to show libraries loaded by dyld before a program's entry point (main) whereas DYLD_PRINT_LIBRARIES_POST_LAUNCH=1 will show libraries loaded afterwards, e.g. using dlopen or NSBundle.
August 25, 2009
Created a static probe that fires first thing in XRE_main and updated my DTrace scripts to use it.
pid$target::function:entry probes are very slow since DTrace may have to search thousands of functions. All that search time skews elapsed time reported by timestamp. USDT (static) probes are just a few NOP instructions in the code that get fixed up by DTrace as needed so they work much faster.
August 24, 2009
Startup time is measured from the entry to XRE_main to the return of the BrowserStartup JS function. It takes a good bit of time but nothing compared to the time elapsed from the start of Firefox to the call to XRE_main.
According to my static-init.d script, the static initialization time can be ignored. I'm recording the library name and then timing the following call to ImageLoader::runInitializers in dyld. The cumulative time is too small to be of essence, though.
0.000053548s for /System/Library/Frameworks/Carbon.framework/Frameworks/HIToolbox.framework/Versions/A/HIToolbox 0.000053732s for /Users/joelr/Work/mozilla/startup/./Minefield.app/Contents/MacOS/libsoftokn3.dylib 0.000069234s for /System/Library/PrivateFrameworks/Shortcut.framework/Versions/A/Shortcut 0.000070455s for /Users/joelr/Work/mozilla/startup/MinefieldRelease.app/Contents/MacOS/libnssckbi.dylib 0.000072754s for /Users/joelr/Work/mozilla/startup/MinefieldRelease.app/Contents/MacOS/components/libbrowsercomps.dylib 0.000073443s for /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Frameworks/ATS.framework/Versions/A/Resources/ATSHI.dylib 0.000074363s for /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Frameworks/CoreGraphics.framework/Versions/A/Resources/libCSync.A.dylib 0.000075845s for /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Frameworks/CoreGraphics.framework/Versions/A/Resources/libCGATS.A.dylib 0.000076892s for /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Frameworks/CoreGraphics.framework/Versions/A/Resources/libRIP.A.dylib 0.000089767s for /Users/joelr/Work/mozilla/startup/./Minefield.app/Contents/MacOS/libnssdbm3.dylib 0.000094390s for /Users/joelr/Work/mozilla/startup/./Minefield.app/Contents/MacOS/libfreebl3.dylib 0.000100462s for /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/CarbonCore.framework/Versions/A/CarbonCore 0.000115375s for /System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/HIToolbox.framework/Versions/A/HIToolbox 0.001161267s for /Users/joelr/Work/mozilla/startup/MinefieldRelease.app/Contents/MacOS/components/libbrowserdirprovider.dylib --------------- = 0.002181527s Initialization: 42.986413480s Startup : 7.015292701s --------------- = 50.001706181s
Digging deeper...
August 21, 2009
My DTrace scripts live here. Use like this
sudo ./cold.sh static-init.d
Shark tips and tricks
Expanding the call tree
Alt-Click does it.
CHUD signposts on Snow Leopard
There are no CHUD header files on Snow Leopard. If you're only interest is in emitting signposts, the easiest thing for you to do is switch to the syscall method instead. That'll also mean you won't have to link against the CHUD framework. The full details are in the Shark User Guide, but in brief:
int err = syscall(SYS_kdebug_trace, APPSDBG_CODE(DBG_MACH_CHUD, <your code number>) | <type>, arg1, arg2, arg3, arg4);
Types are DBG_FUNC_START, DBG_FUNC_END and DBG_FUNC_NONE (for point signposts).
Don't forget to include <sys/kdebug.h> and <sys/syscall.h> for the necessary macros and function.
DTrace tips and tricks
DTrace overhead and proper measuring technique
Function entry/return vs static probes
Q: I added a couple of static probes to Firefox to measure actual work done. I could have used a pid$target probe with a function name but work is done within an if statement, which is where I placed the static probes. I'm wondering about my use, though. Is the following significantly more efficient than pid$target::FunName:entry and return?
A: No, it's not significantly more efficient. Both function entry/return probes and USDT probes involve replacing an instruction in the stream with a trap instruction. The expense of the trap into the kernel is going to overwhelm any subtleties in handling between the two different types of probes. --Chad Mynhier
Combining scripts
I find it convenient to have separate DTrace scripts for various exit probes. This way I can avoid editing the main script whenever I want to stop sampling once 10 seconds go by or XRE_main is entered or the BrowserReturn JavaScript function returns.
It can be done like this
#!/bin/bash cmd="./Minefield.app/Contents/MacOS/firefox-bin -no-remote -foreground -P 2" scripts="" for i in $*; do scripts="$scripts -s $i"; done sync && purge && dtrace -x dynvarsize=64m -x evaltime=exec -c "$cmd" -wZ $scripts
and then
./cold.sh startup.d stop-main.d Total: 11001.481323ms
or
./cold.sh startup.d stop-js.d Total: 7004.799264ms
My scripts can be found on GitHub.
No sudo
sudo chmod u+s `which dtrace`
You can now run dtrace as yourself. Enjoy!
timestamp vs vtimestamp
vtimestamp measures CPU time of the current thread, excluding IO and DTrace overhead. timestamp can still be used for deltas but the goal is to use as few pid$target probes as possible as they affect timestamp when dtrace has to switch between kernel and userland. io and syscall providers are fast and run in the kernel.
Invalid address
Have you seen this kind of error before?
dtrace: error on enabled probe ID 27 (ID 22130: pid34547:libSystem.B.dylib:dlopen:entry): invalid address (0x2ac204) in action #1 at DIF offset 28 dtrace: error on enabled probe ID 2 (ID 22782: pid34547:dyld:dlopen:entry): invalid address (0x2ac204) in action #1 at DIF offset 28
More likely than not, you are using copyinstr on memory that hasn't been paged in yet. Try saving the pointer on entry and doing the copying on return or later. Italic text