Sfink/Performance Thoughts
I'm going to go a little crazy with taxonomies.
Contents
Facets of Performance
- Latency - how long do you have to wait between the time you initiate an action and the time some detectable response appears?
- This is all the user really cares about, but it's not always the right thing to look at as a developer, since it is the end result of lots of other things that may be affected by multiple variables.
- Latency of the complete response is one thing, but in reality some things are going to take some time, and so the latency to a visible progress indicator may be more important. Depends on the situation.
- Variability of latency can be important too. It's critically important if the output depends on it, eg watching an animation of some sort. But it also interferes with learning: "I click on this button to make it do this... oh wait, it didn't work, did I forget something? Let me see if I need to -- oh, there it is. Odd, it normally doesn't take that long."
- Memory usage - why is using memory bad?
- latency is going to get very bad once you hit a certain "problem size"
- you can't have as many other things active at the same time
- the rest of the system gets sluggish as a side effect
- you'll eventually crash the browser or some other application
- Storage space - normally a far lesser concern with browsers.
- But especially with local storage, it could become more relevant.
Where Does Time Go?
I/O
I'm using I/O very generally, inclusive of disk, network, and memory.
You can kind of walk up the cache hierarchy, though it isn't really strictly a hierarchy.
At each level, you'll usually have asynchronous and synchronous behavior.
- synchronous: stuff you have to wait for. Reads are usually synchronous. (Exceptions: readahead, or when you have another thread you can switch to. If you get close to actual devices, DMA can be a form of asynchronous read.)
- asynchronous: stuff you don't have to wait for. Writes are often asynchronous. (Many more exceptions than with read=synchronous.)
- Asynchronous requests don't matter, until they do: when too much asynchronous data is outstanding, it starts blocking and becoming synchronous. Asynchronous requests can also slow down or block synchronous ones earlier. Often this is because of dependencies between the requests. Those dependencies may or may not be fundamental -- they might just be a driver limitation or a simplification in the logic of whatever is handling the resource. (A memory read might unnecessarily block on an earlier write because it's hard to be certain they don't alias the same address space. Or a write may trigger a read to fill in the rest of a cache line.)
Most of these levels also have two types, slow and fast. (Often mapping to seeks vs sequential reads.)
All of them have weird exceptions to the simple taxonomy. (Networks may be faster or slower than disk. Disks have memory caches. Cache line aliasing has weird effects. Etc.)
Cache hierarchy, roughly ordered from most expensive to least:
- Network I/O
- Latency cost varies widely with the server and phase of moon.
- Disk I/O
- Writes can be asynchronous unless you need to explicitly synchronize for durability (in the ACID sense). But the OS will tend to flush them every so often even if it doesn't strictly need to, and those flushes can block reads.
- Sequential I/O is fast, Random I/O is slow. Somewhat less true with an SSD, but I don't know much about those. The difference is significant enough that it's probably worthwhile to track this with two different metrics: initial block reads and total disk bandwidth.
- Main RAM. In my myopic view, RAM is RAM. Mobile devices may break this with different types of memory (eg volatile vs nonvolatile memory can be different speeds.) But if you stick your thumbs in your ears and waggle your fingers, you can ignore that. (NUMA = Non-Uniform Memory Access)
- TLB. Waggle faster. (Not often worth worrying about for non-specialized workloads.)
- L3 then L2 then L1 caches. Highest (aka fastest aka closest to the CPU) levels are usually split between separate data and instruction stores, not that you generally need to care. Writeback vs write-through is another critically important difference that you can usually ignore.
This is a cache hierarchy, so for the most part the later layers will fall back to the earlier when their capacity is exceeded. And writes often go straight through to a slower layer.
CPU
CPUs are fast. Even on mobile devices, they're pretty fast. CPUs rarely consume large chunks of time just crunching through basic math operations. The time normally disappears into loading and saving data to and from memory, which I'm describing here as I/O.
Except that many measurement tools describe I/O in terms of CPU clock ticks. An L1 cache miss, for example, is normally measured in clock ticks. So you can think of it as CPU time if you like. (Clock ticks map fairly well to actual time, although you may need to adjust for occasional frequency scaling or whatever.)
More importantly, tools have a distinction of "I/O wait time" vs "CPU time". This distinction is mostly real: with I/O wait, your process is scheduled out and not running. CPU time includes time when the CPU is twiddling its thumbs waiting for a cache miss to be resolved, but the CPU isn't going anywhere; it'll keep running your process immediately after the needed data gets loaded in. (Even in SMT eg hyperthreading, where it is pulling from multiple threads all the time. When your process blocks on a cache miss, the scheduler will just pull instructions from the other thread for a while until you can start feeding it again.)
Locks
A process can also wait on locks.
Tasks
- App Startup - see Firefox/Projects/Startup Time Improvements
- Page Load - see Performance/Pageloader
- Applications
Mozilla Platform "Cost Centers"
- Javascript
- DOM
- CSS
- Layout
- Graphics
- Distinct from layout. Layout is logic, graphics is rendering.
- Garbage Collector
- XPCOM
- XPConnect
- Quickstubs used for most of what matters these days, which removes the overhead
- Security
- NSPR (eg locking)
- Caching
- Extensions (cross-cutting)
- Electrolysis (cross-cutting)
Applications
- gmail
- Google Docs
- Zimbra
- Yahoo mail
- Flickr
- facebook (this is really more of a site than an application...)
- Web Sites
- wikipedia
- msn.com
- baidu.com
- mozilla.com
Big question of what to do with these. They are complete applications, so "performance" of one is very dependent on what you're actually doing.
Perhaps a first pass would be to come up with one or two tasks for each one. Start out by manually timing using the real server, and try to get better at both (automatically timing using stub servers.)
Metrics
Basic metrics for now. Should really be broken down further. Example: look at various metrics within the scope of a call or page or process or...
- Latency
- Wallclock time
- CPU clock cycles for this process only
- Cache misses
- Initial disk block reads
- Disk bandwidth
- Memory
- VM Size
- RSS
- Private memory
- Garbage collector-specific
- Number, duration of pauses
- Garbage generated
- Meat (non-garbage) generated
Tools
Mozilla has a ton of great tools already available for analyzing performance, including some I haven't uncovered yet. I'm updating Performance:Tools with everything I encounter.
Audiences
The audience determines whether a tool can be enabled via a conditional compilation directive, or whether it needs to be available in a release build. It also gives guidance as to whether a platform-specific tool is acceptable (eg dtrace).
- Mozilla platform developers
- Can be platform-dependent and require different build options.
- Nice if they can be fully automated and added to buildbot
- Mozilla application developers (Firefox, Thunderbird, etc.)
- Similar constraints as for "Mozilla platform developers"
- Add-on authors
- Platform dependence dramatically limits scope
- Different build options are problematic, but in some cases will be ok.
- Web application authors
- Pretty much needs to be platform independent
- Different build options are problematic, but in some cases will be ok.
- General web developers
- Pretty much needs to be platform independent
- Must be available in release build, though enabling via config options or as an add-on is ok
Tool Ideas
First, some random tool ideas, so they don't get lost. Some of these were proposed by other people. Some may already exist, and will be removed as I discover them.
- Mozilla-independent external tool that breaks down the cost of performing some task into the levels of the caching hierarchy (with CPU time included.)
- This would allow comparing a standard operation across different browsers and seeing where firefox uses more of some resource, as a guide to optimization. Lots more cache misses => look at cache oblivious data structures. Lots more seeks => figure out a way to reorder accesses.
- Mark up a user's Javascript to show what percentage of each line (statement? minification is a pain) in a particular run was "on trace" with TraceMonkey.
- This would allow the user to figure out what code works well with the trace engine and what doesn't.
Current Tools at Mozilla
Latency
Moved to Performance:Tools
I'm breaking these down by platform, in hopes that someone looking for a tool will find this more useful and less overwhelming.
Memory Usage
See Performance:Leak Tools, which is already broken down better than this page is.