Performance/Status Meetings/2007-June-13
Contents
Participants
robcee, martain, alice, justin, bobclary, joduinn, justin, damon, vlad
Action Item Update
- AI:joduinn Choosing time/frequency of meeting: 10am-11am? Every week?
- AI:Alice/Robcee: File bugs/bullet list of areas others could help for perf infra
- hardware problem with pxp02 bug 383264
- AI:Justin see if the perf machines are swapping or if they need more memory.
- install more memory on some machines bug 384314
- AI:robcee bug 383167 tracking problem getting buildID-in-a-file from Tinderbox.
- run performance tests with profiling on (rhelmer set up a machine for this, but jprof has problems on trunk)
- Justin & rhelmer: hardware for JProf died. Now fixed.
- AI:rhelmer reopened bug 366615. Hardware replaced. rhelmer installed software. AI:rhelmer needs to change tests to start/stop JProf as part of the test... first implementation lost with harddisk failure, new implementation will be better anyway and checked in!
- vlad filed bug364779. No longer linux specific, now platform independent.
- Getting higher resolution timers for tests
- AI:Damon will meet with Boris about this. Different issues on different platforms.
- Graph server status
- Graph server for easy build-to-build comparisons
- her latest changes now checked into graphs.mozilla.org.
- AI:alice, justin Discussions with IT about having them maintain the machine, not just Alice. Justin & Alice to meet, setup staging & production machines. Justin to support production machine, but not 24x7. Alice to work on stage machines, push to production, like we do for a.m.o and other sites.
- 'AI:rhelmer/robcee XP machines frequently hang, freeze, out-of-mem, etc. Changes XP machines to clean-boot-and-auto-start-everything. Having them auto-login, start VNCserver, etc. rhelmer would like to do this for both build and perf machines. Run one perf machine rebooting-every-24hours, compare results to perf machines that are not rebooted frequently.
Agenda
- Generate reliable, relevant performance data (already underway as talos). Talos status update?
http://tinderbox.mozilla.org/showbuilds.cgi?tree=MozillaTest
- Areas where help is needed
- Reducing test variance
- AI:schrep will try playing with existing TP2 logs from rhelmer, see if schrep can do math magic.
- expand the scope of performance testing beyond Ts/Tp/TXUL/TDHMTL
- reduce noise in tests to ~1% (suggested by bz, not started)
- move perf tests to chrome, so we get more reliable results, and can test more than just content
- improve performance reporting and analyses:
- Better reports for sheriffs to easily spot perf regressions
- Tracking down specific performance issues
- stats change to track AUS usage by osversion.
- Priorities for infra:
- Generate historical baselines
- General profile data regularly on builds
- Getting the perf numbers more stable
- Developing the graph server to display time spent in each module
- New ideas
- Question: How are we tracking perf bugs, specifically, and are we doing this the same way we are triaging security bugs? Can we do it the same way if not? (damon)
AI:Damon: Timer Resolution
Information from Boris:
There are several timers that are involved in the performance tests. First of all, there is the JS Date.now() function. This is a priori accurate to no better than 1ms. We also use JS timeouts (same accuracy) and perl's timing stuff (worse, unless Time::HiRes is installed, which it should be on all the test boxen; with HiRes we should be getting microsecond precision).
In practice, the actual accuracy is sometimes worse than the 1ms accuracy listed above.
On Windows some of the commonly-used timer APIs (e.g. timeGetTime) only give 15ms accuracy; I'm not sure which, if any, of the above are affected by that. I seem to recall issues with JS timeouts due to that. Certainly anything on Windows that uses PR_IntervalNow() will be affected by the timeGetTime behavior.
Most of the other things above seem to use gettimeofday on Linux and GetSystemTimeAsFileTime on Windows; both seem to be accurate enough for our purposes. I think. The msdn docs on GetSystemTimeAsFileTime are pretty slim.
On Mac I'm really not sure what the situation is.
In general, I _think_ that anything that's using JS Date.now() directly is good for 1ms precision (which means that tasks of under 100ms are hard to time to 1% accuracy). Anything using timeouts (I seem to recall Tp does this?) will get noise in it due to PR_IntervalNow; on Windows this might be a lot of noise. Ts uses the Perl timing, which should be ok.
- JS timing is actually worse than 1ms on Mac - granularity is in 16 ms ticks, easily visible if you graph timings of short loops. gettimeofday() on Mac has a 1-microsecond granularity though. -Stan