Mobile/Testing/04 24 13
From MozillaWiki
Previous Action Items
- Gbrown to follow up with the Tree Sheriffs to get robocop tests unhidden once more now that the strategic test disabling seems to have been done
- Follow-on - once the pandas are re-wired, we'll send a try job re-enabling those tests so that we can see if those particular tests were causing reboots due to increased CPU activity thus causing a power spike.
- -> Panda rc is no longer hidden
- -> Disabled rc tests still cause failures if enabled
- Jake and Kim will have all the pandas upgraded with new power infrastructure by Monday
- Dan will let us know at the next meeting where we stand w.r.t. the amount of work estimated to replace tegras with pandas running 2.3.x.
- Follow-on once we know that, Joduinn and I (ctalbert) will need to talk with Karen and Blassey about their projected timelines for EOL'ing 2.2 support.
Status reports
Dev team
- Found a cause of "2400 seconds without output" failures bug 663657
Rel Eng
- (kmoir) brought down masters to facilitate chassis maintenance. Mozpool/mozharness work for android pandas.
- Still working on a higher density chassis. Just waiting for the prototype chassis to be fabricated.
- bug 860028 Replacing 5v supply wire and adj power supply output in panda chassis in scl1 - COMPLETED
A Team
- I am seeing very little change in the frequency for bugs:
- bug 822321 - Intermittent Panda "Could not connect; sleeping for 5 seconds. reconnecting socket"...
- tegra M1, panda rc1, rc2 <- top failure listed above
- bug 663657 - Intermittent Android "command timed out: 2400 seconds without output, attempting to kill"
- panda m2, rc2 <- top failure listed above
- bug 807230 - Intermittent DMError: Automation Error: Timeout in command {ls,ps,isdir,mkdr}, ...
- doesn't happen in talos! but evenly distributed across reftest/mochitest/robocop
- bug 822321 - Intermittent Panda "Could not connect; sleeping for 5 seconds. reconnecting socket"...
- the above bugs should have been reduced with the wiring change.
- investigating "rouge" pandas
- during the smoketests to validate the wiring change, we say about 10% of the pandas being problematic. Average panda failure rates were 1-5%, but these "rouge" pandas were 7-15%.
- running just those pandas standalone yielded the same results as running with all the other pandas
- total smoketest failure rate 4.5%, without 10% of pandas 3.1%.
- How can we detect these?
- proposal:
- detected 20 jobs in the last 48 hours for a given panda
- detected >=2 failures for that given panda in the last 48 hours
- safeguard: if we detect >15% of the pool, just flag somebody in case there is a infra outage or a few bad builds
- remediate: pull panda reflash panda, reseat sdcard
- correction: if panda is "remediated" 3 times in 30 days, change SD Card
- dead: if we have hit the correction stage 3 times for a given panda, throw away the board
- collecting network traffic using wireshark might help us to distinguish between connectivity issues due to reboots and other possible connectivity problems
Android 2.3.5
- Current status is at: bug 859766.
- Largest issue seems to be timeouts, possibly due to losing focus
- a patch for this aimed at b2g landed recently, I will retest and see if things have improved
- Need to discuss prioritization / timelines with respect to other tasks.
- estimate 3 months of work to stand up 2.3.5 on pandas, with another quarter or so of bug fixes / maintenance
x86 automation
- I am running throught the mochitests to get a rough idea of how stable the emulator is
- I do see some timeouts and occasional process crashes. I'm planning to rerun some of this on the actual phone to hopefully determine if this an emulator issue or a product stability issue
- [bc] Adding additional 1 Samsung GS II and 2 GS III phones.
- [bc] bug 862456 Security Review for Phonedash
- [bc] Testing throbber start performance with original fennec launch code vs. mozbase's launchFennec with and without -W parameter to am.
- [bc] Planning to investigate using standard deviation to gate retests in attempt to reduce jitter.
- Added a test for loading webpage post-startup (bug 860790)
- Got some b2g eideticker stuff working which may also be interesting for android:
Round Table
- should we disable tests that are hard to fix and known to cause a lot of failures?
- specifically webgl!
- tbpl starring sometimes posts process crash and timeout/connectivity bug, even though all the tests have completed
- should we fix this?
- should we detect if a harness has completed and then only report shutdown failures?
- other ideas?
Action Items
- (jmaher) explain your round table items
- (ctalbert) get kim a known good build
- (kim) run tests over the weekend
- (ctalbert) to email bad news
- (ateam) split out webgl from mochitest-1
- (wlach) to add stock and chome to new test
- (blassey) follow up with karen to get 2.2 end of life plan