CIDuty/SVMeetings/Aug31-Sept4
Upcoming vacation/PTO:
- alin - aug31-sep11
- vlad - oct16-oct20
- Monday Sep 7 - Holiday in Canada
Meetings every Tuesday and Thursday
- https://wiki.mozilla.org/ReleaseEngineering/Buildduty/SVMeetings/July27-July31
- https://wiki.mozilla.org/ReleaseEngineering/Buildduty/SVMeetings/Aug3-Aug7
- https://wiki.mozilla.org/ReleaseEngineering/Buildduty/SVMeetings/Aug10-Aug14
- https://wiki.mozilla.org/ReleaseEngineering/Buildduty/SVMeetings/Aug17-Aug21
- https://wiki.mozilla.org/ReleaseEngineering/Buildduty/SVMeetings/Aug24-Aug28
Contents
2015-08-31
Additional bugs to work on 1)manage_masters.py retry_dead_queue should run periodically https://bugzilla.mozilla.org/show_bug.cgi?id=1158729 Way to stop all these alerts :-) 2) Add a runner task to check resolution on Windows testers before starting buildbot https://bugzilla.mozilla.org/show_bug.cgi?id=1190868 https://github.com/mozilla/build-runner 3) Add T testing to the trychooser UI https://bugzilla.mozilla.org/show_bug.cgi?id=1141280 Code is in hg.mozilla.org/buildtools and trychooser dir
FYI Friday afternoon we had this problem https://bugzilla.mozilla.org/show_bug.cgi?id=1199524
[vlad]
1. loaned b-2008-ix-0005 (idle slave) to kmoir (details sent via email), bug https://bugzilla.mozilla.org/show_bug.cgi?id=1199663 2. loaned bld-lion-r5-002 to ulfr , bug https://bugzilla.mozilla.org/show_bug.cgi?id=1199874 3. Started the process of re-image for the following windows slaves:
t-xp32-ix-030 - failed
t-w864-ix-020 - failed
Update: 1. I'm receiving the following error in Microsoft Deployment Toolkit : "An error has occurred in the script on this page; ERROR: Name redefined"� 2. Seems to be a problem with MDT, a bug has been opened https://bugzilla.mozilla.org/show_bug.cgi?id=1200180
4. This bug can be closed https://bugzilla.mozilla.org/show_bug.cgi?id=1199586 , right ? 5. Started to look over this bug : https://bugzilla.mozilla.org/show_bug.cgi?id=1158729 . (configured the script locally and tested if is working ) 6. bug : https://bugzilla.mozilla.org/show_bug.cgi?id=1199586 closed 7. bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1199871 , reimage t-yosemite-r5-0056 and enabled in slavealloc 8. bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1158729 Attached the patch
2015-09-01
Discussion item What to do with high pending counts Due to the nature of timezones (more Mozillians online), most of the time we get high pending counts during ET/PT afternooons. You can see the pending counts here https://secure.pub.build.mozilla.org/builddata/reports/slave_health/
What is a high pending count? Will be alerted of the aggregate pending count in #buildduty example:
nagios-relengMon 17:00:34 PDT [4860] cruncher.srv.releng.scl3.mozilla.com:Pending builds is CRITICAL: CRITICAL Pending Builds: 7370 (http://m.mozilla.org/Pending+builds)
Possible Causes https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Infrastructure_Performance 1) Infrastructure problems are causing retries For example: The builds cannot fetch a package and retries. Retries spike load. Pending counts rise. 2) The builds cannot upload their resulting binaries. This happened last week when IT made DNS redirect to a server where we didn't have ssh keys to upload the resulting binaries. (https://bugzilla.mozilla.org/show_bug.cgi?id=1198296) In this case builds fail. 3) Coalescing fails. We have SETA configured to coalesce (run certain test jobs less often). If this breaks, we will see a spike in load and high pending counts. I have to fix some issues with SETA to address this. (https://bugzilla.mozilla.org/show_bug.cgi?id=1199347) 4) We aren't getting new AWS instances. How can you tell? There will be an alert in #buildduty regarding aws_watch_pending.log not being updated. A common cause is that there is a typo in configs/watch_pending.cfg. Look at the logs on the aws manager instance (/var/log/messages). There should be an error message regarding a typo in the json file. We shouldn't really get to that point because there are tests to verify this but sometimes it happens. I did it here.https://bugzilla.mozilla.org/show_bug.cgi?id=1195893#c8 5) We are underbidding for spot instances. We use AWS spot instances for a large proportion of our continuous integration's farm's capacity. We have an algorithm that bids for the different instance types within a range of prices. Prices are here https://github.com/mozilla/build-cloud-tools/blob/master/configs/watch_pending.cfg#L50 Algorithm is here https://github.com/mozilla/build-cloud-tools/blob/master/cloudtools/aws/spot.py If we are underbidding for the current costs of the spot instances, we won't get any new AWS instances and we pending counts will go up. 6) Buildbot ssh keys have a problem https://bugzilla.mozilla.org/show_bug.cgi?id=1198332. 7) Buildbot db problems - cannot connect to database due to network etc. Pending count will probably not increase, will just stay the same because jobs aren't deleted from db as they complete. 8) Cannot connect to AWS due to network
[vlad]
1. https://bugzilla.mozilla.org/show_bug.cgi?id=1158729
Received a lot of "Command Queue is CRITICAL" on #buildduty channel
Run the script to clean them
2. https://bugzilla.mozilla.org/show_bug.cgi?id=959635
Reopened the following bug : https://bugzilla.mozilla.org/show_bug.cgi?id=1023065 in order to be run again a diagnostic. :philor found the following problem : "because of an I/O device error" on slave
3. The reimage for W7/W8 has been resolved,the MDT is working correctly.
more details: https://bugzilla.mozilla.org/show_bug.cgi?id=1200180
4. https://bugzilla.mozilla.org/show_bug.cgi?id=1200273 loaned the t-xp32-ix-031 slave to jmaher 5. https://bugzilla.mozilla.org/show_bug.cgi?id=933901
Tried to connect to the slave, but I was not able to login
Started the process of re-image, after an hour the slave was unreachable , logged on mgmt console to see the status but the console was blank.
For this problem has been opened a DCops bug https://bugzilla.mozilla.org/show_bug.cgi?id=1200531
6. https://bugzilla.mozilla.org/show_bug.cgi?id=1075693
re-imaged the slave and enabled in slavealloc
2015-09-02
[vlad]
1. https://bugzilla.mozilla.org/show_bug.cgi?id=1200824
loaned the slave to eihrul
tst-linux64-spot-eihrul.build.mozilla.org , instance id i-9572b135
2. https://bugzilla.mozilla.org/show_bug.cgi?id=1158729
Created a first draft of the bash script
Attached the bash script
3. https://bugzilla.mozilla.org/show_bug.cgi?id=1103082
Started a re-image for the following slave talos-linux32-ix-022
The re-iamge is still in progress
Re-image was completed, enabled in slavealloc and updated the bug ticket
2015-09-03
[vlad]
1. https://bugzilla.mozilla.org/show_bug.cgi?id=1103082
the following slave (talos-linux32-ix-022) has been re-imaged several times and still burning the jobs. In this case I think the slave can be decommissioned?
2. https://bugzilla.mozilla.org/show_bug.cgi?id=1201396
loaned to glandium
revoked vpn access for glandium
3. https://bugzilla.mozilla.org/show_bug.cgi?id=1201210
uploaded the packages to internal pypi on the relengwebadm machine.
resolved the bug
4. https://bugzilla.mozilla.org/show_bug.cgi?id=1158729
updated the bash script and attached
5. https://bugzilla.mozilla.org/show_bug.cgi?id=1184571
Meetings every Tuesday and Thursday
6. https://bugzilla.mozilla.org/show_bug.cgi?id=1201506
created a loan request bug, for me