CIDuty/SVMeetings/Aug10-Aug14
2015-08-10
- [alin]
1. at the moment, we still do not have permissions to grant VPN access
https://bugzilla.mozilla.org/show_bug.cgi?id=1192253 Kim asked jabba for help with this bug as arr recommended. Jabba says that this access is now in place - can you verify?
2. https://bugzilla.mozilla.org/show_bug.cgi?id=989237
https://bugzilla.mozilla.org/show_bug.cgi?id=1188409 states that the machine was decommissioned
Q: can we mark the bug as RESOLVED? We should remove the machine from slavealloc etc if this hasn't already been done Still exists in slavealloc, will find you a doc on how to decom in our configs https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Slave_Management#How_to_decommission_a_slave https://bugzilla.mozilla.org/show_bug.cgi?id=1193304
3. https://bugzilla.mozilla.org/show_bug.cgi?id=1189003
we finally managed to re-image the machine, disabled the runner and then connected to it via SSH and VNC.
in the past, this machine had problems updating talos (https://bugzil.la/1141416).
Q: should we enable it in slavealloc? Did you disable runner after you reimaged it?
4. https://bugzilla.mozilla.org/show_bug.cgi?id=1192525
slave loan request for a t-w732-ix machine
Armen states in his comment that he needs to find the path to git.exe, however git is only installed on Windows build machines.
Q: should we ask him if he wants to loan a build machine?
5. https://bugzilla.mozilla.org/show_bug.cgi?id=1191967
issue related to "panda-0345.p3.releng.scl3.mozilla.com" machine
this machine is disabled
Phil stated that it is "Close enough to nothing-but-RETRY"
Q: next steps in this case?
- open DCops bug to decomm
6. https://bugzilla.mozilla.org/show_bug.cgi?id=947202 (info) re-imaged and enabled this machine in slavealloc https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=bld-lion-r5-086
2015-08-11
[kmoir] introduction to diff and patch Here is an introduction https://en.wikipedia.org/wiki/Patch_%28Unix%29 diff - tool to compare files line by line i.e. diff you can redirect the output of a diff command to a text file this text file can be applied to another person's copy of the file using patch
I think you run windows on your machines so I'm not sure what command line tools you have available. I recall from before that you cloned buildbot-configs. To update your local copy change directories to where you cloned hg.mozilla.org/build/buildbot-configs hg pull -u As an aside, here is an hg tutorial http://swcarpentry.github.io/hg-novice/
[alin] 1. Granting VPN access now works fine. Thanks for looking into this!
2. https://bugzilla.mozilla.org/show_bug.cgi?id=1193193
as discussed yesterday, we submitted a bug to DCOps for decommissioning the machine (panda-0345)
also marked it as "decomm" in slavealloc
3. https://bugzilla.mozilla.org/show_bug.cgi?id=1193188
loan request for a t-snow-r4 machine
checked in Slave Health dashboard and noticed that all of them are working and taking jobs, excepting one which is disabled as it needs further diagnostics (t-snow-r4-0094)
Q: what is the approach in these cases?
- Pick a slave in slavealloc and mark it disabled. Wait for the jobs on it to finish and then start the slave loan process.
batch downtime
[vlad] 1. we received a lot of nagios alerts on #buildduty channel with the following message: "[4839] buildbot-master111.bb.releng.scl3.mozilla.com:buildbot masters age is �7WARNING���: ELAPSED WARNING: 1 warn out of 1 process with command name buildbot (http://m.mozilla.org/buildbot+masters+age)" Q: Can you explain us what the alerts mean and what are the next steps to do. https://bugzilla.mozilla.org/show_bug.cgi?id=1056348 This is just an alert that the buildbot process has been up for over a month. Probably should just downtime it, although all of the masters will soon have this alert. Amy changed the alert to every 720 minutes so it should be less noisy in #buildduty now [vlad] I thought there is a problem with buildbot and we thought that somebody needs to look over it
2. The re-image problem for bld-lion-r5 has been fixed and the process of re-image has been completed . Updated the bug ticket and resolved : https://bugzilla.mozilla.org/show_bug.cgi?id=1189005
2015-08-12
[vlad] 1. Loaned the win7 slave to bhearsum . https://bugzilla.mozilla.org/show_bug.cgi?id=1193310
2. https://bugzilla.mozilla.org/show_bug.cgi?id=1026516
RyanVM re-opened the bug by specifying that a diagnostic need to be run
Q: Do we need to open a ticket to DCops to run the diagnostic again on the slave ? Can we clone this bug https://bugzilla.mozilla.org/show_bug.cgi?id=1162121 ?
Update: Created a bug 1193734 to DCops to run again the diagnostic
[alin] 1. https://bugzilla.mozilla.org/show_bug.cgi?id=1193412
slave did not any job since August 10
looked over the logs from buildbot master and noticed that the slave was detached and the connection was never re-established
disabled the slave in slavealloc, restarted it and re-enable it
after that we checked the logs from both master and slave and noticed that the connection was up again (more info on the bug)
waiting to see if it starts taking jobs again...
UPDATE: it started taking jobs --> marked the bug as RESOLVED
2. https://bugzilla.mozilla.org/show_bug.cgi?id=1193413
pretty much the same issue as above, so we followed the same steps
after we re-enabled the slave in slavealloc, we noticed that it connected to another master
at the moment we are waiting to see if it takes any jobs.
UPDATE: t-w732-ix-055 appears as "broken" at the moment, we will need to see why UPDATE2: t-w732-ix-055 taking jobs at the moment :) Q: could it be that this "disable-reboot-enable" process does the trick? :) The machines were rebooted before several times, but still didn't take any jobs
3. https://bugzilla.mozilla.org/show_bug.cgi?id=1191967
the panda machine was decommissioned by DCOps
noticed that we don't have any entry for these types of machines in "production_config" script
also looked through many examples, but didn't find any bug for decommissioning such a machine on releng side
Q: are there any additional steps that must be done here? Kim will update doc to decomm pandas
myqsl acess - kim sign slavealloc with vlad and alin's keys - still having problems with my signing access. Will have to get someone else to do this.
2015-08-13
[coop]
- reimage script in braindump repo will make your lives easier for Mac and Windows reimaging:
- https://hg.mozilla.org/build/braindump/file/62b3cf6b6727/buildduty/reimage
- both alin and vlad added to authorized_keys for buildduty user on cruncher, which will make that script work
- need to setup config file ~/.reconfig/config, basic contents look like:
# replace placeholder with creds from oob-password.txt.gpg export IPMI_USERNAME=XXXXXXXX export IPMI_PASSWORD=XXXXXXXX
[vlad] I tried to decrypt the oob-password but failed
[alin] 1. https://bugzilla.mozilla.org/show_bug.cgi?id=890317
used the script provided by Chris and re-imaged the slave
re-enabled in slavealloc and checked the logs - the machine is now connected to a master
waiting to see if it starts taking jobs
UPDATE: at the moment, it has already completed 4 jobs -> marking the bug as RESOLVED.
2. https://bugzilla.mozilla.org/show_bug.cgi?id=1104571
re-imaged the slave, enabled it in slavealloc and then restarted it
noticed that it connected to a buildbot master machine
waiting to see if it takes jobs
UPDATE: at the moment, it has already completed 4 jobs -> marking the bug as RESOLVED
3. we are not able to decrypt "slavealloc.txt.gpg", so we do not have mysql access. I guess the file has not been signed with our keys yet. I had problems with my gpg setup, am trying to fix it and if I can't will ask someone else to sign them
4. https://bugzilla.mozilla.org/show_bug.cgi?id=1191967 - decomission panda-0345
cloned the /tools repo
modified the json file, ran hg diff and obtained the patch
logged in as root on foopy59 and removed /builds/panda-0345 folder
as stated above, we don't have DB access
5. Should we start appending 'buildduty' to our IRC names? :)
[vlad] 1. https://bugzilla.mozilla.org/show_bug.cgi?id=1191901
removed the fqdn from inventory
terminated the aws instance
revoked the VPN access
2. Started the process of re-image for t-w732-ix-001 from my computer. Run the script with success
Update: Reimage complated, the slave take jobs
2015-08-14
[alin] 1. https://bugzilla.mozilla.org/show_bug.cgi?id=795795 - bld-lion-r5-052
tried to re-image the machine
waited over one hour but ping did not work
attempted to reboot the machine from the console -->failed
Attempting SSH reboot...Failed. Attempting PDU reboot...Failed. Filed IT bug for reboot (bug 1194615)
https://bugzilla.mozilla.org/show_bug.cgi?id=867136 - talos-linux64-ix-017
re-imaged, could not connect to it after that
attempted to reboot the machine from the console -> failed
Attempting SSH reboot...Failed. Attempting IPMI reboot...Failed. Filed IT bug for reboot (bug 1194673)
2. loaned machines:
https://bugzilla.mozilla.org/show_bug.cgi?id=1194498 - EC2 machine to kfung
https://bugzilla.mozilla.org/show_bug.cgi?id=1194623 - t-snow-r4 machine to cmanchester
3. https://bugzilla.mozilla.org/show_bug.cgi?id=1141416 (failed to update talos) - re-imaged the following slaves + restarted httpd:
Talos-linux32-ix:
talos-linux32-ix-008
talos-linux32-ix-001 <- coop to investigate
talos-linux32-ix-026
talos-linux32-ix-022
Yesterday before leaving, I enabled the four slaves above. Ryan VanderMeulen noticed that they failed each job so he disabled them. I also tried to re-image talos-linux32-ix-001 once again, but with an extra reboot step after the re-imaging process finished, but with no luck - it still fails every single job. Disabled.
From the logs: "command timed out: 3600 seconds without output running..."
Talos-linux64-ix:
talos-linux64-ix-001 - enabled,
talos-linux64-ix-002 - enabled, OK
talos-linux64-ix-008 - enabled,
talos-linux64-ix-027 - enabled, OK
talos-linux64-ix-004 - enabled,
talos-linux64-ix-099 - enabled, OK
talos-linux64-ix-055 - enabled, OK
talos-linux64-ix-092 - enabled,
talos-linux64-ix-017 - non-responsive
Not re-imaged yet:
talos-linux32-ix-003 -
talos-linux64-ix-027 - re-imaged, started taking jobs
4. https://bugzilla.mozilla.org/show_bug.cgi?id=1194211 - panda-0345 slave
will wait until a reconfig occurs
thanks Kim for the info :)
5. https://bugzilla.mozilla.org/show_bug.cgi?id=1098452 - bld-lion-r5-079
noticed that it no longer takes jobs
disabled it in slavealloc, rebooted and then re-enabled it --> no effect
checked the logs and noticed that it doesn't connect to a master
in slavealloc it still appears to be connected to buildbot-master86.bb.releng.scl3.mozilla.com, even though the logs show that it lost connection two days ago (2015-08-12 11:49:13-0700)
Q: should I try a re-image?
filed bug to DCOps to re-image and run diagnostics on this slave