CIDuty/Meetings:2013-10-01
From MozillaWiki
< CIDuty(Redirected from Buildduty/Meetings:2013-10-01)
« previous week |
index |
next week »
< most recent |
upcoming >
Contents
Release Engineering Buildduty Meeting
- Date: 2013-10-01
- Time: 1:30pm EDT
- Room: ReleaseEngineering Vidyo room
- Meeting notes: https://wiki.mozilla.org/ReleaseEngineering/Buildduty/Meetings:2013-10-01
Status of buildduty period
https://releng.etherpad.mozilla.org/buildduty
Bugs filed
Previous action items
- DONE - armenzg to put queries on buildduty page
- coop:
- update nagios buildduty docs carryover
- follow-up with IT re: JSON formatting of nagios output
- add existing buildduty queries to wiki
- automate buildduty_report generation
- review buildduty report generation logic
- add in-page links
- bhearsum to add documentation at https://wiki.mozilla.org/ReleaseEngineering/Applications/SlaveAPI
Agenda
- (reminder, Callek not attending today, due to flight timing - notes on agenda inline)
- [Callek] Can I ask someone to take an action item to send me a summary of outcomes of discussion/points here? <- coop will do this
- (coop) Q4 goals?
- slave pre-flight tasks (Q4)
- report on disconnects as %age of overall builds
- reduce %age or number of disconnects (stretch goal)
- in-house masters for Windows (Q4)
- improve loaner process
- start with AWS loaners using slaveapi
- coordinate with Kim
- self-serve reboots for sheriffs (Q4)
- slaveapi reboots from slave_health
- improved slavealloc (postponed for now)
- we currently don't have proper control on how to migrate instances from one coast to another or in-house
- bug 907431 to handle Amazon issues better (q4 2013 goal?)
- slaveapi to allow to migrate hosts from one place to another
- allows for sheriffs to manage the pools
- (armenzg) - triaging queries
- can we remove the "build" group belonging? jlund could not see them
- (Callek) We should/could add people to the 'build' bmo group as well (day 1 page update) -- but agreed anyway
- releng query of bugs w/o dependencies should *only* be limited to problem tracking
- (Callek) I don't necessarily agree, however I do think that until we have better tooling elsewhere this is likely necessary
- see bug https://bugzilla.mozilla.org/show_bug.cgi?id=920453
- it had a dep bug and prevented us to see it on the query
- also this bug https://bugzilla.mozilla.org/show_bug.cgi?id=919841
- even though I see the dep bugs as resolved
- can we remove the "build" group belonging? jlund could not see them
- (armenzg) buildduty report not showing the following bug
- https://secure.pub.build.mozilla.org/builddata/reports/slave_health/buildduty_report.html
- https://bugzilla.mozilla.org/show_bug.cgi?id=738489
- (callek) It is in "all deps resolved" right now
- these 2 bugs have "no depencies" rather than "no open dependencies" https://bugzilla.mozilla.org/show_bug.cgi?id=891379 and https://bugzilla.mozilla.org/show_bug.cgi?id=738489
- (armenzg) - improving file removals
- http://stackoverflow.com/questions/186737/whats-the-fastest-way-to-delete-a-large-folder-in-windows/6208144#6208144
- I was thinking that we could move files to a common deletion place
- instead of removing on every run
- check after 1am every 15 min if a machine is idle
- if idle, gracfefully shutdown buildbot, remove the files and reboot the machine
- we could probably save lots of CPU time per machine
- (armenzg) - (rant warning) win64 post-reimaging steps are complete pain
- I still remember when I had managed to get to just a hostname change
- I'm not going to do any hosts until we're fully switched rev2 imaging
- (Callek) a brief chat in IRC suggested we could/should forgo win64 rev1 bringup in favor of reimaging as rev2 GPO'd machines. Has a short-term pain of extra buildbot-config mess, but probably best for our own sanity.
- (bhearsum) lowering nagios noise
- two examples yesterday of important alerts being missed (builds-4hrs file age; buildbot master command queue)
- possible simple improvement: send slave alerts to a different place than the rest
- even if we lower the noise we still have the issue that if I'm busy fixing something else I don't get to look at #buildduty
- (Callek) this is the exact reason I miss some tree-closing issues, get into doing something else and don't look at #buildduty in a timely manner.
- should we create an IRC channel called #treeclosing and report there issues that we know close the tree? or have a bot that calls for "buildduty"?
- (Callek) We could also add ourselves to an IT-esque pager-duty for specific "tree closing" alerts we know are worth getting paged on, e.g. builds-4hrs. I would expect "all" of us [buildduty] on weekends, the assigned buildduty person on weekdays (similar to IT on-call magic) and possibly always-joduinn and/or always-coop (as managers).
- (Callek) Having SMS alerts sent to me would allow me to at least respond that I'm nearby/not-nearby/etc and notice even faster if I am indeed on buildduty in a week and builds-4hr goes off.
- (bhearsum) is buildduty on the hook for random new machine set-up? eg https://bugzilla.mozilla.org/show_bug.cgi?id=919841
List of current projects
- https://github.com/bhearsum/slaveapi/blob/master/TODO
- https://bugzilla.mozilla.org/show_bug.cgi?id=914764
Action items
- coop:
- update nagios buildduty docs carryover
- follow-up with IT re: JSON formatting of nagios output
- add existing buildduty queries to wiki
- notify Callek of meeting outcomes
- talk to arr about turning off nagios slave checks in IRC
- armenzg
- propose a query for non-problem tracking VS buildduty bugs