ReleaseEngineering/How To/Close Trees for TCW
Contents
Overview
Closing the trees for a tree closing window is much more invasive than normal operations. Instead of working with a single host or service, we operate on ALL hosts running a number of services.
The process varies depending on several factors:
- Is it a "hard" or "soft" tree close?
- Do the BuildBot services need to be shut down?
- Is there time to be extra nice?
With the answers to all of those questions, and freshly updated checkouts of the resources below, you are ready to proceed.
This procedure only covers shutdown on the day of the TCW. See TCW Procedures for the larger picture.
Script & Service Resources
There are various utility scripts that can be used for dealing with certain aspects of the process. The main ones are:
- Fabric based scripts in tools/buildfarm/maintenance (DXR)
- Ansible based scripts in build-ansible
- TreeStatus
Closing the Trees
[Optional] Start Graceful Master Shutdown
Used when all of the following are true:
- It will be a hard close.
- The BuildBot services need to be shut down.
This process is optional - it minimizes the number of running jobs that will not complete successfully, but is not required prior to shutting down the buildbot masters.
This procedure can be done a few hours before the start of the TCW. Some build jobs take several hours to complete.
Steps:
- Use the fabric script "manage-masters.py" to initiate the graceful-stop for the roles "build", "tests", and "try".
- This leaves the schedulers running to receive the latest updates from the masters as they shut down.
- Let the script run until it completes, or it is time to do the hard shutdown of the masters.
Closing the Trees
Use the TreeStatus app to perform these actions. Always ensure the "Remember this change to undo later" checkbox is selected. Use a link to the TCW master bug as part of the description.
Always close the "autoland" tree first.
When a soft close is needed:
- select all trees in the "Approval Required" state
- update them, leaving the state as "Approval Required", so only the message is changed.
- Select all trees in the "Open" state
- Update them, leaving the state as "Open", so only the message is changed.
When a hard close is needed:
- Select all trees.
- Update them, seting the state to "Closed", and adding the message
At this point, notify in #moc that trees are closed
Shutting down BuildBot
There are a number of services which also need to be shut down in addition to buildbot.
When BuildBot must be shut down:
- Stop Ancillary Services:
- Stop the auto allocation of more AWS spot instances by:
- Setting the global limits in https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/watch_pending.cfg#L230-L239 to 0 for all types.
- Commit change
- Successful deployment should be reported in #releng within 10 minutes.
- See bug 1340716 for latest information on dealing with windows test machines.
- Stop the BuildAPI service, as follows (see this page for host access, normal instructions, etc.):
- Make a backup copy of the production.ini file.
- Edit production.ini and comment out all lines with a database password in the URL.
- Perform the normal steps to restart the server (here).
- Stop the SelfServe agents, using desired_state=stopped and then follow the normal procedure (here).
- Stop the BuildBotBridge agents, using desired_state=stopped and then follow the normal procedure here.
- Stop the auto allocation of more AWS spot instances by:
- Stop BuildBot itself
- Use the fabric script "manage-masters.py" to initiate a stop for the roles "build", "tests", and "try".
- Use the fabric script "manage-masters.py" to initiate a stop for the role "scheduler".
Inform #moc that all buildbot servers are stopped. (Shutdown of BuildBot usually blocks the start of some IT work, or we wouldn't bother.)