ReleaseEngineering/Jacuzzis
Contents
Background
- http://atlee.ca/blog/posts/initial-jacuzzi-results/
- https://bugzilla.mozilla.org/show_bug.cgi?id=970738
Implementation
Buildbot masters poll the Jacuzzi Allocator (http://jacuzzi-allocator.pub.build.mozilla.org/v1/) for builder/machine assignments. The Jacuzzi allocator is backed by static files currently hosted in the github repository (https://github.com/mozilla/releng-jacuzzis).
These allocations are updated periodically based on load. The allocate.py script is responsible for determining the proper allocations per jacuzzi. It first reads config.json, then looks at build history, and writes out an adjusted version of config.json. There are several parameters in the code that may be tweaks to adjust its behaviour. Currently it tries to make sure that no jacuzzi would have been more than 90% busy for more than 20 minutes per day over the past week.
The manage_jacuzzis.py script is responsible for managing the specific machines in each jacuzzi, and populating the files. It reads config.json to determine how many of each type of machine should be in each jacuzzi. It also consults the usable slaves report and slavealloc to pick appropriate slaves.
Runnable Code
The runnable scripts: allocate.py and manage_jacuzzis.py live in a separate repository from the static files: https://github.com/mozilla/build-jacuzzi-allocator
So, to modify anything aside from configuration, you'll need to submit a pull request to this repository and then run git pull
from /data/releng/src/jacuzzi-allocator
on the relengwebadm host -- where crontask.sh lives.
Updating jacuzzis
To make changes to the jacuzzis, do the following:
- Clone the static-jacuzzis repo
- Modify the repo as necessary. Usually this requires you to add one or more files to the builders and machines directories, and modify allocated/all.
- Run the allocation report to look for inconsistencies: https://hg.mozilla.org/build/braindump/file/default/jacuzzi-related/allocation-report.py
- Push your changes
- The crontask on relengwebadm will automatically pull in changes on its next run
Disabling dynamic allocator
The dynamic allocator will normally override any changes to 'bld-linux64-spot-' and 'w64-ix-' amounts. To disable this behaviour (in the case of a bug), add a top-level key to config.json: "disabled": true
Adding a new builder to a jacuzzi
For linux64 based on Windows based builds, it's very simple. Simply add the builder name into the dictionary of builders in config.json with zero slaves allocated:
"b2g_mozilla-central_emulator-debug_dep": { "bld-linux64-spot-": 0 },
Make sure to verify you've still got valid json. `python -m json.tool config.json` is a handy way to test this.
The next time the allocator runs, it will calculate the proper number of machines for this jacuzzi.
For other types of builds, allocate.py will need to be modified to support the new slave type.
Removing a builder from a jacuzzi
Simply delete it from config.json.
Troubleshooting
The allocation is run on relengwebadm.private.scl3.mozilla.com from /data/releng/src/jacuzzi-allocator/crontask.sh
Take a look at the current allocations at http://jacuzzi-allocator.pub.build.mozilla.org/v1/. Do they match what's in the repo?
If the jacuzzi looks like it has enough machines, but the machines aren't running, check what aws_watch_pending is doing, and that the slaves are actually usable.
If all else fails, contact catlee
Current limitations / known issues
- manage_jacuzzis.py won't always remove unusable or disabled slaves from the allocations
- allocate.py doesn't notice pending load. this means if a builder has low/no activity for a long time and then gets sudden high load, the allocations won't be adjusted in time