QA/Automation/Projects/Mozmill Automation/Mozmill CI/Duties
The Mozmill CI duties are weekly roles that a member of the team must take care of, and runs from Monday-Sunday. The previous owner of each role can provide any important updates during the weekly Automation Development meetings, and the next owner can be assigned at this time.
We change the duty weekly and on Saturday and Sunday, our tests are still running. The duty on Monday should start by checking first the state of the CI and then with the emails from Saturday on (maybe even Friday night since we leave office). Duty has the "CIduty" suffix on its IRC name so everyone knows who it is. Duty should update the etherpad every time it's needed.
Infrastructure Duty
Monitor issues for the infrastructure that supports the Mozmill tests. This includes Mozmill, Mozmill Dashboard, Mozmill CI, Mozmill Environment, and Mozmill Automation. Any aborted or failing testruns on the mozmill-ci mailing list should be investigated and responded to.
Check that all machines are up and running
This task should be done in the morning first thing and before leaving the office.
Usually, all our nodes should be up and running. There are exceptions when we put and keep one offline because it has issues that were not fixed yet or for testing/updates purposes, which is a short period of time.
All these cases should be explained with an offline status message when taking a node offline.
This has the format: [Username] Reason the node is offline
Example: [Andreea] Updating Flash / Testing bug 123456
So when you scroll over the list of nodes, if you see one offline, click on it and check the message. If it's something important, leave it like that.
You could even check with the person that took it offline, they might have forgot to put it back online.
Anyway, testing purposes should only be done during the day, at the end of the day all machines should be back online.
How to put a node back online
You should check the machine via ssh and see if it has a screen session (screen -x) and if it shows 'Connected'. That means it's connected with java, but it's only marked offline in Jenkins.
If all is fine on the machine, you just click Mark this node online.
If the machine is not connected, then you need to follow these instructions from mana.
Check Jenkins Health
Monitoring
- Address: mm-ci-production.qa.scl3.mozilla.com:8080/monitoring | mm-ci-staging.qa.scl3.mozilla.com:8080/monitoring
You need to go through the panels and make sure we're in normal values - usually marked with green. If something is red, please let us know on the mailing list, file an issue on mozmill-ci or a bug under Infrastructure (depending on the issue), write in the etherpad and you can also contact people on IRC.
Pulse
Connect to master machine and check in screen (screen -x) that pulse is connected, on all panels (daily, l10n, release) - switch between them with CTRL + A + number to switch to (1, 2, 3.. )
If it's not connected, follow procedure for Emergencies, as we'll not be able to get the testruns triggered.
Retrigger broken testruns
If the job doesn't have a Mozmill Dashboard link, then the testrun did not complete successfully. It either ran all tests but failed to send the report or in the other steps we're doing at the end. To identify the problem, you need to go on the specific log link (from the "View the build in Jenkins") and check the issue.
- Known issues:
- Bug 915563 - Testrun stopped with "socket.error: [Errno 10022] WSAEINVAL"
- Cannot delete workspace problem on Windows: https://github.com/mozilla/mozmill-ci/issues/358
For aborted emails, these have the following subject:
- [Aborted] ondemand_update: Firefox 31.0 en-US on mm-win-xp-32-4 (2014-11-14_08-51-59)
Usually, when someone aborts testruns in purpose, an email is sent to the mailing list mentioning why this was necessary. We will not rebuild those.
But if they got aborted by mistake or independent of us (something on the nodes), in these cases, you need to rebuild that job (on the left side of the log you find Rebuild) and reply the email with details about what the issue was and the rebuild link. The rebuild job you'll see it's the top one in the queue list (if there is a queue) or the one that just started running.
Leave a status on IRC, etherpad and email if something major happens so everyone is informed.
Mention tasks that are unfinished/need to be done in the etherpad: https://mozqa.etherpad.mozilla.org/automation-CI-duty
Create testrun report for the beta days at the end of that day
Clone repository and follow instructions listed there: https://github.com/andreieftimie/reported
Update the reported.py script for the latest versions and check the blacklisted locales.
Run the script for the beta day.
Check the results and investigate what is missing, if the case. Why those are missing, were retriggered the next day maybe or pulse was down? When you have the information, send an email to mozmill-ci and dev-automation with it attached and description with what happened.
Failure Duty
Monitor Mozmill test failures by monitoring the mozmill-ci mailing list. Raise a bug for all new failures, provide initial investigation (at least results of attempting to replicate on same platform), and respond to the mozmill-ci list to indicate that the notification has been taken care of.
Check failure emails and reply where needed, file bugs, investigate
For failure emails, these have the following format:
Mozmill mozilla-central_update testrun for Nightly 36.0a1 en-US on mm-osx-106-4 (20141126030207) completed with 3 failures. View the build in Jenkins: http://mm-ci-production.qa.scl3.mozilla.com:8080/job/mozilla-central_update/13312/
View the results in the Mozmill Dashboard: http://mozmill-daily.blargon7.com/#/update/report/4baf975f17cb2c8a74373dc7dd123535
If the job has a Mozmill Dashboard link, then the testrun ran and it has some failures. You should check that report and file appropriate bug or update one already filed. For updating the bug we recommend to check Top Failures in the dashboard and filter for the dates you're interested it, to see how many failures there were, on which platforms/branches and so on. Otherwise, only after you go through all emails you'll be able to gather all the failures.
Emergencies
If something major happens, testruns not being triggered, all of them failing/not able to run until the end, please let us know on the mailing list, file an issue on mozmill-ci or a bug under Infrastructure (depending on the issue), write in the etherpad and you can also contact people on IRC.
If testruns fail due to a new regression, in order to not get all the emails and noise, we should:
In Jenkins -> Manage Jenkins click Prepare to Shutdown (that will finish running what's in progress already but will keep in queue the rest of the jobs
File bug for the regression, prepare a skip patch quickly, ask someone to check and land it
After it's landed, the testruns can start again, as that test will no longer fail. Cancel Shutdown from the Manage Jenkins options and trigger a mozmill tests job from the +admin tab (this one runs automatically every 5 minutes, but we want it as quickly as possible to get the new skip too).
We need to let the QA team know to run the test manually, as it's no longer covered by automation for that beta build
Other tips and tricks
If you're on a node and need to access the fs1 share folder, but it says it cannot connect, it's most likely because has too many connections already. So just ssh to fs1 node and reboot it. Otherwise, if you can't see a fs1 folder on a OSX machine, you can add it with Command + k in Finder.
Purge build queue - For example, if ondemand runs are triggered incorrectly and all are failing, but there are NO OTHER typed of jobs in queue ( like remote, endurance and so on, just ondemands), you can purge the queue in order to avoid running them all.