Once the signed builds are available, QA starts manual and automated testing. QA relies on a mix of community members, contractors and employees in different timezones to speed up this verification process. Meanwhile, our release automation generates updates for all languages and all platforms, for all supported releases. These update snippets are typically ready before QA has finished verifying the signed builds. QA then verifies that users can safely update from various previous releases to the newest release using these updates.
Mechanically, our automation pushes the binaries to our "internal mirrors" (Mozilla-hosted servers) in order for QA to verify updates. Only after QA has finished verification of the builds and the updates will we push them to our community mirrors. These community mirrors are essential to handle the global load of users, by allowing them to request their updates from local mirror nodes instead of from ftp.mozilla.org directly. It's worth noting that we do not make builds and updates available on the community mirrors until after QA signoff, because of complications that arise if QA finds a last-minute showstopper and the candidate build needs to be withdrawn.
The validation process after builds and updates are generated is:
- QA, along with community and contractors in other timezones, does manual testing.
- QA triggers the automation systems to do functional testing.
- QA independently verifies that fixed problems and new features for that release are indeed fixed and of good enough quality to ship to users.
- Meanwhile, release automation generates the updates.
- QA signs off the builds.
- QA signs off the updates.
Note that users don't get updates until QA has signed off and the Release Coordinator has sent the email asking to push the builds and updates live.
Pushing to Public Mirrors and AUS
Once the Release Coordinator gets signoff from QA and various other groups at Mozilla, they give Release Engineering the go-ahead to push the files to our community mirror network. We rely on our community mirrors to be able to handle a few hundred million users downloading updates over the next few days. All the installers, as well as the complete and partial updates for all platforms and locales, are already on our internal mirror network at this point. Publishing the files to our external mirrors involves making a change to an rsync exclude file for the public mirrors module. Once this change is made, the mirrors will start to synchronize the new release files. Each mirror has a score or weighting associated with it; we monitor which mirrors have synchronized the files and sum their individual scores to compute a total "uptake" score. Once a certain uptake threshold is reached, we notify the Release Coordinator that the mirrors have enough uptake to handle the release.
This is the point at which the release becomes "official". After the Release Coordinator sends the final "go live" email, Release Engineering will update the symlinks on the Web server so that visitors to our Web and ftp sites can find the latest new version of Firefox. We also publish all the update snippets for users on past versions of Firefox to AUS.
Firefox installed on users' machines regularly checks our AUS servers to see if there's an updated version of Firefox available for them. Once we publish these update snippets, users are able to automatically update Firefox to the latest version.
As software engineers, our temptation is to jump to solve what we see as the immediate and obvious technical problem. However, Release Engineering spans across different fields both technical and non-technical so being aware of technical and non-technical issues is important.
The Importance of Buy-in from Other Stakeholders
It was important to make sure that all stakeholders understood that our slow, fragile release engineering exposed the organization, and our users, to risks. This involved all levels of the organization acknowledging the lost business opportunities, and market risks, caused by slow fragile automation. Further, Mozilla's ability to protect our users with super-fast turnaround on releases became more important as we grew to have more users, which in turn made us more attractive as a target.
Interestingly, some people had only ever experienced fragile release automation in their careers, so came to Mozilla with low, "oh, it's always this bad" expectations. Explaining the business gains expected with a robust, scalable release automation process helped everyone understand the importance of the "invisible" Release Engineering improvement work we were about to undertake.
Involving Other Groups
To make the release process more efficient and more reliable required work, by Release Engineering and other groups across Mozilla. However, it was interesting to see how often "it takes a long time to ship a release" was mistranslated as "it takes Release Engineering a long time to ship a release". This misconception ignored the release work done by groups outside of Release Engineering, and was demotivating to the Release Engineers. Fixing this misconception required educating people across Mozilla on where time was actually spent by different groups during a release. We did this with low-tech "wall-clock" timestamps on emails of clear handoffs across groups, and a series of "wall-clock" blog posts detailing where time was spent.
- These helped raise awareness of which different groups were actually involved in a release.
- These helped people appreciate whenever RelEng got processes to run faster, which in turn helped motivate Release Engineers to make further improvements.
- These helped other groups think about how they too could help improve the overall release process a big mindset shift for the entire organization.
- Finally, these also eliminated all the unclear handoff communications across groups, which had historically cost us many respins, false-starts, and other costly disruptions to the release process.
Establishing Clear Handoffs
Many of our "release engineering" problems were actually people problems: miscommunication between teams; lack of clear leadership; and the resulting stress, fatigue and anxiety during chemspill releases. By having clear handoffs to eliminate these human miscommunications, our releases immediately started to go more smoothly, and cross-group human interactions quickly improved.
When we started this project, we were losing team members too often. In itself, this is bad. However, the lack of accurate up-to-date documentation meant that most of the technical understanding of the release process was documented by folklore and oral histories, which we lost whenever a person left. We needed to turn this situation around, urgently.
We felt the best way to improve morale and show that things were getting better was to make sure people could see that we had a plan to make things better, and that people had some control over their own destiny. We did this by making sure that we set aside time to fix at least one thing anything! after each release. We implemented this by negotiating for a day or two of "do not disturb" time immediately after we shipped a release. Solving immediate small problems, while they were still fresh in people's minds, helped clear distractions, so people could focus on larger term problems in subsequent releases. More importantly, this gave people the feeling that we had regained some control over our own fate, and that things were truly getting better.
Because of market pressures, Mozilla's business and product needs from the release process changed while we were working on improving it. This is not unusual and should be expected.
We knew we had to continue shipping releases using the current release process, while we were building the new process. We decided against attempting to build a separate "greenfield project" while also supporting the existing systems; we felt the current systems were so fragile that we literally would not have the time to do anything new.
We also assumed from the outset that we didn't fully understand what was broken. Each incremental improvement allowed us to step back and check for new surprises, before starting work on the next improvement. Phrases like "draining the swamp," "peeling the onion," and "how did this ever work?" were heard frequently whenever we discovered new surprises throughout this project.
Given all this, we decided to make lots of small, continuous improvements to the existing process. Each iterative improvement made the next release a little bit better. More importantly, each improvement freed up just a little bit more time during the next release, which allowed a release engineer a little more time to make the next improvement. These improvements snowballed until we found ourselves past the tipping point, and able to make time to work on significant major improvements. At that point, the gains from release optimizations really kicked in.
We're really proud of the work done so far, and the abilities that it has brought to Mozilla in a newly heated-up global browser market. Four years ago, doing two chemspill releases in a month would be a talking point within Mozilla. By contrast, last week a published exploit in a third-party library caused Mozilla to ship eight chemspills releases in two low-fuss days.
As with everything, our release automation still has plenty of room for improvement, and our needs and demands continue to change. For a look at our ongoing work, please see the documentation on the design and flow of our Mercurial-based release process.
The authors were all members of the release engineering team at Mozilla. This article was excerpted with permission from Volume 2 of Architecture of Open Source Applications and lightly edited.