Recently, the Mozilla Release Engineering team has made numerous advances in release automation for our browser, Firefox. We have reduced the requirements for human involvement during signing and sending notices to stakeholders, and have automated many other small manual steps, because each manual step in the process is an opportunity for human error. While what we have now isn't perfect, we're always striving to streamline and automate our release process. Our final goal is to be able to push a button and walk away; minimal human intervention will eliminate many of the headaches and do-overs we experienced with our older part-manual, part-automated release processes. In this article, we will explore and explain the scripts and infrastructure decisions that make up the complete Firefox rapid release system, as of Firefox 10.
You'll follow the system from the perspective of a release-worthy Mercurial changeset as it is turned into a release candidate and then a public release available to more than 450 million daily users worldwide. We'll start with builds and code signing, then customized partner and localization repacks, the QA process, and how we generate updates for every supported version, platform and localization. Each of these steps must be completed before the release can be pushed out to Mozilla Community's network of mirrors, which provides the downloads to our users.
We'll look at some of the decisions that have been made to improve this process; for example, our sanity-checking script that helps eliminate much of what used to be vulnerable to human error; our automated signing script; our integration of mobile releases into the desktop release process; patcher, where updates are created; and AUS (Application Update Service), where updates are served to our users across multiple versions of the software.
This article describes the mechanics of how we generate release builds for Firefox. Most of this article details the significant steps that occur in a release process once the builds start, but there is also plenty of complex cross-group communication to deal with before Release Engineering even starts to generate release builds, so let's start there.
Look N Ways Before You Start a Release
When we started on the project to improve Mozilla's release process, we began with the premise that the more popular Firefox became, the more users we would have, and the more attractive a target Firefox would become to blackhat hackers looking for security vulnerabilities to exploit. Also, the more popular Firefox became, the more users we would have to protect from a newly discovered security vulnerability, so the more important it would be to be able to deliver a security fix as quickly as possible. We even have a term for this: a "chemspill" release (short for "chemical spill"). Instead of being surprised by the occasional need for a chemspill release in between our regularly scheduled releases, we decided to plan as if every release could be a chemspill release, and designed our release automation accordingly.
Figure 1: Getting from code to "go to build."
This mindset has three important consequences:
1. We do a postmortem after every release, and look to see where things could be made smoother, easier, and faster next time. If at all possible, we find and fix at least one thing, no matter how small, immediately before the next release. This constant polishing of our release automation means we're always looking for new ways to rely on less human involvement while also improving robustness and turnaround time. A lot of effort is spent making our tools and processes bulletproof so that "rare" events like network hiccups, disk space issues or typos made by real live humans are caught and handled as early as possible. Even though we're already fast enough for regular, non-chemspill releases, we want to reduce the risk of any human error in a future release. This is especially true in a chemspill release.
2. When we do have a chemspill release, the more robust the release automation, the less stressed the humans in Release Engineering are. We're used to the idea of going as fast as possible with calm precision, and we've built tools to do this as safely and robustly as we know how. Less stress means more calm and precise work within a well-rehearsed process, which in turn helps chemspill releases go smoothly.
3. We created a Mozilla-wide "go to build" process. When doing a regular (non-chemspill) release, it's possible to have everyone look through the same bug triage queries, see clearly when the last fix was landed and tested successfully, and reach consensus on when to start builds. However, in a chemspill release where minutes matter keeping track of all the details of the issue as well as following up bug confirmations and fixes gets very tricky very quickly. To reduce complexity and the risk of mistakes, Mozilla now has a full-time person to track the readiness of the code for a "go to build" decision. Changing processes during a chemspill is risky, so in order to make sure everyone is familiar with the process when minutes matter, we use this same process for chemspill and regular releases.
Figure 2: Complete release timeline, using a chemspill as example.
Who can send the "Go to Build"? Before the start of the release, one person is designated to assume responsibility for coordinating the entire release. This person needs to attend triage meetings, understand the background context on all the work being landed, referee bug severity disputes fairly, approve landing of late-breaking changes, and make tough back-out decisions. Additionally, on the actual release day this person is on point for all communications with the different groups (developers, QA, Release Engineering, website developers, PR, marketing, and so on.).
Different companies use different titles for this role. Some titles we've heard include Release Manager, Release Engineer, Program Manager, Project Manager, Product Manager, Product Czar, Release Driver. In this article, we use the term "Release Coordinator" as we feel it most clearly defines the role in our process as described above. The important point here is that the role, and the final authority of the role, is clearly understood by everyone before the release starts, regardless of their background or previous work experiences elsewhere. In the heat of a release day, it is important that everyone know to abide by, and trust, the coordination decisions the Release Coordinator makes.
The Release Coordinator is the only person outside of Release Engineering who is authorized to send "stop builds" emails if a show-stopper problem is discovered. Any reports of suspected show-stopper problems are redirected to the Release Coordinator, who will evaluate, make the final go/no-go decision and communicate that decision to everyone in a timely manner. In the heat of the moment of a release day, we all have to abide by, and trust, the coordination decisions this person makes.
How is the "Go to Build" sent? Early experiments with sending "go to build" in IRC channels or verbally over the phone led to misunderstandings, occasionally causing problems for the release in progress. Therefore, we now require that the "go to build" signal for every release is done by email, to a mailing list that includes everyone across all groups involved in release processes. The subject of the email includes "go to build" and the explicit product name and version number; for example:
go to build Firefox 6.0.1.
Similarly, if a problem is found in the release, the Release Coordinator will send a new "all stop" email to the same mailing list, with a new subject line. We found that it was not sufficient to just hit reply on the most recent email about the release; email threading in some email clients caused some people to not notice the "all stop" email if it was way down a long and unrelated thread.
What is in the "Go to Build" email? The exact code to be used in the build; ideally, the URL to the specific change in the source code repository that the release builds are to be created from.
- Instructions like "use the latest code" are never acceptable; in one release, after the "go to build" email was sent and before builds started, a well-intentioned developer landed a change, without approval, in the wrong branch. The release included that unwanted change in the builds. Thankfully the mistake was caught before we shipped, but we did have to delay the release while we did a full stop and rebuilt everything.
- In a time-based version control system , be fully explicit about the exact time to use; give the time down to seconds, and specify timezone. In one release, when Firefox was still based on CVS, the Release Coordinator specified the cutoff time to be used for the builds but did not give the timezone. By the time Release Engineering noticed the missing timezone info, the Release Coordinator was asleep. Release Engineering correctly guessed that the intent was local time (in California), but in a late-night mixup over PDT instead of PST we ended up missing the last critical bug fix. This was caught by QA before we shipped, but we had to stop builds and start the build over using the correct cutoff time.
A clear sense of the urgency for this particular release. While it sounds obvious, it is important when handling some important edge cases, so here is a quick summary:
- Some releases are "routine", and can be worked on in normal working hours. They are a pre-scheduled release, they are on schedule, and there is no emergency. Of course, all release builds need to be created in a timely manner, but there is no need for release engineers to pull all-nighters and burn out for a routine release. Instead, we schedule them properly in advance so everything stays on schedule with people working normal hours. This keeps people fresh and better able to handle unscheduled urgent work if the need arises.
- Some releases are "chemspills," and are urgent, where minutes matter. These are typically to fix a published security exploit, or to fix a newly introduced top-crash problem impacting a large percentage of our user base. Chemspills need to be created as quickly as possible and are typically not prescheduled releases.
- Some releases change from routine to chemspill or from chemspill to routine. For example, if a security fix in a routine release was accidentally leaked, it is now a chemspill release. If a business requirement like a "special sneak preview" release for an upcoming conference announcement was delayed for business reasons, the release now changes from chemspill to routine.
- Some releases have different people holding different opinions on whether the release is normal or urgent, depending on their perspective on the fixes being shipped in the release.
The Release Coordinator balances all the facts and opinions, reaches a decision, and then communicates that decision about urgency consistently across all groups. If new information arrives, the Release Coordinator reassesses, and then communicates the new urgency to all the same groups. Having some groups believe a release is a chemspill, while other groups believe the same release is routine can be destructive to cross-group cohesion.
Finally, these emails also became very useful to measure where time was spent during a release. While they are only accurate to wall-clock time resolution, this accuracy is really helpful when figuring out where next to focus our efforts on making things faster. As the old adage goes, before you can improve something, you have to be able to measure it.
Throughout the beta cycle for Firefox we also do weekly releases from our mozilla-beta repository. Each one of these beta releases goes through our usual full release automation, and is treated almost identically to our regular final releases. To minimize surprises during a release, our intent is to have no new untested changes to release automation or infrastructure by the time we start the final release builds.