LWR (job scheduling) part i: where we are currently, and what needs to change to scale up
In my entire career, I have never seen Release Engineering scale anywhere near Mozilla's current numbers 1. The number of machines is over an order of magnitude larger than the next largest system I've seen. Our compute time for a full set of builds and tests is an order of magnitude larger2; two orders of magnitude larger in terms of compute-hours-per-day3. No other company I've worked for has even attempted per-checkin builds and tests, due to the scale required; we just lived with developer finger-pointing and shouting matches after every broken build.
It's clear to us that our infrastructure is a force multiplier; it's also clear that we need to improve the current state of things to scale an additional order of magnitude.
Work is already well under way to move logic out of buildbotcustom, into mozharness. This is the "how do we run a job" to the scheduling's "when" and "where". As the build and test logic becomes more independent of the scheduling, we gain flexibility as to how we schedule jobs.
Our current implementation of buildbot cannot scale to the degree we need it to. An increase of an order of magnitude would mean tens of thousands of build+test slaves. One million jobs a day. That scale will help the project to develop faster, test faster and more thoroughly, and release better products that are simultaneously more stable and feature-filled. If our infrastructure is a force multiplier, applying a multiplier to the multiplier should result in massive change for good.
If we also make our configs cleaner, we can be smarter about what we schedule, and when. A 10x increase in our capacity would become an even larger amount of headroom than otherwise. Discussions about what we run, and how often, then become more about business value weighed against infrastructure- and human- time costs, rather than about infrastructure limits.
We've been talking about this for years, now, but product 1.0's and other external time pressures have kept it on the back burner. With no 1.0's on the horizon and the ability to measure the cost of things, hopefully we will finally be able to prioritize work on scheduling.
In part 2, I'm going to discuss a high-level overview of our plans and ideas for LWR, our next-gen scheduling system.
In part 3, I'm going to drill down into some hand-wavy LWR specifics, including what we can roll out in phase 1, which is what we were discussing at length last Tuesday. I didn't think I could dive into those specifics without giving some background context first.
1 :joduinn has seen scale like this, but I think Mozilla has surpassed those numbers.
- on-demand (only build when someone pushes a button),
- nightly- or periodic- only,
- the tinderbox model where each build restarts after finishing, or
- a combination of the above.
With a smaller number of builds and tests per set, and a much less frequent rate of running those sets of builds and tests, the total number of compute hours per-day is significantly lower.
4 By "full staging environment", here, I'm not just including a single standalone buildbot master and a single buildbot slave. Depending on what we need to test, this can include a staging instance of self-serve, buildapi, statusdb, clobberer, slavealloc, tbpl, ftp.m.o, graphserver, hg repos (sometimes read-only, but sometimes read-write, e.g. staging releases which tag the repos), sometimes git repos, downstream test master + test slaves, etc. etc., and whatever staging systems we set up in this environment need to communicate with each other and not pollute production systems.tags: