Music Links
26 November 2013

LWR (job scheduling) part i: where we are currently, and what needs to change to scale up


In my entire career, I have never seen Release Engineering scale anywhere near Mozilla's current numbers 1. The number of machines is over an order of magnitude larger than the next largest system I've seen. Our compute time for a full set of builds and tests is an order of magnitude larger2; two orders of magnitude larger in terms of compute-hours-per-day3. No other company I've worked for has even attempted per-checkin builds and tests, due to the scale required; we just lived with developer finger-pointing and shouting matches after every broken build.

It's clear to us that our infrastructure is a force multiplier; it's also clear that we need to improve the current state of things to scale an additional order of magnitude.

Our current implementation of buildbot runs our automation, with issues like:

  • scaling issues:

    • hg polling runs on the masters, causing hangs when the set of changes to be parsed is extremely large (e.g., new pushlog, or when we re-enable an old scheduler);
    • log uploads, job status updates, etc. also happen on the masters, adding more load where we can least afford it;
    • as :catlee points out, we're holding huge dictionaries in-memory, which results in massive duplication of data like this;
    • buildbot needs a persistent connection with slaves. This is great for streaming logs, and poor for load and network robustness;
  • we trigger dependent jobs via sendchange after a build finishes, which prevents us from querying/acting on a set of builds+tests as a single entity.

Our current configs describe our automation jobs, with issues like:

  • They are a RelEng timesink, which in turn makes us more of a bottleneck to the rest of the project:

    • It's more difficult than it should be to predict what the effect of changes will be without actually running them, which is time consuming;
    • it's near impossible to deal with oddball requests without a large amount of overhead.
  • The scheduling is inflexible, which increases costs in terms of human time, infrastructure time, and money:

    • our current scheduling doesn't allow for things like backfilling test or build jobs on previous commits, forcing us to run a larger set of jobs per-checkin than we would need to otherwise. This is inefficient in terms of compute time and money;
    • trying to find ways to jerry-rig alternate scheduling methods for jobs is time consuming. I see our current efforts as stop-gap solutions until we can roll out something more flexible by design.
  • Since it's difficult to get a full staging environment for RelEng4, we're faced with either landing risky patches with minimal testing (risking closing all trees for an indeterminate amount of time), or spending an inordinate amount of time setting up a rough staging environment that may or may not be giving accurate results, depending on whether you typoed something or missed a step.

Work is already well under way to move logic out of buildbotcustom, into mozharness. This is the "how do we run a job" to the scheduling's "when" and "where". As the build and test logic becomes more independent of the scheduling, we gain flexibility as to how we schedule jobs.

Our current implementation of buildbot cannot scale to the degree we need it to. An increase of an order of magnitude would mean tens of thousands of build+test slaves. One million jobs a day. That scale will help the project to develop faster, test faster and more thoroughly, and release better products that are simultaneously more stable and feature-filled. If our infrastructure is a force multiplier, applying a multiplier to the multiplier should result in massive change for good.

If we also make our configs cleaner, we can be smarter about what we schedule, and when. A 10x increase in our capacity would become an even larger amount of headroom than otherwise. Discussions about what we run, and how often, then become more about business value weighed against infrastructure- and human- time costs, rather than about infrastructure limits.

We've been talking about this for years, now, but product 1.0's and other external time pressures have kept it on the back burner. With no 1.0's on the horizon and the ability to measure the cost of things, hopefully we will finally be able to prioritize work on scheduling.

In part 2, I'm going to discuss a high-level overview of our plans and ideas for LWR, our next-gen scheduling system.
In part 3, I'm going to drill down into some hand-wavy LWR specifics, including what we can roll out in phase 1, which is what we were discussing at length last Tuesday. I didn't think I could dive into those specifics without giving some background context first.

1 :joduinn has seen scale like this, but I think Mozilla has surpassed those numbers.

2 We have a much larger matrix of (num_build_platforms · num_build_types) + (num_test_platforms · num_test_types) than any other project I've been a part of.

3 The workflows I've seen elsewhere have included

  1. on-demand (only build when someone pushes a button),
  2. nightly- or periodic- only,
  3. the tinderbox model where each build restarts after finishing, or
  4. a combination of the above.

With a smaller number of builds and tests per set, and a much less frequent rate of running those sets of builds and tests, the total number of compute hours per-day is significantly lower.

4 By "full staging environment", here, I'm not just including a single standalone buildbot master and a single buildbot slave. Depending on what we need to test, this can include a staging instance of self-serve, buildapi, statusdb, clobberer, slavealloc, tbpl, ftp.m.o, graphserver, hg repos (sometimes read-only, but sometimes read-write, e.g. staging releases which tag the repos), sometimes git repos, downstream test master + test slaves, etc. etc., and whatever staging systems we set up in this environment need to communicate with each other and not pollute production systems.