BuzzFeed Tech

Sharing our experiences & discoveries for the betterment of all!

Follow publication

Continuous Deployments at BuzzFeed

Logan McDonald
BuzzFeed Tech
Published in
6 min readFeb 24, 2020

At BuzzFeed, our Core Infrastructure team is always looking for new ways to improve developer productivity. Recently, we implemented continuous deployment of every merged change to our production environment. This has saved developers time, reduced operator stress, and improved the resilience of our services. In this post, we want to talk about why we invested time in implementing continuous deployments, what they look like for our ecosystem, and how we did it.

Background

Our homegrown platform, rig, powers a microservice ecosystem of over 600 apps including HTTP APIs & UIs, queue readers, one-off jobs, and more. We use a monorepo structure for development, and rather than using more formal, semantically versioned releases, our deploys are tied to a specific git revision (or commit). Every revision is deployable if it passes our automated tests. Before the advent of automated production deploys as described in this post, each deploy required an intentional step by a developer: Choosing the app, revision, and target environment through a deploy UI (and deploys to pre-production environments still work this way).

Our deploy_ui where developers deploy their services manually.

Release Engineering

Last year, a few of us interested in learning more about the discipline of “Release Engineering” decided to form a working group to investigate what could be improved about releasing software in our org. This small group wanted to learn more about continuous delivery as a practice and how we as an organization could reduce cost, time, and risk in our release process. We adopted a motto from Rolf Andrew Russell: “Continuous Delivery means minimizing lead time from idea to production and then feeding back to the idea again.”

We took a page out of Google’s SRE handbook Chapter 8 on release engineering and conducted user interviews with a variety of BuzzFeed developers using Google’s categories: self-service release process, high velocity for changes, hermetic builds, and clear policies and procedures for releasing. Through these interviews we found a consistent problem that impacted every category: Even though we preached a “master is golden” approach, we sometimes failed to embody it.

“Master is Golden”

In BuzzFeed’s monorepo, every change merged into master must be ready to be deployed to production at any time. It’s the responsibility of every developer to ensure that their changes are validated before merge, because the nature of our development process means they do not necessarily have control over when it gets deployed after that point.

An important aspect of this approach is ensuring that every change merged into master is deployed ASAP, to minimize the number of changes to debug when something goes wrong.

Though we preached this “master is golden” approach, we suspected that developers were sometimes failing to deploy every app after a change was merged, because it is not uncommon for a single pull request in our monorepo to change many apps at once, leading to a lot of manual clicking to trigger individual deploys.

(Some enterprising developers attempted to mitigate this manual clicking by using our annual Hack Week to create a Chrome extension, affectionately called Deploy Rocket, that tricked our deploy tooling into doing multi-service deploys. While an effective hack, when used for large numbers of deploys it wound up causing issues with rate limits and disrupting other developers.)

Nope, master is (fake) gold

Step one, as always, was measuring to confirm our suspicion that our services were lagging behind master, and the results were a little scary.

Bar graph depicting master builds lagging behind in prod.

There were services that hadn’t been deployed in years. Even worse were those builds on production that were never merged into master. We found cases where feature branches had been running in production for over 500 days. It had become far too easy for changes to be merged and never deployed.

Bar graph depicting non-master builds lagging behind in prod.

We had the idea that we could address these issues while also improving the resilience of our microservice ecosystem. While every deploy is an opportunity to introduce problems, a more aggressive, automated approach would ensure that we were always deploying the smallest possible change.

So this became our mission: Find a way to safely and reliably ship more consistently and more frequently, to ensure production is always running the latest code. And that’s how continuous deployments at BuzzFeed were born.

Continuous Deployments the BuzzFeed Way

Last year, our team developed a new HTTP API to power the rig platform, which took our platform to the next level, allowing us to simplify our admin services, improve operational security, and build features faster. This API empowered us to programmatically deploy services in the rig ecosystem, and thus, to continuously deploy those services.

To keep it simple, we first set out with some requirements and non-requirements for what continuous deployments at BuzzFeed should look like. Our broad mandate was this: when a pull request is merged, all affected services will be deployed to production. Requirements involved in that were:

  • Continuous deployments would only handle our production environment
  • Each merge would trigger a deploy of all applications modified
  • We would present developers with an opt-in period to gain confidence in the new system and then switch to an opt-out approach once it was ready for widespread adoption
  • Quality of service controls, like rate limiting, backoff, and retries, would have to be built-in
  • To help developers adjust to this (potentially scary) new approach, they would receive clear indicators when a change would be deployed automatically on merge
`deploy_on_merge: true` configuration as user used to opt-in a service to continuous deployments

Non-requirements were:

  • Control over the order of deploys
  • Automatic rollbacks of broken deploys
  • Delivery of any non-master branch to a pre-production environment

Given those requirements, we ended up with an architecture that looks like this:

Architecture diagram of continuous deployments flow by Dan Meruelo.

The flow works like this:

  1. A user merges a pull request on Github, which is sent to Builder, our CI system.
  2. For each app affected by the merge, Builder produces two deploy artifacts: A docker image, pushed to ECR, and metadata needed to deploy that revision, pushed to S3.
  3. When a new docker image is pushed, ECR sends a PutImage event to an SQS queue.
  4. A new deploy_worker app receives those events and, after some validation, makes an HTTP request to rig_controller_api to trigger a deployment.
  5. Finally, deploy_worker updates the originating pull request with the deploy result:
Notification on the pull request that the deploy_worker has deployed to prod with links to the deployment status and application logs.

Meanwhile, the API notifies developers in Slack:

Notification to our deployments Slack channel “deploy_worker is :sparkle: automagically”

And, tada! An “automagical” deployment is created. (And, yes, we even continuously deploy the apps that power continuous deployments).

The Impact

At this point, we have moved to an opt-out phase for continuous deployments, with only a few services currently choosing not to participate. In the past week 85% of our deploys in production have been automated, with the remaining being mostly validation deploys. That’s 307 deploys that humans did not have to do.

Just for fun here were our deploys last Friday:

Feedback from our developers has shown that continuous deployments have not only made their lives easier, but have made them feel safer about deploying. They always feel confident about what state the service is currently in for production and because the services get deployed more often, they feel less likely to break something with their change. Now they can get back to the important work… optimizing the ways we tell you how well you see blue.

Thank you to Dan Meruelo and Will McCutchen who worked on this project (extra thanks to Will for the art and design for the UI!) Thanks in addition to all of Core Infrastructure, and especially Justin Hines, Jack Ussher-Smith, Dan Katz, Clem Huyghebaert, and Andrew Mulholland for their work on the rig controller API.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Published in BuzzFeed Tech

Sharing our experiences & discoveries for the betterment of all!

No responses yet

Write a response