Deploy with haste: the story of rig

This is a tale of infrastructure evolution. In the span of 12 months, we went from large, release-oriented deploys to an architecture and approach that enabled 150+ deploys a day. Getting there wasn’t easy, but we learned a lot along the way…

Matt Reiferson

Published in

BuzzFeed Tech

11 min readMay 2, 2017

Organizational Debt

BuzzFeed was launched over ten years ago with a small engineering team supporting a single monolithic application that powered buzzfeed.com. This worked great and was absolutely the right choice given the requirements at the time!

Fast forward to 2015, a year where the engineering team nearly doubled. The tools and interactions that previously worked weren’t scaling with our expanding team, infrastructure, and products.

Our release-driven approach to deploying the monolith was beginning to really show its age. Releases had grown in size and scope until they required a designated “manager” who was responsible for shepherding it into production. We soon discovered that large, infrequent deploys, coupled with a general lack of observability, greatly increased the risk of every deploy by making production issues difficult to debug and resolve. At its worst, it took us days to deploy and validate a release.

Simultaneously, our company focus and suite of products were evolving. The growing popularity of our video content and a general shift toward a distributed-first content strategy meant that we were building systems (and engineering teams to support them) outside of the buzzfeed.com monolith with vastly different requirements.

We felt that we needed a service-oriented architecture that better reflected our maturing distributed organizational structure and desired workflow.

The Engineering Experience

At this point, we lived in a world where building a new system involved a series of high-friction coordination points in the form of JIRA ticket queues, manual resource provisioning, and a little dose of “doing it live”. The process was woefully inconsistent, lacked transparency, and was really hard to predict. It was a classic game of hot potato with a huge wall in the middle.

A development and deployment pipeline affects everything you do — when it’s cumbersome and inefficient it makes it hard to experiment and iterate, and can overwhelm you with technical debt. Teams are forced to just get it done, eliminating opportunities for consistency, conventions, and standards.

We also needed better abstractions. Most importantly, we wanted to define a “standard interface” for an application and, in return, provide a guarantee that if it met those requirements it would Just Work™. We didn’t want to require teams to touch byzantine config management code or say the magic words in order to deploy.

Finally, we wanted teams to be more engaged after deploying, when systems are in production. We wanted to foster an engineering culture where we collaborate on debugging and resolving production issues, and that responsibility doesn’t fall squarely on Site Reliability. In order to do that well, we needed all services to be instrumented and monitored — and needed tools to be widely accessible and easy to use. Observability should be a natural side effect of building an application in a standard way.

So we set out on a journey to improve the “engineering experience” — but what on earth is that?

The availability, ergonomics, reliability, and effectiveness of the tools and approaches used to develop, validate, deploy, and operate software systems.

Basically, we’re answering this question: How often, in the process of trying to build a product, do you want to throw your laptop? We want that number to be zero.

Proof of Concept

It all started during a hack week in January 2016. A few of us in the Infrastructure group decided to build a proof-of-concept PaaS.

(Yes, everyone on earth is currently building their own PaaS. See Why Didn’t You Use X? for why we built one, too!)

Having only five days to get something working we needed to take some shortcuts. Obviously, we spent most of those five days agonizing over the name. In the end, we came up with rig:

v. Set up (equipment or a device or structure), typically hastily or in makeshift fashion

Seems appropriate, right? We felt pretty strongly about a few key properties and requirements:

Autonomy: Provide standards and tooling to accelerate product engineering teams.
Freedom: Creating and deploying a new service should require no coordination. Make it easy to do the right thing and trust users by default.
Efficiency: Deploy applications to clusters of homogenous compute resources.
Pragmatism: Build on top of AWS’s battle tested platform of EC2, ELB, and their container scheduling service, ECS.
Completeness: Standardize and control the pipeline from development, CI, and production, including management of secrets.
Observability: Out-of-the-box support for system and application distributed logging, instrumentation, and monitoring.

Taking inspiration from the high-level application abstraction found in PaaSTA, we began by building out a CLI in Python to serve as the entrypoint for users.

We quickly adopted a basic set of conventions, e.g. a service is a top-level directory in the mono repo that contains:

service.yml — the intrinsic properties of a service, e.g. CPU/MEM reservation, instance counts, network interfaces, and organizational metadata.
Dockerfile — container definition to be built by CI.

config.yml — plain text application configuration.

secrets.<cluster>.gpg — per-environment GPG encrypted application secrets (inspired by blackbox).

Additionally, we established some runtime requirements:

It must execute as a stateless process.
It must read config from the environment.
It must output logs to stdout/stderr.
It cannot expect any local state to be persisted.
It must be robust to failure and startup and shutdown quickly, expecting that to happen at arbitrary and unexpected times.
HTTP services must implement a health endpoint.

We created a VM-based development workflow to provide a consistent, repeatable, and ephemeral environment to run our tooling and infrastructure. Inside the VM, the CLI serves as the primary interface. Running a service is as simple as rig run foo, including managing dependencies like mysql, redis, and other services. This is possible because of the conventions laid out above and the convenient packaging and tooling provided by Docker.

Additionally, rig supports live code reloading when running a service to preserve a fast feedback loop during development. This works because we mount the repo into the VM and into the container. Running tests is similarly trivial (rig test foo).

As you push commits up to your branch, builder (a bundled rig service based on Jenkins) integrates your code, performs global sanity checks, builds the container image, runs tests, and pushes the image to a container registry.

Finally, deploys are initiated from a web UI, where you select the service, the versioned container image, and a target cluster. The service is then “scheduled” by ECS onto our EC2 instances registered in that cluster, based on the service’s configuration (e.g. its CPU and MEM reservation and desired number of instances). For new HTTP services, this process also provisions the load balancer (with TLS) and DNS entry.

This process typically takes no more than a few minutes end-to-end, depending on the size of the container image and test suite, even for brand new services! You can imagine the impact this might have on an organization struggling with weeks long provisioning and deployment!

Unfortunately, a proof of concept does not necessarily make for a production worthy system.

Production Rollout and Migration

There were significant reasons to transition to rig, however it wasn’t without risk. Primarily, we were concerned about operational risk. For example, we hadn’t previously used Docker, a relatively immature technology, or ECS, a fairly new AWS managed service, in a production environment.

Despite its immaturity, one of the most attractive qualities of ECS is its support for “bringing your own infrastructure”. This meant that we could leverage existing rock solid AWS services like EC2, ELBs, RDS, ElastiCache, etc. and allowed us to retain tight control over the hosts, including well-understood topology and lifecycle primitives (VPCs, SGs, and ASGs) and the base OS and software they ran. This helped us to gain confidence more quickly, as long as we trusted the “black box” that is the ECS managed scheduler.

Still, we had questions around stability, operability, observability, and security — things like:

How does an ECS cluster handle host failure?
What happens when the ECS agent or Docker daemon fails?
At what granularity should we monitor containerized services?
What is the network performance impact of Docker bridge networking?

We had previously experimented with Terraform to manage our cloud resources. We felt that its declarative nature would enable transparency and support for peer review in infrastructure changes. We doubled down on Terraform for rig, using it to provision all of its underlying infrastructure. This provided an automated and repeatable method of provisioning clusters, which we used to quickly stand up and test hypotheses at the infrastructure level.

We addressed observability in three ways:

Distributed logging: We shipped all logs to Papertrail, tagged and searchable by service, version, and cluster.
Instrumentation: We integrated DataDog, which has deep support for Docker, but also acts as a standard statsd receiver, aggregating metrics from instrumented services.
Monitoring: We built monitor (another bundled rig service, based on Nagios), which automatically discovers hosts, services, and other resources running in the cluster, configures alerts, and routes notifications based on service metadata to Slack, PagerDuty, etc.

At this point, we felt like we had good monitoring coverage and were collecting the insights we needed to debug failures. After running all of our operational exercises, we selected low-risk, minimal-workload systems to migrate first. This allowed us to gain experience in production, without compromising critical workloads.

Impact

Let’s start with some numbers.

Rig became “generally available” in April of 2016. Since then, we’ve observed double-digit percentage growth in the number of production services each month. By February 2017, we had 227 services in production! We’ve also executed 18,228 deploys since June 2016 (when we started tracking them). That averages out to ~150/day! The sheer number of services and deploys is evidence of the effect it’s had on our rate of delivery.

Culturally, it’s been nothing short of a game changer. As operators, we have enormous leverage because of the consistency with which services are configured, deployed, and run. Investments we make in rig have a huge multiplier, e.g. we added service auto-scaling, and instantaneously all services on rig could benefit from that feature. Product engineering teams feel ownership and empowerment to do their jobs efficiently and effectively, with more time to focus on the “business domain” of the task at hand. If you lower the cost and overhead of building and deploying a new service, you encourage low-risk experimentation and iteration — the ingredients that have made BuzzFeed successful.

Lastly, the efficiency and utilization of our infrastructure has increased dramatically as we schedule services in Tetris-like fashion, vastly reducing costs and allowing us to invest strategically in homogenous reserved capacity.

Why Didn’t You Use X?

In short, because ECS, Kubernetes, Docker Swarm, and the like are just building blocks, and we wanted to expose set of opinionated, higher level abstractions to our users.

The tools in this space rarely provide a robust development environment or have answers for CI, secret management, or monitoring, which means, no matter what, you’re left gluing the puzzle pieces together. Rig is that glue, and provides solutions for all of these problems!

We didn’t build a container scheduler (ECS), nor a CI system (Jenkins), or even observability tools (Nagios, DataDog, Papertrail) — we just developed a cohesive UX around a set of robust AWS services and established open source projects and gave it a damn good name.

The Future

Our work isn’t done!

There’s really no better way to learn about a system’s pain points than to methodically operate and use it in production over an extended period of time. To get feedback from our internal users, we’ve set up support channels in Slack, run regular office hours, given tech talks, and performed user research through interviews and surveys. Treating infrastructure like an internal product has helped us better understand everyone’s needs and establish a roadmap that blends our vision with changing requirements.

Operationally, we’ve learned that working with Terraform at scale is hard. We’ll talk about this in future posts, but suffice to say that although it has numerous benefits we have yet to resolve some of its painful workflow issues. For example, even though it’s automated and repeatable, we want to make it vastly simpler to provision a rig cluster from scratch (we’re very much inspired by kops — perhaps we should just use Kubernetes?).

From a user’s perspective, working with GPG-encrypted secrets isn’t a very friendly or intuitive process, and given that it’s one of the first onboarding steps, we want to simplify or eliminate this friction. Also, although our approach to secrets has been effective, we want to more easily be able to audit and rotate them, which is still far too difficult and time consuming.

The shift to rig greatly accelerated our transition to a service-oriented architecture. In future posts we’ll talk about other tools and systems we’ve built that help us observe and control the complicated interactions of this growing distributed system, namely our API Gateway.

Finally, we intend to open source all of this in the near future! The hypocrisy isn’t lost on us (see Why Didn’t You Use X?). We’ve got lots of questions about the purpose and value of open sourcing rig — it seems like a huge ask to adopt all of our silly opinions. Still, you just might happen to agree with our approach, in which case we’d love for you to benefit from rig too! At a minimum, sharing our implementation might inspire you to build your own (thanks again, PaasTA).

Obligatory “if this sounds like the kind of thing you want to work on”, we’re hiring!

P.S. no blog post would be complete without first thanking Wilbur McClutchen. Truth be told, rig has been a significant group effort. Big shoutouts to the Site Reliability and Platform Infra teams for delivering, scaling, and supporting all of this!

BuzzFeed Tech

Deploy with haste: the story of rig

This is a tale of infrastructure evolution. In the span of 12 months, we went from large, release-oriented deploys to an architecture and approach that enabled 150+ deploys a day. Getting there wasn’t easy, but we learned a lot along the way…

Organizational Debt

The Engineering Experience

Proof of Concept

Production Rollout and Migration

Impact

Why Didn’t You Use X?

The Future

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in BuzzFeed Tech

Written by Matt Reiferson

Responses (1)