Scalable Request Handling: An Odyssey (Part 2)
One of the primary functions for any website is the ability to serve content based on a given user request. This is the second article in a three part series. Last time we looked at BuzzFeed’s architecture and discussed how user requests were being handled, as well as figuring out what a new service (Site Router) would look like and why it needed to exist.
In this segment we aim to explain the design of our Site Router service layer, its configuration interface, and some examples of the key features.
So pour yourself a 🍹 and let’s get ready for some fun 🎉
So what do we have?
To understand what our architecture looks like now, let’s first look back at what it used to be:
In this flow diagram we have a user, who makes a request for some content. That content request first ends up at our CDN layer, is inspected (possibly manipulated), and finally proxied through to an appropriate origin server, which provides the content that is sent back to the user.
This diagram is a simple ‘cold-cache’ scenario: the CDN hasn’t already cached the origin’s response, so the request flows through to the origin server. Once that request happens, the CDN will be able to cache the content so that future requests don’t reach our origins. In reality, when additional requests come in, we have lots of caching strategy logic to prevent origin servers from becoming overloaded (in addition to load balancers and auto scaling groups to handle the distribution and scaling of our microservices).
Something to be aware of though, is that due to the distributed nature of the web, a user in the UK will find they will be routed to a different POP than a user located in the US for example. This means a cached request in the UK won’t necessarily be available to a US user (as their POP will have its own set of caches). There’s also multiple nodes within a POP, meaning multiple UK based requests can fluctuate between seeing a cache HIT vs a cache MISS (depending on the node those requests are routed to).
Now, as far as our new routing layer is concerned, there is only a small change to this request flow (if this is all new to you, then go checkout Part 1):
We have added an additional network hop into the mix: Site Router. Our routing logic has now been extracted from the CDN and moved into a separate “router” microservice. We do still have some basic routing logic within the CDN layer (generally anything that makes sense to be handled at the network edge — i.e. being nearer to our users), but the vast majority of logic is now sitting within the Site Router layer.
Design benefits?
The benefits of the new Site Router service outweighs the negligible overhead of an additional network hop, and through end-to-end load testing, the introduction of this service did not significantly degrade performance.
On the other hand, the developer experience has improved massively: we received immediate positive responses from teams engaging with the service informing us that they have been able to easily add new routes whilst being able to A/B test specialised behaviour (as well as utilising other types of features we’ll cover in more detail below), all via a simple YAML configuration file which Site Router provides as part of its consumer interface.
Site Router was also designed to be run upon our Rig platform, which facilitates easy deployment of the service to each of our ECS environments (test, stage and production) allowing developers to verify integration behaviours with other services before sending changes out to production.
Site Router’s Foundation
The Site Router service was originally built upon the very popular and highly performant open-source web server NGINX, but it has evolved over time to utilise the commercial version NGINX Plus instead as we came to realise that our usage was complex enough to justify using some of its specific features: DNS monitoring, service discovery, advanced metrics, live activity monitoring and tech support to name a few.
Due to its asynchronous event-driven architecture NGINX is commonly used as a reverse proxy, which is exactly the pattern we were looking to implement for our Site Router service. The ubiquitous nature of NGINX meant that more engineers were already familiar with the product, as well as NGINX already having a stable and proven track record of managing high traffic workloads. This made it the right choice for BuzzFeed.
Configuration Design
Although NGINX is well known, and easy to pick up quickly, we decided we wanted to implement a thin abstraction layer on top of it so that our Site Router service would allow teams to not only add new routing requirements easily but to do it without having to worry about best practice configuration.
This meant that we could avoid engineers accidentally modifying some subtle behaviour that would impact our overall performance goals. This was one of the primary concerns which made us want to change our VCL setup in the first place.
In order to achieve these goals we designed a small configuration-based interface using a simple YAML file. By using this configuration abstraction we are able to dynamically generate the required nginx.conf file (used by NGINX) with the help of the Python programming language.
Each of our mono repo’s microservices already has a config.yml defined. This is something that is provided as part of the rig platform we utilise for testing and deploying our services. We decided that the interfacing layer should use the same configuration file.
It’s probably worth me saying now, that describing our configuration file is actually a lot more complicated than actually using it is 🙂 . That being said, before we see an example configuration file for our Site Router service, it would be best to start with an empty config.yml file, as that will help to introduce the concept of configuration on a ‘per-environment’ basis:
default:
debug: falsedev:
debug: true
foo: 1test:
foo: 2stage:
foo: 3prod:
foo: 4
Notice in the above configuration that we have various environments (dev, test, stage and production), but we also have something referred to as ‘default’. Default serves as the fallback for any non-existent keys in a specific environment configuration. For example, we commonly utilise the default block for setting debug log levels to “info” while overriding it to “debug” in the development environment.
default:
config:
upstreams: &upstreams
foo: 'foo.stage.buzzfeed.io:443'
bar: 'bar.stage.buzzfeed.io:443'
static: 'static.s3.amazonaws.com:443'
monolith: 'monolith.buzzfeed.com:443' fallbacks: &fallbacks
monolith: 'monolith' timeouts: &timeouts
foo: '5s'
bar: '15s'
fallback: '30s' log_format: &log_format
host: '$host'
office_ip: '$http_x_office_ip' locations: ...
Our YAML configuration file holds quite a bit of information related to NGINX and our origin services for which we can define routing behaviour against. Breaking apart this chunk of configuration, we can see we have the following sections:
upstreams
andtimeouts
: upstreams are the servers that return the relevant content for incoming client requests. Engineers can add new upstreams simply by specifying their internal hostname (internally this uses NGINX’s upstream directive). Timeouts allow engineers to define global timeouts for each upstream. Engineers can also override these upstream specific timeouts on a per-route basis, which internally uses NGINX’s proxy_read_timeout directive.fallbacks
: we allow engineers to define a failover upstream to handle a particular request if the initial intended upstream happens to error (internally this uses NGINX’s error_page directive). The value assigned must map to an existing upstream key.log_format
: this lets engineers define additional variables to appear in NGINX’s request log output (internally this uses NGINX’s log_format directive).locations
: this is the most important section for engineers as it’s where all our routing logic is placed, although it doesn’t actually appear in the main config.yml file exactly like I’ve shown above. Don’t worry, I’ll explain what I mean by that in just a moment.
We lean heavily on some YAML specific features such as “Anchors and Aliases” and “Merge Key”, which allows us to reuse configuration across environment blocks and to easily override specific keys without repeating huge chunks of configuration. We use this approach primarily for our ‘upstreams’ and ‘timeouts’ sections, as we usually want to override those values to be environment specific.
Elsewhere in our configuration we define a dns_resolver_ttl_override
key that allows us to define the TTL for name server DNS resolution. This setting internally uses NGINX’s resolver directive and we’ll cover why we do this in part three of the series, where we discuss the various technical challenges we’ve encountered whilst designing and building Site Router.
This section concludes the general configuration model that is provided as part of our Rig platform. In the following sections we’ll start to dig into the Site Router specific configuration model and cover the various features our abstraction provides.
Routing Configuration Design
locations:
- url: '^/jobs$'
redirect: '/about/jobs'
description: '...'
example: '...' - url: '^/amphtml(?:/.*)?$'
upstream: '...'
description: '...'
example: '...'
In the above example routing configuration you can see how just a few simple keys are able to offer engineers a lot of control over how incoming requests are handled. Let’s take a look at the individual keys and understand what they represent:
url
: mandatory key using a regular expression to find matches on an incoming request.redirect
: returns a 301 Moved Permanently status code to the new path.upstream
: defines the backend that would handle the requested path.description
: provides more context to the regex and purpose of the route.example
: links to online regex testing tool for quick verification purposes.
Features
So at this point we’ve seen two fundamental pieces of our configuration model: the foundational pieces (such as defining upstreams and timeouts for those upstreams) and the routing configuration. Let’s now review a small selection of the key abstractions we’ve built into the Site Router service.
Overriding Behaviours
One of the most common things we need to do (other than define a route) is to override that route’s behaviour based upon details within the incoming request. This is important as it allows engineers to do things such as testing a new service backend, as well as handle path rewriting for users based in specific localities, which will be further detailed below.
In the following example route we have defined some behaviour that states the path /foo
should be proxied onto the “foo” upstream, but if we find the request includes a query param of service=new_thing
then we should change the upstream that was going to handle the proxied request to be the “bar” upstream instead (you can imagine how engineers use this feature to verify a new internal service is routed to and functioning correctly in production).
locations:
- url: '^/foo$'
upstream: 'foo'
overrides:
01_new_service:
variable: '$query_string'
match: '~*service=new_thing'
upstream: 'bar'
The variable
key accepts any valid nginx variable (of which there are many!). The match
key utilises regular expressions to help find a match. The symbols used at the start of the match pattern (~*
) are NGINX specific and indicate that the regex pattern specified will be case-insensitive.
Failovers
Failovers are where we wish to try a different upstream in order to ensure we’re serving some form of content in the case of our chosen upstream failing to return something useful.
In the following example route we have defined an upstream for the path /foo
and if that particular proxied request to the “foo” upstream returns a 404 Not Found status, then we should retry the request using the monolithic
upstream instead.
locations:
- url: '^/foo$'
upstream: 'foo'
fallback:
upstream: 'monolith'
The reference to monolith
is actually a mapping back to a key defined within our ‘fallbacks’ section of our main configuration file, which ultimately was a pointer resolving to monolith.buzzfeed.com
. Engineers can define their own status codes for which they expect a fallback to be applied by using the intercept_codes
key, which is a space separated set of codes to listen out for (e.g. intercept_codes: 404 500 503
).
An example of this fallback is where we previously had rolled out a new service responsible for rendering our primary article pages, but not all of the data that drives the pages was supported, and so we needed to rely on our original monolith service for rendering the content for those specific pages that the new service wasn’t yet ready to handle.
Path Rewriting
Path rewriting allows us to change the intended request path from what the user specified. It can be useful in a number of scenarios, such as when users make a common misspelling for a particular resource. Engineers use the path
key, at the top-level of a location block, to trigger a rewrite.
The standard use case is simple enough, but let’s look at a more detailed example where we needed to change the path within an override block instead:
locations:
- url: '^/(?<edition>(?:uk|us|de)/)?(?<page>\w+)$'
upstream: 'foo'
overrides:
01_use_qa_boxes:
variable: '$host'
match: '~*qahost(?<num>[12456]).buzzfeed.com'
upstream: 'webapp_qa$num'
path: '${page}'
What the above example demonstrates is that a user can make a request for something like /uk/foo
, but if they were using a QA host (e.g. qahost1.buzzfeed.com), we’ll override the upstream whilst also rewriting the path.
The reason we rewrite the path in this example is because the upstream that is being set as the override doesn’t support the original path, e.g. the foo upstream supports /<edition>/<path>
, whereas the override only supports /<path>
.
Below are some example paths that would match this location block, and if using a QA host we can see what they would be rewritten to when within the override block:
/uk/foo
→/foo
/us/bar
→/bar
/de/baz
→/baz
To help us achieve this behaviour we require the use of named capture groups to define explicit segments of the incoming path. So we can see we’ve split the path into two pieces:
- Edition:
(?<edition>(?:uk|us|de)/)
- Page:
(?<page>\w+)
When rewriting the path (within the override block), we can now reference the ‘page’ named capture group like so: ${page}
and thus omit the ‘edition’ portion of the path.
Similarly, in the override block, we’re also able to utilise named capture groups to help us identify a specific host (e.g. qahost1, qahost2, etc). This enables us to proxy the request onto a specific host without having to duplicate the override multiple times, each functionally equivalent, but simply checking the $host
variable for a different value.
In the example above this is demonstrated by identifying and capturing the numerical value after the ‘qahost’ part of the hostname, e.g. (?<num>[12456])
and so when defining the upstream, we can use the num capture group as part of the value assigned: webapp_qa$num
.
Meaning, we can ensure requests with specific hosts reach the intended upstream:
qahost1.buzzfeed.com
→webapp_qa1
qahost2.buzzfeed.com
→webapp_qa2
qahost3.buzzfeed.com
→webapp_qa3
Location precedence and data structures
When looking at how to override specific routing behaviour using an override
key, there were two things I glazed over at the time:
- Location and override match order.
- Naming format required for override keys.
The problem that we ultimately face is one of ‘precedence’. NGINX location blocks are “first match wins”, which means the order of the location blocks is important because if you have two routes (in this order):
/foo
/foobar
You’ll find that requests for /foobar
will be handled by the first route and not the second as its pattern matching is too ‘loose’. This is why when using regular expressions we will tend to err on the side of caution and use very explicit anchors (e.g. ^
and $
) so that instead of /foo
we would likely use something more controlled like ^/foo$
.
The location precedence issue is also why we used a list
data structure for our location blocks, as this allowed us to guarantee the order of the routes when we transformed the YAML into Python. Where as dictionaries did not have the same guarantees of insertion order (well, that was the case up until about a month ago, before it was announced that future Python versions would start to guarantee dictionary insertion order).
Unfortunately our individual overrides couldn’t also be defined as lists as we needed the ability to use YAML’s merge syntax to reuse large chunks of configuration. By using a YAML object, it meant we could define an override outside of a specific location block, and then use the YAML merge key syntax (<<:
) to insert the override into multiple location blocks, thus reducing lots of duplication. In doing so, we lost the ability to guarantee the override order. To solve that problem we decided that override keys should be numerically based, so that we could use our template engine’s dictsort feature to order the keys for us.
Conclusion
That’s it for part two of this three part series. To recap, we looked at:
- The old and new request flow and how Site Router fits into that.
- Site Router’s configuration interface and how it builds upon our Rig platform.
- The routing design and the various options available.
- Real world examples for some of Site Router’s features.
In the final part to this series we’ll look in more detail at the various technical challenges we stumbled across (and fixed), as well as how we debug Site Router issues and also how we use integration tests to verify Site Router behaves the way we expect it to.
Come join in!
If any of this sounds interesting to you (the challenges, the problem solving, and the tech), then do please get in touch with us. We’re always on the lookout for talented and diverse people, and we’re pro-actively expanding our teams across all locations around the globe.
We’d love to meet you! ❤️
To keep in touch with us here and find out what’s going on at BuzzFeed Tech, be sure to follow us on Twitter @BuzzFeedExp where a member of our Tech team takes over the handle for a week!