Scalable Request Handling: An Odyssey (Part 3)

Published in

BuzzFeed Tech

18 min readFeb 15, 2018

One of the primary functions for any website is the ability to serve content based on a given user request. This is the third and final article in the series.

In Part 1, we looked at BuzzFeed’s architecture and discussed how user requests were being handled, as well as figuring out what a new service (Site Router) would look like and why it needed to exist.

Next, Part 2 of the series explored the original request flow and how Site Router fitted into that. We also reviewed Site Router’s configuration design as well as some ‘real world’ examples of the supported features.

In this final segment, I’ll recount the various technical challenges we stumbled across and fixed. I will also review how we debug Site Router issues and use integration tests to verify Site Router behaves the way we expect it to.

So what’s on the menu?

Here is a list of the items we’ll be covering today:

Caching old DNS.
Breaking caching behaviours.
Splitting up large configuration.
Lots of redirects are hard to maintain.
REGEX OMG.
Unable to use if statements effectively.
Follow requests through state changes.
Toggling HTTP headers with maps.
Active-Active static assets across cloud providers.
Testing NGINX and mocking upstreams.
Building test abstractions.
Rollout Strategy.
System observability and monitoring.

Prepare yourself, it’s going to get wild… 🐗

Caching old DNS

One of the primary reasons we switched from the open-source NGINX to the commercial version was due to a problem with how NGINX handles the caching of DNS queries.

When NGINX starts up, it resolves the DNS for any upstreams we define in our configuration file. This means it would translate the hostname into an IP (or a set of IPs as was the case when using AWS Elastic Load Balancers). NGINX then proceeds to cache these IPs and this is where our problems started.

If you’re unfamiliar with load balancers, they check multiple servers to see if they’re healthy and, if they’re not, the server is shutdown and a new server instance is brought up. The issue we discovered was that each server instance’s IP was changing. As mentioned earlier, NGINX caches IPs when it’s booted up so new requests were coming in and being proxied to a cached IP that was no longer functioning due to the ELB.

One solution would be to hot reload the NGINX configuration. That’s a manual process, and we would have no idea when that should be done because there is no way of knowing when a server instance would be rotated out of service.

There were a few options documented by NGINX that tried to work around this issue; however, all but one option meant we would have to redesign our entire architecture because they didn’t support the use of NGINX’s upstream directive (which we are leaning on very heavily as part of our configuration abstraction).

The one solution that would allow us to utilise the interface design we had turned out to only be available via NGINX Plus. The cost of going commercial was justified: we wouldn’t need to rewrite our entire service from scratch and there were additional features that we could utilise moving forward (some of which we had already been considering). Once we switched over, the solution was a one-line addition using the resolver directive that allowed us to define a TTL for the cached DNS resolution.

Breaking caching behaviours

As part of the release process for Site Router, we wanted to roll it out incrementally. This gradual process meant we could privately test and monitor the system to ensure we weren’t introducing a large breaking change into every other service in production.

With this approach in mind, we added some logic to our CDN that would bucket specific users so that their requests would be routed via Site Router (but only if the request had the header X-SiteRouter-Enabled set). All other users would be directed to the services directly as per the normal routing behaviour.

At this point we were concerned about users getting a cache HIT from requests served by Site Router while we were still ‘kicking the tyres’ on the new service. We decided to utilise the HTTP Vary header to help us serve different cached content to the user depending on the context of their request. But this didn’t work as expected.

*user requests: getting an unexpected cache hit*

The reason it didn’t work was because CDN logic was incorrect. We had modified the code so that we were adding X-SiteRouter-Enabled to the Vary header only when it was set on the incoming request. What we should have done was to make sure we always added X-SiteRouter-Enabled to the Vary header.

Caching is hard: you need to be very careful when making changes to logic that relates to caching because you can cause all sorts of problems if not handled correctly.

Splitting up large configuration

Our config.yml contained everything. After about a year we realised we should probably split out the configuration into separate files because it was becoming quite hard to maintain. The situation became even more apparent once we started having to support multiple hosts (e.g. www.buzzfeed.com, www.tasty.co, etc).

The solution was to have a top-level /hosts directory containing config responsible for host specific routing logic:

├── hosts
│ ├── buzzfeed-dev.io.yml
│ ├── example.com.yml
│ ├── foo.com.yml
│ └── xyz.co.uk.yml

One benefit of doing this was that we are now able to introduce new features to the host files whilst also being able to reduce unnecessary duplication. For example, we’ve added a host_settings key which allows us to abstract away settings on a per-host basis, such as the need to define an upstream value within each location block.

host_settings:
  default_upstream: 'foo'locations:
    - url: '^/…'
      description: '…'
      example: '…'    - url: '^/…'
      upstream: 'override_default_upstream'
      description: '…'
      example: '…'

In the above example, our host file defines a single upstream called ‘foo’, which all our location blocks will implicitly use unless overridden on a per location basis.

Lots of redirects are hard to maintain

BuzzFeed has been around a long time, and we’ve seen a lot of services come and go. We’ve accrued a large number of old paths that need redirecting to new service endpoints.

In some cases, we can have hundreds of redirects that need to be implemented, and that becomes quite a lot of ‘noise’ inside of our routing configuration.

*me, at the thought of all those redirects*

To solve this issue we implemented a solution whereby redirects can be placed inside separate configuration files that are dynamically compiled into our main configuration at build time. These are then imported into our main host files like so:

- redirect_file: 'name_of_redirect_file'
  host: ‘{}.example.com’

The host key allows us to control which host the new redirects should be sent to. The use of {} means we can replace the subdomain to be environment specific (e.g. stage.example.com or www.example.com depending on if the Site Router is running in staging or production environments).

All redirect files are located inside of a top-level /redirects directory (so they’re easy to locate), while the content of these files is demonstrated below:

- original: /foo
  redirect: /bar
  host: 'www.example.com' # optional override- original: /beep
  redirect: /boop

REGEX OMG

We use the power of regular expressions to match incoming request paths to a specific location block’s routing logic. Every now and then a specific feature of a regex pattern will throw you a curveball. One of those was the use of quantifiers (specifically braces).

To demonstrate, the following example is broken:

- url: '/\w{4}'

The reason this is broken is because when we generate the associated location block for this in NGINX code we’ll find it produces the following output:

location ~ /\w{
  …
}

Notice how the opening brace has actually been interpreted as a new starting block in the NGINX code, and so this completely breaks the nginx.conf file!

*me, upon first discovering this particular bug*

To avoid this issue we discovered we should wrap the regex pattern in quotes.

We already use single quotes in the YAML file but those quotes aren’t persisted once we transform the YAML into Python, so we needed additional quotes inside of them:

- url: '"/\w{4}"'

This will now result in the correct rendering of code within the nginx.conf file:

location ~ "/\w{4}" {
  …
}

Follow requests through state changes

Due to the potentially complex set of behaviours a route can have (such as overrides, redirects, path rewriting, failovers, etc.), it becomes increasingly important to be able to follow the logic branches of the code to understand how a request was actually handled.

For example, if I make a request for the path /foo what actually happens? Did I hit the expected upstream (end of story) or was there an internal path rewrite that meant I went to the upstream but now using a new path of /foobar? Maybe I went to a completely different upstream than expected because I’m based in the UK region and there is an A/B test currently running that means I’m being sent to a new upstream?

To help us identify these specific scenarios we add debug HTTP headers to our responses, which indicate the path taken internally by NGINX. These generally look like the following:

X-SiteRouter-Upstream: indicates the upstream as defined within configuration.
X-SiteRouter-Override: indicates new upstream that we overrode to.
X-SiteRouter-Fallback: indicates the upstream we failed over to.
X-SiteRouter-Error: indicates type of error as well as includes a unique request id.
X-SiteRouter-RouteIndex: indicates the configuration route index that matched.
X-SiteRouter-RouteDescription: indicates description defined in the configuration.
X-SiteRouter-Server: indicates specific server (e.g. static assets redundancy).
X-SiteRouter-Path: indicates path taken by server (e.g. static asset bucket info).

These headers aren’t available to the public. When we need them, we enable the headers dynamically via query parameters. But this then introduces an interesting problem, which is how do we configure that in NGINX code? This leads us nicely onto our next section…

Unable to use if statements effectively

We can’t use standard if statements in our NGINX code because those are considered evil by NGINX (and can even cause NGINX to segfault 🙀). In this case, we’re forced to utilise a different type of directive, known as “maps” in NGINX.

Just to clarify, NGINX can use if statements, but the variables we inspect as part of the condition are the problem and so by using the map directive we can create variables that aren’t going to cause NGINX to freak out.

Here is the general structure for using the map directive:

map $some_variable_to_inspect $a_new_variable_is_created {
  default 0;
  "expected_value" 1;
}

In the above example, we can see that we have to specify a variable we want to look at. I change $some_variable_to_inspect to be something like $query_string. Next, we define what value we’re hoping to find assigned to that variable (e.g. “expected_value”) and also provide a default value if we don’t find what we’re looking for. Finally, the end value is assigned to a new variable (e.g. $a_new_variable_is_created).

The reason 0 and 1 are used in the above example is because these values are coerced into truthy/falsy values, and if we do decide to use an if statement later on in our NGINX code, then we can use it to inspect the new variable and NGINX won’t have an issue, like so:

if ($a_new_variable_is_created) {
  // do something useful
}

Toggling HTTP headers with maps

Now we understand why if statements are problematic, we can revisit our earlier question of how do we enable our debug headers by checking a query string. Consider the following example map code:

map $query_string $debug {
  "~*turn_on_debug_info=true" 1;
  default 0;
}

We want to now use the $debug variable to decide whether to show the debug information in the HTTP response headers we send back to the client, but if we try the following example (which uses an if statement) we would find it doesn’t work:

if ($debug) {
  add_header X-SiteRouter-Upstream "foo" always;
}

Instead we use a little trick in NGINX whereby if you attempt to add a header with an empty value it won’t be shown. For example, if I was to stick the following code into an nginx.conf file, then the header wouldn’t be added to the response as the value provided is effectively empty:

add_header X-SiteRouter-Upstream "" always;

Meaning, we need to use the map directive to determine whether the value is empty or not:

map $debug $upstream {
  1 "foo";
  default "";
}

In the above example we are inspecting the new $debug variable (which we created via the earlier map directive) and we say: if the value assigned is 1 then assign the value “foo” to a new variable called $upstream (otherwise assign an empty string to it). Our code can now utilise the new $upstream variable to display the debug information only if the user actually provided the relevant debug query parameter:

add_header X-SiteRouter-Upstream $upstream always;

*me, after reading the* `map` *directive docs over and over*

Active-Active static assets across cloud providers

We use Site Router for serving static assets such as JavaScript, CSS, and images (e.g. /static/js/foo.js or /static/css/foo.css). In order to provide the appropriate availability and resilience expected, we utilise multiple cloud providers for serving our static asset content.

We also want to implement this behaviour — not using a redundancy design (‘active-passive’), where if one provider fails we then failover to the other — but instead use an ‘active-active’ model which means both providers are working side-by-side and we round-robin requests between them. The benefit to this approach is that we can be sure both providers are working and are also consistent in the content that they serve.

We implement an abstraction within the upstream block which allows an engineer to specify nested cloud providers. This configuration would ultimately generate multiple virtual servers, one for each cloud provider specified. This is then generated into the following example NGINX configuration code:

upstream static_assets {
    zone static_assets 64k;    server 127.0.0.1:49152 max_fails=0;
    server 127.0.0.1:49153 max_fails=0;
}server {
    listen 127.0.0.1:49152;
    server_name some-bucket.s3.amazonaws.com;    location ~* ^/static-assets/(?<asset_path>.+)$ {
        proxy_set_header Host some-bucket.s3.amazonaws.com;
        proxy_pass https://some-bucket.s3.amazonaws.com/${asset_path}$is_args$args;
    }
}server {
    listen 127.0.0.1:49153;
    server_name storage.googleapis.com;    location ~* ^/static/(?<asset_path>.+)$ {
        proxy_set_header Host storage.googleapis.com;
        proxy_pass https://storage.googleapis.com/some-bucket/${asset_path}$is_args$args;
    }
}

All an engineer needs to do now is to define a new route and to associate the upstream with the static_assets upstream. Any requests now passed to that upstream will be distributed between the defined virtual servers using a weighted round-robin balancing method. Meaning: the first request will be proxied to AWS, the second proxied to Google, the third proxied to AWS, and so on back and forth.

*me, trying to figure out the appropriate active-active implementation*

Testing NGINX and mocking upstreams

When testing Site Router, we want to be sure the behaviours we’re verifying are isolated and not resulting in real requests reaching our origin servers. In order to achieve this we need to modify our generated NGINX configuration file to use mocked endpoints (for which we use a Python web server to handle all outgoing requests from NGINX).

Running our tests are actually triggered by a simple bash script and this is where most of the magic happens as far as dynamically modifying our NGINX configuration. In essence we compile our nginx.conf file and then use sed to modify the code just before running our test suite.

One of the key modifications we make is to switch any server directive to using the same IP and port as our mock web server. This means we’re able to handle all outgoing proxy requests and to review their responses to see if we went to the correct location.

We also utilise the debug headers we spoke about earlier to give us extra contextual information about whether the request was routed through the correct pathways as defined in our routing configuration files.

Another change we had to make was in how we handled 301 redirects in NGINX when running our test suite. When NGINX triggers a 301 redirect (for example, to advertise.buzzfeed.com), instead of making a request to the real IP that an external DNS would typically resolve to, the request should be sent to our local IP 127.0.0.1 (we do this by adding the relevant hosts to the service’s /etc/hosts file).

When we started to implement redundancy for static assets, we realised we needed a way to verify the failover behaviour between cloud providers (e.g. if one provider failed to respond then the other provider should handle the request).

In order to achieve that behaviour, we again use sed to modify the nginx.conf so that any proxy_pass reference to AWS is changed to a simple return 404 “Not Found”, and any proxy_pass reference to Google is changed to a simple return 200 “OK”.

The way we have these upstreams defined in our configuration means that AWS will always be requested first, and so we can be sure the redundancy is working as expected by checking we got a 404 followed by a 200.

Building test abstractions

We write our integration tests in Python, meaning we can develop abstractions around the test suite code to make them easier to work with. We use py.test’s parametrize (table matrix) feature, which allows us to pass multiple test scenarios through a single set of assertions and each ‘test’ is then placed into a separate host specific file. This means engineers don’t have to worry about anything other than the abstraction.

Below is an advanced example covering a few different features an engineer might utilise as part of constructing a request to be tested:

tests = [
    {
        "path": "/foo",
        "override": "01_japanese_edition",
        "headers": {
            "X-Edition": "ja-jp",
        },
        "redirect_to": "/jp/foo",
        "host": "www.buzzfeed.com",
        "status_code": 301,
        "response_headers": {
            "X-BF-Vary": "X-Edition",
            "Cache-Control": "no-cache, no-store, must-revalidate",
        },
        "upstream": "foobar",
    },
]

The above example demonstrates a test that verifies a request to /foo for the host www.buzzfeed.com. When sent with a specific custom header, indicating the user’s “edition”, the test states that it expects an override to be triggered (01_japanese_edition) and that a 301 redirect to /jp/foo, as well as receiving a specific set of response headers will occur.

Rollout Strategy

We mentioned earlier how we initially wanted to do private testing of Site Router in production, and we explained how one of the side effects we encountered was through the misuse of the Vary HTTP header. When rolling out the Site Router into production for wider user testing we decided the best approach would be to adopt a ‘regional’ rollout process.

What this meant was that we selected a country that had a smaller footprint than our primary audience, and if we had issues rolling out the Site Router service, we would be able to roll back and purge our caches without causing a stampeding herd effect on our origin servers.

If you have a large delta in traffic patterns between regions you can also consider using Fastly’s geoip.city lookup to rollout to specific cities within a chosen region. This would allow you to reduce the cache purge blast radius.

To help with the cache purging we would modify the incoming request to include an additional header that our upstream applications would be able to identify. If this new header was found, then the service would include a new key within their Surrogate-Key HTTP response header.

The new key added to the Surrogate-Key header would indicate if the response was generated for a request that had come via the Site Router. This subsequently would allow us to purge just those affected requests rather than having to purge our cache by the URL alone (which could have caused a lot of extra traffic flowing through to our origins otherwise).

System observability and monitoring

In order to observe the Site Router service, we rely heavily on Datadog for monitoring and logging. Site Router is built upon NGINX which ‘out of the box’ doesn’t support instrumenting your own custom metrics. This meant with the open-source release we would have had to depend upon the basic monitoring provided.

It’s possible to extend NGINX using OpenResty and Lua, but we decided against this option initially in favour of switching to NGINX Plus, which provided us access to many more granular metrics and allowed us a greater insight into what was happening with our requests and the various upstreams.

We also have PagerDuty configured to be triggered by our Datadog 5xx monitors. To this end, we have many monitors setup for Site Router, but only the 5xx monitors (setup for each of our upstreams) are directed at PagerDuty as these are the most critical ‘user-facing’ errors and demand our immediate attention when on-call.

Conclusion

That’s it for this final part of our three part series on request handling. It’s been a long journey, but let’s have a look back over a few key things that we’ve learned about these past few weeks.

We started out understanding a bit about what a HTTP request is, how it’s constructed and why inspecting the response headers can be useful to understand how clients are meant to handle future requests for the same resource (such as caching behaviours).

We introduced the CDN as a core part of our resilience strategy and how we were too reliant on it to perform routing specific logic. We then discussed the various decisions that go into proposing a new service and how in our case we have agreed to migrate our routing logic from the CDN into a new NGINX based service called “Site Router”.

From there we proceeded to build various abstractions around NGINX in order to enable engineers to get what they needed done quickly and efficiently, with as much flexibility in features as possible, without them needing to be experts in how to write NGINX code. We also need to have an understanding of basic data structures to provide a consistent expectation for our users with regards to NGINX’s ‘first match wins’ routing behaviour.

Finally, in this segment, we covered many of the hurdles encountered along the way, such as:

How NGINX tries to be helpful by internally caching DNS queries.
How the Vary HTTP header can result in users not getting a consistent experience if implemented incorrectly.
How you should be constantly evaluating your services and refactoring them to ensure they stay efficient solutions for your users (for us, this meant splitting out our large configuration into separate files, which also helped to expose additional functionality).
How new requirements, like needing to handle hundreds of redirects in a sane way, can help you to tease out code bloat.
How ‘understanding’ a piece of software and knowing it are two different things, e.g. having to utilise map directives in smart ways to avoid issues with if statements, and how the compilation of our abstractions caused unexpected broken results.
How important being able to debug and trace routing requests, especially due to the amount of state changes that can be applied throughout a single HTTP requests lifecycle.
How using an ‘active-active’ pattern can provide a more robust and trusted system than an ‘active-passive’ design. By using active-active it means we can be sure both A and B are functioning correctly and consistently, rather than discovering at the last moment that when needing to failover to B it doesn’t actually work.
How testing a proxy service like NGINX can introduce different types of problems, such as how you manage DNS resolution internally to prevent things like 301 redirection.
How you need to plan a solid rollout strategy and how you might need to rollback. Making sure there is appropriate monitoring and instrumentation can help you identify issues and alert on them if necessary.

That’s it. We’ve covered it all now. So let’s see who we have to thank for all this…

Who’s responsible?

Although the majority of the code has (up until this point) been written by myself, it would be very foolish to think I alone would be capable of such a feat of success without a lot of help from a lot of very smart individuals.

There is absolutely no way the Site Router project would have turned out nearly as successful as it has without the following honourable mentions:

Clément Huyghebaert (Director of Engineering)
John Cleveley (Director of Engineering)
Andrew Vaughan (Site Reliability Engineer)
Raymond Wong (Staff Site Reliability Engineer)
Dang Vang (Senior Software Engineer)
Yeny Santoso (Senior QA Analyst)

Of course there has been much discussion, great ideas and useful input given by many more people than those mentioned above, and to you I give a very warm hug and say a big “thank you”.

Finally, thank you to Ian Feather and Caroline Amaba for their help converting this entire series of jumbled notes and broken thoughts into something altogether much more coherent.

Come join in!

If any of this sounds interesting to you (the challenges, the problem solving, and the tech), then do please get in touch with us. We’re always on the lookout for talented and diverse people, and we’re pro-actively expanding our teams across all locations around the globe.

We’d love to meet you! ❤️

To keep in touch with us here and find out what’s going on at BuzzFeed Tech, be sure to follow us on Twitter @BuzzFeedExp where a member of our Tech team takes over the handle for a week!