CLS at BuzzFeed — Part 1: Raising The Floor

Published in

BuzzFeed Tech

6 min readNov 4, 2021

This article is part of a three-part series documenting how BuzzFeed tackled their issues with Cumulative Layout Shift as part of a broader Web Vitals initiative. Take a look at parts two and three.

Google launched thousands of performance teams across the globe when they announced they’d be factoring Core Web Vitals into Page Rank. BuzzFeed was no exception. We’ve always cared about web performance but we’d taken our eye off the ball and we had a lot of work to do to catch up with these new requirements.

Our Largest Contentful Paint (LCP) scores were fine, as were First Input Delay (FID) which was a relief given we load React on the page and often hear how that can prevent you from reaching these targets. Where we felt catastrophically short of the mark was Cumulative Layout Shift (CLS) where only 20% of visits were receiving a “good” experience — a long way from the 75% target.

This blog post, and the subsequent parts, detail the progress we’ve made and the lessons we’ve learned which can help your team accomplish the same improvements.

The percentage of “good” page views for Core Web Vitals: LCP is 83.4, FID is 84.5, and CLS is 78.3

Reliance on digital advertising — delayed load and unpredictable sizing
A large amount of third party code
A high degree of page variability (based on page layout, content, content types, embeds)
Lots of unique pages (in the millions)

To address the issue it was clear we had to improve our observability over the current state as well as drastically simplify the problem space.

Improving our Observability over CLS

Real User Monitoring

Looker dashboard showing Stacked bar charts for Core Web Vitals: Largest Contentful Paint (LCP) is 85% “good (less than 2.5 seconds)”, 9% “needs improvement”, and 7% “poor (greater than or equal to 4 seconds)”. First Input Delay (FID) is 73% “good (less than 100 milliseconds)”, 13% “needs improvement”, and 13% “poor (greater than or equal to 300 milliseconds)”. Cumulative Layout Shift (CLS) is 78% “good (less than 0.1)”, 9% “needs improvement” and 13% “poor (greater than or equal to 0.25)”.

We now could monitor the impact of our changes over time and at a much faster cadence than CrUX. Owning our own data would also prove extremely useful later in the process (which we’ll get to) by allowing us to drill down into results by date, page type, and even the specific embeds that are loaded into each page.

Synthetic Monitoring

Synthetic monitoring provides a consistent testing environment for us to repeatedly run the same tests and track results over time. We use Calibre to provide the testing environment and tests are run a few times a day.

Whilst this gave us a consistent environment to test from, we didn’t have a consistent and predictable test subject due to the variability of our pages (different ads could be loaded, AB tests running, new features released, etc.). If we suddenly ran a new ad campaign, one that uses a different ad height or one with heavier assets, it could very easily balloon our scores and set off alerts.

To create a predictable test subject we came up with the concept of breaking our pages into independently testable layers:

Content Layer — we load everything required to show just the page content. This includes interactive embeds and quizzes but doesn’t include complimentary units.
Feature Layer — everything above plus complimentary units such as recirculation feeds, comments, polls.
Full Render Layer — everything above plus ads (configured to be consistent each time)

An illustrated example of applying the layers concept to our article pages:

The layout of an article page with its components represented with simple shapes. The main content and recirculation units are highlighted. Ads are disabled (greyed out).

The layout of an article page with its components represented with simple shapes. The main content is highlighted. Ads are disabled (greyed out). Recirculation units are not included.

The layout of an article page with its components represented with simple shapes. The main content, ads and recirculation units are all highlighted. — **Content Layer** (left) **• Feature Layer** (center) **• Full Render Layer** (right)

Next, we loaded a representative amount of test pages into Calibre. Our RUM data is imported into BigQuery so we were able to find 10 pages from the “good” range, 10 pages that “needed improvement” and 20 “poor” pages.

A screenshot from calibreapp.com showing a table representing a list of buzzfeed.com pages, test profiles and results for Cumulative Layout Shift and Largest Contentful Paint.

Finding and Fixing Issues

Synthetic monitoring was great for narrowing in on units that were causing issues. We took a very deliberate, data-driven approach to prioritizing them — biasing for units and page types that received high volumes of page views. This is one of the lessons we’ve learned over the years of managing performance on a site as big as BuzzFeed — we’ll never complete the full list of possible performance optimizations so we need to ruthlessly focus on the ROI of each change using real data.

Once we’d identified the unit to fix we switched to Chrome Dev Tools and Lighthouse to work more closely and make the required changes. Many articles explain how to do this already (here’s one) so this post won’t try to cover them.

These were our most successful early improvements to CLS:

Correct Image Sizing

Adding width/height attributes to all images was a fairly easy and high-impact win. Originally we were just prioritizing the above-the-fold images but eventually pivoted to include all images on the page.

Static Placeholders for Ads

Most of our ad units can change dimensions depending on the ad inventory that’s actually served. We used our ad impression data to choose the most common ad size for each unit and created static placeholders for them. We also reversed a previous decision to collapse ads when there was no inventory to serve.

Static placeholders for embedded content (e.g. TikTok, Reddit, Twitter)

Unlike ads, where we at least had a range of sizes to choose from, embeds have no fixed dimensions and were impossible for us to accurately size. For example, a Twitter embed can vary dramatically in height depending on the content of the tweet and whether it contains images or videos.

We took the approach of best-guess-sizing them and trying to minimize the CLS impact rather than eliminate it altogether, as we assumed it was impossible to do better. Part 3 of this series will detail how we challenged and got past this assumption.

Despite our best-guess sizing being imperfect, it still had a considerable impact on posts with a large number of embeds. The graphs below show daily CLS scores based on the 75th percentile of page views. The red line denotes when these updates were released.

A time-series line graph showing the daily aggregated CLS scores (page views at the 75th percentile) for mobile and desktop. The X axis represents the date from February 1st through May 15th. On mobile, the score drops from about 1.5 to 0.25, and desktop from 0.25 to 0.1. — Daily CLS scores of articles containing at least one **TikTok** embed.

A note on “Below the Fold”

Until recently, we had not found a good solution for measuring this consistently — either on an ad hoc basis or via synthetic tests. Our method for measuring and diagnosing issues on scroll was to use Chrome dev tools and record via the performance tab as we manually scrolled. In the future, we may use Lighthouse User Flows to automate this. We’d be interested to hear what solutions others have found!

When RUM disagrees with Synthetic

Over a few months, we applied the above fixes and many more; one by one we found and fixed issues until our synthetic tests in Calibre were hitting the mark. It was extremely satisfying to scroll down the list of 240 pages with a sea of Green text reflecting the progress we’d made. The problem was… it wasn’t reflected in our RUM data or in the eyes of Google: we were still far below the 75% mark.

Green results all the way according to Calibre:

CrUX Data still shows only ~50% of users are receiving a “good” experience:

Stacked bar charts showing Cumulative Layout Shift (CLS) percentages for desktop and mobile. For desktop: 51.64% “good (less than 0.1)”, 24.96% “needs improvement”, and 23.4% “poor (greater than or equal to 0.25)”. For mobile: 57.28% “good”, 13.74% “needs improvement”, and 28.99% “poor”.

At this point, we were running out of ideas. Our list of obvious issues to fix was now complete and, whilst we were closer, our goal still felt fairly far from reach. We’d markedly improved the experience and raised the floor across all of our pages but we were going to need a breakthrough to move that green bar closer to 75%.

This is when we turned to our users for help. We’ll cover that in Part 2.

This article was co-authored with Ian Feather. Many thanks to Maria Enderton for her contributions.

BuzzFeed Tech is hiring! 👋

If you are interested in browsing openings, check out buzzfeed.com/jobs. We have roles based in Los Angeles, Minneapolis, London, and New York with options for remote work!