Adventures with Content Delivery (mostly CloudFront) Optimizations

James Lim
Affirm Tech Blog
Published in
6 min readJan 13, 2020

--

Not everyone lives near AWS us-east-1. Photo by Anastasia Dulgier on Unsplash

At the beginning of 2019, Engineering@Affirm set aggressive performance goals for our react apps and affirm.js¹ to improve user experience. To drive this effort, we started out by improving instrumentation, and measured, in granular detail, the performance of each of our apps and of affirm.js. Shortly after, we coordinated a concerted effort across the organization and prioritized optimization projects across engineering teams, which included code-splitting and CDN improvements.

The common web page optimizations are well covered by several other articles. Here, I will discuss some of the more unexpected discoveries that we learned while working on Content Delivery with AWS CloudFront and the Application Load Balancer.

Fig. 1: Latency graphs, such as this, are now routinely featured in Affirm’s release notes. We saw TTLB improvements from a c
Fig. 1: Latency graphs, such as this, are now routinely featured in Affirm’s release notes. We saw TTLB improvements from a combination of CDN tuning, brotli compression, and code trimming.

Complementing real user monitoring with synthetic monitoring

Like all grand optimization journeys, we started with instrumentation. On pages that use resources from multiple domains (more on this in a later post), we added Timing-Allow-Origin where necessary to bypass cross-origin restrictions on the ResourceTiming API. This was implemented using Lambda@Edge, because Timing-Allow-Origin is not one of the supported HTTP headers in S3. Then, using the PerformanceObserver interface, we tracked the PerformanceResourceTiming for each resource, and computed detailed metrics for Time to First Byte (TTFB), Time to Last Byte (TTLB) etc. Empirically, we have found p50 and p95² to be useful for debugging latency contributions and comparing performance across releases. Percentiles above p95 were too noisy to control, and too ambiguous to direct a team’s priorities.

We also found that a combination of real user monitoring and synthetic monitoring was most useful to help inform and prioritize optimization efforts, and for us to see the effects of our changes in production. Alerts from synthetic monitoring tools such as SpeedCurve were particularly useful for catching performance regressions. Tools such as Lighthouse were useful in development, but do not always paint the full picture since it excludes certain effects of caching and connection reuse between pages.

Fig. 2: An overly simplified diagram of our content delivery architecture.
Fig. 2: An overly simplified diagram of our content delivery architecture.

Increasing CloudFront’s origin keep-alive idle timeout

This was one of the first improvements that was deceptively easy, but worked really well with our traffic patterns and yielded significant improvements to TTFB. The keep-alive idle timeout defaults to 5 seconds³, which means that a CloudFront edge server will close an idle connection with our Application Load Balancer (ALB) if no request has been sent on that connection for more than 5 seconds. A subsequent request would have to reopen the connection, incurring the TCP and TLS penalty. Given the large number of CloudFront edge servers and the high max-age used on our static assets, we inferred that this was affecting our TTFB latencies in the ≥90 percentiles, especially during quieter hours.

We tested this hypothesis and applied the following changes:

  • increasing the origin keep-alive idle timeout to 120 seconds,
  • reducing the number of custom origins by consolidating our legacy ELBs into a single ALB, and
  • reducing the number of CNAMEs for the ALB to just one.

These changes keep the connections between CloudFront and ALB concentrated and hot. On CloudWatch, the effects of these changes can be observed using Number of New Connections (decreased), and Number of Active Connections (increased).

Fig. 3: TIL CloudFront uses HTTP/1.1 between the edge server and its origin.
Fig. 3: TIL CloudFront uses HTTP/1.1 between the edge server and its origin.

CloudFront DNS — EDNS0 does not (always) do what you think it does

While trying to find a correlation between TTFB and geographical distance, we noticed that clients were not being routed to their nearest edge locations consistently. A small number of clients were being routed to edge locations far away, sometimes outside of the United States, which was impacting our TTFB latencies in the ≥90 percentiles. Christian Elsen explains this phenomenon in detail in EDNS0-Client-Subnet extension. The gist of it is that, under certain configurations, the DNS resolvers do not forward accurate information about the clients’ locations to the authoritative nameservers.

After discussing with AWS Support, we decided to reduce the price class of our CloudFront distributions to limit our edge servers to the United States, Canada, and Europe regions. Anycast-based routing, which is used by Fastly, CloudFlare, and AWS Global Accelerator, might work better in this case but is currently not supported in CloudFront.

CloudFront s-maxage does not (always) do what you think it does

For affirm.js, we wanted a way to increase the duration which the asset stays cached on CloudFront, without increasing the duration which the asset stays fresh in the browser (e.g. increasing max-age). This would improve TTFB by increasing the cache hit rate on CloudFront, while still allowing us to rollback in a hurry by creating an invalidation request. At first, using s-maxage directive appeared to be a good idea. Here’s a snippet from the CloudFront docs, taken out of context:

CloudFront caches objects for the value of the Cache-Control s-maxage directive.

After some testing, we turned it on and rolled out the change. To our surprise, we saw a spike in the number of requests to the CloudFront distribution, and a spike in the number of 304 responses. This increased the TTFB of affirm.js.

After some humbling investigation, we learned that the Date response header (from the origin e.g. S3) is cached on CloudFront. This is problematic when s-maxage is a lot larger than max-age: after max-age is reached, every response from CloudFront will have expired the moment it reaches the browser. As a result, this caused an increase in the number of conditional GET requests and corresponding 304 responses. For detailed math, refer to this thread on the AWS forums from 2014. Unfortunately, this meant that s-maxage is not useful for our needs.

Using NGINX proxy_cache_use_stale is a very good idea

Until recently, Affirm had some dynamic-ish HTML documents that were generated in the Python flask app, which was not well tuned to stream large HTML files. Furthermore, this made the TTFB latencies susceptible to spikes caused by the occasional garbage collection and gevent contention. It was straightforward (once we figured out what actually needed to be dynamic) to move these into the NGINX server instead, and tune NGINX’s caches to optimize response times.

NGINX also has a feature that is similar to the modern stale-while-revalidate directive: with proxy_cache_use_stale enabled, NGINX will serve a cached stale response while communicating with the proxied server. This cuts out any origin latency from the client-perceived TTFB. On top of this, we also tuned the keep-alive settings to persist open connections between the ALB and the NGINX server, following lessons learned from the CloudFront optimizations above. The combination of these changes reduced p95 request_time to approximately 40 ms on the main checkout app.

There is more! For example,

  • Unlike Fastly and CloudFlare, CloudFront does not support TLS v1.3 yet, (and so we are missing out on 0-RTT goodness), but enabling HTTP/2 was as easy as checking a box, and
  • Lambda@Edge can be surprisingly slow if you import any code at all or open any TLS connections to S3 (say, to check if an asset exists before returning a response).

Until next time.

Fig. 4: One last graph before we part. This one is cool because it has stairs.

If this work inspires you, we are actively hiring for performance engineers to help us obsess over performance and speed.

[1] affirm.js is a Javascript client used by merchants to integrate with Affirm.

[2] After excluding 2G devices, bots (e.g. Ruxit, SpeedCurve), and requests from outside continental US.

[3] Our TAM at AWS explained that increasing the keep-alive idle timeout would increase the number of open connections on the origin, which might not be suitable for custom origins that do not scale as well as AWS Application Load Balancers. Handle with care.

--

--