Highly Resilient Affirm Checkout Architecture

Published in

Affirm Tech Blog

4 min readMar 12, 2024

Author: Rahul Bansal, Distinguished Engineer at Affirm

Introduction

Affirm’s mission is to deliver honest financial products that improve lives. Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay-over-time without any hidden fees or compounding interest.

As the popularity of Affirm’s products has grown dramatically since it was founded over a decade ago, there is a need to re-architect the technology platform to meet the associated scale, performance and reliability requirements. The Architecture Group’s mission is to guide the Affirm team in shipping the next generation of Affirm’s technology platform.

Objective

We are very customer focused and want Affirm to be available whenever customers look to transact. Therefore, an important goal for the Architecture Group is providing 4 9s of availability. 4 9s refers to a high availability system that is available 99.99% of the time. This translates to 52 mins of downtime per year.

Current Architecture

During Checkout, Affirm’s users are underwritten for every individual transaction before a real-time credit decision is made i.e. the underwriting happens synchronously while the user is waiting for a response. The Checkout service runs on Amazon EKS and consists of many microservices. The Checkout service additionally uses multiple AWS Services such as Application Load Balancer (ALB), Amazon Aurora, and STS.

Architectural Decisions & Analysis

To achieve 4 9s, the general recommendation from AWS is to use a single-region architecture. For Affirm, this recommendation raises two crucial questions:

How should Affirm think about large-scale events that affect an entire region? These region-wide large-scale events could happen due to physical disasters such as hurricanes, or software bugs in the cloud provider services themselves.
How should Affirm think about building 4 9s in a single-region when AWS regional services have a mix of 4 9s and 3 9s availability SLA?

Affirm is a data-informed company, so to answer the first question, we did a detailed analysis of all AWS large-scale outages published here. We concluded that since 2011, there have been no region-wide incidents on AWS due to physical disasters. Further, we concluded that if we use the “right” AWS services in the “right” way (more on this below), we can reduce the likelihood of being affected by region wide disasters due to software bugs to less than once every 2–4 years.

To address the second question and to achieve our 4 9s availability goal, we looked at the Affirm Checkout and its dependencies on AWS services carefully and reached the following decisions:

The Checkout service relies on a single-region with multi-AZ redundancy. This allows us to tolerate disasters in a single AZ. It’s rare for disasters in one AZ to spread to multiple AZs, and the distance is not too far to add significant latency.
The Checkout data plane relies on a minimal number of AWS services. Further, the data plane relies only on AWS services which have 4 9s+ availability SLA. The control plane can rely on services with 3 9s as long as we take appropriate steps to mitigate downtime risk. As an example, during the recent AWS Lambda outage, STS, a 3 9s service that we use on the control plane, was affected. Affirm avoided downtime because our token refresh interval was larger than the duration of the incident. If we hadn’t taken care to set the interval to a large value, the STS outage would have resulted in a data plane outage and impacted our availability.
We ensure that all Affirm microservices degrade gracefully. To achieve this, all the Affirm microservices involved in the Checkout Data Plane are classified as either optional or required. By creating sane default behavior, we can ensure that the Checkout service succeeds even when any optional Affirm microservices are down. This further reduces the dependency of Checkout availability on AWS services.
We maintain a Disaster Recovery (DR) plan to failover to another region in case there is a large-scale event in a single-region. While the expectation is to need to trigger failover no more frequently than once every 2–4 years, the DR plan is tested more frequently to ensure correctness.
We ensure that all Affirm microservices robustly isolate between workloads that they serve. As an example, if a microservice serves both Checkout and non-Checkout traffic, we ensure that non-Checkout traffic can not impact Checkout traffic.

In terms of alternatives, we evaluated an active-active multi-region architecture, however, such an architecture is a much larger lift at the application layer. This is because using multi-region strongly consistent storage requires significant application changes like reducing chattiness between application and database due to higher latency on writes. On the other hand, eventually consistent multi-region storage requires applications to correctly handle implications of replication delay e.g. lost writes.

Conclusion

The strategy to build 4 9s in a single-region with emphasis on resiliency practices like graceful degradation and isolation allows us to focus right now on architecting against application-level failures, which is the biggest source of unavailability for us. Availability requirements always keep increasing, so we are considering this a journey. We will invest in active-active multi-region architecture when we architect for 5 9s. Multi-region is an important tool but we want to be careful in not reaching for it too soon, thereby making our system more complex than needed, as we fundamentally believe that simpler architectures result in more reliable systems.