Event-Driven vs Request-Response for Fast and Reliable Checkout

Published in

Affirm Tech Blog

6 min readApr 22, 2024

Authors: Rahul Bansal , Honglin Zhang and Yelena Wu

Introduction

Affirm’s mission is to deliver honest financial products that improve lives. Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay- over-time without any hidden fees or compounding interest.

As the popularity of Affirm’s products has grown dramatically since it was founded over a decade ago, there is a need to re-architect the technology platform to meet the associated scale, performance, and reliability requirements. The Architecture Group’s mission is to guide the Affirm team in shipping the next generation of Affirm’s technology platform.

Objective

We are very customer focused and want Affirm Checkout to be reliable and fast, whenever customers look to transact.

As we are re-architecting Checkout from first principles, we need to evaluate whether an event-driven, or request-response, architecture will help us best meet our goals.

Terminology

Event-driven architecture refers to an asynchronous architectural pattern in which a service fires an event indicating a change of state. Other services listen to these events and then act on them. The service firing the event is known as the producer, the services consuming these events are known as the consumers, and typically there is a broker in the middle. The role of the broker is to enable loose coupling of services where the producers and consumers do not need to know about each other.

Request-response architecture is the traditional pattern used to build services. This is a synchronous pattern where a client makes a request, a server processes it, sends the response, and then the client handles that response. This has been the most common way of building applications, often with protocols like HTTP and RPC.

Current Architecture

Today, Affirm uses a combination of request-response and event-driven architecture, with the majority of it written in Python. Checkout Orchestration uses RPCs to communicate with microservices responsible for identity verification, underwriting, and fraud. The microservices themselves use Celery Tasks with RabbitMQ as the broker. Celery was introduced to handle slow and long-running requests because these microservices are computationally intensive. It helps overcome Python’s Global Interpreter Lock (GIL) limitation, which means threads run concurrently but not in parallel.

The solution initially worked well to reduce overall checkout latency. However, as Checkout has scaled to more traffic over time, we have encountered latency and availability issues due to different sets of limitations. To overcome the Python GIL limitation, we use gevent. However, over the years, we have run into numerous problems with gevent. A CPU-intensive greenlet can hold up the main thread for a very long time, thus starving other tasks. There have been several greenlet contention studies conducted at Affirm in the past. The difficulty has been instrumenting the system to gain clearer insights into workload distribution across multiple requests. To this day, computationally intensive components continue to be the major bottlenecks for overall checkout latency and availability. That is one of the main motivations to re-evaluate the Affirm Checkout Architecture more holistically.

Technical Requirements

During Checkout, Affirm’s users are underwritten for every individual transaction to make a real-time credit decision. The Checkout experience involves multiple user touchpoints and therefore multiple HTTP requests. The underwriting happens synchronously while the user is waiting for a response. Therefore, Checkout needs to be able to timeout a user request in order to avoid long delays. In addition, to ensure that Checkout is highly available, it needs to be resilient to partial unavailability, retry storms, service restarts and deploys.

To summarize, Affirm Checkout that is reliable and fast must meet the following requirements:

Low latency
Ability to timeout user requests
Ability to determine the status of a request
Resilient to partial outages and retry storms
Minimal impact of service restarts and deploys

Beyond those must-have requirements above, here are some additional nice-to-have requirements:

Unified offline and online Checkout transaction data
Data dependency graph visualization and tracing capabilities
Ease of capacity planning

Decision & Analysis

Decision

Use request-response for Affirm Checkout.

Methodology

The Architecture Group developed prototypes for this evaluation, one with the request-response model and the other with an event-driven approach. In addition to prototype development, the Architecture Group also conducted interview sessions with companies who offer managed event-driven platform services, companies who have adopted event-driven architectures, and colleagues at peer companies experienced in both request-response and event-driven systems.

Analysis

To simplify, we assume the event-driven architecture uses a data streaming platform like Kafka and the request-response architecture uses the industry-standard RPC protocol gRPC.

We found that the request-response ecosystem is more mature for our must-have requirements. gRPC can timeout a request natively. When gRPC is paired with a service mesh like Envoy, circuit breakers and outlier detection etc can be configured to make Checkout resilient to retry storms, service restarts and deploys.

On the other hand, while event-driven architectures have the nice property that consumers are protected from the publisher retry storms, they don’t support other must-have requirements well.

Consider a decentralized event-driven Checkout system using the Choreography style below where services communicate via events.

In the system above, there is no native support for timing out requests. The application developer needs to implement custom timeout logic to remove stale requests out of the broker system. It is also surprisingly hard to know the current status of a request. There are suggestions to implement an “observer” which would track the various events for a checkout_id to tell the current state of the request. However, that is also tricky to get right as the “observer” may not have processed its queue in entirety, so the status as seen by the “observer” may be stale.

We could alternatively design a Checkout system in event-driven architecture using the Orchestration style. There will be similar technical challenges.

When it comes to nice-to-have requirements, there are no substantial differences between the two approaches. While an event-driven architecture naturally unifies offline and online data, ensuring the schema and data consistency in both worlds, a request-response architecture with an outbox pattern can achieve the same result. There are multiple vendors who support visualizing data dependency and tracing in event-driven architecture and request-response architecture.

Overall, we found that event-driven architecture does not work for Checkout because there is a fundamental mismatch. The strength of event-driven architecture is that it naturally decouples services. This is actually in contrast to Affirm Checkout where the various microservices are inherently coupled to provide a response to the user in a very short timeframe. In our use case, the decoupling provided by event-driven architecture where if one component is down, others are able to continue processing does not provide any benefit. We are better off failing the request after timeout rather than queueing the request for the component that is down.

The Architecture Group also engaged with companies who have implemented event-driven architectures and companies who offer event-driven platform services. Despite our efforts, we did not find many successful event-driven use cases in production for similar use cases at scale. The recommendation was consistently to utilize request-response for Affirm Checkout re-architecture.

Conclusion

We concluded that the request-response architecture makes the most sense for Affirm Checkout. We’ll be utilizing an event-driven approach for most use-cases where a user is not synchronously waiting for a response. In parallel, we evaluated language choices and decided to rewrite most Checkout microservices in Kotlin, eliminating the need for Celery going forward.

After carefully evaluating the options to make this decision, we are confident that the proposed architecture will lead to a fast and reliable Checkout. We look forward to sharing more about our journey in future blog posts!