Improving Checkout Performance

Chad Lagore
Affirm Tech Blog
Published in
5 min readMar 14, 2019

--

Written by Garrett Schlesinger and Chad Lagore, Affirm Risk Engineering

Affirm’s business is growing rapidly as we launch new enterprise merchants. Improving the latency of our loan application flow is critical as traffic scales, and it also presents interesting design challenges for our engineering team. This post will detail these challenges and our strategy for scaling the application process for enterprise volume.

Introduction

In a November post, we discussed our high-level strategy for improving site quality. As we approached 2018 Black Friday, we aggressively tackled as much low-hanging fruit in our Python code base as we could to improve the performance of our loan application process, which completes online in seconds. Optimizations in this phase generally took on five forms:

  1. Eliminating redundant or unnecessary database reads, including N+1 queries issues with SQLAlchemy
  2. Caching the results of expensive queries with dogpile+redis
  3. Optimizing CPU-bound code making heavy use of Python’s cProfile and our in-house risk verification framework to guard against regressions
  4. Removing CPU-bound work from the critical request path
  5. Doing more IO (DB reads, RPC calls, and external network requests) concurrently with gevent

These optimizations led to dramatic improvements in our application performance (~40% drop in median timing) without requiring a substantial re-architecture of the system. At peak traffic levels on Cyber Monday 2018, the existing system served 150 underwriting decisions per minute, a 3x increase from the year before.

To handle this year’s scale, we need to plan for a much higher number of decisions per minute. However, we’ve now exhausted much of the low-hanging latency optimizations. In order to continue making the latency improvements we need to survive enterprise scale traffic we’re onboarding this year and beyond, we need to redesign our application flow.

Current Architecture and Challenges

Presently, we have a monolithic application. Requests for loan applications are data hungry, pulling in several megabytes of data from internal modules and third-party vendors in order to extract signals and make decisions based on a combination of hard-coded rules and machine learning models. The vast majority of this data is read and transformed in the same process as the request, and our primary mode of concurrency is gevent coroutines. Gevent has helped us scale up to this point and is incredibly useful for keeping external connections warm but python+gevent doesn’t offer any parallelism for CPU-bound workloads within the same process. We’ve observed a lot of contention with gevent, and recognize that at a certain point, there’s no way to further optimize the performance of our application process without truly parallelizing CPU-bound code. This observation is prompting the Risk Engineering team at Affirm to decompose our monolithic loan application process to distribute the work across many processes in order to achieve greater end-to-end performance and unlock the ability to add new data sources to our models without impacting user-perceived performance.

Future Architecture

Our decomposition begins with the extraction of Affirm’s decisioning systems. Affirm’s current decisioning domains fall broadly into two categories: Identity and Terms (aka Credit). As such, these categories will form the first services we port. This work will coincide with the construction of a Decision Orchestrator service, a state machine that will interact with the frontend and offload the decisioning tasks for a particular state to the relevant service. Each decisioning service will be encapsulated in its own deployable process — a Dockerized application suitable to run in its own process and be scaled horizontally as demand increases.

This modular approach will enable the parallelization of CPU- and IO-bound workloads (like data fetching and featurization, a process that accounts for 75% of our decisioning latency at the 98th percentile). Other beneficial side effects we expect include clarification of code ownership, simplification of decisioning logic, a reduction of the responsibilities of each service (giving site-reliability engineers more flexibility in how they allocate resources), shorter deploy times and faster iteration on products.

Building Decisioning Libraries

Decomposing a system like decisioning begins by first identifying the key abstractions used and then carving them out into standalone, well-tested libraries capable of solving the same problem in a variety of contexts. Our decisioning libraries must give engineers the tools to stand up a service quickly, resolve relevant signals, define a human-readable decision tree, and log decision data for analysis and training dataset preparation. A large portion of this work had already been accomplished when developing our Risk Verification System, so we are taking this opportunity to finish porting our decisioning abstractions into standalone libraries for use in new deployable targets.

Redefining the Service Dependency Graph

Our new architecture required us to construct a new service dependency graph. Internal reads still must route into the monolithic application for the time being, meaning that decisioning services have data dependencies on the monolithic application. To allow decisioning services to read raw data feeds over RPC, we are implementing RPC API’s for each service. Messages are constructed using an in-house IDL framework that uses protobuf as a serialization layer and plain Python objects in memory. The monolithic application may be horizontally scaled for reads, bottlenecking us downstream on AWS Aurora database connections, which is itself scalable. External reads must still be routed to their respective third-party provider. CPU-bound featurization takes place on the decisioning nodes themselves, where instances will be scaled according to CPU utilization. We will continue to evaluate machine learning predictions over Celery, with the intention of implementing a predictions service later this year.

Finally, we plan to take advantage of optimizations that have more obvious implementation strategies when client and server follow a declarative state machine, single-stepping through product flows. Consider that certain data resources are immutable within a given product flow. For example, once instantiated, a credit report cannot change during the identity and credit decisions made on a loan application. Immutability of this kind allows for aggressive caching on the resource and improves the user-perceived performance of the underwriting process. A cached credit report takes roughly 30ms to fetch over RPC at the 98th percentile, compared to a database read that may incur an order of magnitude greater latency. Caching this resource avoids unnecessary database reads and can now be orchestrated at a higher level of abstraction.

Current Progress

In February we designed and reviewed our new architecture choices, using Affirm’s engineering review processes to reach consensus on decisions that may affect teams more broadly outside of Risk Engineering. In parallel, we stood the Decision Orchestrator in our staging environment and migrated the last of our decisioning tooling into standalone libraries.

Recently, we have begun standing externally facing APIs to fetch data from the monolithic application. This work will continue through March, after which we’ll stand the Credit decisioning service and begin making shadow decisions in parallel with the legacy decisioning services in the monolithic application. Our Risk Verification system is able to give us strong guarantees that either service is making the same decisions in production. After observing parity on a decision level, we will begin the cutover. To cutover, we will use our internal Affirm Experiments Platform (AXP) to fork portions of actual traffic into the new services, with the ultimate goal to deprecate existing decisioning services well prior to Black Friday 2019.

Conclusion

We are still in the middle of this and other exciting scaling challenges at Affirm, and we need more exceptional talent to help us deliver on our next magnitude of business growth. If this challenge sounds right for you, apply for one of our openings on our careers page.

--

--