Handling Black Friday Scale

Elaine Arbaugh
Affirm Tech Blog
Published in
13 min readJan 31, 2019

--

This year, over 165 million Americans shopped between Thanksgiving and Cyber Monday, and 130 million of them spent at least part of their money online. Given Affirm’s partnership with thousands of retailers online, Cyber Weekend is the most important period of the year for us — so it was critical for us to make sure our infrastructure was ready to handle up to five times a normal day’s traffic without any downtime or performance degradations.

Preparations for Cyber Weekend don’t start weeks, or even months, in advance — our planning for Cyber Weekend 2018 actually started in late 2017, when we applied that year’s Cyber Weekend learnings to prioritize infrastructure projects we foresaw would be most important for this year’s peak shopping period.

The effort was truly cross-functional, with teams across the organization (including Customer Operations, Risk Operations, and Engineering) working during throughout the year and through Thanksgiving week to support customers and ensure platform stability. Thanks to everyone’s hard work, Cyber Weekend was once again a huge success for Affirm! We were able to handle nearly three times 2017 Cyber Weekend’s loan volume without any outages or performance degradations. At peak traffic levels, we were performing almost 150 underwriting decisions per minute, and serving over 150,000 promotional messages per minute.

Cyber Weekend loan volume compared to the rest of November

Here’s how we approached preparing for the most important weekend of the year for our merchants and our company, starting one year out.

Early 2018: System improvements

ProxySQL

One potential bottleneck on our system was database connections. For the most part, we use Amazon Aurora MySQL databases in our platform, and with Aurora, it’s very easy to create multiple read replicas to handle read-only database access. However, with our database access patterns, we access the master databases much more frequently than the read replicas, either because we are inserting or updating data or because we need up-to-date data and cannot tolerate replica lag. As a result, we create a lot of connections to the master database. MySQL can experience performance issues when scaling to handle large numbers of connections since it creates a new thread for every connection, causing high resource usage and significant overhead for thread management and scheduling. AWS also limits database connections on Aurora MySQL to 16,000, and we anticipated crossing this limit this year. In order to handle Cyber Weekend 2018 without downtime, we knew we had to change the way we handled database connections.

We decided to use ProxySQL to handle MySQL connection pooling. ProxySQL also has a lot of other cool features like query caching, slow query logging, query routing, and failover support which we plan to take advantage of later, but we started with just connection pooling since it was the most urgent issue. After debating alternatives, like Vitess; speccing out our solution; writing all the platform and configuration code to set up ProxySQL servers; and testing on our development and stage servers, we slowly and carefully rolled ProxySQL out in production.

With ProxySQL, the ratio of server connections (the actual number of connections ProxySQL makes to the database) to client connections (the number of connections our servers try to create) is between 5–15%. This change gave us a lot more breathing room before we hit any MySQL connections limits, allowing us to support Cyber Weekend traffic.

Client vs server connections with ProxySQL

Autoscaling

Another project that was key to our Cyber Weekend success was moving to autoscaling groups for provisioning servers and automating server setup. Previously, if we wanted to add new machines in production to handle additional load, we had to manually run provisioning scripts as well as 10+ setup commands, a slow and potentially error prone process. Before this year, manually provisioning machines was good enough because we generally only had to add small numbers of machines infrequently. With the huge traffic increases we were expecting this year, it became less feasible to manually add the numbers of different types of machines we required for Cyber Weekend, so we decided to move to autoscaling groups. This project involved automating server provisioning using AWS autoscaling groups; setting up the servers, including setting up configs, deploying to the servers, and starting any required processes; and, if required, adding the servers to load balancers to route traffic to them.

We also took this opportunity to switch from using Amazon’s Classic Load Balancers to Application Load Balancers and move our machines to the newest generation EC2 instances (specifically, we moved from c3s to c5s). Application load balancers allow routing traffic based on URL paths or host headers, meaning we could switch from routing requests to multiple load balancers using CloudFront rules to only having one load balancer and doing all the routing at the load balancer level, simplifying our architecture. Newer generation EC2 instances are cheaper and more performant, and we saw huge (20+%) latency improvements after we switched over.

Latency for a task on c3 (green) vs c5 (yellow) instances

After the autoscaling project, all we have to do to set up and run new production servers is change a configuration in the AWS console, and the rest will be handled automatically. This project made it much easier to scale up for Cyber Weekend, and also made us confident that we could react quickly to any capacity emergencies we might have during the weekend.

Additional future work for the autoscaling project includes automatically scaling up to respond to increased traffic instead of having to manually update the number of machines.

Tech Debt Burndown

In addition to these infrastructure projects, in the months leading up to Black Friday engineers across Affirm worked to make our code more performant, especially on endpoints we knew to be slow or non-optimally written. As Garrett discussed in the last post, Affirm’s Culture of Improving Product Quality, we improved our median Application Response time by around 40% in the months leading up to Black Friday. With these improvements, we require fewer resources to do the same amount of work, and we can make more efficient use of our machines. The latency improvements also improve user experience — some people may leave the checkout flow if we take too long to make a decision.

We also made some improvements to help guard against failure conditions. For example, we added timeouts or all our third party requests to guard against the possibility of massive latency increases from third-party requests putting extra load on our servers, which could impact other requests and cause user-facing errors.

The work that everyone in engineering did to speed up our code and more gracefully handle possible error cases was critical to our success on Cyber Weekend.

Early November: Scaling to handle 5x load

Starting a few weeks before Black Friday, we began to prepare for additional load. We had two requirements: scale out our existing infrastructure to handle 5x scale, and scale up our backup infrastructure in a second AWS region that we could fail over to in case of an AWS outage.

Handling 5x scale

For many infrastructure components, handling 5x scale is as easy as either adding more of a component or expanding its capacity. For our EC2 instances, we added additional servers by scaling up our autoscaling groups. We distribute our servers between two availability zones to be more resilient — for example, if lightning strikes a data center and takes it offline, we won’t go down. We also have duplicate instances for every machine type, so we have no single point of failure in our systems. For databases, we increased the database instance size (for MySQL) or provisioned throughput (for DynamoDB).

We also worked with our AWS enterprise support team to create an Infrastructure Event Management (IEM) plan. We gathered information about the AWS components that would be experiencing 5x load during Cyber Weekend and discussed what we needed to handle this load. The AWS team helped us reserve EC2 capacity for certain instance types to make sure that we could add machines if we needed them, and also set up prewarming for all our load balancers. Normally, load balancers can start failing with big traffic spikes, but prewarming them preemptively prepares them for the maximum traffic amount we anticipate so they can handle these spikes. The IEM was very helpful to ensure that all our AWS components would scale.

Finally, we contacted all of our third-party partners (for gathering data used in underwriting, creating virtual credit cards, making payments, etc.) to make sure that our traffic increase wouldn’t cause any issues on their side.

Load testing

To validate that we added sufficient scale for our promos service, which outputs values for promotional text on merchants’ sites based on merchant and user data, we load tested using locust.

Sample promo messaging

Load testing our checkout flow and many of our other endpoints is difficult because so much of what we do requires real user data — for example, hitting external services to gather credit reports or data about a user; user input, like inputting a PIN as part of authentication; or writing data to the database. Since our promos service is much simpler and accesses data in a read-only pattern, it was straightforward to set up load testing. During our maintenance window, we ramped up promos traffic to 5x and monitored endpoint latency, error rates, and system stats for our promos servers. When we saw latencies increase, we added additional web servers to determine if server capacity was the bottleneck, and we used the results of this test to decide how many instances to use for Cyber Weekend.

Redundant infrastructure

AWS region locations

Although AWS is generally very reliable (knock on wood), we always prepare for all possible failures, including AWS outages. For all our production components, we have redundant infrastructure in multiple availability zones, which would keep us from going down in case of an issue affecting one data center (for instance, if lightning struck it or there was a power outage).

Although we host all our live infrastructure in one east coast AWS region, we also have redundant infrastructure in a west coast region that we are prepared to failover to quickly in case of a region-wide outage. For some components, like EC2, setting up redundant infrastructure was easy, but for others it required more work and planning.

Servers

To scale up our EC2 instances in the west coast region, we added capacity to our autoscaling groups. Since we don’t store any data on our machines, the only complication with adding west coast machines was updating configs to make sure that the west coast instances would talk to other servers and databases in the west coast (to avoid slow and costly cross region requests).

Databases

Multi-region database infrastructure is more complicated than application server infrastructure since it requires replicating data across regions. We use four database types in production: Aurora MySQL, DynamoDB, AWS ElastiCache for Redis (for caching high-volume MySQL query results), and AWS ElastiCache for Memcached (used as a celery result store).

Aurora MySQL databases are straightforward to setup since Aurora supports cross-region read replicas. So, we just had to add cross-region read replicas for all our RDS instances and double check that the parameter groups, which determine database configs, were the same in both regions.

DynamoDB multi-region infrastructure was more complicated to set up. With Global Tables, AWS supports multi-region multi-master infrastructure for DynamoDB — but to set this up as a backup, we first had to copy all our east coast Dynamo data into Global Tables. To do this, we used AWS EMR and DataPipeline to import east coast table data into S3 and then export the data in S3 to the Global Tables following these instructions. With this process, we could do a one-time, bulk upload of Dynamo data up to some point in time. For the data added to the table after the bulk export started, we used a DynamoDB Cross-Region Replication library to apply real-time updates to the Global Table by sending the data to a Kinesis stream which applied updates to the table after we finished the bulk upload. For our larger tables, we had to alter this process slightly since a Kinesis stream can only hold up to 24 hours of data, and the data import/export process took over 24 hours. So, we instead streamed real-time updates into a blank table, and after the bulk upload finished, copied that table into the real table and changed the stream output to send data to the real table.

DynamoDB data replication process

For Redis and Memcached, we added duplicate infrastructure in east coast and west coast, but did not replicate data. For the caching use case, we would just hit the database directly for all requests, and although this would put more load on the database, we determined that the databases could still handle the load. For the celery result store use case, the celery results are only accessed very soon after they are written, so a small amount of data loss would not cause much service disruption.

S3

S3 supports cross-region replication, so we set up replication for all our S3 buckets that are critical to our operations. We have a branch with the changes required to switch over the west coast buckets ready to deploy in case we had to fail over S3.

Monitoring/Alerting

We also duplicated our monitoring and alerting infrastructure, which I discussed in a previous blog post, into the west coast region. We added duplicate Riemann (metrics aggregation) infrastructure in the region, and updated the Riemann configuration in both regions to send metrics to Elasticsearch instances in both regions. If we failed over servers, we would send metrics to the west coast Riemann servers automatically, and we would have minimal disruption in our metrics pipeline. If Elasticsearch went down in either region, we could point Grafana (which we use for creating graphs based on time series data) and Cabot (our alerting software) to the other region’s Elasticsearch cluster.

We also added duplicate servers, load balancers, and a read replica database for Cabot in west coast. We actually have active-active Cabot infrastructure — our west coast Cabot server runs celery tasks, including ones that run status checks and send out alerts. We also added duplicate Cabot servers in another availability zone, which run both celery and a gunicorn webserver.

Failover process

Most of our components in production are accessed based on a Route 53 DNS entry. This makes failing over components very simple: we set up a weighted routing policy for the corresponding east coast and west coast components, initially with the east coast component at 100% and the west coast component at 0%. If we wanted to failover, we would just set the east coast component to 0% and the west coast component to 100%.

For databases, to fail over we would promote another machine to be the master in the AWS console as well as updating the DNS entry.

Documentation and planning

Before Black Friday, we also updated our documentation for handling production issues. For all the multi-region components, this meant documenting the process for failing over to west coast, and any other components that would be affected if we failed over. For example, if we failed over our servers, we would also have to failover databases since cross region database calls are very slow. In general, we made our instructions as detailed and clear as possible (for example, we linked pull requests for anything requiring code changes) so that the oncall engineer would be able to execute the instructions quickly with minimal room for error.

We also discussed the criticality of components so that we would know which components to fail over first. Our first priority is making sure users can checkout, so we would fix that first, then move on to other components. Our SRE team met every day in the week leading up to Black Friday to discuss anything outstanding we had to finish before the date and walk through out responses to failure scenarios.

In addition, we established and broadly communicated a chain of command in case of incidents. We wanted to establish an escalation path for anyone who noticed an issues, as well as avoid a “too many cooks” scenario when trying to resolve a problem. We also planned to follow our existing incident communication plan to update merchants, the teams who work with them, and our operations teams about any ongoing issues and resolutions.

The main event: Cyber Weekend

For Black Friday and Cyber Monday, we chunked our platform and infrastructure oncall schedules into 8 hour shifts so that our oncall engineers wouldn’t get too fatigued. The oncall engineers, as well as many others, actively monitored our critical Grafana dashboards and Rollbar so that we would catch critical errors as soon as possible — although we have alerts that would call us for many critical issues, they may not go off until a couple minutes after an incident begins, which is a big deal on Cyber Weekend. Some of the dashboards we monitored included:

  • HTTP status codes, latencies, and throughputs for critical endpoints
  • Cloudfront and ELB 500 rate
  • Celery task successes, errors, and latency
  • System stats for servers and databases
  • Checkout funnel metrics
  • Third party request successes, errors, and latencies
Sample system metrics

We caught one minor issue earlier on the morning of Black Friday. Some of our autoscaling servers had a config incorrectly setup which caused us to call AWS STS AssumeRole every time we hit our credentials store, and we were calling it frequently enough to get rate limited by AWS. Since we were actively monitoring, we noticed this issue after only one error, and were able to update the configs before we saw any major impact (we had fewer than 15 total errors, and the errors did not impact checkouts).

Preparing for 2019

After a successful Cyber Weekend this year, our team is already preparing for what we need to do for next Cyber Weekend. We’re also planning ahead for significant traffic increases even earlier in the year as a result of exciting merchant launches ahead!

Some of our plans for 2019 include setting up multi-region active-active infrastructure so that we can handle region-specific outages more smoothly without manual intervention. We’re also planning on adding policies so that our autoscaling groups actually scale with traffic instead of requiring a manual config change, and resharding our databases and reallocating our databases on RDS instances so that we can scale out our master databases even more.

Preparing for Black Friday scale in 2018 was a long term, cross functional effort. Thanks to everyone’s hard work and preparation, we were able to handle Black Friday scale gracefully this year and are ready for even bigger challenges in 2019.

If handling these type of challenges sounds interesting to you (or if you love Segways), apply to join our team on our careers page!

--

--