Affirm’s Culture of Improving Product Quality

Garrett Schlesinger
Affirm Tech Blog
Published in
10 min readDec 1, 2018

--

Affirm’s mission is to build honest financial products that improve lives. From an engineering perspective, an important facet of this mission is to ensure that our products continually meet a high quality bar.

This post will discuss Affirm Engineering’s strategy for making continual quality improvements to user experience, which is three-fold: providing visibility into objective quality measurements, setting up organization-wide accountability for maintaining and improving quality, and prioritizing concrete team initiatives to improve quality.

Visibility: how Affirm measures product quality

The first step toward improving product quality is to agree on what objective metrics can serve as a proxy for quality. At Affirm, we measure quality through service-level indicators (SLIs) and production issue resolution. We’ll discuss what these mean from our customers’ perspectives, and walk through our regular process for reviewing metrics and incidents to ensure that we have adequate monitoring coverage.

Service-level Indicators (SLIs)

In a previous blog post, Elaine Arbaugh gave an excellent overview of our Metrics, Monitoring, and Alerting systems at Affirm. Utilizing these monitoring systems, we’ve converged on three primary health indicators for our customer-facing product flows: API throughput (e.g. # requests per minute), customer-facing error rates (usually phrased as a reliability percentage, e.g. % successful requests aggregated daily), and customer-facing latencies (usually expressed in percentiles, e.g. confirming an approved loan takes 800ms in the 50th percentile). We look at these three SLIs in conjunction with one another in order to best plan for scale and account for trends. For example, we noticed in the past that latency rises with more throughput, meaning that as we expand the number of merchants integrating Affirm as a payment option, our service could degrade. This insight is crucial in informing us to plan for increased capacity and make changes to our data models to allow for horizontally scaling our site.

We put a particular emphasis on tracking customer-facing success rates and latencies, since the correlation to impacted lives is much more obvious. For example, over the past month our checkout product’s APIs had 99.97% reliability, which means that 3 out of every 10,000 customers faced some kind of critical error preventing them from using our product. This brings life to the numeric concept of SLIs and gives a clear mandate to our engineering team: fixing bugs and integrating failovers in case one of the vendors we rely on has an outage tangibly manifests in improving more lives with our products. In regard to latency, over the past 30 days (10/20/18 to 11/18/18) the 50th percentile of customer-perceived latency for processing loan application requests was 8.6s, which means that half of our customers waited longer than 8.6s to receive an application decision. While this is certainly faster than traditional credit applications, Affirm is dedicated to lowering this response time to under 3s in order to provide our customers with the best experience possible and impact more lives.

Production issue resolution

Production issues are an inevitable part of running any non-trivial online service. At Affirm, we value being able to detect and resolve issues as soon as possible in order to keep our customers happy with our financing options. During incidents, we keep status.affirm.com up-to-date with details pertaining to production issues and their expected resolution time. After incidents, we practice a blame-free post-mortem culture in order to best learn from our outages and understand the root cause of each issue.

Each root-cause analysis (RCA) document contains the following components: a description of root cause, how long the issue took to detect, how long the issue took to resolve, the business impact (e.g. 25 customers faced errors in loan applications totaling $32,000), any required external followup, and action items that once complete will prevent the issue from recurring. We share each RCA internally with the engineering team and any other internal stakeholders for review. RCAs are also shared with our merchant partners for outages causing more than 10 minutes of service disruption.

The learnings gleaned from post-mortems are crucial for us to keep product quality high and address systemic issues head-on. Frequently, the outcome of an RCA will include expanding our monitoring and alerting coverage to catch issues faster (we always want to know about issues before a customer or merchant partner escalates them to us), improving test coverage, or adding tooling or coding conventions to lessen the likelihood of breakage.

State of the Site meeting

In order to continually review production incidents and site health as measured through SLIs, we launched a weekly 30 minute cross-engineering meeting with technical leads from each of our six engineering departments (Consumer Product Engineering, Retail Product Engineering, Partner Engineering, Risk Engineering, Bank Engineering, and Platform Engineering) and the on-call engineer(s) for the week. Each engineering team hosts a Grafana dashboard to be covered in the weekly review and all regular meeting attendees review any post-mortems sent out in the past week. As pre-meeting preparation, attendees review each team’s dashboard and note any anomalies by taking a screenshot of the anomalous graph, uploading it into the meeting agenda, and tagging the dashboard owner for an explanation. If there is no explanation then the anomaly becomes a topic for discussion and may lead to an action item to provide a root cause at the next week’s meeting. Significant open questions from post-mortems are also discussed and can also lead to action items. The meeting is kept to 30 minutes in order to make the review as efficient as possible, which means the pre-meeting preparation and awareness of the state of the site is crucial for each attendee. Visibility through this meeting ensures that nothing falls through the cracks regarding site-quality and engages the engineering team in making continual improvements for stability, while fostering cross-team collaboration.

SLI Reports

Our Grafana+elasticsearch setup is great for real-time visibility into our systems. However, this setup processes millions of metrics per day, making it operationally expensive and burdensome to maintain the same for historic reports on our quality indicators. Historic reports are crucial for noting long-term trends and for reporting the quality of our service to merchant partners as part of our service-level agreements (SLAs). To ensure we could continue to provide these reports, we needed to devise a different solution for historic reports than for real-time monitoring and alerting.

As Elaine mentioned in the previous post, in addition to elasticsearch we also make all of our metrics available in AWS Redshift through Spectrum. Since Spectrum’s backing storage is AWS/S3, we’re able to retain all historic metrics at a low cost, making it conducive to managing historic reports. The downside is that due to the number of S3 files that must be read in order to perform historic aggregations, Redshift queries for reports might take minutes to complete. This timing is unacceptable for rendering dashboards with historic trends because we want anybody in the company to quickly view historic trends without friction and to avoid ad-hoc long-running queries against Redshift.

To make reporting queries more performant, we use dbt as a SQL-based ETL tool. We run dbt incrementally every day to materialize only the base events we care about in order to report on indicators. Additional dbt transformations then output daily aggregations for SLIs. This enables us to quickly render charts, such as the below chart showing our median user-perceived application response timing, in Apache Superset:

Graphs such as these in Superset make the long-term trends of our quality apparent and provide excellent internal accountability for improving quality. The same underlying data is used for reporting on availability and latency to our merchant partners.

Accountability: setting the quality bar

Launching the State of the Site meeting and building out our historic reporting pipeline with Spectrum and DBT constituted a big push for each engineering team to have sufficient coverage of SLIs and other correlative system metrics. With these indicators in place, the next step was to set standards for quality and make organization-wide commitments to meeting them. Affirm sets accountability by setting internal and external service-level objectives (SLOs).

Internal SLOs

Inspired by the Google SRE Book chapter on Embracing Risk, Affirm sets internal service-level objectives (SLOs) on a per-product basis and engineering teams responsible for each product commit to meeting them. We set internal SLOs on products based on their maturity level (since stability is critical for our most mature financial products) but newer products might require more rapid iteration at the expense of reliability. If we breach internal SLOs, the team will shift focus back to meet quality objectives at a higher priority than shipping new features. In addition to keeping quality high, this system incentivizes engineers to thoroughly test and carefully monitor features in production to continue focusing on new features without interruption.

External SLOs

Affirm has a standard externally-facing SLO that we use by default when communicating to new merchant partners. We have set the objective of 99.7% API reliability and 98th-percentile application response timing to under 10s. Over the next year as our platform team focuses on introducing full AWS regional failover, we intend to raise our availability objective to 99.95%.

We set internal SLOs more aggressively than external SLOs to ensure that we uphold external commitments. We might breach internal SLOs and then re-calibrate our team’s efforts to meet them again, but we must continually meet and exceed external expectations of our quality.

Improvements: pushing the envelope

With accountability in the form of SLOs and SLAs in place, we’ll cover some of the tangible ways Affirm has improved quality over the past year, including: new feature moratoriums, planned re-architecture, capacity planning, and individual initiative. We’ll use the above chart tracking our median application response timing since July as an illustration for our quality story.

New feature moratoriums

We completed v1 of our SLI reporting feature in July, which provided excellent visibility into how we were tracking against our SLOs. This visibility, along with a few outages resulting from increased traffic with new enterprise partners, prompted us to prioritize an engineering-wide burndown period to bolster the reliability of our systems in preparation for peak traffic during Q4 sales. Each engineering team devoted two weeks in August or September toward bug fixes, latency improvements, and other initiatives to reduce system load. This burndown period is obvious by looking at the latency drops throughout the month of August on the chart. The performance improvements came as a result of short-term work to isolate certain APIs to different processes, reduce the number of database reads we do in the critical path of API requests, and optimize CPU time spent in parsing credit reports and instantiating Python objects. This burndown not only improved the bottom line for our product quality (12% improvement to median timings and 28% improvement to 98th-percentile) but was also a lot of fun; each team came together to contribute in somewhat of a performance hackathon, which did a lot for building a culture of taking pride in shipping performant software. With proper instrumentation and visibility, such new feature moratoriums can be great for team bonding and getting quick quality improvements.

Planned Re-Architecture

Throughout September, you can note latencies steadily dropping from continued focus on improving our underwriting performance with a substantial drop at the beginning of October. With the burndown period past, these improvements came from our underwriting and frameworks teams within Risk Engineering prioritizing longer-term cleanup of technical debt. Notably, for the risk validation project (covered in a previous blog post by Artem Shnayder) our production decisioning systems for underwriting and fraud log all of the raw and derived data used for making each decision for purposes of offline historic validation and machine learning model training. Prior to October, we did this logging in the critical path of user application requests, but by observing latency we spent in logging this data (logs for a single underwriting request can be up to several megabytes), we determined that we could significantly improve user-perceived latency by doing this logging asynchronous to the user experience. This was a non-trivial change that required speccing and careful implementation to ensure that no valuable historic decisioning data was lost. The work was well worth it as we saw improved user experience, progress towards SLOs, and continued historic validation of our risk decisions. Quality metrics are a great motivator for re-architecting systems. At Affirm, taking a user-centric lens for improving both latency and credit models gives us valuable parameters for decoupling systems in order to deliver the best financial value and user experience.

Capacity Planning

Coming from a more infrastructure-centric perspective, capacity planning for key enterprise launches and holiday sales ensures that we have the appropriate number of web servers, task workers, and database replicas to handle additional volume without degrading in quality. In some cases, capacity planning also greatly improves the performance of our product as-is. The last two drops in November on the median application response timing chart came from prioritizing autoscaling for our web servers and workers, respectively. This project laid the foundations for easily spinning up and spinning down capacity based on volume needs and also included an upgrade from C3 to C5 EC2 instances, leading to a 20% improvement in median performance just in time for Black Friday.

Individual Initiative

In January, shortly after starting the State of the Site meeting, several engineers were motivated by the increased visibility into Affirm’s product quality. In a matter of four weeks, our error rates went down by more than 80% from our team taking action on several long-standing bugs that had fallen through the cracks over time. These bug fixes came from engineers feeling a sense of individual responsibility to improving quality, rather than from instructions given by managers or actions in response to customer complaints. This culture of responsibility is essential to our culture and delivering reliable products to our customers. Affirm encourages and recognizes engineers who think creatively and go above and beyond their immediate job responsibilities.

Conclusion

In today’s post, we covered how Affirm uses metrics, post-mortems, and service-level objectives as cornerstones in creating a culture of accountability and responsibility for improving the user-facing quality of our financial products. These strategies have had a meaningful impact in improving user experience as we take on more consumers than ever before, setting standards for acceptable engineering quality, and growing the careers of our engineers.

If this sounds like the right challenge and environment for you, please check out our careers site to learn more about working at Affirm!

--

--