Affirm at @Scale 2019

Taylor Law
Affirm Tech Blog
Published in
1 min readJul 1, 2019

--

As infrastructure grows, it’s critical to have observability into the performance and reliability of the system in order to identify any current issues or potential future bottlenecks. In this talk, Elaine Arbaugh, Senior Software Engineer at Affirm, discusses how Affirm’s custom metrics, monitoring, and alerting systems work, how we’ve scaled them as our traffic and engineering team have grown rapidly, and examples of scaling-related issues we’ve identified with them. She also discusses the instrumentation we’ve added around SQL queries, which has helped identify several issues that were causing excessive load on our MySQL databases, as well as the tooling we’ve added to help devs optimize their queries.

Elaine goes into detail about the specific database and machine-level issues Affirm has faced, and how detection, diagnosis, escalation, and resolution were handled. Finally, she also briefly discusses the processes Affirm has around site reliability, including a weekly “State of the Site” meeting to discuss issues and anomalies from the past week, a strong post mortem culture, and on-call practices.

Systems @Scale is an invite-only technical conference. You can view Elaine’s talk alongside other presentations here.

--

--