Exploration for Machine Learning at Affirm

Published in

Affirm Tech Blog

9 min readJul 7, 2020

TL;DR: At Affirm, we can’t know if someone would have repaid a loan we didn’t make. This makes it tricky, but not impossible, to evaluate new credit models.

Imagine that, as an Affirm Applied Machine Learning Scientist, you’ve been asked to train the next generation of one of our credit models. These are the machine learning models that power the real-time approve/decline decisions Affirm makes tens of thousands of times a day. If you’re anything like me, your first reaction will be to get out an envelope, turn it over, and do some math to figure out how many applications your model will have to handle, and just how much of a financial catastrophe you might single-handedly cause. Then you will book a conference room where you can privately panic for a while.

Calm and collected again, you set out on the standard machine learning workflow:

Step 1 — Naming

Training a credit model is just like parenting; first you name it, then you train it. You settle on Credit Model v2. Snazzy and memorable.

Step 2 — Get Your Data

At Affirm, we make installment loans, and our credit model’s goal is to estimate how likely any applicant is to pay their loans back. Your data set is the history of the Affirm portfolio. Your features are the borrower’s credit-related data that Affirm would have had access to at the time of the decision. This includes information from the credit report, information about the transaction, and how they managed any previous Affirm loans. Your labels are a series of binary values: for each month that this loan was outstanding, did the borrower satisfy their obligation?

Step 3 — Train Your Model

Luckily, we solved (to our satisfaction, at least) the problem of training on our sequential monthly labels for Credit Model v1 [Note: this is a pretty interesting problem that we will not cover here. Maybe in a future post!]. This time you just need to split your data into training set and test set, and press Go. Then take a break. Play with the cat. Grab your mask and have a socially distanced walk in the sun while the computer works hard.

Step 4 — Evaluation

Now the moment of truth has arrived. For Affirm, the thing that really matters is the performance of the loan portfolio, which feeds straight through to the bottom line. If a loan your model approves is repaid — great, we earned a little bit of interest. If it defaults, then we lose the principal. So while the standard model evaluation metrics like precision, recall, F1 score, and AUC are useful, the gold standard metric is just the portfolio default rate at a given approval rate. In other words, your new model is better than the old model if it approves the same number of loans but fewer of them default.

Aside: we try to keep approval rate changes and modeling changes separate to make it easier to tease out the impacts of these different kinds of changes. You can reduce default rates by approving fewer (and therefore more creditworthy) applicants, or you can also do it by sorting creditworthiness better and approving the same number.

So you load in all of the applications Affirm saw during your test set month and score them with your new model. You sort them and set a cutoff to match the approval rate of the old model: now you have your portfolio! Inspired by your earlier naming success, you call it New Portfolio. You run your get_backtest_default_rate function — but it throws an error!

Step 5 — Back to the Panic Room

Just kidding — but why is evaluation broken? You quickly realize that lots of the applications in your new portfolio don’t have labels, because Credit Model v1 declined them, so we can’t know if they would have repaid! Can’t calculate the default rate if you don’t have labels. It would be convenient if we could just throw the applications with missing labels out — but we can’t. To the contrary, these are arguably the most important loans in the new portfolio. So, we need a new strategy. Let’s zoom out.

Figure 1 shows what Affirm’s application population looks like, if you count up the repayments and defaults and sort them by our internal, transaction-based credit score.

Figure 1: Most people repay, but not everyone!

Most people are far to the right, and they repay really well. This is good. People pay what they promise. Affirm couldn’t exist otherwise. Of course, sometimes some unexpected catastrophe causes a default or two. As you go further to the left, the applications get riskier and also sparser.

Our job is to do this sorting better. Ideally we’d be able to put all the defaults all the way to the left and all the repayments all the way to the right (of course, that’s impossible, because we can’t predict all unexpected catastrophes). Then, to create a healthy loan portfolio, we set a threshold above which we approve an application and below which we decline it. Finally we deploy the model, and it starts making decisions. What do we observe? Something like Figure 2.

Figure 2: We are blind below our cutoff.

This is a big problem. First of all, we have no idea what is going on below the cutoff. Maybe all these folks would have repaid! Maybe none of them would have! That’s not ideal.

Second, our training set for Credit Model v2 contains only loans that Credit Model v1 scored above the threshold, and this is hardly an unbiased sample. This effect will get worse over time, if we’re doing our jobs right, because better models mean fewer defaults. Eventually we could have a training set with so few defaults that we won’t be able to teach the model what an uncreditworthy loan looks like! At the very least we’ll have to do some serious weighting to combat this training-serving skew.

But the issue at hand is that we’d like to evaluate the quality of our New Portfolio in the test set. The situation is something like what we see in Figure 3. The old model has some set of approvals — whose labels we can observe — and declines — which are mysteries. If we are trying to keep our overall approval rate the same, then we will decline some loans we used to approve (swap-outs), and approve some we used to decline (swap-ins). If the model is an improvement, the swap-outs will perform worse than the swap-ins, and both groups will default more than the always-in group. (Incidentally, this is why we can’t simply ignore the missing labels and move on with our evaluation: they are not missing at random, and we know they will be worse than average.)

Figure 3: How will Affirm’s portfolio change if we deploy a new model?

The resulting portfolio has a bunch of unknowns in it, and that’s the root of our problem. It’s also a problem that so many of our v2 declines are unknowns for a lot of common metrics like AUC and F1. But it’s a less severe problem, because our profitability depends much more on the loans we make than the ones we don’t.

Believe it or not, we aren’t the first lenders in history to try to evaluate a new model — nor is the problem limited to lending. In fact, it is well-known enough to have names, such as reject inference or off-policy evaluation. Since the issue’s been named, it shouldn’t be a surprise that there are classic approaches to solving it. For example, you can make a note of who you declined, and see how many folks are later exhibiting financial distress or have defaulted on other loans. Unfortunately, building a bridge between Affirm loans and other loans is hard, though we have estimated that it’s probably easier than building an actual bridge.

Another possible solution is to build a calibrator to transform your new model’s score into some kind of interpretable default rate statistic, using score -> label information from the always-ins. Then you can use this calibrator to estimate the likelihood that a swap-in will default, using its new score. This solution has a couple of problems in our view:

While it works for estimating the default rate of our new portfolio, like before, calculating classic ML metrics for classification will be tricky due to non-binary labels.
It’s not correct yet, because the swap-ins will be worse than the calibrator predicts. This is because the always-ins have two models that agree they belong in the portfolio, while the swap-ins have two models that disagree. We believe the new model is better than the old one, but we don’t believe that the old model has no value. Sorting this out is a tricky problem that gives us a headache.

Because we are not big fans of these more classic approaches, we must be creative, and brave, and strap on our Indiana Jones exploration hats — because we are boldly going where few have gone before: under the threshold!

Figure 4: Deep-sea diving is our favorite exploration metaphor, because it’s risky and you explore a tiny space in order to make sweeping generalizations about ¾ of the planet. Image Credit: Annie Jensen

In other words, we make a change to our approval logic and add some probability that we’ll flip our decision for some small portion of otherwise-declined users. For these ‘exploratory’ loans, we’ll actually observe the labels. And then we can let them stand in as representatives of the swap-ins (and the always-declines, too)! This will be great because it will save us from having to make arbitrary assumptions or do really hard math.

Now we just have to decide on the probability of decision-flipping. It’s important to do it efficiently, because the loans we are proposing to make are very risky and we expect to take large losses on them. Since the thing that we really care about is the evaluation of our test-set portfolio, efficiency means that we should concentrate our exploration budget on loans that have the highest chance of getting swapped in by a future model. But we don’t want to ignore applications with a low probability of being swapped-in altogether. As long as the probability isn’t zero, and it never is, some loans from that region will be swapped in.

So we make our ‘exploration curve’ look sort of like Figure 5.

Figure 5: Probability of approval, given model score. Folks above the threshold are always approved; folks below it are approved with a probability that decreases with their score.

Having settled on a shape, now we have to argue for the level of that bottom curve. Should it come up to 0.5%, 1%, 5%? This depends on how much variance the company would like to accept in its projections, and how accurately it would like the Applied ML team to select and deploy the best models. As always, everything is a tradeoff.

Figure 6: More exploration (left) means we have a more precise idea of how much default content there is for a given approval rate than if we don’t explore very much (right). But that precision is expensive.

We have one more step — but I promise, it’s the last step.

We said we’d let these relatively few exploratory loans stand in for all of the unlabeled swap-ins. But if we only include them as individual loans in our evaluation metrics, they’ll stand in only for themselves and we’ll undercount swap-ins relative to always-ins. This is because the exploratory loan had only a small probability of being approved — so there must be many more applications just like it out there that might get swapped in too! The solution to this is to weight the exploratory loans according to how many swap-ins they stand in for. There are lots of ways you can calculate that weight, but we’ve already got a function for the probability that a loan got swapped in. All we need to do is weight each loan by the inverse of its approval probability when we’re counting up the new portfolio’s repayments and defaults.

And we’re done! Our trip to the sub-threshold wilds means we can evaluate our model with confidence! At least, we can for a while. Because sometimes explorers encounter monsters. The monster that ambushed us goes by the name of variance. Next time: how we defeated it.

Exploration for Machine Learning at Affirm

Written by Ben Taborsky