Shparkley: Scaling Shapley values with Spark for Interpreting Machine Learning Models

Niloy Gupta
Affirm Tech Blog
Published in
5 min readMay 28, 2020

--

As a part of Affirm’s mission to deliver honest financial products that improve lives, we invest in research and engineering efforts to develop machine learning models that are interpretable.

While model performance metrics such as area under the ROC curve (AUC) and mean cross entropy (MXE) provide insight into how well a model is performing in terms of prediction accuracy, they fail to provide any intuition into what is driving the prediction for a given feature vector. This is crucial for our understanding, as it helps us to interpret and debug our models. Why was this particular loan application denied? Why was this user classified as a fraudulent actor? Answering these questions helps us build high fidelity machine learning models that are robust, comprehensible, and accurate.

One popular methodology for explaining model predictions is Shapley Values. The idea has its roots in cooperative game theory. Each predictor in the feature vector is considered a “player” in a game where the model prediction is the payout. Shapley values inform us how to fairly distribute the payout among the predictors.

An attribution method is considered to have a “fair payout” only if it meets the axioms of efficiency, symmetry, dummy, and additivity. Shapley value is the only method which satisfies these axioms.

Consider the efficiency axiom. Shapley values ensure that the difference between the prediction and the average prediction is fairly distributed among the feature values of the instance. In other words the feature contributions must add up to the difference of prediction for a data point x and the average prediction over the entire dataset X.

Shapley values also allow for contrastive explanations. We can compare Shapley values for a single data point against those for another data point.

While Shapley values are a powerful tool for generating model prediction explanations, they are also computationally expensive to calculate as the complexity scales with the number of data points and features. An exact computation of the Shapley value for a single feature is computationally expensive because there are 2^k possible coalitions of the feature values where k is the number of features. There are optimized algorithms such as TreeShap that offer better performance, but they require assumptions about the model architecture. Popular open source packages such as shap, have their limitations: the package uses the brute-force algorithm which is expensive to scale and does not have support for handling training weights. In an effort to scale Shapley value computation to large datasets, many features, and arbitrary model architectures, we have implemented the Shapley algorithm with a monte-carlo approximation using Spark.

Fig1. Consider a toy example. The aim is to find the importance of FICO in Alex’s loan application. For that we sample a random data point (Sam’s loan application). We permute and intersect the feature vectors to create two vectors: one with Alex’s FICO and one without.

Given data point x and feature j for which we want to compute the Shapley value.

For all m = 1,…,M:

  • Choose a random instance z from the training matrix, along with its sample weight w.
  • Choose a random permutation of the feature values
  • Order instance x and z based on the permutation order:

x0 = (x(1) , …, x(j), … x(k)) and z0 = (z(1) , …, z(j), … z(k))

  • Construct two new feature vectors from x0 and z0, one with feature j and one without

x(+j) = (x(1) , …, x(j-1), x(j), z(j+1), … z(k))

x(-j) = (x(1) , …, x(j-1),z(j), … z(k))

  • Compute the marginal contribution: w*(f(x+j) — f(x-j)), where f is the machine learning model.

The Shapley value is the weighted average of all the marginal contributions across M iterations. The number of iterations M controls the variance of the Shapley values. A large value for M results in the Shapley values converging to the correct value, but also increases computation time.

Fig 2. We parallelize the feature vector creation using Spark. The feature vectors are ordered based on a random permutation. Two vectors are created, one which has the feature in question, while the other doesn’t.

Shparkley optimizes the above algorithm by computing the marginal contribution for all features within the same iteration by reusing the permuted feature vector. We parallelize the iterations using Spark such that the marginal contribution for each permuted random sample is computed in parallel. The machine learning model is broadcasted to each executor. The Spark implementation returns the Shapley values for all features for a given data point. It can be further scaled to handle batches of data points by increasing the number of machines/cores.

Fig 3. The marginal contribution is computed by running the machine learning model on the feature vectors.

Fig 4. The Shapley value is computed by taking the mean of the individual marginal contributions.

This has enabled us to investigate model explanations for our custom in-house models and has improved our research productivity. We have open sourced our implementation for the benefit of the machine learning community.

Fig 5. Compared to shap, Shparkley provides a runtime improvement of 50–60x with minimal difference in Shapley values.

If you are interested in learning more about Affirm engineering, check out our other blog posts.

Acknowledgments:

This project had a number of key contributors we would like to acknowledge: Adam Johnston, Cristine Dewar, Isaac Joseph, Niloy Gupta, and Xiang Huang.

References:

--

--