Purpose

  • The goal of this vignette is to demonstrate how, for the same boosted tree prediction model, the stochastic Shapley values from ShapML correlate with the non-stochastic, tree-based Shapley values from the Python shap package using the implementation discussed here.

  • While shap provides the preferred Shapley value algorithm when modeling with boosted trees, this vignette demonstrates that the sampling-based ShapML implementation returns nearly identical results while having the ability to work with all classes of ML models.

Setup

  • Because the tree-based Shapley value algorithm is not currently available in Julia, we’ll use catboost’s R package which provides a port of the tree-based algorithm in shap in catboost.get_feature_importance().

  • Outline of the comparison:

    1. R: Train the ML model.
    2. R: Calculate the tree-based Shapley values.
    3. R: Write the predict() wrapper that works with the trained model.
    4. Julia: Using RCall, convert the trained model and predict() function into Julia objects.
    5. Julia: Calculate the stochastic Shapley values, passing in the objects from step 4.
    6. R: Compare the results.

Comparison

Load Packages

R

  • Allow rmarkdown to pass Julia and R objects between code blocks.
library(JuliaCall)

JuliaCall::julia_setup()

# Setting the Julia path first in the R environment points JuliaCall to the right Julia .dll files.
Sys.setenv(PATH = paste("C:/Users/nredell/AppData/Local/Julia-1.3.1/bin", Sys.getenv("PATH"), sep = ";"))

library(dplyr)
library(tidyr)
library(ggplot2)
library(shapFlex)
library(devtools)

if (!"catboost" %in% installed.packages()[, "Package"]) {
  # Install catboost which is not available on CRAN (Windows link below).
  devtools::install_url('https://github.com/catboost/catboost/releases/download/v0.20/catboost-R-Windows-0.20.tgz',
                        INSTALL_opts = c("--no-multiarch"))
}

library(catboost)  # version 0.20

Julia

using ShapML
using Random
using DataFrames
using RCall

Load Data in R

data("data_adult", package = "shapFlex")
data <- data_adult

outcome_name <- "income"  # A binary outcome.
outcome_col <- which(names(data) == outcome_name)

Train ML Model in R

  • The accuracy of the model isn’t entirely important because we’re interested in comparing Shapley values across algorithms: stochastic in Julia vs. tree-based in R.
cat_features <- which(unlist(Map(is.factor, data[, -outcome_col]))) - 1

data_pool <- catboost.load_pool(data = data[, -outcome_col],
                                label = as.vector(as.numeric(data[, outcome_col])) - 1,
                                cat_features = cat_features)

set.seed(224)
model_catboost <- catboost.train(data_pool, NULL,
                                 params = list(loss_function = 'CrossEntropy',
                                               iterations = 30, logging_level = "Silent"))

Shapley Algorithms

  • We’ll explain the same 300 instances with each algorithm.

Tree-based Shapley values in R

data_pool <- catboost.load_pool(data = data[1:300, -outcome_col],
                                label = as.vector(as.numeric(data[1:300, outcome_col])) - 1,
                                cat_features = cat_features)

data_shap_tree <- catboost.get_feature_importance(model_catboost, pool = data_pool,
                                                  type = "ShapValues")

data_shap_tree <- data.frame(data_shap_tree[, -ncol(data_shap_tree)])  # Remove the intercept column.

data_shap_tree$index <- 1:nrow(data_shap_tree)

data_shap_tree <- tidyr::gather(data_shap_tree, key = "feature_name",
                                value = "shap_effect_catboost", -index)

Stochastic Shapley values in Julia

Predict function in R

  • For ShapML, the required user-defined prediction function takes 2 positional arguments and returns a 1-column DataFrame of model predictions.
predict_function <- function(model, data) {

  data_pool <- catboost.load_pool(data = data, cat_features = cat_features)

  # Predictions and Shapley explanations will be in log-odds space.
  data_pred <- data.frame("y_pred" = catboost.predict(model, data_pool))

  return(data_pred)
}
  • In Julia, convert the input data, the trained model, and the predict() function into Julia objects.
data = RCall.reval("data")
data = convert(DataFrame, data)

outcome_name = RCall.reval("outcome_name")
outcome_name = convert(String, outcome_name)

model_catboost = RCall.reval("model_catboost")

predict_function = RCall.reval("predict_function")
predict_function = convert(Function, predict_function)

ShapML.shap

explain = copy(data[1:300, :])  # Compute Shapley feature-level predictions for all instances.
explain = select(explain, Not(Symbol(outcome_name)))  # Remove the outcome column.

reference = copy(data)  # An optional dataset for computing the intercept/baseline prediction.
reference = select(reference, Not(Symbol(outcome_name)))  # Remove the outcome column.

Random.seed!(224)
data_shap = ShapML.shap(explain = explain,
                        reference = reference,
                        model = model_catboost,
                        predict_function = predict_function,
                        sample_size = 100  # Number of Monte Carlo samples.
                        )

Results

  • For 10 out of 13 model features, the correlation between the stochastic and tree-based Shapley values was >= .99 and above .92 for the remaining features.
data_shap <- JuliaCall::julia_eval("data_shap")  # Pass from Julia to R.
data_shap$feature_value <- NULL
data_all <- dplyr::inner_join(data_shap, data_shap_tree, by = c("index", "feature_name"))
data_cor <- data_all %>%
  dplyr::group_by(feature_name) %>%
  dplyr::summarise("cor_coef" = round(cor(shap_effect, shap_effect_catboost), 3))

data_cor
## # A tibble: 13 x 2
##    feature_name   cor_coef
##    <chr>             <dbl>
##  1 age               0.994
##  2 capital_gain      0.997
##  3 capital_loss      0.983
##  4 education         0.991
##  5 education_num     0.998
##  6 hours_per_week    0.993
##  7 marital_status    0.99
##  8 native_country    0.99
##  9 occupation        0.996
## 10 race              0.998
## 11 relationship      0.991
## 12 sex               0.924
## 13 workclass         0.975
p <- ggplot(data_all, aes(shap_effect_catboost, shap_effect))
p <- p + geom_point(alpha = .25)
p <- p + geom_abline(color = "red")
p <- p + facet_wrap(~ feature_name, scales = "free")
p <- p + theme_bw() + xlab("catboost tree-based Shapley values") + ylab("ShapML stochastic Shapley values") +
  theme(axis.title = element_text(face = "bold"))
p