ShapML.shapFunction
shap(explain::DataFrame,
     reference::Union{DataFrame, Nothing} = nothing,
     model,
     predict_function::Function,
     target_features::Union{Vector, Nothing} = nothing,
     sample_size::Integer = 60,
     parallel::Symbol = [:none, :samples, :features, :both],
     seed::Integer = 1,
     precision::Union{Integer, Nothing} = nothing,
     chunk::Bool = true,
     reconcile_instance::Bool = false
     )

Compute stochastic feature-level Shapley values for any ML model.

Arguments

  • explain::DataFrame: A DataFrame of model features with 1 or more instances to be explained using Shapley values.
  • reference: Optional. A DataFrame with the same format as explain which serves as a reference group against which the Shapley value deviations from explain are compared (i.e., the model intercept).
  • model: A trained ML model that is passed into predict_function.
  • predict_function: A wrapper function that takes 2 required positional arguments–(1) the trained model from model and (2) a DataFrame of instances with the same format as explain. The function should return a 1-column DataFrame of model predictions; the column name does not matter.
  • target_features: Optional. An Array{String, 1} of model features that is a subset of feature names in explain for which Shapley values will be computed. For high-dimensional models, selecting a subset of features may dramatically speed up computation time. The default behavior is to return Shapley values for all instances and features in explain.
  • sample_size::Integer: The number of Monte Carlo samples used to compute the stochastic Shapley values for each feature.
  • parallel::Union{Symbol, Nothing}: One of [:none, :samples, :features, :both]. Whether to perform the calculation serially (:none) or in parallel over Monte Carlo samples (:samples) with pmap() and/or multi-threaded over target features (:features) with @threads or :both.
  • seed::Integer: A number passed to Random.seed!() to get reproducible results.
  • precision::Union{Integer, Nothing}: The number of digits to round() results in the ouput (to reduce the size of the returned DataFrame).
  • chunk::Bool: Default true. Increases speed on data with many instances and/or features. Calls the predict() function once per sample in sample_size instead of once per call to ShapML.shap().
  • reconcile_instance: EXPERIMENTAL. For each instance in explain, the stochastic feature-level Shapley values are adjusted so that their sum equals the model prediction. The adjustments are based on feature-level sampling variances and are typically small compared to the model prediction.

Return

  • A size(explain, 1) * length(target_features) row by 6 column DataFrame.
    • index: An instance in explain.
    • feature_name: Model feature.
    • feature_value: Feature value.
    • shap_effect: The average Shapley value across Monte Carlo samples.
    • shap_effect_sd: The standard deviation of Shapley values across Monte Carlo samples.
    • intercept: The average model prediction from explain or reference.
source