Survival Data Science Explanability performance issue

pierrick.pochelu · 26 June 2023 09:43

Hello, I am working in HPC support team. I summarize a challenge we have received from an user asking our help through the ticket system.

Context:

Life scientists are interested in Survival Data Science model. Unlike standard ML doing regression or classification, Survival ML returns survival function. The technology used in this ticket is Survival-Scikit: scikit-survival — scikit-survival 0.21.0

After training and testing Survival ML models, if the model is accurate enough, life scientist wants to explain it. They use SurvivalShape(t) :

Challenge:

The SurvivalShap(t) algorithm is killed due to a lack of memory.

Diagnostic:

A critical line of code in SurvivalShap Python framework is this one:

simplified_inputs = [list(z) for z in itertools.product(range(2), repeat=p)]

The memory/time complexity of the line itertools.product(range(2),repeat=n) is O(2 power n) with n the number of variables. The user uses 387 variables which is impossible to store/compute and crashes.

Proposed solutions:

Reducing the number of variables
Github Pull Request:

I recently proposed a new feature to the developers of Survival Shap. This feature involves sampling the computations to achieve results comparable to the true brute-force approach.

github.com/MI2DataLab/survshap

Approximate Survival Shapley

MI2DataLab:main ← PierrickPochelu:montecarlo

opened 07:59PM - 19 Jun 23 UTC

PierrickPochelu

+15 -2

``` import numpy as np import pandas as pd from survshap import SurvivalModel…Explainer, ModelSurvSHAP import time np.random.seed(42) nb_features=8 nb_events=150 np_X=np.random.rand(nb_events, nb_features) np_time=np.random.rand(nb_events, 1) np_is_living=np_X[:,0] < np_time[:,0] y=np.empty(nb_events, dtype=[('event', '?'), ('time', '<f16')]) y['event']=np_is_living.reshape(-1) y['time']=np_time.reshape(-1) X=pd.DataFrame(np_X,columns=['f'+str(i) for i in range(1,nb_features+1)]) from sksurv.ensemble import RandomSurvivalForest rsf=RandomSurvivalForest(random_state=42) st=time.time() rsf.fit(X,y) print(f"score:{rsf.score(X,y)} fit time:{time.time()-st}") print(f"predict: {rsf.predict(X)}") exp_rsf=SurvivalModelExplainer(rsf,X,y) ms_rsf=ModelSurvSHAP(random_state=42) # <-------- EXACT SURVIVAL SHAP #ms_rsf=ModelSurvSHAP(random_state=42,max_shap_value_inputs=20) # <-------- APPROXIMATE SURVIVAL SHAP st=time.time() ms_rsf.fit(exp_rsf) print(f"Interpretation time:{time.time()-st}") # The scope of these changes made to # pandas settings are local to with statement. with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.precision', 3, ): ms_rsf.get_mean_abs_shap_values() print(ms_rsf.result) ``` OUTPUT: THE EXACT VERSION TAKES 363 seconds: 100%|██████████| 150/150 [06:03<00:00, 2.42s/it] Interpretation time:363.3058955669403 variable_name variable_value B aggregated_change index \ 0 f1 0.466 0.0 0.901 74.5 2 f3 0.455 0.0 0.105 74.5 6 f7 0.501 0.0 0.094 74.5 5 f6 0.508 0.0 0.056 74.5 3 f4 0.544 0.0 0.051 74.5 4 f5 0.509 0.0 0.036 74.5 7 f8 0.509 0.0 0.035 74.5 1 f2 0.502 0.0 0.027 74.5 THE APPROXIMATE VERSION TAKES 32 seconds: 100%|██████████| 150/150 [00:31<00:00, 4.70it/s] Interpretation time:32.05052995681763 variable_name variable_value B aggregated_change index \ 0 f1 0.466 0.0 0.891 74.5 2 f3 0.455 0.0 0.102 74.5 6 f7 0.501 0.0 0.086 74.5 5 f6 0.508 0.0 0.055 74.5 3 f4 0.544 0.0 0.050 74.5 4 f5 0.509 0.0 0.034 74.5 7 f8 0.509 0.0 0.033 74.5 1 f2 0.502 0.0 0.026 74.5 CONCLUSION MUCH FASTER, MUCH MEMORY EFFICIENT AND ABOUT THE SAME RESULT

It is up to developers to make it official.

Github Issue
I also reported a speed issue with Survival Random Forest predictions in the framework named survival-scikitlearn:
Survival Random Forest predict_survival_function does not scale with `n_jobs` · Issue #382 · sebp/scikit-survival · GitHub

Conclusion:

It was possible for the life scientist to reduce the number of variables and she proceeded.

I propose some update for approximating the Survival Shap algorithm and report to Survival Scikit that SRForest predictions does not scale when n_jobs increases.