yramon / ShapCounterfactual

Hybrid algorithm based on SEDC and SHAP for computing Evidence Counterfactuals (SHAP-Counterfactual): explaining the model predictions of any classifier using a minimal set of features, such that removing these features results in a predicted class change.
9 stars 4 forks source link

Clarification for parameter types #1

Open joelrorseth opened 2 years ago

joelrorseth commented 2 years ago

Hi Yanou, first of all, many thanks for these excellent counterfactual adaptations of existing explanation methods. I am using SEDC_Explainer, ShapCounterfactual, and LimeCounterfactual to explain results from an IR ranking model.

I have SEDC_Explainer working, however I'm having difficulty with ShapCounterfactual, and LimeCounterfactual. Do you have any code demonstrating how to use these two explainers? I found the tutorial notebooks in your EDC repo, which were a great help for SEDC, so I'm hoping to get some similar sample code for these two as well.

I was hoping that all 3 explainers would have the same interface (ie. parameters), but there are a few differences, which are the source of my issues. In particular, I believe the parameter documentation is out-of-date in a few spots, which has made it difficult to determine the true type expected. Here's what I've found so far:

When I try to pass a regular string (ie. a sentence of words, since I wish to determine salient words), I get the following error:

Traceback (most recent call last):
  File "main.py", line 20, in <module>
    explanation = explainer.explanation(doc_text)
  File "/home/shap_c/shap_counterfactual.py", line 100, in explanation
    reference = np.reshape(np.zeros(np.shape(instance)[1]), (1,len(np.zeros(np.shape(instance)[1]))))
IndexError: tuple index out of range

When I try to pass a (1,vocab_size) csr_matrix, representing a single sentence, I get the following error:

Traceback (most recent call last):
  File "main.py", line 19, in <module>
    explanation = explainer.explanation(sparse_doc)
  File "/home/shap_c/shap_counterfactual.py", line 104, in explanation
    shap_values = explainer.shap_values(instance, nsamples = 5000, l1_reg="aic")
  File "/home/venv/lib/python3.6/site-packages/shap/explainers/_kernel.py", line 186, in shap_values
    explanations.append(self.explain(data, **kwargs))
  File "/home/venv/lib/python3.6/site-packages/shap/explainers/_kernel.py", line 376, in explain
    self.run()
  File "/home/venv/lib/python3.6/site-packages/shap/explainers/_kernel.py", line 516, in run
    self.y[self.nsamplesRun * self.N:self.nsamplesAdded * self.N, :] = np.reshape(modelOut, (num_to_run, self.D))
  File "<__array_function__ internals>", line 6, in reshape
  File "/home/venv/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 299, in reshape
    return _wrapfunc(a, 'reshape', newshape, order=order)
  File "/home/venv/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 58, in _wrapfunc
    return bound(*args, **kwds)
ValueError: cannot reshape array of size 1 into shape (5000,1)

Can you clarify what should be passed to explanation() for these two explainers?

Lastly, the model I'm explaining is not an ML model, so it doesn't use Scikit or any other ML framework, but I am able to calculate a score to return in my classifier_fn. LimeCounterfactual has two parameters that seem to be Scikit-specific, namely c_fn and vectorizer. Can you explain more about the expected type of these two parameters, so that I can figure out how to provide a non-Scikit equivalent (if this is even possible)?

Thanks for your help! If anything needs changing, let me know, I'd be happy to contribute fixes!

yramon commented 2 years ago

Hi Joel,

Apologies for the delayed answer. Thanks a lot for your comment and suggestions, as well as the kind words :)

Regarding the interface, indeed there are some inconsistencies for the parameters and how I called them. About your small questions, here's my answers:

In LimeCounterfactual, c_fn is a pipeline that you can create with make_pipeline(vectorizer, classification_model) . Classification_model here is a trained model, for example: LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='sag', tol=0.0001, verbose=0, warm_start=False) Indeed, this does not seem to be agnostic in the sense that make_pipeline() expects scikit-learn estimators. The vectorizer is also a scikit-learn estimator. I fit the vectorizer using a CountVectorizer() on an artificial text data set (artificial_text below), that is a nx1 matrix, with n instances, and as entry the artificial text, for example 'a1 a3' would mean the instance has feature a1 and a3 as "active", in line with how a text document could be vectorized as having a word or not. So the CountVectorizer() just transforms the artificial text data to the original numerical feature representation for the behavioral data (e.g., "liked" a Facebook page or not, 1 vs. 0). Here's what it looks like: text_data = artificial_text vectorizer = sklearn.feature_extraction.text.CountVectorizer() vectorizer.fit_transform(text_data)

It would be great if you could look into how to make the package more agnostic and make it possible to use non-Scikit estimators. classifier_fn() indeed makes it possible to use non-Scikit estimators as well, but c_fn doesn't make it possible yet..

Thanks so much for your comments and contributions!! If anything is unclear, please let me know. I'm having my PhD defense in a few weeks and starting a new job in October, but we could have more discussions about this during summer if you want.

Best regards, Yanou

joelrorseth commented 2 years ago

Hi Yanou, thanks for your detailed explanations! I'd be happy to contribute some of these revisions and improvements to SEDC / LIME-C / SHAP-C. Once I find some free time (hopefully within the next month or two), I will open a PR to improve the interface / data types / documentation.

I was able to utilize SEDC and SHAP-C for general (Scikit-agnostic) classification, using classifier_fn. This generalization required few (if any) changes to your code, but if warranted, I will open another PR to integrate any improvements.

I am still working on generalizing LIME-C. I think c_fn might be difficult to generalize, as you explained. I am still experimenting to find some Scikit-agnostic workaround. I suppose this will depend on whether the lime library itself can support non-Scikit classification.

Let me know if you have any thoughts or suggestions, I'd be happy to discuss further. Best of luck with your PhD defense and new job!

yramon commented 2 years ago

Hi Joel,

Sounds good. Would be great if you integrate improvements to SEDC/SHAP-C. For LIME-C it will indeed depend on the lime library on which mine is based. Not sure if you'll find a workaround. Let me know if you want to discuss anything.

Warm regards, Yanou