Clarification for parameter types

joelrorseth commented 2 years ago

Hi Yanou, first of all, many thanks for these excellent counterfactual adaptations of existing explanation methods. I am using SEDC_Explainer, ShapCounterfactual, and LimeCounterfactual to explain results from an IR ranking model.

I have SEDC_Explainer working, however I'm having difficulty with ShapCounterfactual, and LimeCounterfactual. Do you have any code demonstrating how to use these two explainers? I found the tutorial notebooks in your EDC repo, which were a great help for SEDC, so I'm hoping to get some similar sample code for these two as well.

I was hoping that all 3 explainers would have the same interface (ie. parameters), but there are a few differences, which are the source of my issues. In particular, I believe the parameter documentation is out-of-date in a few spots, which has made it difficult to determine the true type expected. Here's what I've found so far:

In SEDC_Explainer, threshold_classifier expects an ndarray with 1 element, not float. I think the same applies to ShapCounterfactual and LimeCounterfactual.
In SEDC_Explainer, classifier_fn should return an ndarray with 1 element, not float. I think the same applies to ShapCounterfactual and LimeCounterfactual.
In ShapCounterfactual and LimeCounterfactual, the docs state that the instance parameter in explanation() should be a string. I don't know what type is truly expected, but it does not seem to be string, nor a csr_matrix with shape (1,vocab_size).

When I try to pass a regular string (ie. a sentence of words, since I wish to determine salient words), I get the following error:

Traceback (most recent call last):
  File "main.py", line 20, in <module>
    explanation = explainer.explanation(doc_text)
  File "/home/shap_c/shap_counterfactual.py", line 100, in explanation
    reference = np.reshape(np.zeros(np.shape(instance)[1]), (1,len(np.zeros(np.shape(instance)[1]))))
IndexError: tuple index out of range

When I try to pass a (1,vocab_size) csr_matrix, representing a single sentence, I get the following error:

Traceback (most recent call last):
  File "main.py", line 19, in <module>
    explanation = explainer.explanation(sparse_doc)
  File "/home/shap_c/shap_counterfactual.py", line 104, in explanation
    shap_values = explainer.shap_values(instance, nsamples = 5000, l1_reg="aic")
  File "/home/venv/lib/python3.6/site-packages/shap/explainers/_kernel.py", line 186, in shap_values
    explanations.append(self.explain(data, **kwargs))
  File "/home/venv/lib/python3.6/site-packages/shap/explainers/_kernel.py", line 376, in explain
    self.run()
  File "/home/venv/lib/python3.6/site-packages/shap/explainers/_kernel.py", line 516, in run
    self.y[self.nsamplesRun * self.N:self.nsamplesAdded * self.N, :] = np.reshape(modelOut, (num_to_run, self.D))
  File "<__array_function__ internals>", line 6, in reshape
  File "/home/venv/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 299, in reshape
    return _wrapfunc(a, 'reshape', newshape, order=order)
  File "/home/venv/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 58, in _wrapfunc
    return bound(*args, **kwds)
ValueError: cannot reshape array of size 1 into shape (5000,1)

Can you clarify what should be passed to explanation() for these two explainers?

Lastly, the model I'm explaining is not an ML model, so it doesn't use Scikit or any other ML framework, but I am able to calculate a score to return in my classifier_fn. LimeCounterfactual has two parameters that seem to be Scikit-specific, namely c_fn and vectorizer. Can you explain more about the expected type of these two parameters, so that I can figure out how to provide a non-Scikit equivalent (if this is even possible)?

Thanks for your help! If anything needs changing, let me know, I'd be happy to contribute fixes!

yramon commented 2 years ago

Hi Joel,

Apologies for the delayed answer. Thanks a lot for your comment and suggestions, as well as the kind words :)

Regarding the interface, indeed there are some inconsistencies for the parameters and how I called them. About your small questions, here's my answers:

For my analyses, threshold_classifier is a float. I changed it to a ndarray with 1 element, which also worked. So it seems both are working.
classifier_fn() indeed returns an ndarray with 1 element
<1x21880 sparse matrix of type '<class 'numpy.int64'>' with 3 stored elements in Compressed Sparse Row format> This is the data type that is expected for the "instance" to be explained for SEDC and SHAP-C. For LIME-C, a textual format is expected, for example: instance_idx = ' a11556 a12336 a12829' (where each term represents an active feature, so here there's an instance with three active features for example). In addition, a vectorizer needs to be put as an input parameter for the LimeCounterfactual explainer, as to transform the text to a numerical feature vector. I did this because I'm using LIME text as the "base" for LimeCounterfactual, as it only considers active features as candidates for the explanation, and importantly, use a single reference value (zero), which is intuitive and suitable for behavioral data sets as well. Because there is no LIME implementation available for behavioral data, I had to make this detour to be able to test the idea of LIME-C faster. So to change this in the future, we'll need to use another implementation than LIME text as underlying one for LIME-C.

In LimeCounterfactual, c_fn is a pipeline that you can create with make_pipeline(vectorizer, classification_model) . Classification_model here is a trained model, for example: LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='sag', tol=0.0001, verbose=0, warm_start=False) Indeed, this does not seem to be agnostic in the sense that make_pipeline() expects scikit-learn estimators. The vectorizer is also a scikit-learn estimator. I fit the vectorizer using a CountVectorizer() on an artificial text data set (artificial_text below), that is a nx1 matrix, with n instances, and as entry the artificial text, for example 'a1 a3' would mean the instance has feature a1 and a3 as "active", in line with how a text document could be vectorized as having a word or not. So the CountVectorizer() just transforms the artificial text data to the original numerical feature representation for the behavioral data (e.g., "liked" a Facebook page or not, 1 vs. 0). Here's what it looks like: text_data = artificial_text vectorizer = sklearn.feature_extraction.text.CountVectorizer() vectorizer.fit_transform(text_data)

It would be great if you could look into how to make the package more agnostic and make it possible to use non-Scikit estimators. classifier_fn() indeed makes it possible to use non-Scikit estimators as well, but c_fn doesn't make it possible yet..

Thanks so much for your comments and contributions!! If anything is unclear, please let me know. I'm having my PhD defense in a few weeks and starting a new job in October, but we could have more discussions about this during summer if you want.

Best regards, Yanou

joelrorseth commented 2 years ago

Hi Yanou, thanks for your detailed explanations! I'd be happy to contribute some of these revisions and improvements to SEDC / LIME-C / SHAP-C. Once I find some free time (hopefully within the next month or two), I will open a PR to improve the interface / data types / documentation.

I was able to utilize SEDC and SHAP-C for general (Scikit-agnostic) classification, using classifier_fn. This generalization required few (if any) changes to your code, but if warranted, I will open another PR to integrate any improvements.

I am still working on generalizing LIME-C. I think c_fn might be difficult to generalize, as you explained. I am still experimenting to find some Scikit-agnostic workaround. I suppose this will depend on whether the lime library itself can support non-Scikit classification.

Let me know if you have any thoughts or suggestions, I'd be happy to discuss further. Best of luck with your PhD defense and new job!

yramon commented 2 years ago

Hi Joel,

Sounds good. Would be great if you integrate improvements to SEDC/SHAP-C. For LIME-C it will indeed depend on the lime library on which mine is based. Not sure if you'll find a workaround. Let me know if you want to discuss anything.

Warm regards, Yanou

yramon / ShapCounterfactual

Clarification for parameter types #1