Open joelrorseth opened 2 years ago
Hi Joel,
Apologies for the delayed answer. Thanks a lot for your comment and suggestions, as well as the kind words :)
Regarding the interface, indeed there are some inconsistencies for the parameters and how I called them. About your small questions, here's my answers:
In LimeCounterfactual, c_fn is a pipeline that you can create with make_pipeline(vectorizer, classification_model) . Classification_model here is a trained model, for example: LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='sag', tol=0.0001, verbose=0, warm_start=False) Indeed, this does not seem to be agnostic in the sense that make_pipeline() expects scikit-learn estimators. The vectorizer is also a scikit-learn estimator. I fit the vectorizer using a CountVectorizer() on an artificial text data set (artificial_text below), that is a nx1 matrix, with n instances, and as entry the artificial text, for example 'a1 a3' would mean the instance has feature a1 and a3 as "active", in line with how a text document could be vectorized as having a word or not. So the CountVectorizer() just transforms the artificial text data to the original numerical feature representation for the behavioral data (e.g., "liked" a Facebook page or not, 1 vs. 0). Here's what it looks like: text_data = artificial_text vectorizer = sklearn.feature_extraction.text.CountVectorizer() vectorizer.fit_transform(text_data)
It would be great if you could look into how to make the package more agnostic and make it possible to use non-Scikit estimators. classifier_fn() indeed makes it possible to use non-Scikit estimators as well, but c_fn doesn't make it possible yet..
Thanks so much for your comments and contributions!! If anything is unclear, please let me know. I'm having my PhD defense in a few weeks and starting a new job in October, but we could have more discussions about this during summer if you want.
Best regards, Yanou
Hi Yanou, thanks for your detailed explanations! I'd be happy to contribute some of these revisions and improvements to SEDC / LIME-C / SHAP-C. Once I find some free time (hopefully within the next month or two), I will open a PR to improve the interface / data types / documentation.
I was able to utilize SEDC and SHAP-C for general (Scikit-agnostic) classification, using classifier_fn
. This generalization required few (if any) changes to your code, but if warranted, I will open another PR to integrate any improvements.
I am still working on generalizing LIME-C. I think c_fn
might be difficult to generalize, as you explained. I am still experimenting to find some Scikit-agnostic workaround. I suppose this will depend on whether the lime
library itself can support non-Scikit classification.
Let me know if you have any thoughts or suggestions, I'd be happy to discuss further. Best of luck with your PhD defense and new job!
Hi Joel,
Sounds good. Would be great if you integrate improvements to SEDC/SHAP-C. For LIME-C it will indeed depend on the lime library on which mine is based. Not sure if you'll find a workaround. Let me know if you want to discuss anything.
Warm regards, Yanou
Hi Yanou, first of all, many thanks for these excellent counterfactual adaptations of existing explanation methods. I am using
SEDC_Explainer
,ShapCounterfactual
, andLimeCounterfactual
to explain results from an IR ranking model.I have
SEDC_Explainer
working, however I'm having difficulty withShapCounterfactual
, andLimeCounterfactual
. Do you have any code demonstrating how to use these two explainers? I found the tutorial notebooks in your EDC repo, which were a great help for SEDC, so I'm hoping to get some similar sample code for these two as well.I was hoping that all 3 explainers would have the same interface (ie. parameters), but there are a few differences, which are the source of my issues. In particular, I believe the parameter documentation is out-of-date in a few spots, which has made it difficult to determine the true type expected. Here's what I've found so far:
ndarray
with 1 element, notfloat
. I think the same applies to ShapCounterfactual and LimeCounterfactual.ndarray
with 1 element, notfloat
. I think the same applies to ShapCounterfactual and LimeCounterfactual.instance
parameter inexplanation()
should be astring
. I don't know what type is truly expected, but it does not seem to be string, nor a csr_matrix with shape (1,vocab_size).When I try to pass a regular string (ie. a sentence of words, since I wish to determine salient words), I get the following error:
When I try to pass a (1,vocab_size) csr_matrix, representing a single sentence, I get the following error:
Can you clarify what should be passed to explanation() for these two explainers?
Lastly, the model I'm explaining is not an ML model, so it doesn't use Scikit or any other ML framework, but I am able to calculate a score to return in my classifier_fn.
LimeCounterfactual
has two parameters that seem to be Scikit-specific, namelyc_fn
andvectorizer
. Can you explain more about the expected type of these two parameters, so that I can figure out how to provide a non-Scikit equivalent (if this is even possible)?Thanks for your help! If anything needs changing, let me know, I'd be happy to contribute fixes!