zelros / cinnamon

CinnaMon is a Python library which offers a number of tools to detect, explain, and correct data drift in a machine learning system
MIT License
76 stars 9 forks source link
concept-drift covariate-shift domain-adaptation drift drift-correction drift-detection explainable-ai machine-learning mlops monitoring streaming-data

.. -- mode: rst --

|ReadTheDocs| |License| |PyPi|_

.. |ReadTheDocs| image:: https://readthedocs.org/projects/cinnamon/badge .. _ReadTheDocs: https://cinnamon.readthedocs.io/en/add-documentation

.. |License| image:: https://img.shields.io/badge/License-MIT-yellow .. _License: https://github.com/zelros/cinnamon/blob/master/LICENSE.txt

.. |PyPi| image:: https://img.shields.io/pypi/v/cinnamon .. _PyPi: https://pypi.org/project/cinnamon/

=============================== Introduction to CinnaMon

CinnaMon is a Python library which allows to monitor data drift on a machine learning system. It provides tools to study data drift between two datasets, especially to detect, explain, and correct data drift.

⚡️ Quickstart

As a quick example, let's illustrate the use of CinnaMon on the breast cancer data where we voluntarily introduce some data drift.

Setup the data and build a model

.. code:: python

>>> import pandas as pd
>>> from sklearn import datasets
>>> from sklearn.model_selection import train_test_split
>>> from xgboost import XGBClassifier

# load breast cancer data
>>> dataset = datasets.load_breast_cancer()
>>> X = pd.DataFrame(dataset.data, columns = dataset.feature_names)
>>> y = dataset.target

# split data in train and valid dataset
>>> X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=2021)

# introduce some data drift in valid by filtering with 'worst symmetry' feature
>>> y_valid = y_valid[X_valid['worst symmetry'].values > 0.3]
>>> X_valid = X_valid.loc[X_valid['worst symmetry'].values > 0.3, :].copy()

# fit a XGBClassifier on the training data
>>> clf = XGBClassifier(use_label_encoder=False)
>>> clf.fit(X=X_train, y=y_train, verbose=10)

Initialize ModelDriftExplainer and fit on train and validation data

.. code:: python

>>> import cinnamon
>>> from cinnamon.drift import ModelDriftExplainer

# initialize a drift explainer with the built XGBClassifier and fit it on train
# and valid data
>>> drift_explainer = ModelDriftExplainer(model=clf)
>>> drift_explainer.fit(X1=X_train, X2=X_valid, y1=y_train, y2=y_valid)

Detect data drift by looking at main graphs and metrics

.. code:: python

# Distribution of logit predictions
>>> cinnamon.plot_prediction_drift(drift_explainer, bins=15)

.. image:: https://github.com/zelros/cinnamon/raw/master/docs/img/plot_prediction_drift.png :width: 400 :align: center

We can see on this graph that because of the data drift we introduced in validation data the distribution of predictions are different (they do not overlap well). We can also compute the corresponding drift metrics:

.. code:: python

# Corresponding metrics
>>> drift_explainer.get_prediction_drift()
[{'mean_difference': -3.643428434667366,
'wasserstein': 3.643428434667366,
'kolmogorov_smirnov': KstestResult(statistic=0.2913775225333014, pvalue=0.00013914094110123454)}]

Comparing the distributions of predictions for two datasets is one of the main indicator we use in order to detect data drift. The two other indicators are:

Explain data drift by computing the drift importances

Drift importances can be thought as equivalent of feature importances but in terms of data drift.

.. code:: python

# plot drift importances
>>> cinnamon.plot_tree_based_drift_importances(drift_explainer, n=7)

.. image:: https://github.com/zelros/cinnamon/raw/master/docs/img/plot_drift_values.png :width: 400 :align: center

Here the feature worst symmetry is rightly identified as the one which contributes the most to the data drift.

More

See "notes" below to explore all the functionalities of CinnaMon.

🛠 Installation

CinnaMon is intended to work with Python 3.7 or above. Installation can be done with pip:

.. code:: sh

$ pip install cinnamon

🔗 Notes

👍 Contributing

Check out the contribution <https://github.com/zelros/cinnamon/blob/master/CONTRIBUTING.md>_ section.

📝 License

CinnaMon is free and open-source software licensed under the MIT <https://github.com/zelros/cinnamon/blob/master/LICENSE.txt>_.