worldbank / REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
https://worldbank.github.io/REaLTabFormer/
MIT License
200 stars 23 forks source link

Logistic detection metric #42

Closed zechchair closed 9 months ago

zechchair commented 1 year ago

Greetings,

I am a student and I have a strong interest in your work. Currently, I am engrossed in a research endeavor focused on the Anonymization of extensive relational databases through the utilization of synthetic data generation. What particularly captivated me is how seamlessly your work aligns with the objectives of my research.

My current endeavor involves attempting to replicate the outcomes outlined in your published paper. During this process, I observed that you employed Logistic detection as a metric and as depicted in the following image: image

However, I encountered difficulty in locating an implementation of this metric, even within SDV (Synthetic Data Vault). Consequently, I find myself uncertain about the efficacy of my manual attempts in reproducing the same results.


import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

def compute_logistic_detection_score(real_data, synthetic_data, n_folds=3):
    # Combine the real and synthetic data
    real_data["orig"] = 0
    synthetic_data["orig"] = 1
    data = pd.concat([real_data.fillna(0), synthetic_data.fillna(0)])
    data = data.reset_index(drop=True).fillna(0)

    # Split the data into features and target
    X = data.drop("orig", axis=1)
    y = data["orig"]

    # Detect categorical variables
    categorical_columns = X.select_dtypes(include="object").columns.tolist()
    X = X.astype({col: "string" for col in categorical_columns})

    # Encode categorical variables
    label_encoder = LabelEncoder()
    X[categorical_columns] = X[categorical_columns].apply(label_encoder.fit_transform)

    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

    # Compute ROC-AUC scores using cross-validation
    roc_auc_scores = cross_val_predict(
        rf_classifier, X, y, cv=n_folds, method="predict_proba"
    )[:, 1]

    # Transform the ROC-AUC scores using the given formula
    transformed_scores = np.maximum(0.5, roc_auc_scores) * 2 - 1

    # Calculate the average transformed ROC-AUC score
    avg_transformed_score = np.mean(transformed_scores)

    # Calculate the logistic detection (LD) score
    ld_score = 100 * (1 - avg_transformed_score)

    return ld_score

I also have another question. I'm eager to understand the necessary specifications to replicate the results presented in the table above, both for the AirBnB and Rossman datasets. In your publication, I noted the hardware configuration: 2x AMD EPYC 7H12 64-Core Processor, 2x RTX 3090 GPU, and 1TB RAM, all running on Ubuntu 20.04 LTS.

However, I am inclined to believe that this configuration might be somewhat excessive, and I wonder if it's possible to achieve the same outcomes with a more modest setup, specifically tailored to reproducing only the results displayed in the aforementioned table. If this is indeed the case, I am genuinely interested in discovering the minimal configuration necessary for this task.

Thank you immensely for your assistance.

avsolatorio commented 1 year ago

Hello @zechchair, great to hear you are also working on synthetic data generation! Indeed, I had to implement these metrics on top of the SDVMetrics library. Below is a preview of the implementation of the metrics.

Please check the notebooks under this repo as well: https://github.com/avsolatorio/REaLTabFormer-Experiments/tree/main/exp-relational. I have not extensively documented the entirety of that repo, but the notebooks are relatively clean. Please feel free to let me know if you have any questions. More importantly, please let me know if you see any bugs! 😅

Regarding the configuration, I believe these experiments can run in a Google Colab environment.

P.S. The non-tabular dataset experiments are not on this repo.


"""scikit-learn based DetectionMetrics for single table datasets."""
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.svm import SVC

# from sdmetrics.single_table.detection.base import DetectionMetric: Use the fixed version above with HyperTransformer using OneHotEncoder variable handle_unknown="ignore"

class ScikitLearnClassifierDetectionMetric(DetectionMetric):
    """Base class for Detection metrics build using Scikit Learn Classifiers.

    The base class for these metrics makes a prediction using a scikit-learn
    pipeline which contains a SimpleImputer, a RobustScaler and finally
    the classifier, which is defined in the subclasses.
    """

    name = 'Scikit-Learn Detection'

    @staticmethod
    def _get_classifier():
        """Build and return an instance of a scikit-learn Classifier."""
        raise NotImplementedError()

    @classmethod
    def _fit_predict(cls, X_train, y_train, X_test):
        """Fit a pipeline to the training data and then use it to make prediction on test data."""
        model = Pipeline([
            ('imputer', SimpleImputer()),
            ('scalar', RobustScaler()),
            ('classifier', cls._get_classifier()),
        ])
        model.fit(X_train, y_train)

        return model.predict_proba(X_test)[:, 1]

"""Detectors"""
import sdv.metrics
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

class LogisticDetection(ScikitLearnClassifierDetectionMetric):
    """ScikitLearnClassifierDetectionMetric based on a LogisticRegression.

    This metric builds a LogisticRegression Classifier that learns to tell the synthetic
    data apart from the real data, which later on is evaluated using Cross Validation.

    The output of the metric is one minus the average ROC AUC score obtained.
    """

    name = 'LogisticRegression Detection'

    @staticmethod
    def _get_classifier():
        return LogisticRegression(solver='lbfgs', max_iter=5000)

class RandomForestDetection(ScikitLearnClassifierDetectionMetric):
    """ScikitLearnClassifierDetectionMetric based on a RandomForest.

    This metric builds a RandomForest Classifier that learns to tell the synthetic
    data apart from the real data, which later on is evaluated using Cross Validation.

    The output of the metric is one minus the average ROC AUC score obtained.
    """

    name = 'RandomForest Detection'

    @staticmethod
    def _get_classifier():
        return RandomForestClassifier(n_estimators=100, max_depth=5, n_jobs=25, oob_score=False)

# m = sdv.metrics.relational.LogisticDetection(lm)

def report_logistic(real_in, real_out, rtf_in, rtf_out, sdv_in, sdv_out, join_on, seed=None, verbose=True, classifier="randomforest"):    
    if seed is not None:
        np.random.seed(seed)
        random.seed(seed)
        torch.manual_seed(seed)

    if classifier == "logistic":
        lm = LogisticDetection()
    elif classifier == "randomforest":
        lm = RandomForestDetection()
    else:
        raise ValueError('Invalid classifier: "{randomforest}"')

    real_flat = real_out.merge(real_in, on=join_on)
    rtf_flat = rtf_out.merge(rtf_in, on=join_on)
    sdv_flat = sdv_out.merge(sdv_in, on=join_on)

    data = dict(
        parent=dict(
            sdv=lm.compute(real_in.drop(join_on, axis=1), sdv_in.drop(join_on, axis=1)),
            rtf=lm.compute(real_in.drop(join_on, axis=1), rtf_in.drop(join_on, axis=1))),
        child=dict(
            sdv=lm.compute(real_out.drop(join_on, axis=1), sdv_out.drop(join_on, axis=1)),
            rtf=lm.compute(real_out.drop(join_on, axis=1), rtf_out.drop(join_on, axis=1))),
        merged=dict(
            sdv=lm.compute(real_flat.drop(join_on, axis=1), sdv_flat.drop(join_on, axis=1)),
            rtf=lm.compute(real_flat.drop(join_on, axis=1), rtf_flat.drop(join_on, axis=1))),
    )

    if verbose:
        print("LogisticDetection for parents")
        print("SDV:", data["parent"]["sdv"])
        print("REaLTabFormer:", data["parent"]["rtf"])
        print()
        print("LogisticDetection for children")
        print("SDV:", data["child"]["sdv"])
        print("REaLTabFormer:", data["child"]["rtf"])
        print()
        print("LogisticDetection for merged")
        print("SDV:", data["merged"]["sdv"])
        print("REaLTabFormer:", data["merged"]["rtf"])

    return data

def get_comp_samp(tables, join_on, seed, n=1000):
    parent_comp_samp = tables["parent"][join_on].sample(n=n, random_state=seed)
    parent_comp_samp = tables["parent"][tables["parent"][join_on].isin(parent_comp_samp.tolist())]

    child_comp_samp = tables["child"][tables["child"][join_on].isin(parent_comp_samp[join_on].tolist())]

    return parent_comp_samp, child_comp_samp

# m.compute(tables, new_data, metadata)
zechchair commented 1 year ago

I greatly appreciate your assistance. I'm currently continuing my efforts, and I wanted to inform you that the child sampling process doesn't have compatibility for multiple GPUs. image

zechchair commented 1 year ago

Hello @avsolatorio,

I've reviewed your paper and noticed the following regarding your data splits:

We created train and test splits. For the Rossmann dataset, we used 80% of the stores data and their associated sales records for our training data. We used the remaining stores as the test data. We also limit the data used in the experiments from 2015-06 onwards spanning 2 months of sales data per store. In the Airbnb dataset, we considered a random sample of 10,000 users for the experiment. We take 8,000 as part of our training data, and we assessed the metrics and plots using the 2,000 users in the test data. We also limit the users considered to those having at most 50 sessions in the data.

I'm curious about the rationale behind the choice of train and test splits in your study. Typically, when generating synthetic data to replicate real-world scenarios, one might consider training the model on the entire dataset and then using metrics to compare the distribution and machine learning utility of the new data. Could you kindly clarify the reasoning behind your approach, or correct me if I have misunderstood your methodology?

Thank you for your valuable assistance.

avsolatorio commented 1 year ago

Hi @zechchair, the "test split" in this case is used to detect the overfitting of the model. Unlike the non-relation model, where we can use the full data since an algorithm auto-detects overfitting, this is not the case for the relational model. So, a separate split is used.

Also, to assess a synthetic data model, one has to create separate test data against which synthetic data produced by the model will be compared. This will allow you to benchmark a machine learning model trained on the test and synthetic data.

In the paper, we report the machine learning efficacy metric on a hold-out test data to ensure that the synthetic data generated generalizes. Nevertheless, you can train the synthetic generation model with the full data afterward.

avsolatorio commented 1 year ago

I greatly appreciate your assistance. I'm currently continuing my efforts, and I wanted to inform you that the child sampling process doesn't have compatibility for multiple GPUs. image

@zechchair, this is expected since the REaLTabFormer uses an autoregressive decoder model (GPT2). This means that the sampling has to be iterative. The workaround is to do sampling across your GPUs independently. You can use the CUDA_VISIBLE_DEVICES to specify which GPU will be used for each process.