worldbank / REaLTabFormer

A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
https://worldbank.github.io/REaLTabFormer/
MIT License
200 stars 23 forks source link

OSError: rtf_checkpoints/not-best-disc-model does not appear to have a file named config.json. #66

Open AhmadKajjan-QU opened 5 months ago

AhmadKajjan-QU commented 5 months ago
# pip install realtabformer
import pandas as pd
from realtabformer import REaLTabFormer

df = pd.read_csv("./data/02-14-2018 - test.csv")

# NOTE: Remove any unique identifiers in the
# data that you don't want to be modeled.

# Non-relational or parent table.
rtf_model = REaLTabFormer(
    model_type="tabular",
    epochs=1,
    gradient_accumulation_steps=4,
    logging_steps=100)

print("fitting the model")
# Fit the model on the dataset.
# Additional parameters can be
# passed to the `.fit` method.
rtf_model.fit(df, num_bootstrap=1)

print("generating synthetic data")
# Generate synthetic data with the same
# number of observations as the real dataset.
samples = rtf_model.sample(n_samples=len(df))

print("saving synthetic data")
# Save the generated synthetic data to a CSV file
samples.to_csv("synthetic_data.csv", index=False)

print("saving the model")
# Save the model to the current directory.
# A new directory `rtf_model/` will be created.
# In it, a directory with the model's
# experiment id `idXXXX` will also be created
# where the artefacts of the model will be stored.
rtf_model.save("rtf_model/")

print("loading the model")
# Load the saved model. The directory to the
# experiment must be provided.
rtf_model2 = REaLTabFormer.load_from_dir(
    path="rtf_model/IDX")

the error I got PS C:\Users\Qatar University\Desktop\Akef> python .\start.py fitting the model ate (0.757) in the data. This will not give a reliable early stopping condition. Consider using qt_max="compute" argument.an the duplicate warnings.warn( Computing the sensitivity threshold... C:\Users\Qatar University\AppData\Local\Programs\Python\Python310\lib\site-packages\realtabformer\realtabformer.py:597: UserWarning: qt_interval adjusted from 100 to 16... warnings.warn( Using parallel computation!!! Bootstrap round: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s] Sensitivity threshold summary: count 1.000000 mean 0.433998 std NaN min 0.433998 25% 0.433998 50% 0.433998 75% 0.433998 max 0.433998 dtype: float64 Sensitivity threshold: 0.43399772209567195 qt_max: 0.05 Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:10<00:00, 194.27 examples/s] C:\Users\Qatar University\AppData\Local\Programs\Python\Python310\lib\site-packages\accelerate\accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an accelerate.DataLoaderConfiguration instead: dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True) warnings.warn( {'train_runtime': 14.4486, 'train_samples_per_second': 138.422, 'train_steps_per_second': 4.291, 'train_loss': 1.2673711469096522, 'epoch': 0.99} 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 62/62 [00:14<00:00, 4.29it/s] 1024it [08:01, 2.13it/s] Generated 0 invalid samples out of total 1024 samples generated. Sampling efficiency is: 100.0000% Critic round: 5, sensitivity_threshold: 0.43399772209567195, val_sensitivity: -0.020717878427690663, val_sensitivities: [-0.01820568252007412, -0.022510889856876166, -0.022523219814241484, -0.022510889856876166, -0.02251552795031056, -0.020073891625615764, -0.022527812113720645, -0.022539975399753998, -0.01519607843137255, -0.021314496314496313, -0.016960420531849103, -0.022512437810945272, -0.016347342398022248, -0.022513983840894966, -0.02251552795031056] C:\Users\Qatar University\AppData\Local\Programs\Python\Python310\lib\site-packages\realtabformer\realtabformer.py:834: UserWarning: No best model was saved. Loading the closest model to the sensitivity_threshold. warnings.warn( Traceback (most recent call last): File "C:\Users\Qatar University\Desktop\Akef\start.py", line 21, in rtf_model.fit(df, num_bootstrap=1) File "C:\Users\Qatar University\AppData\Local\Programs\Python\Python310\lib\site-packages\realtabformer\realtabformer.py", line 458, in fit trainer = self._train_with_sensitivity( File "C:\Users\Qatar University\AppData\Local\Programs\Python\Python310\lib\site-packages\realtabformer\realtabformer.py", line 839, in _train_with_sensitivity self.model = self.model.from_pretrained(loaded_model_path.as_posix()) File "C:\Users\Qatar University\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\modeling_utils.py", line 3006, in from_pretrained config, model_kwargs = cls.config_class.from_pretrained( File "C:\Users\Qatar University\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\configuration_utils.py", line 602, in from_pretrained config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, kwargs) File "C:\Users\Qatar University\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\configuration_utils.py", line 631, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, kwargs) File "C:\Users\Qatar University\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\configuration_utils.py", line 686, in _get_config_dict resolved_config_file = cached_file( File "C:\Users\Qatar University\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\utils\hub.py", line 369, in cached_file raise EnvironmentError( OSError: rtf_checkpoints/not-best-disc-model does not appear to have a file named config.json. Checkout 'https://huggingface.co/rtf_checkpoints/not-best-disc-model/tree/main' for available files. PS C:\Users\Qatar University\Desktop\Akef>

some rows of my data Pkt Len Std,Flow Byts/s,Flow Pkts/s,Flow IAT Mean,Flow IAT Std,Flow IAT Max,Flow IAT Min,Fwd IAT Tot,Fwd IAT Mean,Fwd IAT Std,Fwd IAT Max,Fwd IAT Min,Bwd IAT Tot,Bwd IAT Mean,Bwd IAT Std,Bwd IAT Max,Bwd IAT Min,Fwd PSH Flags,Bwd PSH Flags,Fwd URG Flags,Bwd URG Flags,Fwd Header Len,Bwd Header Len,Fwd Pkts/s,Bwd Pkts/s,Pkt Len Min,Pkt Len Max,Pkt Len Mean,Pkt Len Std,Pkt Len Var,FIN Flag Cnt,SYN Flag Cnt,RST Flag Cnt,PSH Flag Cnt,ACK Flag Cnt,URG Flag Cnt,CWE Flag Count,ECE Flag Cnt,Down/Up Ratio,Pkt Size Avg,Fwd Seg Size Avg,Bwd Seg Size Avg,Fwd Byts/b Avg,Fwd Pkts/b Avg,Fwd Blk Rate Avg,Bwd Byts/b Avg,Bwd Pkts/b Avg,Bwd Blk Rate Avg,Subflow Fwd Pkts,Subflow Fwd Byts,Subflow Bwd Pkts,Subflow Bwd Byts,Init Fwd Win Byts,Init Bwd Win Byts,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label 0,0,14/02/2018 08:31:01,112641719,3,0,0,0,0,0,0,0,0,0,0,0,0,0.026633116,56320859.5,139.3000359,56320958,56320761,112641719,56320859.5,139.3000359,56320958,56320761,0,0,0,0,0,0,0,0,0,0,0,0.026633116,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,-1,-1,0,0,0,0,0,0,56320859.5,139.3000359,56320958,56320761,Benign 0,0,14/02/2018 08:33:50,112641466,3,0,0,0,0,0,0,0,0,0,0,0,0,0.026633176,56320733,114.5512986,56320814,56320652,112641466,56320733,114.5512986,56320814,56320652,0,0,0,0,0,0,0,0,0,0,0,0.026633176,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,-1,-1,0,0,0,0,0,0,56320733,114.5512986,56320814,56320652,Benign 0,0,14/02/2018 08:36:39,112638623,3,0,0,0,0,0,0,0,0,0,0,0,0,0.026633848,56319311.5,301.9345956,56319525,56319098,112638623,56319311.5,301.9345956,56319525,56319098,0,0,0,0,0,0,0,0,0,0,0,0.026633848,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,-1,-1,0,0,0,0,0,0,56319311.5,301.9345956,56319525,56319098,Benign

limhasic commented 5 months ago

me too shit

limhasic commented 5 months ago

Dear my Ottoman friend,

i found something, its work on based before

!pip install transformers==4.24.0

cuda - pytorch env is very important to RUN

Before not matching enviroment -> we encounter error like

  1. RuntimeError: NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)
  2. old driver

but not finished,

we encounter error : out of memory

I thought I was running out of memory, so I changed the environment 3 times and in the end I used it almost like a super computer, but it said I was out of memory. Oh, fuck, the computer is great, why is that happening?

The answer was in

" output_max_length=None, "

I just accepted unlimited tokens, so 80GB or whatever was blown. Plus, I had to do it separately for multi-GPU settings, so I went crazy. Is this a reasonably even number, like 1024 or 2048? This seems to be part of the rule regarding tokens, but anyway, it works because I set it to 1024 on the A100 80GB.

So now I'm going to test this with the HMA model and generate AIRBNB data.

have a nice day friend