`init_predict_dataset` doesn't work if loading training dataset from file

YikChingTsui commented 4 months ago

Background

After training the model, I want to save the training dataset to the filesystem. Then I want to run another script to load the training dataset and model to predict values. After all, if everything can be saved after training, it implies they can be loaded in another script for prediction.

(The example in this repo, and the Estimating PM2.5 Concentrations example, puts everything in one file. They use the original training dataset for datasets.init_predict_dataset, i.e., the one in memory and not the one loaded from the filesystem.)

Problem 1

When I tried to load the training dataset, and use it for init_predict_dataset, it would fail with something like the reference attribute is missing on train_dataset_load. The code is something like this:

# training.py:
train_dataset, val_dataset, test_dataset = datasets.init_dataset(
    data=data
    # ...
)
# ...
train_dataset.save("./train_dataset")

# possibly in another file, such as predict.py:

train_dataset_load = datasets.load_dataset("./train_dataset/")
# ...

gnnwr_load = models.GNNWR(
    train_dataset=train_dataset_load,
    # ...
)

gnnwr_load.load_model("./gnnwr_models/GNNWR_PM25.pkl")

pred_data = pd.DataFrame(
    # ...
)

pred_dataset = datasets.init_predict_dataset(  # Fails here
    data=pred_data,
    train_dataset=train_dataset_load,  # using the training dataset we just loaded.
    # it works if we use `train_dataset` defined above, but this only works if all code is in 1 file.
    # ...
)

# Never gets to this place
pred_res = gnnwr_load.predict(pred_dataset)

Fix

Adding this code after saving the training dataset would save the reference to a file:

# ...
# remove train_dataset if it exists
# after `train_dataset.save("./train_dataset")`

train_dataset.reference.to_csv('./train_dataset/reference.csv', index=False)

Then, in predict.py, after loading the training dataset, add reference back to the training dataset:

# after `train_dataset_load = datasets.load_dataset("./train_dataset/")`
reference = pd.read_csv('./train_dataset/reference.csv')
train_dataset_load.reference = reference

Problem 2 and fix

After that fix, init_predict_dataset still fails at x = (x - min) / (max - min) (link): cannot subtract (-) between two lists.

The cause is that train_dataset.distances_scale_param['min'] and train_dataset.distances_scale_param['max'] were originally np.arrays, but it was converted into a Python list when saved. When the training dataset is loaded, they remained as lists.

The solution is to convert the lists to np.array after loading the training dataset in predict.py:

# after `train_dataset_load = datasets.load_dataset("./train_dataset/")`
train_dataset_load.distances_scale_param['min'] = np.array(
    train_dataset_load.distances_scale_param['min']
)
train_dataset_load.distances_scale_param['max'] = np.array(
    train_dataset_load.distances_scale_param['max']
)

# Works now
pred_dataset = datasets.init_predict_dataset(
    data=pred_data,
    train_dataset=train_dataset_load,
    # ...
)

Library fix

To move this fix to the library, the save() method should be modified to save the reference as well. The read() method should read the file where the reference was saved and set the attribute. But I'm not sure if this would break something else. The fixes above are for the users of the library so they can add the fixes for the training dataset only.

This part should also be changed to convert the lists to np.array. This shouldn't affect anything else

https://github.com/zjuwss/gnnwr/blob/2a6ad0f034ae799367b3594e0adb601fae98ddbd/src/gnnwr/datasets.py#L268

Y-nuclear commented 4 months ago

Thank you for your issue, we will fix the error soon.

YikChingTsui commented 4 months ago

Sorry, fix 1 should actually be after saving the train dataset, because the save method requires the train_dataset folder to not exist previously. train_dataset.reference should also be saved directly to match the shape

# ...
# remove train_dataset if it exists
train_dataset.save("./train_dataset")
train_dataset.reference.to_csv('./train_dataset/reference.csv', index=False)

Mitchell-rmb commented 3 months ago

Have you solved this problem, I also found that the saved model reloading does not work, will prompt the training set is not available

yorktownting commented 3 months ago

Have you solved this problem, I also found that the saved model reloading does not work, will prompt the training set is not available

Hello, under what circumstances reloading the model will report an error?

Here's a demo (in replication1_load.py and replication2_load.py) for reload the model without error, maybe it will help.

Mitchell-rmb commented 3 months ago

Have you solved this problem, I also found that the saved model reloading does not work, will prompt the training set is not available

Hello, under what circumstances reloading the model will report an error?

Here's a demo (in replication1_load.py and replication2_load.py) for reload the model without error, maybe it will help.

gnnwr.reg_result('./ceshi/textresult/GNNWR_PM25_Result.csv') train_dataset.save('./ceshi/textresult/gnnwr_datasets/train_dataset') val_dataset.save('./ceshi/textresult/gnnwr_datasets/val_dataset') test_dataset.save('./ceshi/textresult/gnnwr_datasets/test_dataset') train_dataset_load = datasets.load_dataset('./demo_result/gnnwr_datasets/train_dataset/') val_dataset_load = datasets.load_dataset('./demo_result/gnnwr_datasets/val_dataset/') test_dataset_load = datasets.load_dataset('./demo_result/gnnwr_datasets/test_dataset/') pred_data = pd.read_csv(u'C:/Users/lenovo/Desktop/gnnwr-0.1.5/data/pm25_predict_data.csv') gnnwr_load = models.GNNWR(train_dataset = train_dataset_load, valid_dataset = val_dataset_load, test_dataset = test_dataset_load, dense_layers = [512,256,64,128], start_lr = 0.2, optimizer = "Adadelta", activate_func = nn.PReLU(init=0.1), model_name = " ceshi_GNNWR_PM25", model_save_path = "./ceshi/textresult", log_path = "./ceshi/textresult/gnnwr_logs", write_path = "./ceshi/textresult/gnnwr_runs" ) gnnwr_load.load_model('./ceshi/textresult/ ceshi_GNNWR_PM25.pkl')

then init_predict_dataset doesn't work if loading training dataset from file pred_dataset = datasets.init_predict_dataset(data = pred_data,train_dataset = train_dataset_load,x_column=['dem', 'w10','d10','t2m','aod_sat','tp'],spatial_column=['经度','纬度'])

Mitchell-rmb commented 3 months ago

Have you solved this problem, I also found that the saved model reloading does not work, will prompt the training set is not available

Hello, under what circumstances reloading the model will report an error?

Here's a demo (in replication1_load.py and replication2_load.py) for reload the model without error, maybe it will help.

thanks you, I will try your demo in my spare time

zjuwss / gnnwr