Open night-chen opened 11 hours ago
Hi Yuchen,
Thanks for your interest in our project. Based on the default setting of EPACT, five-fold cross-validation is performed when training, so maybe you need to prepare a .fasta
file containing the training epitopes and a .csv file
containing validation (positive and negative) TCR-pMHC pairs for each fold in the directory of kfold_data
. Please refer to the original data on Zenodo.
If you want to train the model on your customized data without cross-validation, slightly modifying the code may help.
# Configure training positive samples and validation data
train_pos_dataset = PairedTCRpMHCDataset(data_path = config.data.train_pos_data_path, ...)
val_dataset = PairedTCRpMHCDataset(data_path = <your validation data path>, ...)
# Configure training and validation data loaders
batch_converter = PairedCDR123pMHCBatchConverter(max_mhc_len = config.model.mhc_seq_len, sample_cdr3 = False)
train_loader = DataLoader(
dataset = train_pos_dataset, batch_size = config.training.train_batch_size,
num_workers = config.training.num_workers, shuffle = True,
collate_fn = batch_converter
)
val_loader = DataLoader(
dataset = val_dataset, batch_size = config.training.test_batch_size,
num_workers = config.training.num_workers, shuffle = False,
collate_fn = batch_converter
)
Trainer = PairedCDR123pMHCCoembeddingTrainer(config, log_dir=config.training.log_dir)
Trainer.fit(train_loader, val_loader)
Thank you so much for the timely response! Yes, I have figured out this and proceeded to encounter another problem. I have added my data and update the 'train_pmhc_path', 'train_pos_data_path', 'test_data_path' in the config file. It seems that the 'train_tcr_feat_path' still needs update as I am using my own data. However, the original path 'data/binding/Paired-TCR/train_paired_cdr3_seq.pt' seems to be a pytorch tensor file and I am not sure how I can obtain this feature for my own CDR3 data. Thank you very much!
Sorry. I forgot that preparing this tensor file is necessary to train the model from scratch. You can try the following code to generate the train_tcr_feat_path
.
import pandas as pd
import torch
cdr3_data = pd.read_table('Paired-TCR/TCR-train-data.tsv')
res = []
for i in range(len(cdr3_data)):
cdr_alpha_seq, cdr_beta_seq, pmhc_idx = cdr3_data.iloc[i, 0], cdr3_data.iloc[i, 1], cdr3_data.iloc[i, 2]
alpha_seq_len = len(cdr_alpha_seq)
beta_seq_len = len(cdr_beta_seq)
pmhc_idx = torch.tensor([int(idx) for idx in pmhc_idx.split(";")])
res.append({
'cdr3.alpha': cdr_alpha_seq,
'cdr3.beta': cdr_beta_seq,
'pmhc': pmhc_idx,
'len': [alpha_seq_len, beta_seq_len]
})
torch.save(res, 'Paired-TCR/train_paired_cdr3_seq.pt')
Hi authors, I find this work useful and meaningful. I am trying to fine-tune the model on my own curated pmhc-tcr pairs. However, I encountered some issue with the kfold_data in the config. I replace the train_pos_data_path and test_data_path in the config with my own data path. However, I have no idea how to deal with the kfold_data_path. If I ignore it and run the code, the code will bump such error message:
I will very much appreciate it if you can take time and respond to this issue. Thank you so much!