yupenghe / REPTILE

Predicting regulatory DNA elements based on epigenomic signatures
MIT License
26 stars 4 forks source link

Variables in the training data missing in newdata #1

Open karamveerverma37 opened 1 month ago

karamveerverma37 commented 1 month ago

Hi, I am trying to run reptile on pre-trained model mm_model_coreMarks.reptile using methylation data. Is there any issue with bw generation, I have methylation base call bed files containing chr no, start, end, methylation rate. I convereted it into bw file using the following commands: awk '{printf "%s\t%d\t%d\t%2.3f\n" , $1,$2,$3,$4}' myBed.bed > myFile.bedgraph sort -k1,1 -k2,2n myFile.bedgraph > myFile_sorted.bedgraph bedGraphToBigWig myFile_sorted.bedgraph myChrom.sizes myBigWig.bw

I tried alone Meth epimark as well as all four H3K4me1 etc given for mm_model_coreMarks.reptile model. The output of REPTILE_preprocess.py is preprocessed.region_with_epimark.tsv file and look like this: chr start end id Meth_E4 H3K4me1_E4 H3K4me3_E4 H3K27ac_E4 chr1 0 2000 bin_0 0.0 0.0 0.0 0.0 chr1 100 2100 bin_1 0.0 0.0 0.0 0.0 chr1 200 2200 bin_2 0.0 0.0 0.0 0.0 chr1 300 2300 bin_3 0.0 0.0 0.0 0.0 chr1 400 2400 bin_4 0.0 0.0 0.0 0.0 chr1 500 2500 bin_5 0.0 0.0 0.0 0.0 chr1 600 2600 bin_6 0.0 0.0 0.0 0.0 chr1 700 2700 bin_7 0.0 0.0 0.0 0.0 chr1 800 2800 bin_8 0.0 0.0 0.0 0.0 chr1 900 2900 bin_9 0.0 0.0 0.0 0.0 chr1 1000 3000 bin_10 0.0 0.0 0.0 0.0 . . chr1 3211200 3213200 bin_32112 5.0 5.0 5.0 5.0 chr1 3211300 3213300 bin_32113 5.0 5.0 5.0 5.0 chr1 3211400 3213400 bin_32114 5.0 5.0 5.0 5.0 chr1 3211500 3213500 bin_32115 4.0 4.0 4.0 4.0 chr1 3211600 3213600 bin_32116 3.3 3.3 3.3 3.3 chr1 3211700 3213700 bin_32117 2.54545 2.54545 2.54545 2.54545 chr1 3211800 3213800 bin_32118 2.69231 2.69231 2.69231 2.69231 chr1 3211900 3213900 bin_32119 3.0 3.0 3.0 3.0 chr1 3212000 3214000 bin_32120 2.85714 2.85714 2.85714 2.85714

Now when I run the compute score command: REPTILE_compute_score.R -i data_info_file2 -m mm_model_coreMarks.reptile -a tmp/mm39_w2kb_s100bp_preprocessed.region_with_epimark.tsv -s E4 -o tmp/E4__compute_pred

I get the following error: Error in predict.randomForest(reptile_classifier, epimark, type = "prob") : variables in the training data missing in newdata Calls: reptile_predict_genome_wide ... reptile_predict_one_mode -> predict -> predict.randomForest Execution halted Are there any specific trained model available for only DNA methylation data to predict enhancers. Note: I tried with both genome wide and region specific.

yupenghe commented 1 month ago

Do you mind sharing some dummy input file for me to reproduce the error?

Are there any specific trained model available for only DNA methylation data to predict enhancers.

I found that methylation alone did not generate good enough prediction so I did not pursuit this further.

karamveerverma37 commented 1 month ago

Please find the attached files used as input. I have generated bigwig file from bed file (Methylation_Calls.Pseudobulk.E4.5-5.5.bed) as described above and using MM39 genome for query. Preprocessing was done using: REPTILE_preprocess.py data_info_file mm39_w2kb_s200bp.bed mm39_w2kb_s200bp_preprocessed -g input: data_info_file, mm39_w2kb_s200bp.bed (query region file) output: mm39_w2kb_s200bp_preprocessed_regions_with_epimark.tsv

compute_score gives error: REPTILE_compute_score.R -i data_info_file -m tmp/REPTILE_model.reptile -a mm39_w2kb_s200bp_preprocessed_regions_with_epimark -s E4 -o E4_pred

Error in predict.randomForest(reptile_classifier, epimark, type = "prob") : variables in the training data missing in newdata Calls: reptile_predict_genome_wide ... reptile_predict_one_mode -> predict -> predict.randomForest Execution halted

Files: issue.zip

karamveerverma37 commented 1 month ago

Dear Dr. He, I have shared the dummy input files in the github issue. Please find the attached files herewith as well. The initial file is methylation rates for each basepair. Is this issue due to the bigwig files generated from these .bed files that have each basepair in each line or due to the fact that I am using the mm39 mouse genome.

Sincerely, Karamveer

On Sun, Jun 2, 2024 at 11:31 PM Yupeng He @.***> wrote:

Do you mind sharing some dummy input file for me to reproduce the error?

Are there any specific trained model available for only DNA methylation data to predict enhancers.

I found that methylation alone did not generate good enough prediction so I did not pursuit this further.

— Reply to this email directly, view it on GitHub https://github.com/yupenghe/REPTILE/issues/1#issuecomment-2144211151, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHRJQJ3USWE2WBVQIHMR273ZFPPQTAVCNFSM6AAAAABITCKGKSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBUGIYTCMJVGE . You are receiving this because you authored the thread.Message ID: @.***>

-- Sincerely, Karamveer Post-doctoral Scholar Department of Pediatrics Penn State College of Medicine Penn State University

yupenghe commented 1 month ago

Thanks. I will take a look. Using mm39 will be an issue but it probably won't be the cause of the error you saw. Unfortunately all models were trained and tested on data processed based on mm10. I would suggest reprocessing your data on mm10 if I am able to fix the error and you still want to run REPTILE on your data.

karamveerverma37 commented 1 month ago

Thanks for the suggestion. I will try it with mm10 data as well but my data is for mm39, and I am doing some other complimentary analysis on mm39. So, cannot convert it to mm10.

yupenghe commented 1 month ago

I see. REPTILE probably won't generate what you want. I would recommend you to use peak calls from H3K27ac or the overlapping peaks of H3K27ac and H3K4me1 as predicted enhancers.

karamveerverma37 commented 1 month ago

Hi, Can you share the full training dataset used for training. Since there is a subset of dataset (Chr19) only available in example data.