weir12 / DENA

Deep learning model used to detect RNA m6a with read level based on the Nanopore direct RNA data.
MIT License
22 stars 5 forks source link

Fail with step3 LSTM_extract.py #14

Closed JBerthelier closed 2 years ago

JBerthelier commented 2 years ago

Dear DENA maker(s), I have trouble with the third part of the pipeline (LSTM_extract.py step). After running I got for all process, empty temporary files (0_tmp to 109_tmp) but no final output. My command is:

python3 LSTM_extract.py predict --fast5 /fast5_workspace/ --corr_grp RawGenomeCorrected_001 --bam /basecalls.bam --sites candidate_predict_pos.txt --label test --windows 2 2 --processes 110

Do you have an idea of my issue?

Thanks for the help

The log says
[17:49:29] Parsing Tombo index file(s). [17:49:29] Parsing Tombo index file(s). [17:49:29] Parsing Tombo index file(s). [17:49:29] Parsing Tombo index file(s). [17:49:29] Parsing Tombo index file(s). [17:49:29] Parsing Tombo index file(s). [17:49:29] Parsing Tombo index file(s). [17:49:29] Parsing Tombo index file(s). [17:49:29] Parsing Tombo index file(s). ... (lot of parsing)

processes_99: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60160/60160 [00:01<00:00, 35667.92it/s]

... (lot of bar)

None None None None None None None None

... (lot of None)

JBerthelier commented 2 years ago

Dear DENA maker(s), I noticed a mistake and now I got a diffrent output, but still the tmp file are empty.

Could it be because I do not use enought reads ? I just used a single run of DRS to test.

Best regards


I got first:

[15:44:48] Parsing Tombo index file(s). [15:44:48] Parsing Tombo index file(s). [15:44:48] Parsing Tombo index file(s). [15:44:48] Parsing Tombo index file(s). [15:44:48] Parsing Tombo index file(s). [15:44:48] Parsing Tombo index file(s).

etc ... and

STRG.31926.2 346-351found 0 reads in bamfile '253a0924-23ed-4b92-872c-051d5505c27b' STRG.30415.34 809-814found 0 reads in bamfile '85f587a6-c333-4b5d-953a-ef9388758dc2' 'a0dd2b81-1927-40cb-ac26-7679c6117283' '91bd478f-173a-4141-86fa-85aae81e0a91' 'a2a7c14b-f567-4261-88f7-03753046d352' STRG.5003.1 1902-1907found 0 reads in bamfile '29d306e6-6f00-4afb-9c61-44cd506af5e3' STRG.5003.1 1923-1928found 8 reads in fast5 '48cbb8b1-5301-4226-9989-83a2be4187ce' 'ef8dfb85-b023-484d-a205-d96a1c2d3cd0' '4de55687-281b-43b1-a317-21e1c737e405' '552ba8b5-5842-492f-a47c-35837069f491' 'b5d70bbf-244c-45f7-995d-cccef92dc159' STRG.18079.6 705-710found 5 reads in fast5 'b187867c-bea6-4917-86f4-075e20d8026d'

etc... and

None None None None ... (lot of None)

weir12 commented 2 years ago

Thank you for your interest in our project, and I suggest you try the following ways to troubleshoot errors: 1.Reduce the number of threads to an appropriate number, such as 32. 2.Check whether the parameters in the tombo resquiggle step contain --include-event-stdev

Best regards

JBerthelier commented 2 years ago

Dear Liang Ou,

Thank you for your message and time, I ran LSTM_extract.py with 32 threads but still no proper output. I double checked and the tombo resquiggle step well contain --include-event-stdev

We are repeating the pipeline with an other dataset to see if it fix this time.

For annotation we used denovo stringtie transcriptome, with .gtf or .fa format, should be fine?

Best regards

weir12 commented 2 years ago

Dear JBerthelier,

Would you mind checking whether TOMBO and minimap2 are point to the same fasta file? In other words, we need to use the same naming format for chromosomes.

Best regards

JBerthelier commented 2 years ago

Dear Liang Ou,

Thank you for your message, I tried to re-do all the process, but I am still having issue at the LSTM_extract.py I am trying now with vir-1 fast5 data you used for the paper.

Regarding your question, I only used a transcriptome data (.fasta) as reference. I havent used any genome file (with chromosomes). I what step should I use a genome file ?

Best regards,

weir12 commented 2 years ago

Dear JBerthelier

You should not use the genome file at any step in the pipeline, as tombo's electrical signal mapping does not handle well the interval jumping generated by genomic introns

Best regards

JBerthelier commented 2 years ago

Dear Liang Ou,

Thanks for your reply,

unfortunatly for me, I am still stuck at the same step.
The LSTM_extract.py do not give me any output.

I tried with the test data, but I still got the same issue.

Bellow, I listed what I did using the test data you provide:

"W_003002_20180301_FAH54216_MN23410_mux_scan_20180301_FAH54216_VIRcomp_2928_DRS_96459_read_101_ch_125_strand.fast5"

Do you notice something wrong? I really need to use your tool for my current research project.

1. coordinates matching motif extraction

I used the atRTD3 transcriptome for all step (.fasta), first 10 lines are

AT1G01010.1 GTCTTCCTCCCTCCAAATTATTAGATATACCAAACCAGAGAAAACAAATACATAATCGGAGAAATACAGA TTACAGAGAGCGAGAGAGATCGACGGCGAAGCTCTTTACCCGGAAACCATTGAAATCGGACGGTTTAGTG AAAATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTATCTCCGTA ACAAAATCGAAGGAAACACTAGCCGCGACGTTGAAGTAGCCATCAGCGAGGTCAACATCTGTAGCTACGA TCCTTGGAACTTGCGCTGTAAGTTCCGAATTTTCTGAATTTCATTTGCAAGTAATCGATTTAGGTTTTTG ATTTTAGGGTTTTTTTTTGTTTTGAACAGTCCAGTCAAAGTACAAATCGAGAGATGCTATGTGGTACTTC TTCTCTCGTAGAGAAAACAACAAAGGGAATCGACAGAGCAGGACAACGGTTTCTGGTAAATGGAAGCTTA CCGGAGAATCTGTTGAGGTCAAGGACCAGTGGGGATTTTGTAGTGAGGGCTTTCGTGGTAAGATTGGTCA TAAAAGGGTTTTGGTGTTCCTCGATGGAAGATACCCTGACAAAACCAAATCTGATTGGGTTATCCACGAG

I ran the command:

python LSTM_extract.py get_pos --fasta "test_data/step1/atRTD3_07082020.transcript.fa" --motif 'RRACH' --output test-candidate_predict_pos.txt

I got

AT1G01010.1 31 36 + AAACC AT1G01010.1 41 46 + AAACA AT1G01010.1 113 118 + AAACC AT1G01010.1 223 228 + AAACA AT1G01010.1 286 291 + GAACT AT1G01010.1 373 378 + GAACA AT1G01010.1 434 439 + AAACA AT1G01010.1 460 465 + GGACA AT1G01010.1 512 517 + GGACC AT1G01010.1 601 606 + AAACC etc...

2. basecalling

guppy_basecaller -i "test_data/step1/Fast5/" -s test_virC --flowcell FLO-MIN106 --kit SQK-RNA001 --cpu_threads_per_caller 19 --fast5_out --records_per_fastq 0 --recursive

I got a new test_virC directory, containing pass (fastq directory) and workspace (fast5 directory).

2.2 resquiggle

tombo resquiggle --rna --processes 100 --corrected-group RawGenomeCorrected_001 --basecall-group Basecall_1D_001 --include-event-stdev --overwrite --ignore-read-locks "test_data/step2_/test_virC/workspace/" "test_data/step1/atRTD3_07082020.transcript.fa"

I see in "testdata/step2/test_virC/workspace/" that the fast5 file has been modified

2.3 mapping

I mapped the basecalled fastq contained in test_data/step2/test_virC/pass/

minimap2 -t 10 -ax map-ont -L --secondary=no "test_data/step1/atRTD3_07082020.transcript.fa" "/step2_/test_virC/pass/" | samtools view -bh -F 2324 | samtools sort -O bam > basecalls.bam && samtools index basecalls.bam

3.extract features

python LSTM_extract.py predict --fast5 test_data/step2_/test_virC/workspace/ --corr_grp RawGenomeCorrected_001 --bam "test_data/step2_3/basecalls.bam" --sites "test_data/step1/test-candidate_predict_pos.txt" --label test --windows 2 2 --processes 8

I unfortunatlly got None at the end and only emply temporaries files

[11:41:34] Parsing Tombo index file(s). [11:41:34] Parsing Tombo index file(s). [11:41:34] Parsing Tombo index file(s). [11:41:34] Parsing Tombo index file(s). [11:41:34] Parsing Tombo index file(s). [11:41:34] Parsing Tombo index file(s). [11:41:35] Parsing Tombo index file(s). [11:41:35] Parsing Tombo index file(s). processes_0: 100%|███████████████████████████████████████████████| 672237/672237 [00:11<00:00, 58764.98it/s] processes_1: 100%|███████████████████████████████████████████████| 672237/672237 [00:11<00:00, 58851.49it/s] processes_2: 100%|███████████████████████████████████████████████| 672237/672237 [00:11<00:00, 58923.67it/s] processes_3: 100%|███████████████████████████████████████████████| 672237/672237 [00:11<00:00, 58574.04it/s] processes_4: 100%|███████████████████████████████████████████████| 672237/672237 [00:11<00:00, 59228.51it/s] processes_5: 100%|███████████████████████████████████████████████| 672237/672237 [00:11<00:00, 58864.95it/s] processes_7: 100%|███████████████████████████████████████████████| 672236/672236 [00:11<00:00, 58974.14it/s] processes_6: 100%|███████████████████████████████████████████████| 672237/672237 [00:11<00:00, 58280.36it/s] Noneesses_4: 99%|██████████████████████████████████████████████▌| 666380/672237 [00:11<00:00, 59745.54it/s] Noneesses_5: 98%|█████████████████████████████████████████████▊ | 656104/672237 [00:11<00:00, 59551.82it/s] Noneesses_6: 95%|████████████████████████████████████████████▊ | 641434/672237 [00:11<00:00, 59204.66it/s] Noneesses_5: 99%|██████████████████████████████████████████████▋| 668147/672237 [00:11<00:00, 59877.71it/s] Noneesses_6: 97%|█████████████████████████████████████████████▋ | 653419/672237 [00:11<00:00, 59595.11it/s] Noneesses_6: 99%|██████████████████████████████████████████████▌| 665832/672237 [00:11<00:00, 60911.59it/s] Noneesses_7: 100%|██████████████████████████████████████████████▊| 668995/672236 [00:11<00:00, 61438.64it/s] None all tasks has done!

weir12 commented 2 years ago

Dear @JBerthelier , I noticed that the program was processing candidate sites abnormally fast (such as 60911.59it/s, which is unlikely to occur on a mechanical or solid-state hard disk), most likely because it encountered an implicit error and was not captured as expected. Would you mind append "--debug" to the end of command line? and re-run.

For example, a command just like: python LSTM_extract.py predict --fast5 test_data/step2_/test_virC/workspace/ --corr_grp RawGenomeCorrected_001 --bam "test_data/step2_3/basecalls.bam" --sites "test_data/step1/test-candidate_predict_pos.txt" --label test --windows 2 2 --processes 8 -debug

This option will output more details in the standard error output stream. You can paste the detailed stderr into this issue and I will try to help you solve it.

Best regards Liang Ou

JBerthelier commented 2 years ago

Dear Liang Ou,

Thank you for your concern and help.

I ran the command with --debug and I got this output (sorry the begining is a bit badly displayed)

Best regards,


[23:10:38] Parsing Tombo index file(s). [23:10:38] Parsing Tombo index file(s). [23:10:38] Parsing Tombo index file(s). [23:10:39] Parsing Tombo index file(s). [23:10:39] Parsing Tombo index file(s). [23:10:39] Parsing Tombo index file(s). [23:10:39] Parsing Tombo index file(s). [23:10:39] Parsing Tombo index file(s). A T5G05580.80: 332079-2084found 1 reads in fast5█████ | 219803/672237 [00:03<00:07, 57542.33it/s] '79bc713d-d298-467c-9443-c13d69386a8c'████████████▉ | 213092/672237 [00:03<00:08, 57251.50it/s] AT5G05580.8: 312079-2084found 0 reads in bamfile█ | 207086/672237 [00:03<00:08, 57375.37it/s] processes_3: 29%|██████████████████████████████▍ | 196974/672237 [00:03<00:08, 58215.41it/sA T5G05580.84: 282106-2111found 1 reads in fast5▌ | 191087/672237 [00:03<00:08, 57507.11it/s] '79bc713d-d298-467c-9443-c13d69386a8c'████████▌ | 184845/672237 [00:03<00:08, 57624.39it/s] AT5G05580.8: 262106-2111found 0 reads in bamfile | 174470/672237 [00:03<00:08, 57302.33it/s] AT5G05580.8: 302114-2119found 1 reads in fast5██▎ | 202796/672237 [00:03<00:08, 58024.17it/s] '79bc713d-d298-467c-9443-c13d69386a8c' AT5G05580.8 2114-2119found 0 reads in bamfile AT5G05580.8 2121-2126found 1 reads in fast5 ' 79bc713d-d298-467c-9443-c13d69386a8c' AT5G05580.8 2121-2126found 0 reads in bamfile AT5G05580.8 2166-2171found 1 reads in fast5 ' 79bc713d-d298-467c-9443-c13d69386a8c'███████████████▉ | 225558/672237 [00:03<00:07, 56910.09it/s] AT5G05580.8 2166-2171found 0 reads in bamfile AT5G05580.8 2171-2176found 1 reads in fast5 '79bc713d-d298-467c-9443-c13d69386a8c'██████▉ | 173833/672236 [00:03<00:08, 57345.75it/s] AT5G05580.8 2171-2176found 0 reads in bamfile AT5G05580.8 2223-2228found 1 reads in fast5 '79bc713d-d298-467c-9443-c13d69386a8c' AT5G05580.8 2223-2228found 0 reads in bamfile AT5G05580.8 2232-2237found 1 reads in fast5 '79bc713d-d298-467c-9443-c13d69386a8c' AT5G05580.8 2232-2237found 0 reads in bamfile AT5G05580.8 2239-2244found 1 reads in fast5 ' 79bc713d-d298-467c-9443-c13d69386a8c' AT5G05580.8: 332239-2244found 0 reads in bamfile██▊ | 218818/672237 [00:03<00:08, 56098.83it/s] AT5G05580.8 2247-2252found 1 reads in fast5 ' 79bc713d-d298-467c-9443-c13d69386a8c' AT5G05580.8 2247-2252found 0 reads in bamfile AT5G05580.8 2261-2266found 1 reads in fast5 '79bc713d-d298-467c-9443-c13d69386a8c' AT5G05580.8: 292261-2266found 0 reads in bamfile | 196838/672237 [00:03<00:08, 56250.64it/s] AT5G05580.8 2270-2275found 1 reads in fast5 '79bc713d-d298-467c-9443-c13d69386a8c' AT5G05580.8 2270-2275found 0 reads in bamfile AT5G05580.8 2364-2369found 1 reads in fast5 '79bc713d-d298-467c-9443-c13d69386a8c' AT5G05580.8 2364-2369found 0 reads in bamfile AT5G05580.8 2376-2381found 1 reads in fast5 '79bc713d-d298-467c-9443-c13d69386a8c' AT5G05580.8 2376-2381found 0 reads in bamfile processes_0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 672237/672237 [00:11<00:00, 57002.97it/s] processes_1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 672237/672237 [00:11<00:00, 56874.08it/s] processes_2: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 672237/672237 [00:11<00:00, 56895.33it/s] processes_3: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 672237/672237 [00:11<00:00, 57456.43it/s] processes_4: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 672237/672237 [00:11<00:00, 56956.23it/s] processes_5: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 672237/672237 [00:11<00:00, 57177.32it/s] processes_6: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 672237/672237 [00:11<00:00, 56848.00it/s] processes_7: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 672236/672236 [00:11<00:00, 57432.14it/s] Noneesses_4: 98%|█████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 657358/672237 [00:11<00:00, 56231.33it/s] Noneesses_4: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████▌| 669044/672237 [00:11<00:00, 57392.92it/s] Noneesses_5: 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████ | 665918/672237 [00:11<00:00, 58081.21it/s] Noneesses_5: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████▉| 671843/672237 [00:11<00:00, 58430.94it/s] Noneesses_6: 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 660835/672237 [00:11<00:00, 58432.43it/s] Noneesses_6: 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████▏| 666787/672237 [00:11<00:00, 58755.87it/s] Noneesses_7: 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████▍| 668784/672236 [00:11<00:00, 59351.48it/s] None all tasks has done!

JBerthelier commented 2 years ago

Also I am running the scripts on a cluster using slurm scheduler , if this information can help, please let me know if you need any extra informations

weir12 commented 2 years ago

Hi, Have you checked the alignment rate of bam file, especially the number of reads covered by candidate sites

e.g alignment depth of given ref site samtools depth -r AT5G05580.8:2166-2171 -f "test_data/step2_3/basecalls.bam" > depth.txt

e.g number of alignments for each FLAG type samtools flagstat "test_data/step2_3/basecalls.bam" > flagstat.txt

Best regards Liang Ou

JBerthelier commented 2 years ago

Dear Liang Ou,

Many thanks for your message and time, Indeed the problem was what you suspected, the read from your test file was not mapped on the transcriptome assembly. I tried with Araport11 transcripome instead of AtRTD3 and now it worked properly! I am not sure why I got no mapping on AtRTD3 transcriptome.

I also was able to make works the predict step and got the output


AT5G05580.1 1499 AAACA 0 1 0.0

I will now try with my data

Best

Jeremy

weir12 commented 2 years ago

You're welcome. If you have other questions about DENA,you can always open new issue .