wdecoster / NanoPlot

Plotting scripts for long read sequencing data
http://nanoplot.bioinf.be
MIT License
435 stars 47 forks source link

Error with bam file #157

Closed valery-shap closed 5 years ago

valery-shap commented 5 years ago

I have rna seq reads from Nanopore(kit RNA002, U based), aligned it by minimap2 in 2 different ways:

  1. -ax splice -uf -k14
  2. -ax map-ont -N 100

the reference was transcriptome of human from https://www.gencodegenes.org/human/. This reference was dna based, but minimap2 developers said to me that it is not a problem. With big merged fastq file I don't have errors when using NanoPlot. With this bam file from minimap I have only a log file and crashing a python. bam-1NanoPlot_20191119_2207.log python error: Process: Python [78648].pdf I checked my bam file -H, the output was: @HD VN:1.6 SO:coordinate @SQ SN:ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|lncRNA| LN:1657 @SQ SN:ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene| LN:632 @SQ SN:ENST00000488147.1|ENSG00000227232.5|OTTHUMG00000000958.1|OTTHUMT00000002839.1|WASH7P-201|WASH7P|1351|unprocessed_pseudogene| LN:1351 @SQ SN:ENST00000619216.1|ENSG00000278267.1|-|-|MIR6859-1-201|MIR6859-1|68|miRNA| LN:68 @SQ SN:ENST00000473358.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-2HG-202|MIR1302-2HG|712|lncRNA| LN:712 @SQ SN:ENST00000469289.1|ENSG00000243485.5|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-2HG-201|MIR1302-2HG|535|lncRNA| LN:535 @SQ SN:ENST00000607096.1|ENSG00000284332.1|-|-|MIR1302-2-201|MIR1302-2|138|miRNA| LN:138 @SQ SN:ENST00000417324.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002842.1|FAM138A-201|FAM138A|1187|lncRNA| LN:1187 @SQ SN:ENST00000461467.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002843.1|FAM138A-202|FAM138A|590|lncRNA| LN:590 @SQ SN:ENST00000606857.1|ENSG00000268020.3|OTTHUMG00000185779.1|OTTHUMT00000471235.1|OR4G4P-201|OR4G4P|840|unprocessed_pseudogene| LN:840 @SQ SN:ENST00000642116.1|ENSG00000240361.2|OTTHUMG00000001095.3|OTTHUMT00000492680.1|OR4G11P-202|OR4G11P|1414|lncRNA| LN:1414 @SQ SN:ENST00000492842.2|ENSG00000240361.2|OTTHUMG00000001095.3|OTTHUMT00000003224.3|OR4G11P-201|OR4G11P|939|transcribed_unprocessed_pseudogene| LN:939 @SQ SN:ENST00000641515.2|ENSG00000186092.6|OTTHUMG00000001094.4|OTTHUMT00000003223.4|OR4F5-202|OR4F5|2618|protein_coding| LN:2618 @SQ SN:ENST00000335137.4|ENSG00000186092.6|OTTHUMG00000001094.4|-|OR4F5-201|OR4F5|1054|protein_coding| LN:1054 @SQ SN:ENST00000466430.5|ENSG00000238009.6|OTTHUMG00000001096.2|OTTHUMT00000003225.1|AL627309.1-201|AL627309.1|2748|lncRNA| LN:2748 @SQ SN:ENST00000477740.5|ENSG00000238009.6|OTTHUMG00000001096.2|OTTHUMT00000003688.1|AL627309.1-202|AL627309.1|491|lncRNA| LN:491 @SQ SN:ENST00000471248.1|ENSG00000238009.6|OTTHUMG00000001096.2|OTTHUMT00000003687.1|AL627309.1-203|AL627309.1|629|lncRNA| LN:629 @SQ SN:ENST00000610542.1|ENSG00000238009.6|OTTHUMG00000001096.2|-|AL627309.1-205|AL627309.1|723|lncRNA| LN:723 @SQ SN:ENST00000453576.2|ENSG00000238009.6|OTTHUMG00000001096.2|OTTHUMT00000003689.1|AL627309.1-204|AL627309.1|336|lncRNA| LN:336 @SQ SN:ENST00000495576.1|ENSG00000239945.1|OTTHUMG00000001097.2|OTTHUMT00000003226.2|AL627309.3-201|AL627309.3|1319|lncRNA| LN:1319 @SQ SN:ENST00000442987.3|ENSG00000233750.3|OTTHUMG00000001257.3|OTTHUMT00000003691.3|CICP27-201|CICP27|3812|processed_pseudogene| LN:3812 @SQ SN:ENST00000494149.2|ENSG00000268903.1|OTTHUMG00000182518.2|OTTHUMT00000461982.2|AL627309.6-201|AL627309.6|755|processed_pseudogene| LN:755 @SQ SN:ENST00000595919.1|ENSG00000269981.1|OTTHUMG00000182738.2|OTTHUMT00000463398.2|AL627309.7-201|AL627309.7|284|processed_pseudogene| LN:284

[......] the end @SQ SN:ENST00000361789.2|ENSG00000198727.2|-|-|MT-CYB-201|MT-CYB|1141|protein_coding| LN:1141 @SQ SN:ENST00000387460.2|ENSG00000210195.2|-|-|MT-TT-201|MT-TT|66|Mt_tRNA| LN:66 @SQ SN:ENST00000387461.2|ENSG00000210196.2|-|-|MT-TP-201|MT-TP|68|Mt_tRNA| LN:68 @PG ID:minimap2 PN:minimap2 VN:2.14-r892-dirty CL:minimap2 -ax splice -uf -k14 /Users/valery/Desktop/rna/gencode.v32.transcripts.fa /Users/valery/Desktop/rna/bigfile.fastq

suppose that the problem is with bam file.

Could you help me?

P.S I had problems when installing because I had python 2.7. Then I installed python 3.8 and it had problems with installing Pysam. So now I have 3.7.5 version.

wdecoster commented 5 years ago

Hi,

I think the problem is because of the many contigs/"chromosomes" in a transcriptome bam. That may have caused too many separate threads (as extraction of information is done per contig). That's my hypothesis. To fix that I have pushed some changes to nanoget (v1.10.0), the submodule taking care of the data extraction.

I have made the changes available on PyPI and conda will follow shortly. Please let me know if that helps.

Thanks, Wouter

valery-shap commented 5 years ago

Thank you for your reply. I try pip3 install nanoget and the answer was: Requirement already satisfied: nanoget in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (1.9.1) Requirement already satisfied: pandas>=0.22.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from nanoget) (0.25.3) Requirement already satisfied: numpy in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from nanoget) (1.17.4) Requirement already satisfied: biopython in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from nanoget) (1.75) Requirement already satisfied: pysam>0.10.0.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from nanoget) (0.15.3) Requirement already satisfied: nanomath in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from nanoget) (0.23.1) Requirement already satisfied: pytz>=2017.2 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas>=0.22.0->nanoget) (2019.3) Requirement already satisfied: python-dateutil>=2.6.1 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas>=0.22.0->nanoget) (2.8.1) Requirement already satisfied: six>=1.5 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas>=0.22.0->nanoget) (1.13.0)

So suppose it is installed, no?

wdecoster commented 5 years ago

You have nanoget v1.9.1. You'll have to do pip install --upgrade nanoget. If that doesn't pull in v1.10.0 you may have to add --no-cache-dir.

valery-shap commented 5 years ago

thank you, I installed it:

=2.6.1->pandas>=0.22.0->nanoget) (1.13.0) Installing collected packages: nanoget Found existing installation: nanoget 1.9.1 Uninstalling nanoget-1.9.1: Successfully uninstalled nanoget-1.9.1 Running setup.py install for nanoget ... done Successfully installed nanoget-1.10.0

Again try to: NanoPlot -t 4 -o ~/Desktop/rna/nanoplot-3 -p bam-3 --bam ~/Desktop/rna/aln-3-sorted.bam

half of hour and there is no result. bam-3NanoPlot_20191120_1301.log it hangs on this. Now I decided to check this bam file with seqkit and seqkit bam -s gave the result: PrimAlnPerc PrimAln SecAln SupAln Unmapped MultimapPerc TotalReFile 96.34 623141 1434239 13883 23682 39.99 2081062 /Users/valery/Desktop/rna/aln-sorted.bam

so my bam file isn't corrupted and smith wrong with it. I hope for it.

wdecoster commented 5 years ago

And can you check with e.g. htop to see if it's still doing work?

valery-shap commented 5 years ago
Screenshot 2019-11-20 at 14 20 14

looks like it doesn't work(

wdecoster commented 5 years ago

I see in that screenshot that your 4 CPUs are working, but the highest percentage is 0.2%? Or is the rest processes from other users?

Do you think you could share your bam file for me to take a look at it?

valery-shap commented 5 years ago

It was my dream) I could do it in 4-5 hours. No, I'm the only user.

valery-shap commented 5 years ago

I've sent you via e-mail.

wdecoster commented 5 years ago

Using the bam file you provided (aln-sorted.bam) on a larger server (running with 24 threads) I have determined that it does work, it just takes an embarrassingly long time. Let me see how I can fix that.

valery-shap commented 5 years ago

Thank you for analyzing it. I have an opportunity to count on server but l'm a bit surprised. I'll try. If you change it for 4 threads(for real time), it'll be very useful. Thank you.

wdecoster commented 5 years ago

I have just updated nanoget to v1.11.0, to avoid being so slow when running on a bam with lots of contigs. Please try pip install --upgrade nanoget and run NanoPlot again.

valery-shap commented 5 years ago

Thank you, it works! I had only the one warning: JointGrid annotation is deprecated and will be removed in a future release.