wdecoster / methplotlib

Plotting tools for nanopore methylation data
MIT License
90 stars 13 forks source link

Found 0 gene(s) in the region. #15

Closed sarah872 closed 4 years ago

sarah872 commented 4 years ago

I have a gff3 file that I downloaded from the Microscope Annotation platform. This is how the file looks:

##gff-version 3
##sequence-region TQC00001.1 1 2035761
TQC00010.1  annotation  remark  1   2035761 .   .   .   accession=TQC00001.1;comment=Annotations were generated from the MicroScope annotation platform. Additional results are available at http://www.genoscope.cns.fr/agc/microscope . This file is not suitable for direct databank submission. To contact us: mage%40genoscope.cns.fr .%0AMicroscope genomic region coordinates: 1..2035761;data_file_division=BCT;date=25-NOV-2019;organism=Genus Species;source=Genus Species;topology=linear
TQC00010.1  feature region  1   2035761 .   +   .   Is_circular=false;Note=whole genome shotgun linear WGS contig 1;db_xref=taxon:1907535,MaGe/Organism_id:12250,MaGe/Species_code:TQUA2019,MaGe/Sequence_id:16851,MaGe/Scaffold_id:1,MaGe/Contig_id:1,MaGe/Contig_label:scaffold1;mol_type=genomic DNA;organism=Candidatus Thiosymbion quadrati;strain=ONT2019
TQC00010.1  feature gene    168 2300    .   +   .   locus_tag=TQUA2019_v1_10001
TQC00010.1  feature CDS 168 2300    .   +   0   ID=71761160;db_xref=MaGe:71761160;inference=ab initio prediction:AMIGene:2.0;locus_tag=TQUA2019_v1_10001;product=N-6 DNA methylase;transl_table=11;translation=M
TQC00010.1  feature gene    2300    3358    .   +   .   locus_tag=TQUA2019_v1_10002
TQC00010.1  feature CDS 2300    3358    .   +   0   ID=71761161;db_xref=MaGe:71761161;inference=ab initio prediction:AMIGene:2.0;locus_tag=TQUA2019_v1_10002;product=2-hydroxyacid dehydrogenase;transl_table=11;translation=M
TQC00010.1  feature gene    3529    4179    .   +   .   locus_tag=TQUA2019_v1_10003
TQC00010.1  feature CDS 3529    4179    .   +   0   ID=71761162;db_xref=MaGe:71761162;inference=ab initio prediction:AMIGene:2.0;locus_tag=TQUA2019_v1_10003;note=Evidence 5 : Unknown function;product=protein of unknown function;transl_table=11;translation=M

This is the log file:

2019-12-13 12:38:14,529 methplotlib 0.8.0 started.
Python version is: 3.7.0 (default, Sep  6 2018, 14:24:05)  [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]
Arguments are: Namespace(bed=None, example=False, gtf='file.gff', methylation=['methylation_calls_dam.mod.tsv.gz', 'methylation_calls_dam.mod.tsv.gz.freq.mod'], names=['calls', 'frequencies'], simplify=True, smooth=5, split=False, window='TQC00010.1:1-10000')
2019-12-13 12:38:14,530 Processing TQC00010.1_1_10000
2019-12-13 12:38:21,836 Read the file in a dataframe.
2019-12-13 12:38:21,900 File contains raw data.
2019-12-13 12:38:22,040 Read the file in a dataframe.
2019-12-13 12:38:22,057 File contains frequency data.
2019-12-13 12:38:22,065 Collected methylation data for 2 datasets
2019-12-13 12:38:24,886 Created QC plots
2019-12-13 12:38:28,085 Prepared methylation traces.
2019-12-13 12:38:31,383 Prepared modification plots.
2019-12-13 12:38:31,403 Parsing GTF file...
2019-12-13 12:38:31,451 Loaded GTF file, processing...
2019-12-13 12:38:31,456 Found 0 gene(s) in the region.

2019-12-13 12:38:31,465 Prepared annotation plots.
2019-12-13 12:38:34,397 Finished!

The plotting works, but as said, there are no annotations plotted. What is wrong with my gff-file?

endrebak commented 4 years ago

What command did you use?

Uneducated guesses:

If you could post the data used and the command you tried that would be of great help :)

sarah872 commented 4 years ago

This is the command:

methplotlib -m calls freq -n calls freq -w TQUA2019C00010.1:0-1794 -g annotation.gff --simplify

annotation.gff.gz calls.gz freq.gz

endrebak commented 4 years ago

Looking into it now :) Thanks!

endrebak commented 4 years ago

I could not find the reason. Guess @wdecoster will look at it next week :)

wdecoster commented 4 years ago

Thanks for reporting this. I believe the cause is that methplotlib makes certain assumptions about the gtf file, which may not necessarily be valid for yours. It is tailored to Ensembl gtf files, maybe even specific for human files. But I should change that to also enable other types. It's an annoying format.

These assumptions are:

Based on the file you linked above (which is only 4 gtf lines?) your gtf doesn't match with both of these assumptions.

Could you perhaps share the full gtf? You can find my email in my GitHub profile.

Thanks, Wouter

endrebak commented 4 years ago

Perhaps you could report that it found no entries in the format expected and what the expected format is? Also the readme and --help screen could also inform the user about the expected format.

wdecoster commented 4 years ago

Yes, I agree, that would be the minimal information that the script should report. I hope to look into this later this week.

wdecoster commented 4 years ago

I'm sorry, but I won't be able to provide a fix before the 2nd of January. Enjoy your holidays!

wdecoster commented 4 years ago

I apologize for taking so long to fix this. I have just made version 0.13 available, which supports your annotation file. I didn't realize in December that you were using a GFF file, while everything I wrote is intended for GTF files. They kinda look the same, but there are some crucial differences in how they're parsed.

My code will currently use the locus_tag attribute as "gene name". I am not entirely happy with how it at the moment parses GFF files and how the attributes of interest are selected, but I will wait for more user feedback to see if changes need to be made. I can't foresee all the problems that users might encounter. It is perfectly functional for Ensembl GTF files, and I'll make more changes when users require those.

Please let me know if you have more feedback.