zhpn1024 / ribotish

Ribo-seq TIS Hunter, predicting translation initiation sites and ORFs using riboseq data
http://dx.doi.org/10.1038/s41467-017-01981-8
GNU General Public License v3.0
24 stars 7 forks source link

TypeError: '<' not supported between instances of 'NoneType' and 'int' when running predict module #26

Closed mt1022 closed 1 year ago

mt1022 commented 1 year ago

Dear developer,

I came across this error when running ribotish with the following command:

ribotish predict -p4 --alt --framebest -b sample.bam -g ribotish_test2.gtf -f genome.fa -o ribotish_test2 -v -v

Error message:

Thu Aug  4 19:14:55 2022 Loading genome...
No input TIS data!
Thu Aug  4 19:14:55 2022 Predicting...
10
ENSMMUG00000061626  ENSMMUT00000107115  0   68
Thu Aug  4 19:14:55 2022 ENSMMUG00000061626 ENSMMUT00000107115  SHISAL1 protein_coding  10:7187171-7234333:+    TTG 804 1335    Internal    0   0   None    5.82694120373455e-12    T   5.826941203734552e-12
Traceback (most recent call last):
  File "/home/admin/mambaforge/bin/ribotish", line 56, in <module>
    main()
  File "/home/admin/mambaforge/bin/ribotish", line 34, in main
    commands[cmd].run(args)
  File "/home/admin/mambaforge/lib/python3.9/site-packages/ribotish/run/predict.py", line 250, in run
    cr = interval.cds_region_trans(t)
  File "/home/admin/mambaforge/lib/python3.9/site-packages/ribotish/zbio/interval.py", line 299, in cds_region_trans
    thick.sort()
TypeError: '<' not supported between instances of 'NoneType' and 'int'

This error seems to be caused by incorrect parsing of stop_codon position in src/zbio/gtf.py when the stop codon overlaps with a splice junction and is reprensented by two stop_codon lines in the gtf file: https://github.com/zhpn1024/ribotish/blob/607ab539267b4234ef8f7ec4f274f46637ce63f9/src/zbio/gtf.py#L395 And the stop codon position is stored as an interval rather than a list of two intervals: https://github.com/zhpn1024/ribotish/blob/607ab539267b4234ef8f7ec4f274f46637ce63f9/src/zbio/gtf.py#L278

Here is the full gtf file used above:

10  ensembl transcript  7172377 7234333 .   +   .   gene_id "ENSMMUG00000061626"; gene_version "1"; transcript_id "ENSMMUT00000107115"; transcript_version "1"; gene_name "SHISAL1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SHISAL1-205"; transcript_source "ensembl"; transcript_biotype "protein_coding";
10  ensembl exon    7172377 7173079 .   +   .   gene_id "ENSMMUG00000061626"; gene_version "1"; transcript_id "ENSMMUT00000107115"; transcript_version "1"; exon_number "1"; gene_name "SHISAL1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SHISAL1-205"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSMMUE00000475544"; exon_version "1";
10  ensembl CDS 7172377 7173079 .   +   0   gene_id "ENSMMUG00000061626"; gene_version "1"; transcript_id "ENSMMUT00000107115"; transcript_version "1"; exon_number "1"; gene_name "SHISAL1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SHISAL1-205"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSMMUP00000071984"; protein_version "1";
10  ensembl exon    7183086 7183184 .   +   .   gene_id "ENSMMUG00000061626"; gene_version "1"; transcript_id "ENSMMUT00000107115"; transcript_version "1"; exon_number "2"; gene_name "SHISAL1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SHISAL1-205"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSMMUE00000523952"; exon_version "1";
10  ensembl CDS 7183086 7183184 .   +   2   gene_id "ENSMMUG00000061626"; gene_version "1"; transcript_id "ENSMMUT00000107115"; transcript_version "1"; exon_number "2"; gene_name "SHISAL1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SHISAL1-205"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSMMUP00000071984"; protein_version "1";
10  ensembl exon    7187170 7187383 .   +   .   gene_id "ENSMMUG00000061626"; gene_version "1"; transcript_id "ENSMMUT00000107115"; transcript_version "1"; exon_number "3"; gene_name "SHISAL1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SHISAL1-205"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSMMUE00000036524"; exon_version "2";
10  ensembl CDS 7187170 7187383 .   +   2   gene_id "ENSMMUG00000061626"; gene_version "1"; transcript_id "ENSMMUT00000107115"; transcript_version "1"; exon_number "3"; gene_name "SHISAL1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SHISAL1-205"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSMMUP00000071984"; protein_version "1";
10  ensembl exon    7198342 7198659 .   +   .   gene_id "ENSMMUG00000061626"; gene_version "1"; transcript_id "ENSMMUT00000107115"; transcript_version "1"; exon_number "4"; gene_name "SHISAL1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SHISAL1-205"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSMMUE00000520747"; exon_version "1";
10  ensembl CDS 7198342 7198657 .   +   1   gene_id "ENSMMUG00000061626"; gene_version "1"; transcript_id "ENSMMUT00000107115"; transcript_version "1"; exon_number "4"; gene_name "SHISAL1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SHISAL1-205"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSMMUP00000071984"; protein_version "1";
10  ensembl exon    7234333 7234333 .   +   .   gene_id "ENSMMUG00000061626"; gene_version "1"; transcript_id "ENSMMUT00000107115"; transcript_version "1"; exon_number "5"; gene_name "SHISAL1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SHISAL1-205"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSMMUE00000508835"; exon_version "1";
10  ensembl stop_codon  7198658 7198659 .   +   0   gene_id "ENSMMUG00000061626"; gene_version "1"; transcript_id "ENSMMUT00000107115"; transcript_version "1"; exon_number "4"; gene_name "SHISAL1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SHISAL1-205"; transcript_source "ensembl"; transcript_biotype "protein_coding";
10  ensembl stop_codon  7234333 7234333 .   +   1   gene_id "ENSMMUG00000061626"; gene_version "1"; transcript_id "ENSMMUT00000107115"; transcript_version "1"; exon_number "5"; gene_name "SHISAL1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SHISAL1-205"; transcript_source "ensembl"; transcript_biotype "protein_coding";

Here is a demo of this error:

from ribotish.zbio import io

for g in io.geneIter('ribotish_test2.gtf', 'gtf'):
     print(g)

for t in g.trans:
     print(t)

t.cds_stop(cdna=True)  # = 1337
t.cds_start(cdna=True)  # = 2
t.cdna_length()  # = 1335 (< 1337)

I am not sure how to deal with this issue. Can I get expected results by just deleting all the lines with "stop_codon" in third column of the gtf file?

zhpn1024 commented 1 year ago

Thank you for reporting the bug. Let me see how to deal with it.

zhpn1024 commented 1 year ago

The problem is that the 'start_codon' line is missing, and the code calculate start codon position with stop codon, resulting in frame shift error. The t.cds_start(cdna=True) should be 0 instead of 2, and t.cds_stop(cdna=True) should be 1335. Deleting the second stop_codon line should be fine. I have find a way to fix the bug.

mt1022 commented 1 year ago

Thanks for the timely response.

By the way, I am not sure whether self.cds_start have similar problems when "start_codon" lines are present in gtf. It is possible that a start codon might be distributed over two exons in rare cases, which might also lead to incorrect results in cds_start() class mehtod.

zhpn1024 commented 1 year ago

You are right. The split of start codon should also be considered.

zhpn1024 commented 1 year ago

https://github.com/zhpn1024/ribotish/commit/ed3d80d5aed2e504c2eb120f8b6e51b05db14acf