zolotarovgl / GeneExt

GeneExt - Gene extension for improved scRNA-seq data counting
GNU General Public License v3.0
2 stars 3 forks source link

Error of transcript number after geneext #5

Closed CaiCheng1996 closed 8 months ago

CaiCheng1996 commented 8 months ago

In put for example: NC_048303.1 StringTie gene 90101184 90185752 . - . gene_id "DMRT1"; NC_048303.1 StringTie transcript 90101184 90185752 . - . gene_id "DMRT1"; transcript_id "MSTRG.13159.3"; NC_048303.1 StringTie exon 90101184 90102440 . - . gene_id "DMRT1"; transcript_id "MSTRG.13159.3"; NC_048303.1 StringTie exon 90154786 90155069 . - . gene_id "DMRT1"; transcript_id "MSTRG.13159.3"; NC_048303.1 StringTie exon 90179468 90179654 . - . gene_id "DMRT1"; transcript_id "MSTRG.13159.3"; NC_048303.1 StringTie exon 90185100 90185752 . - . gene_id "DMRT1"; transcript_id "MSTRG.13159.3"; NC_048303.1 StringTie transcript 90101225 90161625 . - . gene_id "DMRT1"; transcript_id "MSTRG.13159.4"; NC_048303.1 StringTie exon 90101225 90102440 . - . gene_id "DMRT1"; transcript_id "MSTRG.13159.4"; NC_048303.1 StringTie exon 90140839 90141001 . - . gene_id "DMRT1"; transcript_id "MSTRG.13159.4"; NC_048303.1 StringTie exon 90154786 90155069 . - . gene_id "DMRT1"; transcript_id "MSTRG.13159.4"; NC_048303.1 StringTie exon 90161570 90161625 . - . gene_id "DMRT1"; transcript_id "MSTRG.13159.4";

The output: NC_048303.1 StringTie gene 90101184 90185752 . - . gene_id "DMRT1"; NC_048303.1 StringTie transcript 90101184 90185752 . - . gene_id "DMRT1"; transcript_id "MSTRG.13159.3"; NC_048303.1 StringTie exon 90101184 90102440 . - . gene_id "DMRT1"; transcript_id "MSTRG.13159.3"; NC_048303.1 StringTie exon 90154786 90155069 . - . gene_id "DMRT1"; transcript_id "MSTRG.13159.3"; NC_048303.1 StringTie exon 90179468 90179654 . - . gene_id "DMRT1"; transcript_id "MSTRG.13159.3"; NC_048303.1 StringTie exon 90185100 90185752 . - . gene_id "DMRT1"; transcript_id "MSTRG.13159.3";

Geneext make all genes have only one transcript, and don't do any merging but just lost other transcripts, is that a bug?

zolotarovgl commented 8 months ago

Dear user,

Thank you for your question and especially for providing the input file!

This is not a bug but intended functionality to make the tool more robust: lf a gene has multiple transcripts, then it's not clear which one should be extended downstream ( peaks can rarely be assigned to one of the mRNA isoforms unambiguously as there is no info about splicing upstream of the peak). So far, the GeneExt selects the longest transcript per locus and adds the extensions to the last exon of it. If you are running gene-level quantification downstream ( the wast majority of scRNA-seq workflows), picking the longest transcript or keeping the all of them should not affect gene quantification - the only important variable is the total range of the gene not the exon-intron structure (i.e. constituent transcripts).

P.S. If you want to keep multiple transcripts, you can disable picking the longest transcript per gene in the 445-th line of geneext.py: do_longest = **False** # whether to select the longest transcript per gene

(But be aware that this may lead to some unexpected behavior!)

Please, let me know if that answers your question / if you have any additional ones - I'm happy to help!

Cheers, Grisha

CaiCheng1996 commented 8 months ago

Thank you!