wu-lab-egio / EGIO

Exon Group Ideogram based detection of Orthologous exons and Orthologous isoforms
8 stars 0 forks source link

GTF file format error #7

Closed Tang-pro closed 4 months ago

Tang-pro commented 4 months ago

Hi, @wu-lab-egio

this gtf format is as follows Gbar_A01 StringTie gene 27440 34713 1000 + . gene_id "MSTRG.1"; Gbar_A01 StringTie transcript 27440 34713 1000 + . gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; Gbar_A01 StringTie exon 27440 27578 1000 + . gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "1"; Gbar_A01 StringTie exon 27894 28455 1000 + . gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "2"; Gbar_A01 StringTie exon 29197 29256 1000 + . gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "3";

This error occurred here Traceback (most recent call last): File "/public/home/zwliu/software/EGIO/prepare_egio_extra.py", line 442, in prepare_geio_extra(args.gtf, args.cdna, args.cds, args.species) File "/public/home/zwliu/software/EGIO/prepare_egio_extra.py", line 251, in prepare_geio_extra gty = gtytmp[0].split(" ")[1].replace("\"","").replace("\n","").replace(";","") IndexError: list index out of range BLAST options error: File extrainfo/Ghir.exonfasta does not exist Traceback (most recent call last): File "/public/home/zwliu/software/EGIO/prepare_egio_extra.py", line 442, in prepare_geio_extra(args.gtf, args.cdna, args.cds, args.species) File "/public/home/zwliu/software/EGIO/prepare_egio_extra.py", line 251, in prepare_geio_extra gty = gtytmp[0].split(" ")[1].replace("\"","").replace("\n","").replace(";","") IndexError: list index out of range BLAST options error: File extrainfo/Gbar.exonfasta does not exist Command line argument error: Argument "query". File is not accessible: extrainfo/Ghir.exonfasta' Command line argument error: Argument "query". File is not accessible:extrainfo/Gbar.exonfasta' Traceback (most recent call last): File "/public/home/zwliu/software/EGIO/prepare_egio_blastn.py", line 205, in summaryblastn(args.species1, args.blast1, args.exonanno1, args.species2, args.blast2, args.exonanno2, args.orthogen, args.coverage) File "/public/home/zwliu/software/EGIO/prepare_egio_blastn.py", line 20, in summaryblastn tf = open(str(exonanno1)) FileNotFoundError: [Errno 2] No such file or directory: 'extrainfo/Ghir.exon' Traceback (most recent call last): File "/public/home/zwliu/software/EGIO/EGIO.py", line 1923, in blastexon = pd.read_table(str(args.blastn),header=0,sep='\t') File "/public/home/zwliu/miniconda3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1405, in read_table return _read(filepath_or_buffer, kwds) File "/public/home/zwliu/miniconda3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 620, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/public/home/zwliu/miniconda3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1620, in init__ self._engine = self._make_engine(f, self.engine) File "/public/home/zwliu/miniconda3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine self.handles = get_handle( File "/public/home/zwliu/miniconda3/lib/python3.10/site-packages/pandas/io/common.py", line 873, in get_handle handle = open( FileNotFoundError: [Errno 2] No such file or directory: 'extrainfo/blastn_mapping_Ghir-Gbar.tab'

Is this error related to the GTF file? How should I solve it?

Best!

wu-lab-egio commented 4 months ago

It seems that some information are not provided in your STRINGTIE output gtf file. I fix this bug and update the script. Hope that helps.

Please feel free to leave a message if there is any problem in the script !

Best!

Tang-pro commented 4 months ago

Hi, @wu-lab-egio

Thank you, that's so nice!

But I encountered a new problem. The exon.fasta of the two species online here is empty. Is this related to blast?

BLAST options error: File extrainfo/Ghir.exonfasta is empty BLAST options error: File extrainfo/Gbar.exonfasta is empty BLAST Database error: No alias or index file found for nucleotide database [extrainfo/Gbar/Gbar] in search path [/public/home/zwliu/EGIO::] BLAST Database error: No alias or index file found for nucleotide database [extrainfo/Ghir/Ghir] in search path [/public/home/zwliu/EGIO::] Traceback (most recent call last): File "/public/home/zwliu/EGIO/__prepare_egio_blastn.py", line 205, in <module> summaryblastn(args.species1, args.blast1, args.exonanno1, args.species2, args.blast2, args.exonanno2, args.orthogen, args.coverage) File "/public/home/zwliu/EGIO/__prepare_egio_blastn.py", line 147, in summaryblastn store[0] = maplist[countlsr] IndexError: list index out of range Traceback (most recent call last): File "/public/home/zwliu/EGIO/__EGIO.py", line 1922, in <module> blastexon = pd.read_table(str(args.blastn),header=0,sep='\t') File "/public/home/zwliu/miniconda3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1405, in read_table return _read(filepath_or_buffer, kwds) File "/public/home/zwliu/miniconda3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 620, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/public/home/zwliu/miniconda3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__ self._engine = self._make_engine(f, self.engine) File "/public/home/zwliu/miniconda3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1880, in _make_engine self.handles = get_handle( File "/public/home/zwliu/miniconda3/lib/python3.10/site-packages/pandas/io/common.py", line 873, in get_handle handle = open( FileNotFoundError: [Errno 2] No such file or directory: 'extrainfo/blastn_mapping_Ghir-Gbar.tab'

my command is as follows module load /BLAST+/2.15.0 ./_RUN_egio.sh -s Ghir -S Gbar -r example/Ghincfib_gene.gtf -R example/Gb_Inclufib_gene.gtf -e example/Ghirsutum_transcript.fa -E example/Gbarbadense_transcript.fa -o example/Ghirsutum_transcript.fa.transdecoder.cds -O example/Gbarbadense_transcript.fa.transdecoder.cds -h example/homo_genepairs.txt -p 12 -i 0.8 -c 0.8 -m 2 -n -2 -g -1

Tang-pro commented 4 months ago

Hi, @wu-lab-egio

Meantime, my homogene.txt is as follows

head homo_genepairs.txt Ghir Gbar MSTRG.36527 Gbar.D11G028220 MSTRG.21471 MSTRG.24672 MSTRG.21376 MSTRG.8294 Ghir.D02G002070 Gbar.A02G001670 MSTRG.16082 MSTRG.26602 MSTRG.4517 MSTRG.5214

In the two species Ghir and Gbar, due to previous merging using string, MSTRG* will have some duplications in the two species. Will this affect the calculation?

Looking forward to your reply!

Best!

Tang-pro commented 4 months ago

Hi, @wu-lab-egio

image

I guess it's still a problem with the GTF format. This is part of my GTF file, I can provide the full GTF file if needed

I would be very grateful if I could get your help

wu-lab-egio commented 4 months ago

could you please also show the CDS and mRNA fasta files? That may help to find the bugs.

Tang-pro commented 4 months ago

Sure, this is cds fasta files is image

and mRNA fasta is

image

Tang-pro commented 4 months ago

It should be noted here that some IDs start with MSTRG image

wu-lab-egio commented 4 months ago

Hi,

I test the files you sent, and find the problem is caused by the gene and transcript name foramt.

To fit the file format in Ensembl, the transcript version information (for example ENSMMUT00000085913.1) was removed by dropping the information after "." symbol. In your custom files, the transcript id usually use "." symbol to mark the gene id and transcrip id (for example: gene_id "MSTRG.23024"; transcript_id "MSTRG.23024.2"; ), which causes the failure of the pipeline and generates empty xx.exonfasta and ..exon files.

Here is my solution, replace the "." symbol to "_" in all your custom files (now gene_id "MSTRG.23024" and transcript_id "MSTRG.23024.2" are transformed into "MSTRG_23024" and "MSTRG_23024_2". Then rerun the pipeline.

Best!

wu-lab-egio commented 4 months ago

Here are screenshots of xx.exonfasta and ..exon files using your file as input. 截屏2024-05-21 10 11 32 xx.exon file

截屏2024-05-21 10 12 16 xx.exonfasta