wu-lab-egio / EGIO

Exon Group Ideogram based detection of Orthologous exons and Orthologous isoforms
8 stars 0 forks source link

Error parsing blast results: `IndexError: list index out of range` #8

Open clemgoub opened 5 days ago

clemgoub commented 5 days ago

Hello! I am really looking forward to using your tool which seems very useful. I am encountering an error for which I cannot seem to find a solution. I am trying to get orthologous exons between human and mice. All my data comes from ensembl.

Command: (system is MacOSX Catalina, arm64)

./_RUN_egio.sh -s hg38 -S mm39 -r ../Homo_sapiens.GRCh38.112.gtf -R ../Mus_musculus.GRCm39.112.gtf -e ../Homo_sapiens.GRCh38.cdna.all.fa -E ../Mus_musculus.GRCm39.cdna.all.fa -o ../Homo_sapiens.GRCh38.cds.all.fa -O ../Mus_musculus.GRCm39.cds.all.fa -h ../hg38-mm39.ORTHOLOGS.tsv -p 6 -i 0.8 -c 0.8 -m 2 -n -2 -g -1

Output:

species1 is hg38
species2 is mm39
identity is 0.8
coverage is 0.8
match score is 2
mismatch penalty is -2
gap penalty is -1
use 6 core
prepare required files of egio
extract orf position based on cDNA and CDS sequence
206723 of mRNA sequence is read
123495 CDS sequence is read
extract transcript information
extract coding exon sequence from 63140 genes
..........10%
..........20%
..........30%
..........40%
..........50%
..........60%
..........70%
..........80%
..........90%
..........100%

transform CDS-Exon into fasta

Building a new DB, current time: 07/03/2024 13:30:25
New DB name:   /Users/clementgoubert/Documents/tmp_work/EGIO/EGIO/extrainfo/hg38/hg38
New DB title:  extrainfo/hg38.exonfasta
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named /Users/clementgoubert/Documents/tmp_work/EGIO/EGIO/extrainfo/hg38/hg38
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 282465 sequences in 2.38058 seconds.

extract orf position based on cDNA and CDS sequence
115988 of mRNA sequence is read
66464 CDS sequence is read
extract transcript information
extract coding exon sequence from 57186 genes
..........10%
..........20%
..........30%
..........40%
..........50%
..........60%
..........70%
..........80%
..........90%
..........100%

transform CDS-Exon into fasta

Building a new DB, current time: 07/03/2024 13:31:14
New DB name:   /Users/clementgoubert/Documents/tmp_work/EGIO/EGIO/extrainfo/mm39/mm39
New DB title:  extrainfo/mm39.exonfasta
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named /Users/clementgoubert/Documents/tmp_work/EGIO/EGIO/extrainfo/mm39/mm39
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 243324 sequences in 1.9633 seconds.

reciprocal blastn of hg38 and mm39: hg38 as query
reciprocal blastn of hg38 and mm39: mm39 as query
summarize reciprocal blastN

organize orthologous gene pairs
organize cds region information
filter BLASTN results within orthologous gene pairs
analysis species1
100000 records are analyzed
135049 records are analyzed
analysis species2
100000 records are analyzed
134971 records are analyzed
filter BLASTN results with reciprocal coverage over threshold
filter BLASTN results with reciprocal best hit

Traceback (most recent call last):
  File "/Users/clementgoubert/Documents/tmp_work/EGIO/EGIO/__prepare_egio_blastn.py", line 205, in <module>
    summaryblastn(args.species1, args.blast1, args.exonanno1, args.species2, args.blast2, args.exonanno2, args.orthogen, args.coverage)
  File "/Users/clementgoubert/Documents/tmp_work/EGIO/EGIO/__prepare_egio_blastn.py", line 147, in summaryblastn
    store[0] = maplist[countlsr]
               ~~~~~~~^^^^^^^^^^
IndexError: list index out of range
detect orthologous exons and isoforms between species hg38 and mm39
Traceback (most recent call last):
  File "/Users/clementgoubert/Documents/tmp_work/EGIO/EGIO/__EGIO.py", line 1922, in <module>
    blastexon = pd.read_table(str(args.blastn),header=0,sep='\t')
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1282, in read_table
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 611, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1448, in __init__
    self._engine = self._make_engine(f, self.engine)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1705, in _make_engine
    self.handles = get_handle(
                   ^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/pandas/io/common.py", line 863, in get_handle
    handle = open(
             ^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'extrainfo/blastn_mapping_hg38-mm39.tab'

The blastn outputs are both present in ./extrainfo (and not empty)

.
├── README.md
├── _RUN_egio.sh
├── __EGIO.py
├── ___pairwisealign.c
├── ___pairwisealign.so
├── __plotIsoCom.py
├── __prepare_egio_blastn.py
├── __prepare_egio_extra.py
└── extrainfo
    ├── hg38
    │   ├── hg38.ndb
    │   ├── hg38.nhr
    │   ├── hg38.nin
    │   ├── hg38.njs
    │   ├── hg38.not
    │   ├── hg38.nsq
    │   ├── hg38.ntf
    │   └── hg38.nto
    ├── hg38-mm39
    ├── hg38.exon
    ├── hg38.exonfasta
    ├── hg38.tran
    ├── mm3
    ├── mm39
    │   ├── mm39.ndb
    │   ├── mm39.nhr
    │   ├── mm39.nin
    │   ├── mm39.njs
    │   ├── mm39.not
    │   ├── mm39.nsq
    │   ├── mm39.ntf
    │   └── mm39.nto
    ├── mm39-hg38
    ├── mm39.exon
    ├── mm39.exonfasta
    └── mm39.tran

Thanks a lot for your help!

Clément

wu-lab-egio commented 4 days ago

Hi,

I think this might be caused by genes that are annotated in your orthologous gene pairs but are not annotated in the ensembl. Could you please send me the hg38-mm39.ORTHOLOGS.tsv file, which will help to find the bug.

clemgoub commented 3 days ago

Hi,

Thanks for your quick reply! I see, here is the hg38-mm39.ORTHOLOGS.tsv file (I had to add .txt for Github to upload). hg38-mm39.ORTHOLOGS.tsv.txt

Thank you very much!

wu-lab-egio commented 3 days ago

Hi,

In your hg38-mm39.ORTHOLOGS.tsv file, the orthogolous gene pairs are both human ensembl IDs. The mm39 column should be mouse ensmeb gene IDs. To get orthologous gene pairs, please refer to (https://github.com/wu-lab-egio/EGIO) Requirement (4) section.

Best!