zolotarovgl / GeneExt

GeneExt - Gene extension for improved scRNA-seq data counting
GNU General Public License v3.0
2 stars 3 forks source link

GeneExt removes GTF entries? #9

Closed seveein closed 4 weeks ago

seveein commented 4 weeks ago

Hello, first of all, thanks for this beautiful tool! I am having an issue with running GeneExt on some plant annotations.

Either the progress stops during 'Preflight checks' without notice, or it gets canceled with this call: File "GeneExt/geneext.py", line 611, in <module> helper.check_file_size(genefile,verbose = verbose) File "GeneExt/geneext/helper.py", line 1847, in check_file_size raise FileSizeError(f"File '{filename}' is empty.") geneext.helper.FileSizeError: File 'GeneExt/data/ANNOTATION.gtf' is empty.

Afterwards, the GTF is empty. Do you have any ideas about the source of this issue? I curated the GTF with AGAT before. Computational resources shouldn't be a problem either. Thanks in advance! Best s

zolotarovgl commented 4 weeks ago

Dear Seveein, For sure, it's some bug. Can you, please, list the the contents of the temporary folder with ls -hlat? So we can see the which files are empty?

Also, can you share the exact command you've used to run the tool?

Kind regards, Grygoriy

seveein commented 4 weeks ago

Hey, there is only one file in the tmp-dir: -rw-r--r-- 1 user suaph 605 Jun 25 07:29 chr_sizes.tsv The .log file contains only 'Preflight checks' .

to be specific, the INPUT GTF is empty after running GeneExt. thx, s

zolotarovgl commented 4 weeks ago

OK, it looks like a bedtools error which usually dies silently. I guess, there is a difference between the chr names in the fasta file and in the gff or .bam. Would you be so kind to compare them? Also, can you run GeneExt with -v 3 so that everything is printed to stdout?

Thx, Grisha

seveein commented 4 weeks ago

-v 3 output:

Temporary directory isn't set, setting to tmp/ Temporary directory tmp/ found! Alignment file ... OK Genome annotation file .... OK Input: /gxfs_home/cau/suaph281/programs/GeneExt/data/ANNOTATION.gtf, guessed format: gtf Output: /gxfs_home/cau/suaph281/programs/GeneExt/data/habr_genext.gtf, guessed format: gtf Checking gene exons...

then, it aborts. The .bam I'm testing now contains reads mapped against a concatenated reference (two different organisms). I removed the reads of the non-focus organism. However this brought no improvement. Best, s-

Edit: Repeating with adjusted bam-header (no non-target-chromosomes left) brought the old error back:

Temporary directory isn't set, setting to tmp/ Temporary directory tmp/ found! Alignment file ... OK Genome annotation file .... OK Input: /gxfs_home/cau/suaph281/programs/GeneExt/data/ANNOTATION.gtf, guessed format: gtf Output: /gxfs_home/cau/suaph281/programs/GeneExt/data/habr_genext.gtf, guessed format: gtf Checking gene exons... Running: bedtools sort -i /gxfs_home/cau/suaph281/programs/GeneExt/data/ANNOTATION.gtf -g tmp/chr_sizes.tsv > /gxfs_home/cau/suaph281/programs/GeneExt/data/ANNOTATION.gtf.reord; mv /gxfs_home/cau/suaph281/programs/GeneExt/data/ANNOTATION.gtf.reord /gxfs_home/cau/suaph281/programs/GeneExt/data/ANNOTATION.gtf Done reordering genefile. Done reordering by bam: /gxfs_home/cau/suaph281/programs/GeneExt/data/ANNOTATION.gtf Traceback (most recent call last): File "/gxfs_home/cau/suaph281/programs/GeneExt/geneext.py", line 611, in <module> helper.check_file_size(genefile,verbose = verbose) File "/gxfs_home/cau/suaph281/programs/GeneExt/geneext/helper.py", line 1847, in check_file_size raise FileSizeError(f"File '{filename}' is empty.") geneext.helper.FileSizeError: File '/gxfs_home/cau/suaph281/programs/GeneExt/data/ANNOTATION.gtf' is empty.

Edit2: Chromosome-names match between the files.

bedtools is certainly involved. After this, the input-gtf is empty:

Running: bedtools sort -i /gxfs_home/cau/suaph281/programs/GeneExt/data/ANNOTATION.gtf -g /gxfs_home/cau/suaph281/programs/GeneExt/tmp/chr_sizes.tsv > /gxfs_home/cau/suaph281/programs/GeneExt/data/ANNOTATION.gtf.reord; mv /gxfs_home/cau/suaph281/programs/GeneExt/data/ANNOTATION.gtf.reord /gxfs_home/cau/suaph281/programs/GeneExt/data/ANNOTATION.gtf

seveein commented 4 weeks ago

the hint towards bedtools/chromosome name issues helped resolving the problem! The GTF had some entries from residual scaffolds left. It worked after removing those. It's unfortunate that the Input GTF is cleared during this, though. Maybe this can be handled differently.

Thanks! best s.

zolotarovgl commented 4 weeks ago

Yeah, it’s a very annoying - I will try to add better reporting. Thanks for taking your time to flag this issue! Let me know if I can help you with anything else.

Grygoriy Zolotarov, MD PhD Student Centre for Genomic Regulation C/ del Dr. Aiguader 88 08003 Barcelona

On Tue, 25 Jun 2024 at 15:09, seveein @.***> wrote:

the hint towards bedtools/chromosome name issues helped resolving the problem! The GTF had some entries from residual scaffolds left. It worked after removing those.

Thanks! best s.

— Reply to this email directly, view it on GitHub https://github.com/zolotarovgl/GeneExt/issues/9#issuecomment-2188906374, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGZDQJ65GLDPBGZLGXYGMZDZJFTZTAVCNFSM6AAAAABJZ6DIVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBYHEYDMMZXGQ . You are receiving this because you commented.Message ID: @.***>