zktuong / dandelion

dandelion - A single cell BCR/TCR V(D)J-seq analysis package for 10X Chromium 5' data
https://sc-dandelion.readthedocs.io/
GNU Affero General Public License v3.0
103 stars 25 forks source link

reannotate_genes encountered Unknown argument: "c_region_db" Error and so on #428

Closed Cscharlotte closed 2 weeks ago

Cscharlotte commented 2 weeks ago

Description of the bug

Hi,

Thank you for the development of dandelion package! I am trying to reannotate my 10x TCR data (only one dataset for testing), while there was some error raised which I cannot see the reasons for. Code is shown below. And my sample folder was saved as '~/dandelion/dandelion_data/10kPBMC_CFCOV' which contains 'all_contig_annotations.csv' and 'all_contig.fasta'.

The dandelion I am using is the current Github developer version to avoid conflict with anndata in the 0.3.8 release version.

Minimal reproducible example

import sys
import os
import dandelion as ddl

os.environ["IGDATA"] = "~/dandelion/Github/dandelion/container/database/igblast/"
os.environ["GERMLINE"] = "~/dandelion/Github/dandelion/container/database/germlines/"
os.environ["BLASTDB"] = "~/dandelion/Github/dandelion/container/database/blast/"
os.chdir(os.path.expanduser("~/dandelion/dandelion_data"))

samples = "10kPBMC_CFCOV"
ddl.pp.format_fastas(samples, prefix = samples, filename_prefix="all")
ddl.pp.reannotate_genes(samples, loci = "tr", filename_prefix="all")

The error message produced by the code above

Assigning genes :   0%|          | 0/1 [00:00<?, ?it/s]
USAGE
  igblastn [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-germline_db_V germline_database_name]
    [-num_alignments_V int_value] [-germline_db_V_seqidlist filename]
    [-germline_db_D germline_database_name] [-num_alignments_D int_value]
    [-germline_db_D_seqidlist filename]
    [-germline_db_J germline_database_name] [-num_alignments_J int_value]
    [-germline_db_J_seqidlist filename] [-auxiliary_data filename]
    [-min_D_match min_D_match] [-D_penalty D_penalty] [-J_penalty J_penalty]
    [-num_clonotype num_clonotype] [-clonotype_out clonotype_out]
    [-allow_vdj_overlap] [-organism germline_origin]
    [-domain_system domain_system] [-ig_seqtype sequence_type]
    [-focus_on_V_segment] [-extend_align5end] [-min_V_length Min_V_Length]
    [-min_J_length Min_J_Length] [-show_translation] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]
    [-subject_loc range] [-query input_file] [-out output_file]
    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
    [-gapextend extend_penalty] [-qcov_hsp_perc float_value]
    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-penalty penalty] [-reward reward] [-no_greedy]
    [-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value]
    [-best_hit_score_edge float_value] [-window_size int_value]
    [-off_diagonal_range int_value] [-lcase_masking] [-query_loc range]
    [-strand strand] [-parse_deflines] [-outfmt format] [-show_gis]
    [-num_descriptions int_value] [-num_alignments int_value]
    [-line_length line_length] [-max_target_seqs num_sequences]
    [-num_threads int_value] [-remote] [-version]

DESCRIPTION
   Nucleotide-Nucleotide BLAST for immunoglobulin sequences 2.6.1+

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Unknown argument: "c_region_db"
Error:  (CArgException::eInvalidArg) Unknown argument: "c_region_db"
USAGE
  igblastn [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-germline_db_V germline_database_name]
    [-num_alignments_V int_value] [-germline_db_V_seqidlist filename]
    [-germline_db_D germline_database_name] [-num_alignments_D int_value]
    [-germline_db_D_seqidlist filename]
    [-germline_db_J germline_database_name] [-num_alignments_J int_value]
    [-germline_db_J_seqidlist filename] [-auxiliary_data filename]
    [-min_D_match min_D_match] [-D_penalty D_penalty] [-J_penalty J_penalty]
    [-num_clonotype num_clonotype] [-clonotype_out clonotype_out]
    [-allow_vdj_overlap] [-organism germline_origin]
    [-domain_system domain_system] [-ig_seqtype sequence_type]
    [-focus_on_V_segment] [-extend_align5end] [-min_V_length Min_V_Length]
    [-min_J_length Min_J_Length] [-show_translation] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]
    [-subject_loc range] [-query input_file] [-out output_file]
    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
    [-gapextend extend_penalty] [-qcov_hsp_perc float_value]
    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-penalty penalty] [-reward reward] [-no_greedy]
    [-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value]
    [-best_hit_score_edge float_value] [-window_size int_value]
    [-off_diagonal_range int_value] [-lcase_masking] [-query_loc range]
    [-strand strand] [-parse_deflines] [-outfmt format] [-show_gis]
    [-num_descriptions int_value] [-num_alignments int_value]
    [-line_length line_length] [-max_target_seqs num_sequences]
    [-num_threads int_value] [-remote] [-version]

DESCRIPTION
   Nucleotide-Nucleotide BLAST for immunoglobulin sequences 2.6.1+

Use '-help' to print detailed descriptions of command line arguments
========================================================================

Error: Unknown argument: "c_region_db"
Error:  (CArgException::eInvalidArg) Unknown argument: "c_region_db"
ERROR> Input 10kPBMC_CFCOV/dandelion/tmp/all_contig_igblast.fmt7 does not exist.

ERROR> Input 10kPBMC_CFCOV/dandelion/tmp/all_contig_igblast.fmt7 does not exist.

**BLAST Database error: Error: Not a valid version 4 database.**
**BLAST Database error: Error: Not a valid version 4 database.**
Assigning genes : 100%|██████████| 1/1 [00:17<00:00, 17.14s/it]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/miniconda3/envs/dandelion/lib/python3.10/site-packages/sc_dandelion-0.3.9.dev4+g3faaadd-py3.10.egg/dandelion/preprocessing/_preprocessing.py", line 1150, in reannotate_genes
    rename_dandelion(
  File "~/miniconda3/envs/dandelion/lib/python3.10/site-packages/sc_dandelion-0.3.9.dev4+g3faaadd-py3.10.egg/dandelion/utilities/_io.py", line 1016, in rename_dandelion
    fp = filePath.parent / filePath.name.rsplit(ends_with)[0]
AttributeError: 'NoneType' object has no attribute 'parent'

OS information

Linux

Version information

dandelion==0.3.9.dev4 pandas==2.2.2 numpy==1.26.4 matplotlib==3.9.2 networkx==2.7 scipy==1.14.1

Additional context

I have retried in a new window, and now instead of the errors with asterisks above, I got current error. Seems like a blast and igblast confusion?

BLAST Database error: No alias or index file found for nucleotide database [home/dandelion/Github/dandelion/container/database/igblast/database/imgt_human_tr_j] in search path [home/dandelion:home/dandelion/Github/dandelion/container/database/blast:] BLAST Database error: No alias or index file found for nucleotide database [home/dandelion/Github/dandelion/container/database/igblast/database/imgt_human_tr_d] in search path [home/dandelion:home/dandelion/Github/dandelion/container/database/blast:]

zktuong commented 2 weeks ago

hi @Cscharlotte, thanks for your interest in the package!

The issue you are encountering is due to an old release version of igblast (and likely blast too) that you have on your system. c_region_db was added from igblast v1.18.0 ~3 years ago https://ncbi.github.io/igblast/rel/Release-notes.html

the latest versions of igblast and blast are 1.22.0 and 2.16.0+ respectively. Just update your igblast and blast and they should hopefully work?

Cscharlotte commented 2 weeks ago

I have upgraded both, the error above got solved, but now getting errors as below:

Assigning genes : 0%| | 0/1 [00:00<?, ?it/s] BLAST query/options error: Germline annotation database human/human_TR_V could not be found in [internal_data] directory Please refer to the BLAST+ user manual. BLAST query/options error: Germline annotation database human/human_TR_V could not be found in [internal_data] directory Please refer to the BLAST+ user manual. ERROR> Input 10kPBMC_CFCOV/dandelion/tmp/all_contig_igblast.fmt7 does not exist.

ERROR> Input 10kPBMC_CFCOV/dandelion/tmp/all_contig_igblast.fmt7 does not exist.

BLAST Database error: Database memory map file error BLAST Database error: Database memory map file error Assigning genes : 100%|██████████| 1/1 [00:09<00:00, 9.30s/it] Traceback (most recent call last): File "", line 1, in File "home/miniconda3/envs/dandelion/lib/python3.10/site-packages/sc_dandelion-0.3.9.dev4+g3faaadd-py3.10.egg/dandelion/preprocessing/_preprocessing.py", line 1150, in reannotate_genes rename_dandelion( File "home/miniconda3/envs/dandelion/lib/python3.10/site-packages/sc_dandelion-0.3.9.dev4+g3faaadd-py3.10.egg/dandelion/utilities/_io.py", line 1016, in rename_dandelion fp = filePath.parent / filePath.name.rsplit(ends_with)[0] AttributeError: 'NoneType' object has no attribute 'parent'

I have noticed similar issues mentioned in tracer repo and recently in #382. I tried to specify igblast_db parameter either as internal_data in the container folder or home/miniconda3/pkgs/igblast-1.22.0-h6a68c12_1/bin, but it did not work. Can you provide detailed guidance on how to solve this problem? Thanks.

zktuong commented 2 weeks ago

can you try something like:

ddl.pp.reannotate_genes(
        folder,
        igblast_db="path/to/database/igblast",
        germline="path/to/database/germlines/imgt/human/vdj",
    )
zktuong commented 2 weeks ago

one more thing to add that if it still complains that the index are outdated, you might want to use the container script to prepare the database: https://github.com/zktuong/dandelion/blob/master/container/scripts/prepare_imgt_database.py

or ideally, if you can run the singularity container, all of this will be trivialised - it doesn't need anndata for the preprocessing

Cscharlotte commented 2 weeks ago

Defining germline path unfortunately did not work for me, I re-read the replies in https://github.com/Teichlab/tracer/issues/48. The solution works for me now is to download the internal_data folder from NCBI and put it in the /bin folder of igblast executable. After doing this, running the following command works: ddl.pp.reannotate_genes(samples, loci="tr", reassign_dj=True, filename_prefix="all",igblast_db='home/dandelion/Github/dandelion/container/database/igblast').

I suspect the issue was caused by the missing internal_data folder in the igblast executable directory. This folder seems normally be included with the igblast installation starting from version 1.13.0. However, my conda installation of igblast (though version 1.22+) did not include this folder for some reason, which likely caused the problem. Anyways thank you for taking the time to address all the issues!

zktuong commented 2 weeks ago

glad that it worked out eventually!