zhanxw / seqminer

Query sequence data (VCF/BCF1/BCF2, Tabix, BGEN, PLINK) in R
http://zhanxw.github.io/seqminer/
Other
30 stars 12 forks source link

`readVCFToListByRange`: Report 'VCF/BCF Files does not have header!' #26

Open bschilder opened 2 years ago

bschilder commented 2 years ago

seqminer::readVCFToListByRange currently crashes Rstudio. The VCF I'm querying is definitely not missing a valid header (contrary to the error message).

Moreover, even if the file was missing a header, the function should be able to handle this situation more gracefully than causing Rstudio to crash (and lose all the user's variables). Instead, a standard stop message should be triggered.

Reprex 1

 target_path <- paste(
        "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/",
        "ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz",
        sep="/")
query_str <- "4:14884541-16649679"

main_cols <- c("CHROM","POS","ID","REF","ALT","QUAL","FILTER","INFO","FORMAT")
samples <- c('HG00097','HG00099','HG00100','HG00101','HG00102')
vcfColumn <- unique(c(main_cols,toupper(samples)))

 out <- seqminer::readVCFToListByRange(fileName = target_path,
                                              range = query_str, 
                                              vcfColumn = vcfColumn)

Error

Screenshot 2022-04-10 at 13 02 58

Reprex 2

This imports the table successfully.

dat <- seqminer::tabix.read.table(tabixFile = target_path, 
                                          tabixRange = query_str)
dim(dat)
## [1] 24376  1101

For further discussion on querying VCFs using various R packages, see here.

Session info

``` R version 4.1.3 (2022-03-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.3.1 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] echotabix_0.99.5 dplyr_1.0.8 VariantAnnotation_1.40.0 Rsamtools_2.10.0 [5] Biostrings_2.62.0 XVector_0.34.0 SummarizedExperiment_1.24.0 Biobase_2.54.0 [9] GenomicRanges_1.46.1 GenomeInfoDb_1.30.1 IRanges_2.28.0 S4Vectors_0.32.4 [13] MatrixGenerics_1.6.0 matrixStats_0.61.0 BiocGenerics_0.40.0 loaded via a namespace (and not attached): [1] fs_1.5.2 bitops_1.0-7 lubridate_1.8.0 bit64_4.0.5 [5] filelock_1.0.2 progress_1.2.2 httr_1.4.2 rprojroot_2.0.3 [9] gh_1.3.0 tools_4.1.3 utf8_1.2.2 R6_2.5.1 [13] DT_0.22 DBI_1.1.2 withr_2.5.0 tidyselect_1.1.2 [17] prettyunits_1.1.1 bit_4.0.4 curl_4.3.2 compiler_4.1.3 [21] cli_3.2.0 xml2_1.3.3 desc_1.4.1 DelayedArray_0.20.0 [25] rtracklayer_1.54.0 readr_2.1.2 rappdirs_0.3.3 stringr_1.4.0 [29] digest_0.6.29 piggyback_0.1.1 R.utils_2.11.0 htmltools_0.5.2 [33] pkgconfig_2.0.3 echodata_0.99.7 dbplyr_2.1.1 fastmap_1.1.0 [37] BSgenome_1.62.0 htmlwidgets_1.5.4 rlang_1.0.2 rstudioapi_0.13 [41] RSQLite_2.2.12 BiocIO_1.4.0 generics_0.1.2 jsonlite_1.8.0 [45] echoconda_0.99.5 BiocParallel_1.28.3 zip_2.2.0 R.oo_1.24.0 [49] RCurl_1.98-1.6 magrittr_2.0.3 GenomeInfoDbData_1.2.7 Matrix_1.4-1 [53] Rcpp_1.0.8.3 fansi_1.0.3 reticulate_1.24-9000 lifecycle_1.0.1 [57] R.methodsS3_1.8.1 stringi_1.7.6 yaml_2.3.5 zlibbioc_1.40.0 [61] brio_1.1.3 BiocFileCache_2.2.1 grid_4.1.3 blob_1.2.2 [65] parallel_4.1.3 crayon_1.5.1 lattice_0.20-45 GenomicFeatures_1.46.5 [69] hms_1.1.1 KEGGREST_1.34.0 seqminer_8.4 pillar_1.7.0 [73] rjson_0.2.21 clisymbols_1.2.0 biomaRt_2.50.3 pkgload_1.2.4 [77] XML_3.99-0.9 glue_1.6.2 data.table_1.14.2 tzdb_0.3.0 [81] png_0.1-7 vctrs_0.4.0 testthat_3.1.3 tidyr_1.2.0 [85] purrr_0.3.4 assertthat_0.2.1 cachem_1.0.6 openxlsx_4.2.5 [89] restfulr_0.0.13 tibble_3.1.6 GenomicAlignments_1.30.0 AnnotationDbi_1.56.2 [93] memoise_2.0.1 ellipsis_0.3.2 ```
bschilder commented 2 years ago

FYI, same error occurs with readvCFToMatrixByRange

 target_path <- paste(
        "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/",
        "ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz",
        sep="/")
query_str <- "4:14884541-16649679"

out <- seqminer: :readvCFToMatrixByRange(fileName = target_path,
range = query_str)

Screenshot 2022-04-10 at 14 07 48