zhanxw / seqminer

Query sequence data (VCF/BCF1/BCF2, Tabix, BGEN, PLINK) in R
http://zhanxw.github.io/seqminer/
Other
30 stars 12 forks source link

`tabix.createIndex`: [ti_index_core] the file out of order at line #### #25

Open bschilder opened 2 years ago

bschilder commented 2 years ago

Hello,

When trying to use tabix.createIndex to index a munged GWAS summary stats file (tab-separated and compressed by Rsamtools::bgzip), I keep getting the following error:

[ti_index_core] the file out of order at line 776464
Create tabix index failed for [ /Users/schilder/Downloads/pgc-bip2021-all_munged.tsv.bgz ]!

And yet, I've made sure to sort the file by CHR (chromosome) and BP (position). I even visually confirmed that the positions are in order at the line it is referencing (after it's been bgzip compressed):

Screenshot 2022-03-16 at 22 17 25

Reprex

The data can be downloaded here.

#### Set up paths ####
fullSS_path_vcf <- "~/Downloads/pgc-bip2021-all.vcf.tsv.gz" 
fullSS_path_tsv <- gsub("\\.vcf","",fullSS_path_vcf) 
fullSS_path_munged <- gsub("-all","-all_munged",fullSS_path_tsv)
#### Edit ####
dat <- data.table::fread(fullSS_path_vcf, 
                         skip = "#CHROM")
colnames(dat) <- gsub("#","",colnames(dat))
#### Sort ####
data.table::setkey(dat, CHROM, POS)
#### Save ####
data.table::fwrite(x = dat, 
                   file = fullSS_path_tsv, 
                   sep="\t")
#### Munge ####
fullSS_path <- MungeSumstats::format_sumstats(path = fullSS_path_tsv, 
                                              save_path = fullSS_path_munged, 
                                              sort_coordinates = TRUE,
                                              log_folder = "~/Downloads/logs", 
                                              log_mungesumstats_msgs = TRUE, 
                                              log_folder_ind = TRUE)

#### Compress ####
bgz_file <- Rsamtools::bgzip(file = fullSS_path, 
                                 overwrite = TRUE)

#### Index ####
 seqminer::tabix.createIndex(
        bgzipFile = bgz_file,
        sequenceColumn = 2,
        startColumn = 3,
        endColumn = 3,
        ## Just use the first column's name (since none have the `#` symbol)
        metaChar = "SNP"
    )

Any help would be appreciated.

Best, Brian

Session info

``` R version 4.1.0 (2021-05-18) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 11.4 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_7.0.0 ggimage_0.3.0 ggplot2_3.3.5 dplyr_1.0.8 hexSticker_0.4.9 [6] echotabix_0.99.3 loaded via a namespace (and not attached): [1] AnnotationHub_3.2.2 BiocFileCache_2.2.1 systemfonts_1.0.4 [4] igraph_1.2.11 BiocParallel_1.28.3 GenomeInfoDb_1.30.1 [7] digest_0.6.29 yulab.utils_0.0.4 htmltools_0.5.2 [10] magick_2.7.3 fansi_1.0.2 magrittr_2.0.2 [13] memoise_2.0.1 BSgenome_1.62.0 echoverseTemplate_0.99.0 [16] ontologyPlot_1.6 openxlsx_4.2.5 Biostrings_2.62.0 [19] matrixStats_0.61.0 R.utils_2.11.0 sysfonts_0.8.5 [22] prettyunits_1.1.1 colorspace_2.0-3 blob_1.2.2 [25] rappdirs_0.3.3 textshaping_0.3.6 xfun_0.30 [28] crayon_1.5.0 RCurl_1.98-1.6 echodata_0.99.6 [31] jsonlite_1.8.0 hexbin_1.28.2 graph_1.72.0 [34] VariantAnnotation_1.40.0 glue_1.6.2 gtable_0.3.0 [37] zlibbioc_1.40.0 XVector_0.34.0 DelayedArray_0.20.0 [40] Rgraphviz_2.38.0 BiocGenerics_0.40.0 scales_1.1.1 [43] DBI_1.1.2 Rcpp_1.0.8.2 showtextdb_3.0 [46] xtable_1.8-4 progress_1.2.2 gridGraphics_0.5-1 [49] bit_4.0.4 clisymbols_1.2.0 stats4_4.1.0 [52] DT_0.21 htmlwidgets_1.5.4 httr_1.4.2 [55] ontologyIndex_2.7 ellipsis_0.3.2 pkgconfig_2.0.3 [58] XML_3.99-0.9 R.methodsS3_1.8.1 farver_2.1.0 [61] seqminer_8.4 dbplyr_2.1.1 utf8_1.2.2 [64] here_1.0.1 ggplotify_0.1.0 tidyselect_1.1.2 [67] labeling_0.4.2 rlang_1.0.2 later_1.3.0 [70] AnnotationDbi_1.56.2 BiocVersion_3.14.0 munsell_0.5.0 [73] tools_4.1.0 cachem_1.0.6 cli_3.2.0 [76] generics_0.1.2 RSQLite_2.2.10 evaluate_0.15 [79] stringr_1.4.0 fastmap_1.1.0 yaml_2.3.5 [82] ragg_1.2.2 knitr_1.37 bit64_4.0.5 [85] fs_1.5.2 zip_2.2.0 purrr_0.3.4 [88] KEGGREST_1.34.0 gh_1.3.0 showtext_0.9-5 [91] mime_0.12 R.oo_1.24.0 xml2_1.3.3 [94] biomaRt_2.50.3 brio_1.1.3 compiler_4.1.0 [97] rstudioapi_0.13 interactiveDisplayBase_1.32.0 filelock_1.0.2 [100] curl_4.3.2 png_0.1-7 testthat_3.1.2 [103] paintmap_1.0 tibble_3.1.6 stringi_1.7.6 [106] GenomicFeatures_1.46.5 desc_1.4.1 lattice_0.20-45 [109] Matrix_1.4-0 vctrs_0.3.8 pillar_1.7.0 [112] lifecycle_1.0.1 BiocManager_1.30.16 data.table_1.14.2 [115] bitops_1.0-7 httpuv_1.6.5 rtracklayer_1.54.0 [118] GenomicRanges_1.46.1 R6_2.5.1 BiocIO_1.4.0 [121] promises_1.2.0.1 IRanges_2.28.0 ontoProc_1.16.0 [124] assertthat_0.2.1 pkgload_1.2.4 SummarizedExperiment_1.24.0 [127] rprojroot_2.0.2 rjson_0.2.21 withr_2.5.0 [130] GenomicAlignments_1.30.0 Rsamtools_2.10.0 S4Vectors_0.32.3 [133] GenomeInfoDbData_1.2.7 parallel_4.1.0 hms_1.1.1 [136] grid_4.1.0 ggfun_0.0.5 tidyr_1.2.0 [139] rmarkdown_2.13 MatrixGenerics_1.6.0 piggyback_0.1.1 [142] Biobase_2.54.0 shiny_1.7.1 lubridate_1.8.0 [145] restfulr_0.0.13 ```