zhengxwen / SeqArray

Data management of large-scale whole-genome sequence variant calls (Development version only)
http://www.bioconductor.org/packages/SeqArray
44 stars 12 forks source link

seqAlleleCount returns NA if no samples in filter #79

Closed smgogarten closed 2 years ago

smgogarten commented 2 years ago

Previously seqAlleleCount returned 0 if no samples were selected in the filter; the current version returns NA, which breaks some code in SeqVarTools.

> gds <- seqOpen(seqExampleFileName())
> seqSetFilter(gds, sample.sel=integer(), variant.sel=1:10)
# of selected samples: 0
# of selected variants: 10
> seqAlleleCount(gds)
 [1] NA NA NA NA NA NA NA NA NA NA
> sessionInfo()

R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] SeqArray_1.37.1 gdsfmt_1.33.0 

Running this on an earlier version of SeqArray:

> gds <- seqOpen(seqExampleFileName())
> seqSetFilter(gds, sample.sel=integer(), variant.sel=1:10)
# of selected samples: 0
# of selected variants: 10
> seqAlleleCount(gds)
 [1] 0 0 0 0 0 0 0 0 0 0

> sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] SeqArray_1.36.0 gdsfmt_1.32.0  
zhengxwen commented 2 years ago

CHANGES IN VERSION 1.37.1

o `seqAlleleCount()` returns NA instead of zero when all genotypes are missing at a site

Here, no sample selected is treated as missing genotypes. Returning NA is a way to distinguish the case of all missing genotypes from no polymorphism.

I am afraid that you might replace NA by zero in your codes, and check is.na() if call seqAlleleCount() and seqAlleleFreq().

In addition, the previous and current seqAlleleFreq() could return NaN.

seqAlleleCount() and seqAlleleFreq() never return Inf, -Inf.

smgogarten commented 2 years ago

Ok, I updated the SeqVarTools code to accommodate the change.