zhanxw / seqminer

Query sequence data (VCF/BCF1/BCF2, Tabix, BGEN, PLINK) in R
http://zhanxw.github.io/seqminer/
Other
30 stars 12 forks source link

New features for seqminer (8.0) #12

Open WenjianBI opened 4 years ago

WenjianBI commented 4 years ago

Hi Xiaowei,

I am using seqminer (v8.0) and it works pretty well under multiple OS. I am wondering if you can add some features to the current functions.

  1. Usually, we do not need all subjects in analysis. So, for readBGENToMatrixByRange() and readVCFToMatrixByRange(), can you add one more argument such as 'subjIDs' or 'subjIndex' to specify the subjects in analysis. That can save a lot of memory sometimes.

  2. Can you add one more function to split all markers into multiple ranges, and each range includes similar number of markers. When conducting a genome-wide analysis, we cannot put the genotype of all markers into memory. Hence, this function can greatly help us for that purpose. If possible, I suggest the new function should be like splitRange(fileName, memoryChunk = 4GB, subjIDs, ...). Output can be a data.frame object in which each row is for one range.

  3. Sometimes, the plink bed/bim/fam files or bgen bgen/bgi files have different prefix names. I am wondering if you can let users specify the different names for different files. That would be also helpful.

Thanks, Wenjian

zhanxw commented 4 years ago

That’s all very helpful suggestions. Thanks and I will implement those.

Sent from my iPhone

On Jul 28, 2020, at 3:56 PM, Wenjian Bi notifications@github.com wrote:

 Hi Xiaowei,

I am using seqminer (v8.0) and it works pretty well under multiple OS. I am wondering if you can add some features to the current functions.

Usually, we do not need all subjects in analysis. So, for readBGENToMatrixByRange() and readVCFToMatrixByRange(), can you add one more argument such as 'subjIDs' or 'subjIndex' to specify the subjects in analysis. That can save a lot of memory sometimes.

Can you add one more function to split all markers into multiple ranges, and each range includes similar number of markers. When conducting a genome-wide analysis, we cannot put the genotype of all markers into memory. Hence, this function can greatly help us for that purpose. If possible, I suggest the new function should be like splitRange(fileName, memoryChunk = 4GB, subjIDs, ...). Output can be a data.frame object in which each row is for one range.

Sometimes, the plink bed/bim/fam files or bgen bgen/bgi files have different prefix names. I am wondering if you can let users specify the different names for different files. That would be also helpful.

Thanks, Wenjian

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

WenjianBI commented 4 years ago

Thank you for the swift reply. Bgen files are becoming more and more popular and I think your package can be a very important tool for R users.

garyzhubc commented 3 years ago

I think it'd be great if there is an option load a matrix from readBGENToMatrixByRange indexed by rsid instead of position.

zhanxw commented 3 years ago

Thanks for the suggestion, but managing rsid is quite challenging as they can change over time (rs ids can merge or becomes invalid across releases).

Sent from my iPhone

On Feb 10, 2021, at 5:44 PM, Peiyuan Zhu notifications@github.com wrote:

 I think it'd be great if there is an option load a matrix from readBGENToMatrixByRange indexed by rsid instead of position.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

garyzhubc commented 3 years ago

Missing data imputation can be an important feature to have. I wonder how missing genotype is handled in the current version.

zhanxw commented 3 years ago

If a genotype is missing in BGEN, you will get NA as the genotype.

Best, Xiaowei

On Thu, Feb 11, 2021 at 8:07 PM Peiyuan Zhu notifications@github.com wrote:

Missing data imputation can be an important feature to have. I wonder how missing genotype is handled in the current version.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zhanxw/seqminer/issues/12#issuecomment-777921113, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGRCHLX7WLVFL77LZ6ZMDS6SEPFANCNFSM4PK5HAWQ .