zhanxw / rvtests

Rare variant test software for next generation sequencing data
131 stars 41 forks source link

VCF Line does not have correct column number and exsiting! #28

Closed zishang30 closed 7 years ago

zishang30 commented 7 years ago

Dear Xiaowei

I have checked single chrosome vcf.gz file and bgziped chr.all vcf.gz file for generating kinship matrix. It seems that in chr.all vcf.gz file some VCF line does not have correct column number as follows:

04-30 14:54 Line [ 8869314 ] does not have correct column number, exiting! 04-30 14:54 Current line has 2251 columns. 05-02 23:51 Line [ 16287134 ] does not have correct column number, exiting! 05-02 23:51 Current line has 861 columns.

Because we already checked single VCF.gz, so this error may be due to the bgzip code for bgziping all chromosome ?

The code for bgzip all chromosome are as follows: (zcat chr1_notmonomorph.vcf.gz; zgrep -v '^#' chr2_notmonomorph.vcf.gz; zgrep -v '^#' chr3_notmonomorph.vcf.gz; zgrep -v '^#' chr4_notmonomorph.vcf.gz; zgrep -v '^#' chr5_notmonomorph.vcf.gz; zgrep -v '^#' chr6_notmonomorph.vcf.gz; zgrep -v '^#' chr7_notmonomorph.vcf.gz; zgrep -v '^#' chr8_notmonomorph.vcf.gz; zgrep -v '^#' chr9_notmonomorph.vcf.gz; zgrep -v '^#' chr10_notmonomorph.vcf.gz; zgrep -v '^#' chr11_notmonomorph.vcf.gz; zgrep -v '^#' chr12_notmonomorph.vcf.gz; zgrep -v '^#' chr13_notmonomorph.vcf.gz; zgrep -v '^#' chr14_notmonomorph.vcf.gz; zgrep -v '^#' chr15_notmonomorph.vcf.gz; zgrep -v '^#' chr16_notmonomorph.vcf.gz; zgrep -v '^#' chr17_notmonomorph.vcf.gz; zgrep -v '^#' chr18_notmonomorph.vcf.gz; zgrep -v '^#' chr19_notmonomorph.vcf.gz; zgrep -v '^#' chr20_notmonomorph.vcf.gz; zgrep -v '^#' chr21_notmonomorph.vcf.gz; zgrep -v '^#' chr22_notmonomorph.vcf.gz; zgrep -v '^#' chrX_notmonomorph.vcf.gz)| bgzip -c > HRC_chrall.vcf.gz

Is this correct?

In addition, another thing came to my mind: our statistic plan ask us to use follow command to bgzip and index VCF file: (grep ^"#" $your_old_vcf; grep -v ^"#" $your_old_vcf | sed 's:^chr::ig' | sort -k1,1n -k2,2n) | bgzip -c > $your_vcf_file tabix -f -p vcf $your_vcf_file

this command are used for group-based rare variant tests.

If we don't conduct group-based rare variant tests, should we just use command as follow: bgzip -c file.vcf > file.vcf.gz tabix -p vcf file.vcf.gz

Thank you very much!

Best regards

Zishan

zhanxw commented 7 years ago

Hi Zishan,

The error message reported line numbers (8869314). Can you please check line 8869314 and compare that with previous line and the next line? I would try to generate a small VCF file with line 8869310 to 8869320, re-run rvtests and see if the problem disappears. If not, the problem is probably the input file. You can attach here, and I will check it.

If you don't perform group-based tests, your command will be fine.

zishang30 commented 7 years ago

Dear Xiaowei

I re-bgziped all the vcf file and re-check all the vcf.gz use check vcf and it seems this problem maybe resolved. But currently when I use re-bgziped chrall vcf.gz to generate kinship matrix and try to run rvtest, it report as follows:

[ERROR] Cannot find sample [ 890005850 ] from the kinship file! [ERROR] Failed to load kinship file [ chrall_kinship_matrix.kinship ]

Why this happen....? and how could we check the patient ID between kinshipmatrix and VCF?

Thank you for your reply.

Best regards

Zishan

PS we update the version of rvtest and it seems the newest version could resolve the "segmentation fault".

dajiangliu commented 7 years ago

Zhishan:

The ID for the kinship files are in the first two columns. You can compare it with the header of the VCF file.

All the best, Dajiang

Assistant Professor Dept. of Public Health Sciences Institute of Personalized Medicine Penn State College of Medicine, HCAR 2020, Mail Stop R125 Email: dajiang.liu@psu.edu URL: https://dajiangliu.wordpress.com Tel: +1-717-531-4178


From: zishang30 notifications@github.com Sent: Sunday, May 28, 2017 11:52 AM To: zhanxw/rvtests Cc: Subscribed Subject: Re: [zhanxw/rvtests] VCF Line does not have correct column number and exsiting! (#28)

Dear Xiaowei

I re-bgziped all the vcf file and re-check all the vcf.gz use check vcf and it seems this problem maybe resolved. But currently when I use re-bgziped chrall vcf.gz to generate kinship matrix and try to run rvtest, it report as follows:

[ERROR] Cannot find sample [ 890005850 ] from the kinship file! [ERROR] Failed to load kinship file [ chrall_kinship_matrix.kinship ]

Why this happen....? and how could we check the patient ID between kinshipmatrix and VCF?

Thank you for your reply.

Best regards

Zishan

PS we update the version of rvtest and it seems the newest version could resolve the "segmentation fault".

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/zhanxw/rvtests/issues/28#issuecomment-304523127, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJohpS0t4KhxE5DKkVMI8bhGm-OhNB3tks5r-ZgxgaJpZM4NUqP_.