zhanxw / rvtests

Rare variant test software for next generation sequencing data
133 stars 42 forks source link

VCF header have LESS people than VCF content for kinshipmatrix!!--Why this happen?? #22

Closed zishang30 closed 7 years ago

zishang30 commented 7 years ago

Dear Xiaowei

I am using vcf2kinship command to generate kinship matrix, the command are as below: vcf2kinship --inVcf chrall_notmonomorph.vcf.gz --ped phenotypes_all.ped --bn --minMAF 0.050000 --thread 12 --out all_kinship_matrix

rvtest start to generate kinshipmatrix and give the information as follow: [INFO] Empiricial kinship will be calculated. [WARN] Warning: Specified parameter --ped has no effect. strtol: Invalid argument strtol: Invalid argument strtol: Invalid argument strtol: Invalid argument strtol: Invalid argument strtol: Invalid argument [INFO] Start creating empirical kinship from VCF file. [INFO] Using default maximum missing rate = 0.05 [INFO] Exclude [ 6 ] samples from VCF files because they do not exist in pedigree file or do not have sex: 922000077 922003426 922003311 922001212 922000100 922000199 [INFO] Total [ 3782 ] individuals from VCF are used. Total [ 16210000 ] VCF records ( Expected 3788 individual but only have 852 individualve processed. Report 'VCF header have LESS people than VCF content!' to zhanxw@umich.edu Error line [ 14 ... ]

It said 'VCF header have LESS people than VCF content!' why this error happen and how can I fix it? In addition, I use command as follow to bgzip VCF file and tabix VCF.gz before I generate kinship matrix : (grep ^"#" chr1_notmonomorph.vcf;grep -v ^"#" chr1_notmonom orph.vcf| sed 's:^chr::ig' | sort -k1,1n -k2,2n) | bgzip -c > chr1_notmonomorph.vcf.gz

tabix -f -p vcf chr1_notmonomorph.vcf.gz

Thank you very much for your help!

Zishan

zhanxw commented 7 years ago

This error means the VCF header and its content are inconsistent. For example, the VCF header line include 10 samples, but the VCF content lines include 11 samples. Can you validate your VCF file?

On Sat, Apr 22, 2017 at 7:27 PM, zishang30 notifications@github.com wrote:

Dear Xiaowei

I am using vcf2kinship command to generate kinship matrix, the command are as below: vcf2kinship --inVcf chrall_notmonomorph.vcf.gz --ped phenotypes_all.ped --bn --minMAF 0.050000 --thread 12 --out all_kinship_matrix

rvtest start to generate kinshipmatrix and give the information as follow: [INFO] Empiricial kinship will be calculated. [WARN] Warning: Specified parameter --ped has no effect. strtol: Invalid argument strtol: Invalid argument strtol: Invalid argument strtol: Invalid argument strtol: Invalid argument strtol: Invalid argument [INFO] Start creating empirical kinship from VCF file. [INFO] Using default maximum missing rate = 0.05 [INFO] Exclude [ 6 ] samples from VCF files because they do not exist in pedigree file or do not have sex: 922000077 922003426 922003311 922001212 922000100 922000199 [INFO] Total [ 3782 ] individuals from VCF are used. Total [ 16210000 ] VCF records ( Expected 3788 individual but only have 852 individualve processed. Report 'VCF header have LESS people than VCF content!' to zhanxw@umich.edu Error line [ 14 ... ]

It said 'VCF header have LESS people than VCF content!' why this error happen and how can I fix it? In addition, I use command as follow to bgzip VCF file and tabix VCF.gz before I generate kinship matrix : (grep ^"#" chr1_notmonomorph.vcf;grep -v ^"#" chr1_notmonom orph.vcf| sed 's:^chr::ig' | sort -k1,1n -k2,2n) | bgzip -c > chr1_notmonomorph.vcf.gz

tabix -f -p vcf chr1_notmonomorph.vcf.gz

Thank you very much for your help!

Zishan

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zhanxw/rvtests/issues/22, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJoiBp8PP0__uf3v2ZPSmvdp3LLpnlhks5ryprXgaJpZM4NFQnI .

zishang30 commented 7 years ago

Hi, Xiaowei Thank you very much for your reply! How can we validate our VCF file?

zhanxw commented 7 years ago

You can write a script to check the number of column for each line in your VCF file. If you see inconsistency, you will need to manually fix it.

You may also want to use a script to check VCF file: https://github.com/zhanxw/checkVCF

zishang30 commented 7 years ago

Dear xiaowei

Thank you very much for your help! I have already download the check VCF and start to run it. If there is an inconsistency, how can I manually fix it ? Thanks!

zhanxw commented 7 years ago

In my point of view, to manually fix a VCF file is not a good practice.
You probably need to check previous analysis steps and find out which step bring in the inconsistency. Then you can fix that step.

zishang30 commented 7 years ago

Dear Xiaowei

I already check chr22 using checkVCF tools. Here is the report: --------------- REPORT --------------- Total [ 252148 ] lines processed Examine [ 10 ] VCF header lines, [ 252138 ] variant sites, [ 3788 ] samples [ 0 ] duplicated sites [ 0 ] NonSNP site are outputted to [ test.check.nonSnp ] [ 0 ] Inconsistent reference sites are outputted to [ test.check.ref ] [ 0 ] Variant sites with invalid genotypes are outputted to [ test.check.geno ] [ 19110 ] Alternative allele frequency > 0.5 sites are outputted to [ test.check.af ] [ 0 ] Monomorphic sites are outputted to [ test.check.mono ] --------------- ACTION ITEM --------------- It seems there is no inconsistency at this VCF file. But when I use this VCF to run RVtest with kinshipmatrix, we still get the same segmentation fault.Therefore I assume this problem due to the kinshipmatrix. And when I re-check this problem, I remember some important point: we have 2 dataset A and B to run rvtest. When I generate kinshipmatrix for dataset A, it report " Expected 4397 individual but only have 2242 individual Report 'VCF header have LESS people than VCF content!' And when we finished the kinship matrix generation, the log file report some warning: " Warning: Specified parameter --ped has no effect. " The former command for kinshipmatrix are: vcf2kinship --inVcf N3788_chr1.vcf.gz --ped phenotypes.ped --bn --minMAF 0.050000 --thread 8 --out kinship_matrix we don't use --xHemi, because we have no Chr X imputation data. I check google and it seems this may cause this warning. Do you think this is the reason for the inconsistency when generate the kinshipmatrix? I already start to re-generate kinshipmatrix using the full command: vcf2kinship --inVcf N3788_chr1.vcf.gz --ped phenotypes.ped --bn --xHemi --minMAF 0.050000 --thread 8 --out kinship_matrix

Thank you very much for your help!

Best regards

Zishan

zishang30 commented 7 years ago

Dear Xiaowei

This is the new log file to generate kinship matrix.

################################### [INFO] Program version: 20170210 [INFO] Git Version [INFO] 584dea45a315644886d470b56b9eb7a4818580cc [INFO] Parameters BEGIN

ParameterList created by zishan.gao on 128055 at Mon Apr 24 17:14:09 2017

--inVcf "N3788_chrALL.vcf.gz" --out "hemi_kinship_matrix.kinship" --xHemi --ped "phenotype.ped" --bn --minMAF 0.050000 --thread 8 [INFO] Parameters END [INFO] Analysis started at: Mon Apr 24 17:14:09 2017 [INFO] Multiple ( 8 ) threads will be used. [INFO] Empiricial kinship will be calculated. [INFO] Start creating empirical kinship from VCF file. [INFO] Using default maximum missing rate = 0.05 [INFO] Exclude [ 6 ] samples from VCF files because they do not exist in pedigree file or do not have sex: [INFO] Total [ 3782 ] individuals from VCF are used. [INFO] Total [ 19921449 ] VCF records have been processed. [INFO] Kinship [ hemi_kinship_matrix.kinship.kinship ] has been generated. [ERROR] There are not enough variants to create kinship matrix. [ERROR] Failed to create hemizygous-region kinship file [ S4F4hemi_kinship_matrix.kinship.xHemi.kinship ]. [INFO] Skipped [ 14591821 ] sites due to MAF or high misssingness [INFO] Total [ 5329628 ] variants are used to calculate autosomal kinship matrix. [INFO] Total [ 0 ] variants are used to calculate chromosome X kinship matrix. [INFO] Analysis ends at: Wed Apr 26 04:53:26 2017 [INFO] Analysis took 128357 seconds.

###############################################

According to this log , it seems "There are not enough variants to create kinship matrix". And, we still have the inconsistency "VCF header have LESS people than VCF content!'" when we generate this new kinship matrix, although we have already used the --xHemi command at this case.

Thanks!

Best regards

Zishan

zhanxw commented 7 years ago

Maybe you can skip "--xHemi" option in vcf2kinship program? Then you will not create ".xHemi" kinship files, and that may help rvtests to run smoothly. Can you please try again? Thanks.

zishang30 commented 7 years ago

Dear Xiaowei we try to run the vcf2kinship command with or with out "--xHemi" and got results as follows: ########################### Run kinship without xHemi: [INFO] Program version: 20170210 [INFO] Git Version [INFO] 584dea45a315644886d470b56b9eb7a4818580cc [INFO] Parameters BEGIN

ParameterList created by zishan on 55 at Wed Apr 26 17:33:05 2017

--inVcf "chrall.vcf.gz" --out "0426_kinship_matrix" --ped "phenotype_all.ped" --bn --minMAF 0.050000 --thread 12 [INFO] Parameters END [INFO] Analysis started at: Wed Apr 26 17:33:05 2017 [INFO] Multiple ( 12 ) threads will be used. [INFO] Empiricial kinship will be calculated. [WARN] Warning: Specified parameter --ped has no effect. [INFO] Start creating empirical kinship from VCF file. [INFO] Using default maximum missing rate = 0.05 [INFO] Exclude [ 6 ] samples from VCF files because they do not exist in pedigree file or do not have sex: [INFO] Total [ 3782 ] individuals from VCF are used. [INFO] Total [ 20023742 ] VCF records have been processed. [INFO] Kinship [ 0426_kinship_matrix.kinship ] has been generated. [INFO] Skipped [ 14665592 ] sites due to MAF or high misssingness [INFO] Total [ 5358150 ] variants are used to calculate autosomal kinship matrix. [INFO] Analysis ends at: Fri Apr 28 06:46:49 2017 [INFO] Analysis took 134024 seconds. ################################### Run kinship with xHemi [INFO] Program version: 20170210 [INFO] Git Version [INFO] 584dea45a315644886d470b56b9eb7a4818580cc [INFO] Parameters BEGIN

ParameterList created by zishan on 55 at Wed Apr 26 23:23:02 2017

--inVcf "chrall.vcf.gz" --out "0427_kinship_matrix" --xHemi --ped "phenotypes_all.ped" --bn --minMAF 0.050000 --thread 12 [INFO] Parameters END [INFO] Analysis started at: Wed Apr 26 23:23:02 2017 [INFO] Multiple ( 12 ) threads will be used. [INFO] Empiricial kinship will be calculated. [INFO] Start creating empirical kinship from VCF file. [INFO] Using default maximum missing rate = 0.05 [INFO] Exclude [ 6 ] samples from VCF files because they do not exist in pedigree file or do not have sex: [INFO] Total [ 3782 ] individuals from VCF are used. [INFO] Total [ 20023742 ] VCF records have been processed. [INFO] Kinship [ 0427_kinship_matrix.kinship ] has been generated. [ERROR] There are not enough variants to create kinship matrix. [ERROR] Failed to create hemizygous-region kinship file [ 0427_kinship_matrix.xHemi.kinship ]. [INFO] Skipped [ 14665592 ] sites due to MAF or high misssingness [INFO] Total [ 5358150 ] variants are used to calculate autosomal kinship matrix. [INFO] Total [ 0 ] variants are used to calculate chromosome X kinship matrix. [INFO] Analysis ends at: Fri Apr 28 11:53:45 2017 [INFO] Analysis took 131443 seconds. ################################## To compare this 2 log file ,we can see that it seems we can generate kinshipmatrix without xHemi but get the warning : Specified parameter --ped has no effect. When we generate kinship matrix with xHemi, it seems we got the error "There are not enough variants to create kinship matrix." and "Failed to create hemizygous-region kinship file [ S4F40427kinship mari x.xHemi.kinship ]. Thís 2 generate kinship are in the same size 158M

I assume that both this warning and error are due to lack of X chromosome. And at this situation just generate kinshipmatrix without --xHemi maybe better because we can generate kinshipmartix with some warning instead of the error happened in the kinshipmatrix with xHemi.

Also, this time we haven't receive the error "'VCF header have LESS people than VCF content", because I use unsorted and untabixed vcf.gz to generate kinshipmatrix. Therefore may be we can not use sorted and tabixed vcf.gz to make kinshipmatrix because maybe this kind of sort and tabix will change the VCF file content or the header?

In addition, we have got the x chromosome imputation data at yesterday, I will try to generate the ki nship matrix with X chromosome at this weekend.

zhanxw commented 7 years ago

Thanks for reporting back here. It seems your problem was solved, right? It's usually unlikely that sort and tabix will change the VCF file content...