zhanxw / rvtests

Rare variant test software for next generation sequencing data
131 stars 41 forks source link

std::logic_error when generating kinship matrix #33

Closed MostPJ closed 7 years ago

MostPJ commented 7 years ago

Hello all,

I ran into an odd problem when trying to generate a kinship matrix using Rvtest. The dataset I am working on actually consists of two subcohorts that were genotyped and imputed together, but need to be analysed separately. As such, and because I only have limited disk space available, I decided to generate separate kinship matrices for the subcohorts. I had no problem generating a kinship matrix for the larger of the two, but when I did so for the smaller one, rvtest gave the following output:

Effective Options
    --inVcf chrALL.imputed.poly.vcf.gz
    --out kinship_matrix_CC
    --xHemi
    --xLabel X
    --ped dataF_P90c_TRAILS_CC_anthro.txt
    --bn
    --minMAF 0.05
    --thread 16

[INFO]   Program version: 20170210
[INFO]   Analysis started at: Mon May 29 16:47:51 2017
[INFO]   Multiple ( 16 ) threads will be used.
[INFO]   Empiricial kinship will be calculated.
strtol: Invalid argument
[INFO]   Start creating empirical kinship from VCF file.
[INFO]   Using default maximum missing rate = 0.05
terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_M_construct null not valid
/var/spool/torque/mom_priv/jobs/547534.batch1.lisa.surfsara.nl.SC: line 19: 32228 Aborted                 ~/Software/rvtests/executable/vcf2kinship --inVcf chrALL.imputed.poly.vcf.gz --ped dataF_P90c_TRAILS_CC_anthro.txt --bn --out kinship_matrix_CC --xLabel X --xHemi --minMAF .05 --thread 16

(The “strol: Invalid argument” line is actually repeated exactly 1900 times - I just left the one entry to show where it appears.)

As ped file I use the phenotype file, containing only samples of the smaller subcohort. The command line used is the same as for the larger subcohort, except where it specifies a different ped file and output filename. The phenotypes files of both subcohorts were created in the same way, so it seems unlikely the cause lies there. What does this error mean?

zhanxw commented 7 years ago

I have not seen this type of error before. Just want to know more about the system.

Can you please paste the outputs of uname -a and lsb_release -a? Can you check the memory limit on your torque system?

MostPJ commented 7 years ago

It's the Dutch LISA system on the SurfSara cluster: https://userinfo.surfsara.nl/systems/lisa/description

uname -a Linux login2.lisa.surfsara.nl 3.2.0-4-amd64 #1 SMP Debian 3.2.88-1 x86_64 GNU/Linux

lsb_release -a LSB Version: core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch:core-4.1-amd64:core-4.1-noarch:security-4.0-amd64:security-4.0-noarch:security-4.1-amd64:security-4.1-noarch Distributor ID: Debian Description: Debian GNU/Linux 7.11 (wheezy) Release: 7.11 Codename: wheezy

I'm afraid I don't know what a torque system is. For the kinship-command I requested a node with 64 GB RAM (QPI 8.00 GT/s), which would also have a 20 MB cache.

Also, given that it occured with the smaller subcohort but not the larger one, I doubt it is a memory issue.

zhanxw commented 7 years ago

Can you please try to use just one thread? Maybe try it on a small chromosomal chunk and see if thread is the source of the problem. Thanks.

Xiaowei

On Jun 7, 2017, at 8:44 AM, MostPJ notifications@github.com wrote:

Reopened #33.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

zhanxw commented 7 years ago

Do you mind attached the first 5 lines of the input VCF and PED file? How many lines does the PED file have? Just want to make sure the input file look all right. Thanks.

MostPJ commented 7 years ago

Thanks for the suggestion. I am trying to run kinship with a single thread right now, but it may be a couple of hours before the job is processed by the cluster.

The ped file and VCF are attached to this post. I removed the sample IDs from the VCF, and edited those in the PED file for the sake of anonimization, but that shouldn't be a problem. The PED file contains 32 lines in total (a header line and 341 samples).

EDITED: data files removed. As they are no longer relevant.

MostPJ commented 7 years ago

The single-thread analysis (using the chromosome 22 file rather than the combined genome file) gave the same output. The only difference that I could see was that the "strol: invalid argument" message repeated 76 times.

Effective Options
    --inVcf chr22.imputed.poly.vcf.gz
    --out kinship_matrix_CC
    --xHemi
    --xLabel X
    --ped dataF_P90c_TRAILS_CC_anthro.txt
    --bn
    --minMAF 0.05
    --thread 1

[INFO]  Program version: 20170210
[INFO]  Analysis started at: Fri Jun  9 16:55:30 2017
[INFO]  Empiricial kinship will be calculated.
strtol: Invalid argument
[INFO]  Start creating empirical kinship from VCF file.
[INFO]  Using default maximum missing rate = 0.05
terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_M_construct null not valid
zhanxw commented 7 years ago

Thanks for reporting back. This shows the problem is not related to multithreading.

On Fri, Jun 9, 2017 at 10:33 AM, MostPJ notifications@github.com wrote:

The single-thread analysis (using the chromosome 22 file rather than the combined genome file) gave the same output. The only difference that I could see was that the "strol: invalid argument" message repeated 76 times.

`Effective Options --inVcf chr22.imputed.poly.vcf.gz --out kinship_matrix_CC --xHemi --xLabel X --ped dataF_P90c_TRAILS_CC_anthro.txt --bn --minMAF 0.05 --thread 1

[INFO] Program version: 20170210 [INFO] Analysis started at: Fri Jun 9 16:55:30 2017 [INFO] Empiricial kinship will be calculated. strtol: Invalid argument [INFO] Start creating empirical kinship from VCF file. [INFO] Using default maximum missing rate = 0.05 terminate called after throwing an instance of 'std::logic_error' what(): basic_string::_M_construct null not valid `

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/zhanxw/rvtests/issues/33#issuecomment-307422039, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJoiAxy-dlUGK1LE6JeAVWwFrz_L0Ctks5sCWXVgaJpZM4NykE3 .

zhanxw commented 7 years ago

As I don't have clue as this moment, do you think the VCF can be invalid? Maybe you can run to check validity using this: https://github.com/zhanxw/checkVCF

MostPJ commented 7 years ago

I think I found the problem. I discovered that I could run a kinship analysis if I replaced the phenotypes (but not the sample IDs) of the smaller subset with values taken from the larger subset.

Then I realized that there where missing gender values in the smaller subset. These unsexed samples are missing from my phenotype file. However, when I arbitrarily made them all females, the kinship analysis ran without error.

Does the gender value in the ped file affect the kinship analysis? Because if it doesn't, using arbitrary genders will have solved this problem.

zhanxw commented 7 years ago

Thanks for the information. Yes for the X-chromosome analysis, as vcf2kinship needs to account for gender and zygosity in the X non-PAR region. Just to be clear, since the input is chromosome 22, the results should be the same with or without the gender information.

MostPJ commented 7 years ago

OK. I am going to rerun the kinship analysis with the old, unmodified phenotype file while using the peopleExclude argument to ignore these samples. That should hopefully avoid the problem.

zhanxw commented 7 years ago

Does this file chrALL.imputed.poly.vcf.gz include both autosomes and sex chromosome?

MostPJ commented 7 years ago

Yes, it includes chromosome X.

I tried generating a kinship matrix while excluding the unsexed samples. It worked, but it gets a bit strange, since I did get 36 "strtol: Invalid argument" warnings in the console output. When I did not exclude the unsexed samples, but set them all to female, I did not get any warnings. (For this test, I also used the chr22 vcf file as input, rather than the chrALL file, in order to save space, so this wasn't a comprehensive test.)

I still don't get what the warning means. If I were to give an arbitrary gender to these 76 samples, and then exclude them via the peopleExclude argument, would that affect the kinship matrix? Since these 76 samples have no phenotypes, they are not going to be used in the analysis anyway.

MostPJ commented 7 years ago

Adding this for clarity: when I generated a kinship matrix where the unsexed samples had been set to female, I used the chrALL file. However, I am running another analysis that is taking up a lot of disk space, so when I tried generating a kinship matrix while excluding the unsexed samples, I only used chr22. I don't think that explains the error messages, but I wanted to add it for completeness.

zhanxw commented 7 years ago

You can safely ignore the warning "strtol: Invalid argument". It is just verbose warning messages when RVTESTS tries to convert a non-numeric value. For example, when RVTESTS process "NA", it will give this warning. I realize that this warning is no longer necessary. The latest version thus stops displaying warnings.

MostPJ commented 7 years ago

Excellent. The issue has been resolved, then. Thank you for your help!