zhanxw / rvtests

Rare variant test software for next generation sequencing data
126 stars 41 forks source link

WARN: Failed to load spectral decomposition results of the kinship matrix (.xHemi.eigen file is incorrectly reused across independent runs of rvtests) #105

Open dpanyard opened 4 years ago

dpanyard commented 4 years ago

Hi, I've encountered a warning and error in rvtests log when I try to conduct a GWAS. My code looks like this:

./software/rvtests/v2.1.0/executable/rvtest \
    --inVcf ./derived_data/genetic_data/30_chr12.imputed.poly.vcf.gz \
    --dosage DS \
    --kinship ./derived_data/genetic_data/32_kinship_matrix.kinship \
    --xHemiKinship ./derived_data/genetic_data/32_kinship_matrix.xHemi.kinship \
    --pheno ./derived_data/genetic_data/32_chrAll_phenotypes.txt \
    --pheno-name pheno48428 \
    --meta score \
    --out ./derived_data/x_temp_gwas_test_run

In my log, I see the following:

[INFO] Family-based model specified. Loading kinship file... [INFO] DONE: Loaded kinship file [ ./derived_data/genetic_data/32_kinship_matrix.kinship ] successfully in [ 0.5 ] seconds. [INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 1.0 ] seconds. [INFO] Kinship eigen-decomposition file [ .xHemi.eigen ] detected and will be used for for X chromosome analysis [INFO] DONE: Loaded kinship file [ ./derived_data/genetic_data/32_kinship_matrix.xHemi.kinship ] successfully in [ 0.4 ] seconds. [ERROR] Unexpected column 'u997'! [WARN] Failed to load spectral decomposition results of the kinship matrix! [INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 1.1 ] seconds. [INFO] Impute missing genotype to mean (by default) [INFO] Use dosage genotype from VCF flag DS. [INFO] Analysis started [INFO] Analyzed [ 830117 ] variants [INFO] Analysis ends at: Wed May 27 16:26:39 2020 [INFO] Analysis took 1005 seconds

I've looked at the .kinship and .xHemi.kinship files and do not see the string "u997" listed anywhere. What's strange is that I'm running this same code on loop for many different phenotypes, and sometimes rvtests runs (using the same kinship and xHemi.kinship files) without this warning and error appearing.

Any idea what might be going on?

zhanxw commented 4 years ago

Can you check if this sample is in the vcf file?

Sent from my iPhone

On May 27, 2020, at 4:49 PM, Daniel Panyard notifications@github.com wrote:

 Hi, I've encountered a warning and error in rvtests log when I try to conduct a GWAS. My code looks like this:

./software/rvtests/v2.1.0/executable/rvtest \ --inVcf ./derived_data/genetic_data/30_chr12.imputed.poly.vcf.gz \ --dosage DS \ --kinship ./derived_data/genetic_data/32_kinship_matrix.kinship \ --xHemiKinship ./derived_data/genetic_data/32_kinship_matrix.xHemi.kinship \ --pheno ./derived_data/genetic_data/32_chrAll_phenotypes.txt \ --pheno-name pheno48428 \ --meta score \ --out ./derived_data/x_temp_gwas_test_run In my log, I see the following:

[INFO] Family-based model specified. Loading kinship file... [INFO] DONE: Loaded kinship file [ ./derived_data/genetic_data/32_kinship_matrix.kinship ] successfully in [ 0.5 ] seconds. [INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 1.0 ] seconds. [INFO] Kinship eigen-decomposition file [ .xHemi.eigen ] detected and will be used for for X chromosome analysis [INFO] DONE: Loaded kinship file [ ./derived_data/genetic_data/32_kinship_matrix.xHemi.kinship ] successfully in [ 0.4 ] seconds. [ERROR] Unexpected column 'u997'! [WARN] Failed to load spectral decomposition results of the kinship matrix! [INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 1.1 ] seconds. [INFO] Impute missing genotype to mean (by default) [INFO] Use dosage genotype from VCF flag DS. [INFO] Analysis started [INFO] Analyzed [ 830117 ] variants [INFO] Analysis ends at: Wed May 27 16:26:39 2020 [INFO] Analysis took 1005 seconds

I've looked at the .kinship and .xHemi.kinship files and do not see the string "u997" listed anywhere. What's strange is that I'm running this same code on loop for many different phenotypes, and sometimes rvtests runs (using the same kinship and xHemi.kinship files) without this warning and error appearing.

Any idea what might be going on?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

dpanyard commented 4 years ago

Hi, I'm not sure you're referring to by "this sample", but if you mean the string "u997", then that string is not an identifier for a sample in this data set and neither is the string "997". I used... grep u997 and confirmed that the string "u997" is not found in the vcf file, .kinship, or .xHemi.kinship files. I'm really not sure where that string could be coming from.

dpanyard commented 4 years ago

I think I've noticed a pattern with these errors. The "unexpected column" will have a number of one more than the number of samples with valid phenotypes in the analysis. So, for example, if I have a data set with 1000 samples, but the given phenotype is only present for 850 samples, then the error will read "unexpected column 'u851'".

The broader pipeline I'm using is that I have a total genetic data set with about 1,000 samples, and I'm running rvtests multiple times on this same sample, each time with a different phenotype. However, some of the phenotypes I use have missing values for some of the samples. Thus, there are many cases where I want to run rvtests using a kinship matrix calculated from 1,000 samples but for a phenotype where only, say, 800 samples actually have a valid phenotype. It seems like it's in these cases that I'm getting the error and warning.

Since rvtests continues to run the analysis anyway, I guess the question is whether rvtests is able to handle the mismatch between the samples present in the kinship matrix and the samples with valid phenotypes. Do you think this might be the source of the error?

dpanyard commented 4 years ago

I discovered today that if I run rvtests but without including the xHemi.kinship file, the error doesn't occur (but of course then it skips the X chromosome for the analysis). So, potentially there's something wrong with the xHemi.kinship file.

dpanyard commented 4 years ago

A couple more things I've tried: I tried using a tab-delimited phenotype file instead of space-delimited, but that did not prevent the error.

I tried generating and using a kinship file that included only the samples that had nonmissing values for the outcome where the error occurs, but that also did not prevent the error.

dpanyard commented 4 years ago

The error seems to occur during KinshipHolder.cpp, line 358

Looks like the "u997" column must be coming from the eigen file (this->eigenFileName). I just (re)discovered that there's a hidden file called .xHemi.eigen, generated on 3/19/2020 sitting in the top-level project directory from where I run my code, including rvtest, but that would mean this file is really out of date from my testing this month in June. This directory is the place from which I run all code. There's no eigen file there for the autosomes from what I see.

This xHemi eigen file looks to contain an 1,118 x 1,118 matrix of values. The rows are stored with the actual sample ID, but the columns contains sequentially numbered strings like U1, U2, U3..., with the U there for an unknown reason. This at least explains the letter "u" in the error log, I assume.

And now I'm seeing the line in the log file, before the error, that says this hidden eigen-decomposition file [.xHemi.eigen] was detected and used: [INFO] Kinship eigen-decomposition file [ .xHemi.eigen ] detected and will be used for for X chromosome analysis [INFO] DONE: Loaded kinship file [ ./derived_data/wrap_genetic_data/32_kinship_matrix.xHemi.kinship ] successfully in [ 0.4 ] seconds. [ERROR] Unexpected column 'u803'! [WARN] Failed to load spectral decomposition results of the kinship matrix! [INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 0.7 ] seconds.

There's no corresponding .eigen file for the autosomes, apparently.

If I temporarily rename that old .xHemi.eigen file, that should make it so rvtests can't detect it....

Now the code seems to be running without an error. Here's the log: [INFO] Analysis begins with [ 996 ] samples... [INFO] Family-based model specified. Loading kinship file... [INFO] DONE: Loaded kinship file [ ./derived_data/wrap_genetic_data/archive/troubleshooting_rvtest _error/met48428_matrix.kinship ] successfully in [ 0.6 ] seconds. [INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 1.8 ] seconds. [INFO] Kinship file [ ./derived_data/wrap_genetic_data/archive/troubleshooting_rvtest_error/met484 28_matrix.xHemi.kinship ] detected and will be used for for X chromosome analysis [INFO] DONE: Loaded kinship file [ ./derived_data/wrap_genetic_data/archive/troubleshooting_rvtest _error/met48428_matrix.xHemi.kinship ] successfully in [ 0.7 ] seconds. [INFO] DONE: Spectral decomposition of the kinship matrix succeeded in [ 1.6 ] seconds. [INFO] Impute missing genotype to mean (by default) [INFO] Use dosage genotype from VCF flag DS. [INFO] Analysis started

We can see in DataConsolidator.cpp, line 587, where this file name of .xHemi.eigen is set and loaded.

But a new .xHemi.eigen hidden file has been created in the working directory. Importantly, it appears that this new .eigen file only has 996 "U" columns, presumably one for each of the samples with non-missing phenotypes in this current analysis. That means the .xHemi.eigen file is different for each phenotype when the number of samples is different. When this file is present in the working directory, it gets loaded. This leads to the error/warning when the current .eigen file has stuck around from the last rvtests run but it doesn't correspond to the current data set being analyzed.

If this .xHemi.eigen file is somehow created at each run of rvtests, would that cause issues with running multiple iterations of rvtest at the same time in parallel in the same working directory? Or, is there the possibility that this eigen file does not get erased after the code is run and it is mistakenly used again the next time rvtests is used from that same directory? I'm testing the latter situation now.

dpanyard commented 4 years ago

It looks like the .xHemi.eigen file is not cleaned up after rvtests runs.

dpanyard commented 4 years ago

Now, if I'm understanding what's happening correctly, I think this issue might be benign if, when rvtests detects any sort of unexpected problem with the loading of .xHemi.eigen, it just forgets about what's currently in the file and recalculates the kinship eigen-decomposition manually. However, I could imagine a niche case where someone runs rvtests with a kinship matrix for, say, 457 samples, and then they realize they need to change how they calculate that kinship matrix in some way, so they recreate it and then run rvtests again. In this situation, the old .xHemi.eigen file would be loaded and it might have the wrong values. Unless rvtests is able to identify this situation, rvtests would incorrectly use the old eigen-decomposition values instead of recalculating them.

More generally, if there is any gap in rvtests ability to detect situations when the existing .xHemi.eigen file does not exactly correspond to the current X chromosome kinship file, then rvtests could be using incorrect values.

I could also see the potential for issues when someone runs rvtests multiple times in parallel, because all of those processes would be trying to use the .xHemi.eigen file at the same time, maybe even trying to write to the same file at once if the file didn't exist at the beginning.

dpanyard commented 4 years ago

So, some potential changes to help mitigate this issue:

dpanyard commented 4 years ago

For myself (and others who might run into this), it seems like a potential workaround could be to include a line of code in a shell script that would manually remove the .xHemi.eigen file after each run of rvtests. If running in parallel, some care might be needed to ensure the file isn't removed while another process is trying to read it, but for my purposes, I think I'll be able to run all of the chromosome-specific tests for one outcome in parallel, then remove .xHemi.eigen, and then continue on with the next outcome's analyses.

agilly commented 3 years ago

This is great investigative work @dpanyard !

dpanyard commented 3 years ago

This is great investigative work @dpanyard !

Thanks! Glad someone else read my detective-novel-of-a-post here. Out of curiosity, have you encountered this warning message as well? I haven't heard of any official solution to this issue yet.

agilly commented 3 years ago

Yes, one of our analysts is trying out RVtests and is having a really hard time understanding what it's doing and working her way through these warnings... For now I recommended to use a different directory for each analysis which should get rid of at least this problem :)