Closed samreenzafer closed 3 months ago
I would like to add that my WGS VCFs did not have the AVGDP annotation, so I added that manually by copying the DP values for every variant as follows for every chromosome.
`
gds.path <- paste0(dir_geno,gds_file_name_1,chr,gds_file_name_2) genofile <- seqOpen(gds.path, readonly = FALSE)
avgdp<- seqGetData(genofile , "annotation/info/DP") print( length(avgdp)) # for chr22 [1] 730895 seqAddValue(genofile, "annotation/info/AVGDP", avgdp ,"dummy sequencing depth added by samreen") seqClose(genofile)
`
Hi @samreenzafer,
Thanks for reaching out. It's OK that your aGDS files do not have a specific annotation named AVGDP
(average DP), and what you have done makes sense. Generally, we would want to use a set of high-quality genetic variants to compute ancestry PCs and sparse GRM for null model fitting, and AVGDP
is one of the filtering criteria.
For your question, I wonder whether the sample.id
from your phenotype data can be matched with the sample.id
stored in your aGDS file. It is also OK if your phenotype sample ids are a subset of the sample ids in the aGDS files. Could you please double check?
Best, Xihao
Thank you for responding.
for the sampleID question , yes they are overlapping.
phenotype.id <- obj_nullmodel$id_include genotype.id<- seqGetData(genofile, "sample.id") print(length(phenotype.id)) [1] 1004 print(length(seqGetData(genofile, "sample.id"))) [1] 1026 length(intersect( phenotype.id,genotype.id)) [1] 1004
It seems like your outcome variable is binary, and if so, you should use family=binomial(link="logit")
instead of family=gaussian(link="identity")
and make sure that your outcome is coded as numeric 0/1 instead of a factor variable. Also, while it is true that "groups"
need only be used if your outcome is a continuous variable, you could still consider adding your "group"
variable as a fixed effect term in your logistic mixed model.
Hope this helps.
For group, do you mean I should use my "group_race"
variable from the phenotype file for "groups"
I will try thefamily=binomial(link="logit")
and let you know.
And yes my outcome variable was coded as 1/2 instead of 0/1. I will also change that.
I hadn't seen any instructions on the tutorial page to format the phenotype file, and had just assumed the plink-like notation. That was my bad.
I got this error while using the suggested logit
Error in glmmkin(fixed = fixed, data = data, kins = kins, id = id, random.slope = random.slope, : Error: heteroscedastic linear mixed models are only applicable when "family" is gaussian. Calls: fit_nullmodel -> glmmkin Execution halted
obj_nullmodel <- fit_nullmodel(pheno~SNPSEX+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10+as.factor(group_race), data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="FID", groups="group_race",family=binomial(link="logit"),verbose=TRUE)
Hi @samreenzafer, for binary traits, please leave groups
parameter as NULL
.
I'm not sure what I'm doing wrong but I get errors.
obj_nullmodel <- fit_nullmodel(pheno~SNPSEX+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10+as.factor(group_race), data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="FID", family=binomial(link="logit"),verbose=TRUE, groups=NULL)
it ran for 3 iterations and then gave the error
Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'chol2inv': not a positive definite matrix Calls: fit_nullmodel ... chol2inv -> chol -> chol -> .local -> as -> asMethod Execution halted
Then I edited the command to remove theas.factor(group_race)
obj_nullmodel <- fit_nullmodel(pheno~as.factor(SNPSEX)+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10 ,
data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="FID",
family=binomial(link="logit"),verbose=TRUE, groups=NULL)
And I still get an error, but after 500 iterations.
[1] "kins is a sparse matrix."
Fixed-effect coefficients:
(Intercept) SNPSEX PC1 PC2 PC3 PC4
16.0408287 -18.3676970 -27.0664801 9.4036154 -5.5903104 3.7215324
PC5 PC6 PC7 PC8 PC9 PC10
6.9741816 20.4901888 -6.8478820 -2.8771294 0.6520779 1.4433908
Iteration 1 :
Variance component estimates:
[1] 1.000000 7.234874
Fixed-effect coefficients:
[1] 17.1054427 -19.5066252 -30.2674194 9.8406604 -6.9357156 4.8385081
[7] 8.1287788 24.0488017 -12.4647225 -3.3129108 -0.1664124 2.1897960
...
...
...
...
Iteration 499 :
Variance component estimates:
[1] 1.000000 1.040721
Fixed-effect coefficients:
[1] 12.9362534 -15.2646269 -27.1436580 9.3835255 -5.6920925 3.8344001
[7] 7.0144809 20.7832368 -7.6375962 -2.8396810 0.5219339 1.5484195
Iteration 500 :
Variance component estimates:
[1] 1.0000000 0.5132395
Fixed-effect coefficients:
[1] 13.9159314 -16.2561242 -27.1981325 9.4333121 -5.6752385 3.8187186
[7] 7.0241264 20.8224936 -7.4864314 -2.8504011 0.5512372 1.5232672
Fixed-effect coefficients:
(Intercept) SNPSEX PC1 PC2 PC3 PC4
16.0408287 -18.3676970 -27.0664801 9.4036154 -5.5903104 3.7215324
PC5 PC6 PC7 PC8 PC9 PC10
6.9741816 20.4901888 -6.8478820 -2.8771294 0.6520779 1.4433908
Iteration 1 :
Error: Not a matrix.
In addition: Warning message:
Average Information REML not converged, refitting model using Brent method...
Execution halted
I don't understand how the program runs, so not sure what is it that is " not a matrix " here.
Is it even okay for me to have all 3 of my ethnic group of samples analysed together (the creating of the SGRM, fitting null model and all the variant analyses)? Or should this pipeline be run separately for the 3 ethnic groups?
And here is my GRM object
> grm<- get(load("chrall.degree2.sparseGRM.sGRM.RData"))
> str(grm)
Formal class 'dsCMatrix' [package "Matrix"] with 7 slots
..@ i : int [1:1027] 0 0 1 2 3 4 5 6 7 8 ...
..@ p : int [1:1027] 0 1 3 4 5 6 7 8 9 10 ...
..@ Dim : int [1:2] 1026 1026
..@ Dimnames:List of 2
.. ..$ : chr [1:1026] "0_NA19331" "0_NA19334" "0_BKP000674" "0_BKP000684" ...
.. ..$ : chr [1:1026] "0_NA19331" "0_NA19334" "0_BKP000674" "0_BKP000684" ...
..@ x : num [1:1027] 0.5 0.23 0.494 0.499 0.497 ...
..@ uplo : chr "U"
..@ factors : list()
> min(grm@x)
[1] 0.2297566
> max(grm@x)
[1] 0.5668696
Which I had created using the following command.
Rscript ../extdata/runPipeline_wrapper.R --prefix.in chrall_pruned --prefix.in.unfiltered chrall --prefix.out chrall.degree2.sparseGRM --degree 2 --num_threads 8 --no_pcs 20 --KINGformat.out TRUE
outputting
chrall.degree2.sparseGRM.kins chrall.degree2.sparseGRM.score chrall.degree2.sparseGRM.sGRM.RData
Hi @xihaoli
I tried a few things, and realized that the "SNPSEX" in the phenotype is causing the problems for me.
I refitted the null object without using SNPSEX as a covariate (also did NOT use group_race)
obj_nullmodel <- fit_nullmodel(pheno~ PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10 ,
data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="FID",
family=binomial(link="logit"),verbose=TRUE, groups=NULL)
The code ran to completion in only 16 iterations (as opposed to 500 iterations when I used SNPSEX which had given me the "not a matrix" error).
Then I tried one randome Individual_Analysis on arrayid=130
Rscript STAARpipeline_Individual_Analysis_newSampleIDs.noSexnoRace.R 130
which gave the output log
[1] 1004
[1] 1026
[1] "IDs of phenotype: "
[1] "0_NA20821" "0_NA20822" "0_NA20826" "0_NA20827" "0_NA20828" "0_NA20832"
[1] "IDs of genotype: "
[1] "0_NA20821" "0_NA20822" "0_NA20826" "0_NA20827" "0_NA20828" "0_NA20832"
[1] 1004
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 2,290
# of selected samples: 1,026
# of selected variants: 2,941,749
Time difference of 14.57882 secs
and output of
> df<- get(load("analysis/Individual_Analysis.noSexnoRace_130.Rdata"))
> head(df)
CHR POS REF ALT ALT_AF MAF N pvalue pvalue_log10
1 7 20022978 G A 0.08850000 0.08850000 1004 0.1671100 0.77699757
2 7 20023277 AAAAAC A 0.18661972 0.18661972 1004 0.7529397 0.12323983
3 7 20023350 G A 0.08133733 0.08133733 1004 0.7110876 0.14807688
4 7 20026620 T C 0.05700000 0.05700000 1004 0.7886436 0.10311921
5 7 20026886 CT C 0.19687815 0.19687815 1004 0.7687004 0.11424288
6 7 20027095 T C 0.08941059 0.08941059 1004 0.9202903 0.03607516
Score Score_se Est Est_se
1 3.2924486 2.383156 0.57971434 0.4196116
2 -1.4022395 4.454869 -0.07065652 0.2244735
3 -1.1257320 3.039268 -0.12187006 0.3290266
4 0.8054780 3.004703 0.08921759 0.3328116
5 -1.3251899 4.506296 -0.06525875 0.2219118
6 0.3447801 3.445455 0.02904349 0.2902374
which now has numeric Pvalues in the PValue column. I think I can now run for all the entire job array.
Question: Why do you think SNPSEX could be causing a problem? This is what my phenotype file is coded as
FID SNPSEX pheno group_race PC1 PC2 ...PC10
0_NA20814 1 0 EUR -0.0233194936208413 0.0177841478492174
0_NA20815 1 0 EUR -0.0232876963243618 0.0178482237390411
0_NA20818 2 0 EUR -0.023596640369937 0.0174856248889635
0_NA20819 2 0 EUR -0.0235735234473422 0.016707921936772
0_NA20821 2 0 EUR -0.0233292853163099 0.0167566107200051
0_NA20822 2 0 EUR -0.0233028712187108 0.0174146357365599
0_NA20826 2 0 EUR -0.0235409212779821 0.0179969624535894
0_NA20827 1 0 EUR -0.0234523001514743 0.0176404968800983
0_NA20828 2 0 EUR -0.0233579084370098 0.0173618077100926
0_NA20832 2 0 EUR -0.0232000420926164 0.0169872294492342
where SNPSEX is 1 (for male) or 2 (for female), pheno is 0 (control) or 1 (case) - I made this change after you corrected my error.
Hi @samreenzafer,
Thanks for following up. First of all, your output log of arrayid=130
looks good to me, so you may proceed with running other arrayid
's and summarize/visualize all results.
For your question, yes it is possible that the non-convergence issue is on SNPSEX
due to the collinearity between SNPSEX
and intercept or some PCs. A similar issue is here. Could you please check?
Lastly, even if you end up not including SNPSEX
, you may also consider adding as.factor(group_race)
to the fixed effect formula when fitting the null model.
That makes sense. The cases in our cohort are all male, whereas controls have males and females (only for the European cohort). You can see the tables below.
Also we have small number of cases in the Afr and Hispanic cohorts as opposed to the European cohort.
So I think adding as.factor(group_race)
to the fixed effect formula when fitting the null model , and Excluding SNPSEX
makes most sense.
pheno == 0. (i.e. controls)
AFR EUR HISP
Num males 321 213 169
Num females 0 212 0
pheno == 1 (i.e. cases)
AFR EUR HISP
Num males 12 61 16
Num males 0 0 0
Hi @samreenzafer,
Thanks for checking. It all makes sense now. Here you may not be able to include SNPSEX
in the null model fitting, because knowing a sample with SNPSEX=2
guarantees that pheno=0
.
Also we have small number of cases in the Afr and Hispanic cohorts as opposed to the European cohort.
So I think adding as.factor(group_race) to the fixed effect formula when fitting the null model , and Excluding SNPSEX makes most sense.
This sounds good.
Thank you again for all your help, I’ve begun running the association analyses and so far so good, I must say that the pipeline steps have been implemented so well that tweaking for my directory structure and filenames etc is very easy.
We are interested in Noncoding gene region and hence using your pipeline.
Regards,
Samreen
On Wed, Mar 13, 2024 at 12:53 PM Xihao Li @.***> wrote:
Hi @samreenzafer https://github.com/samreenzafer,
Thanks for checking. It all makes sense now. Here you may not be able to include SNPSEX in the null model fitting, because knowing a sample with SNPSEX=2 guarantees that pheno=0.
Also we have small number of cases in the Afr and Hispanic cohorts as opposed to the European cohort. So I think adding as.factor(group_race) to the fixed effect formula when fitting the null model , and Excluding SNPSEX makes most sense.
This sounds good.
— Reply to this email directly, view it on GitHub https://github.com/xihaoli/STAARpipeline-Tutorial/issues/49#issuecomment-1995573028, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADLD6ZQG7SJZWO7CGAHJPTYYCVCBAVCNFSM6AAAAABERETKGOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJVGU3TGMBSHA . You are receiving this because you were mentioned.Message ID: @.***>
Hi Samreen,
This is great to hear. You may close this case for now, and if you have any questions, please feel free to reopen it.
Best, Xihao
Hi, I'm not sure if I'm getting a reasonable output of the null model.
My genotyping (WGS) data has 1026 samples. ` genofile Object of class "SeqVarGDSClass" File: /analysis/21-WGS/analysis/STAARpipeline/gds_with_AVGDP/chr11.gds (283.1M)
The sgrm matrix is 1026x1026 . I made this using --degree 2 (everything else the same, I know non of my cases are related, and only 1 pair of controls are 1st degree relatives, which the SGRM pipeline identified)
Phenotype has 1004 samples, with a binary variable, which I've created as follows: