samreenzafer commented 4 months ago

Hi, I'm not sure if I'm getting a reasonable output of the null model.

My genotyping (WGS) data has 1026 samples. ` genofile Object of class "SeqVarGDSClass" File: /analysis/21-WGS/analysis/STAARpipeline/gds_with_AVGDP/chr11.gds (283.1M)

[ ] |--+ description [ ] |--+ sample.id { Str8 1026 LZMA_ra(13.0%), 1.1K } |--+ variant.id { Int32 2496036 LZMA_ra(4.50%), 439.1K } |--+ position { Int32 2496036 LZMA_ra(30.8%), 2.9M } |--+ chromosome { Str8 2496036 LZMA_ra(0.02%), 1.2K } |--+ allele { Str8 2496036 LZMA_ra(15.8%), 1.7M } |--+ genotype [ ] | |--+ data { Bit2 2x1026x2496036 LZMA_ra(4.87%), 59.5M } | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } | --+ extra { Int16 0 LZMA_ra, 18B } |--+ phase [ ] | |--+ data { Bit1 1026x2496036 LZMA_ra(0.01%), 45.6K } | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } | --+ extra { Bit1 0 LZMA_ra, 18B } |--+ annotation [ ] | |--+ id { Str8 2496036 LZMA_ra(9.76%), 2.8M } | |--+ qual { Float32 2496036 LZMA_ra(62.4%), 5.9M } | |--+ filter { Int32,factor 2496036 LZMA_ra(2.17%), 211.1K } | |--+ info [ ] | | |--+ AC { Int32 2496036 LZMA_ra(18.8%), 1.8M } | | |--+ AF { Float32 2496036 LZMA_ra(36.3%), 3.5M } | | |--+ ALLELE_END { Bit1 2496036 LZMA_ra(0.06%), 201B } | | |--+ AN { Int32 2496036 LZMA_ra(12.2%), 1.2M } | | |--+ AS_BaseQRankSum { Float32 0 LZMA_ra, 18B } | | |--+ AS_FS { Float32 1531413 LZMA_ra(26.4%), 1.5M } | | |--+ AS_InbreedingCoeff { Float32 3060955 LZMA_ra(23.0%), 2.7M } | | |--+ AS_MQ { Float32 0 LZMA_ra, 18B } | | |--+ AS_MQRankSum { Float32 0 LZMA_ra, 18B } | | |--+ AS_QD { Float32 3060955 LZMA_ra(35.3%), 4.1M } | | |--+ AS_ReadPosRankSum { Float32 0 LZMA_ra, 18B } | | |--+ AS_SB_TABLE { Str8 2496036 LZMA_ra(0.02%), 517B } | | |--+ AS_SOR { Float32 1531413 LZMA_ra(28.4%), 1.7M } | | |--+ BaseQRankSum { Float32 2496036 LZMA_ra(39.8%), 3.8M } | | |--+ ClippingRankSum { Float32 2496036 LZMA_ra(34.4%), 3.3M } | | |--+ DB { Bit1 2496036 LZMA_ra(0.06%), 201B } | | |--+ DP { Int32 2496036 LZMA_ra(40.8%), 3.9M } | | |--+ DS { Bit1 2496036 LZMA_ra(0.06%), 201B } | | |--+ END { Int32 2496036 LZMA_ra(0.02%), 1.6K } | | |--+ ExcessHet { Float32 2496036 LZMA_ra(24.4%), 2.3M } | | |--+ FS { Float32 2496036 LZMA_ra(35.9%), 3.4M } | | |--+ InbreedingCoeff { Float32 2496036 LZMA_ra(24.2%), 2.3M } | | |--+ MLEAC { Int32 2642114 LZMA_ra(17.2%), 1.7M } | | |--+ MLEAF { Float32 2642114 LZMA_ra(18.6%), 1.9M } | | |--+ MQ { Float32 2496036 LZMA_ra(12.3%), 1.2M } | | |--+ MQ0 { Int32 2496036 LZMA_ra(0.02%), 1.6K } | | |--+ MQRankSum { Float32 2496036 LZMA_ra(35.7%), 3.4M } | | |--+ NEGATIVE_TRAIN_SITE { Bit1 2496036 LZMA_ra(45.5%), 138.7K } | | |--+ POSITIVE_TRAIN_SITE { Bit1 2496036 LZMA_ra(96.5%), 294.1K } | | |--+ QD { Float32 2496036 LZMA_ra(37.5%), 3.6M } | | |--+ RAW_MQ { Float32 2496036 LZMA_ra(0.02%), 1.6K } | | |--+ ReadPosRankSum { Float32 2496036 LZMA_ra(39.5%), 3.8M } | | |--+ SOR { Float32 2496036 LZMA_ra(36.3%), 3.5M } | | |--+ VQSLOD { Float32 2496036 LZMA_ra(41.1%), 3.9M } | | |--+ culprit { Str8 2496036 LZMA_ra(7.96%), 1.0M } | | |--+ F_MISSING { Float32 2496036 LZMA_ra(13.2%), 1.3M } | | |--+ AN_case { Int32 2496036 LZMA_ra(6.09%), 593.7K } | | |--+ AN_control { Int32 2496036 LZMA_ra(11.0%), 1.0M } | | |--+ AC_case { Int32 2496036 LZMA_ra(9.24%), 900.6K } | | |--+ AC_control { Int32 2496036 LZMA_ra(19.1%), 1.8M } | | |--+ AC_Het_case { Int32 2496036 LZMA_ra(8.40%), 818.7K } | | |--+ AC_Het_control { Int32 2496036 LZMA_ra(17.7%), 1.7M } | | |--+ AC_Het { Int32 2496036 LZMA_ra(17.7%), 1.7M } | | |--+ AF_case { Float32 2496036 LZMA_ra(13.4%), 1.3M } | | |--+ AF_control { Float32 2496036 LZMA_ra(34.4%), 3.3M } | | |--+ HWE_case { Float32 2496036 LZMA_ra(7.81%), 761.3K } | | |--+ HWE_control { Float32 2496036 LZMA_ra(22.8%), 2.2M } | | |--+ HWE { Float32 2496036 LZMA_ra(23.6%), 2.2M } | | |--+ FunctionalAnnotation [ spec_tbl_df,tbl_df,tbl,data.frame,list ] | | | |--+ VarInfo { Str8 2496036 LZMA_ra(17.0%), 6.7M } | | | |--+ apc_conservation { Float64 2496036 LZMA_ra(79.1%), 15.1M } | | | |--+ apc_epigenetics { Float64 2496036 LZMA_ra(81.4%), 15.5M } | | | |--+ apc_epigenetics_active { Float64 2496036 LZMA_ra(75.9%), 14.5M } | | | |--+ apc_epigenetics_repressed { Float64 2496036 LZMA_ra(50.5%), 9.6M } | | | |--+ apc_epigenetics_transcription { Float64 2496036 LZMA_ra(43.6%), 8.3M } | | | |--+ apc_local_nucleotide_diversity { Float64 2496036 LZMA_ra(79.9%), 15.2M } | | | |--+ apc_mappability { Float64 2496036 LZMA_ra(29.7%), 5.7M } | | | |--+ apc_protein_function { Float64 2496036 LZMA_ra(2.15%), 419.9K } | | | |--+ apc_transcription_factor { Float64 2496036 LZMA_ra(7.10%), 1.4M } | | | |--+ cage_tc { Str8 2496036 LZMA_ra(5.19%), 183.1K } | | | |--+ metasvm_pred { Str8 2496036 LZMA_ra(0.82%), 20.1K } | | | |--+ rsid { Str8 2496036 LZMA_ra(36.4%), 9.4M } | | | |--+ fathmm_xf { Float64 2496036 LZMA_ra(50.6%), 9.6M } | | | |--+ genecode_comprehensive_category { Str8 2496036 LZMA_ra(0.57%), 147.1K } | | | |--+ genecode_comprehensive_info { Str8 2496036 LZMA_ra(6.10%), 3.2M } | | | |--+ genecode_comprehensive_exonic_category { Str8 2496036 LZMA_ra(1.19%), 34.6K } | | | |--+ genecode_comprehensive_exonic_info { Str8 2496036 LZMA_ra(7.53%), 489.1K } | | | |--+ genehancer { Str8 2496036 LZMA_ra(0.39%), 365.7K } | | | |--+ linsight { Float64 2496036 LZMA_ra(21.7%), 4.1M } | | | |--+ cadd_phred { Float64 2496036 LZMA_ra(23.7%), 4.5M } | | | --+ rdhs { Str8 2496036 LZMA_ra(3.81%), 317.0K } | | --+ AVGDP { Int32 2496036 LZMA_ra(40.8%), 3.9M } * | --+ format [ ] --+ sample.annotation [ ] `

The sgrm matrix is 1026x1026 . I made this using --degree 2 (everything else the same, I know non of my cases are related, and only 1 pair of controls are 1st degree relatives, which the SGRM pipeline identified)

> sgrm[100:105,100:105]  ### Example of few lines of the matrix
> 6 x 6 sparse Matrix of class "dsCMatrix"
>           0_HG00151 0_HG00154 0_HG00155 0_HG00157 0_HG00158 0_HG00159
> 0_HG00151 0.5043822 .         .         .         .         .        
> 0_HG00154 .         0.5009275 .         .         .         .        
> 0_HG00155 .         .         0.4998904 .         .         .        
> 0_HG00157 .         .         .         0.4966408 .         .        
> 0_HG00158 .         .         .         .         0.4979511 .        
> 0_HG00159 .         .         .         .         .         0.5044529

Phenotype has 1004 samples, with a binary variable, which I've created as follows:


tail(phenotype)   #### The PCs here are those that are generated by the SGRM output .score file
                   FID SNPSEX pheno group_race         PC1        PC2           PC3
999  0_NA20821      2     1        EUR -0.02332929 0.01675661 -0.0010476483
1000 0_NA20822      2     1        EUR -0.02330287 0.01741464  0.0001754674
1001 0_NA20826      2     1        EUR -0.02354092 0.01799696 -0.0002679973
1002 0_NA20827      1     1        EUR -0.02345230 0.01764050 -0.0008549907
1003 0_NA20828      2     1        EUR -0.02335791 0.01736181 -0.0007909704
1004 0_NA20832      2     1        EUR -0.02320004 0.01698723  0.0004000635
              PC4          PC5         PC6           PC7           PC8
999  -0.003812300 -0.007291464 -0.04352340  0.0010770846 -0.0009705049
1000 -0.003096957 -0.006086234 -0.04654090  0.0005451567 -0.0014752338
1001 -0.004336742 -0.004473674 -0.04533632  0.0002382286  0.0016030985
1002 -0.003471963 -0.004603992 -0.03881987 -0.0001682949 -0.0005360977
1003 -0.003189036 -0.004899655 -0.04413306  0.0002276412 -0.0001268270
1004 -0.003445883 -0.008064412 -0.04676934  0.0007819886 -0.0020845665
....
... PC9 to PC20 ....
```> 

My genotyping data is a mix of Europeans, hispanics, and African races, which I've coded as the "group_race" in the phenotype file. 
From another answered question I understood that "groups" need only be used if my phenotype is a continuous variable.
So I skipped it.
`## fit null model
obj_nullmodel <- fit_nullmodel(pheno~SNPSEX+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10, data=phenotype,kins=sgrm,use_sparse=TRUE,
kins_cutoff=0.022,id="FID",family=gaussian(link="identity"),verbose=TRUE)
`

This is the output. I am wondering if this is reasonable. 
obj_nullmodel$coefficients
(Intercept)      SNPSEX         PC1         PC2         PC3         PC4 
 1.35457458 -0.22051452 -2.45884174  1.31956283 -0.12508581  0.21253838 
        PC5         PC6         PC7         PC8         PC9        PC10 
 0.40876625  1.82775475 -0.27736805 -0.14491347  0.03741783 -0.01117903 

table( obj_nullmodel$residuals )

   0 
1004 
table( obj_nullmodel$scaled.residuals)
< table of extent 0 >
head(obj_nullmodel$scaled.residuals)
  1   2   3   4   5   6 
NaN NaN NaN NaN NaN NaN 
> obj_nullmodel$converged
[1] TRUE
>  obj_nullmodel$sparse_kins
[1] TRUE
>  obj_nullmodel$relatedness
[1] TRUE
>  obj_nullmodel$use_SPA
[1] FALSE

I'm concerned because when I ran the Individual analysis, I got NaN Pvalues for all variants.
head analysis/results_individual_analysis_sig.csv 
"","CHR","POS","REF","ALT","ALT_AF","MAF","N","pvalue","pvalue_log10","Score","Score_se","Est","Est_se"
"NA",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.1",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.2",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.3",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.4",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.5",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
"NA.6",NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA

Here is a one part of the individual analysis just to see the NaNs
ind<- get(load("analysis/DILIN_Individual_Analysis_78.Rdata"))
head(ind)
    CHR      POS REF ALT     ALT_AF        MAF    N pvalue pvalue_log10 Score Score_se Est. Est_se
989   4 70019914   G   A 0.02442672 0.02442672 1004    NaN          NaN   NaN. 26.30146 NaN 0.03802071
1     4 70019928   G   A 0.55276382 0.44723618 1004    NaN          NaN   NaN.    77.90924 NaN 0.01283545
990   4 70020508  GA   G 0.01846307 0.01846307 1004    NaN          NaN   NaN. 22.09448 NaN 0.04526017
2     4 70020586   G   A 0.06000000 0.06000000 1004    NaN          NaN   NaN.   38.48158 NaN 0.02598646
3     4 70020606   T   C 0.09450000 0.09450000 1004    NaN          NaN   NaN 46.11840 NaN 0.02168332
4     4 70020771   G   A 0.09530938 0.09530938 1004    NaN          NaN   NaN.    46.27219 NaN 0.02161125

samreenzafer commented 4 months ago

I would like to add that my WGS VCFs did not have the AVGDP annotation, so I added that manually by copying the DP values for every variant as follows for every chromosome.

`

open GDS

gds.path <- paste0(dir_geno,gds_file_name_1,chr,gds_file_name_2) genofile <- seqOpen(gds.path, readonly = FALSE)

avgdp<- seqGetData(genofile , "annotation/info/DP") print( length(avgdp)) # for chr22 [1] 730895 seqAddValue(genofile, "annotation/info/AVGDP", avgdp ,"dummy sequencing depth added by samreen") seqClose(genofile)

`

xihaoli commented 3 months ago

Hi @samreenzafer,

Thanks for reaching out. It's OK that your aGDS files do not have a specific annotation named AVGDP (average DP), and what you have done makes sense. Generally, we would want to use a set of high-quality genetic variants to compute ancestry PCs and sparse GRM for null model fitting, and AVGDP is one of the filtering criteria.

For your question, I wonder whether the sample.id from your phenotype data can be matched with the sample.id stored in your aGDS file. It is also OK if your phenotype sample ids are a subset of the sample ids in the aGDS files. Could you please double check?

Best, Xihao

samreenzafer commented 3 months ago

Thank you for responding.

for the sampleID question , yes they are overlapping.

phenotype.id <- obj_nullmodel$id_include genotype.id<- seqGetData(genofile, "sample.id") print(length(phenotype.id)) [1] 1004 print(length(seqGetData(genofile, "sample.id"))) [1] 1026 length(intersect( phenotype.id,genotype.id)) [1] 1004

xihaoli commented 3 months ago

It seems like your outcome variable is binary, and if so, you should use family=binomial(link="logit") instead of family=gaussian(link="identity") and make sure that your outcome is coded as numeric 0/1 instead of a factor variable. Also, while it is true that "groups" need only be used if your outcome is a continuous variable, you could still consider adding your "group" variable as a fixed effect term in your logistic mixed model.

Hope this helps.

samreenzafer commented 3 months ago

For group, do you mean I should use my "group_race" variable from the phenotype file for "groups"

I will try thefamily=binomial(link="logit") and let you know. And yes my outcome variable was coded as 1/2 instead of 0/1. I will also change that.

I hadn't seen any instructions on the tutorial page to format the phenotype file, and had just assumed the plink-like notation. That was my bad.

samreenzafer commented 3 months ago

I got this error while using the suggested logit

Error in glmmkin(fixed = fixed, data = data, kins = kins, id = id, random.slope = random.slope, : Error: heteroscedastic linear mixed models are only applicable when "family" is gaussian. Calls: fit_nullmodel -> glmmkin Execution halted

obj_nullmodel <- fit_nullmodel(pheno~SNPSEX+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10+as.factor(group_race), data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="FID", groups="group_race",family=binomial(link="logit"),verbose=TRUE)

xihaoli commented 3 months ago

Hi @samreenzafer, for binary traits, please leave groups parameter as NULL.

samreenzafer commented 3 months ago

I'm not sure what I'm doing wrong but I get errors.

obj_nullmodel <- fit_nullmodel(pheno~SNPSEX+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10+as.factor(group_race), data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="FID", family=binomial(link="logit"),verbose=TRUE, groups=NULL) it ran for 3 iterations and then gave the error Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'chol2inv': not a positive definite matrix Calls: fit_nullmodel ... chol2inv -> chol -> chol -> .local -> as -> asMethod Execution halted

Then I edited the command to remove theas.factor(group_race)

obj_nullmodel <- fit_nullmodel(pheno~as.factor(SNPSEX)+PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10 ,
                               data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="FID",
                                family=binomial(link="logit"),verbose=TRUE, groups=NULL)

And I still get an error, but after 500 iterations.


[1] "kins is a sparse matrix."
Fixed-effect coefficients:
(Intercept)      SNPSEX         PC1         PC2         PC3         PC4 
 16.0408287 -18.3676970 -27.0664801   9.4036154  -5.5903104   3.7215324 
        PC5         PC6         PC7         PC8         PC9        PC10 
  6.9741816  20.4901888  -6.8478820  -2.8771294   0.6520779   1.4433908 

Iteration  1 :
Variance component estimates:
[1] 1.000000 7.234874
Fixed-effect coefficients:
 [1]  17.1054427 -19.5066252 -30.2674194   9.8406604  -6.9357156   4.8385081
 [7]   8.1287788  24.0488017 -12.4647225  -3.3129108  -0.1664124   2.1897960
...
...
...
...

Iteration  499 :
Variance component estimates:
[1] 1.000000 1.040721
Fixed-effect coefficients:
 [1]  12.9362534 -15.2646269 -27.1436580   9.3835255  -5.6920925   3.8344001
 [7]   7.0144809  20.7832368  -7.6375962  -2.8396810   0.5219339   1.5484195

Iteration  500 :
Variance component estimates:
[1] 1.0000000 0.5132395
Fixed-effect coefficients:
 [1]  13.9159314 -16.2561242 -27.1981325   9.4333121  -5.6752385   3.8187186
 [7]   7.0241264  20.8224936  -7.4864314  -2.8504011   0.5512372   1.5232672
Fixed-effect coefficients:
(Intercept)      SNPSEX         PC1         PC2         PC3         PC4 
 16.0408287 -18.3676970 -27.0664801   9.4036154  -5.5903104   3.7215324 
        PC5         PC6         PC7         PC8         PC9        PC10 
  6.9741816  20.4901888  -6.8478820  -2.8771294   0.6520779   1.4433908 

Iteration  1 :
Error: Not a matrix.
In addition: Warning message:
Average Information REML not converged, refitting model using Brent method... 
Execution halted

I don't understand how the program runs, so not sure what is it that is " not a matrix " here.

samreenzafer commented 3 months ago

Is it even okay for me to have all 3 of my ethnic group of samples analysed together (the creating of the SGRM, fitting null model and all the variant analyses)? Or should this pipeline be run separately for the 3 ethnic groups?

samreenzafer commented 3 months ago

And here is my GRM object


> grm<- get(load("chrall.degree2.sparseGRM.sGRM.RData"))
> str(grm)
Formal class 'dsCMatrix' [package "Matrix"] with 7 slots
  ..@ i       : int [1:1027] 0 0 1 2 3 4 5 6 7 8 ...
  ..@ p       : int [1:1027] 0 1 3 4 5 6 7 8 9 10 ...
  ..@ Dim     : int [1:2] 1026 1026
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:1026] "0_NA19331" "0_NA19334" "0_BKP000674" "0_BKP000684" ...
  .. ..$ : chr [1:1026] "0_NA19331" "0_NA19334" "0_BKP000674" "0_BKP000684" ...
  ..@ x       : num [1:1027] 0.5 0.23 0.494 0.499 0.497 ...
  ..@ uplo    : chr "U"
  ..@ factors : list()

> min(grm@x)
[1] 0.2297566
> max(grm@x)
[1] 0.5668696

Which I had created using the following command.

Rscript ../extdata/runPipeline_wrapper.R --prefix.in chrall_pruned --prefix.in.unfiltered chrall --prefix.out chrall.degree2.sparseGRM --degree 2 --num_threads 8 --no_pcs 20 --KINGformat.out TRUE

outputting chrall.degree2.sparseGRM.kins chrall.degree2.sparseGRM.score chrall.degree2.sparseGRM.sGRM.RData

samreenzafer commented 3 months ago

Hi @xihaoli

I tried a few things, and realized that the "SNPSEX" in the phenotype is causing the problems for me.

I refitted the null object without using SNPSEX as a covariate (also did NOT use group_race)

obj_nullmodel <- fit_nullmodel(pheno~ PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10 ,
                               data=phenotype,kins=sgrm,use_sparse=TRUE,kins_cutoff=0.022,id="FID",
                                family=binomial(link="logit"),verbose=TRUE, groups=NULL)

The code ran to completion in only 16 iterations (as opposed to 500 iterations when I used SNPSEX which had given me the "not a matrix" error).

Then I tried one randome Individual_Analysis on arrayid=130 Rscript STAARpipeline_Individual_Analysis_newSampleIDs.noSexnoRace.R 130

which gave the output log

[1] 1004
[1] 1026
[1] "IDs of phenotype: "
[1] "0_NA20821" "0_NA20822" "0_NA20826" "0_NA20827" "0_NA20828" "0_NA20832"
[1] "IDs of genotype: "
[1] "0_NA20821" "0_NA20822" "0_NA20826" "0_NA20827" "0_NA20828" "0_NA20832"
[1] 1004
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 5,000
# of selected samples: 1,026
# of selected variants: 2,941,749
# of selected samples: 1,004
# of selected variants: 2,290
# of selected samples: 1,026
# of selected variants: 2,941,749
Time difference of 14.57882 secs

and output of


> df<- get(load("analysis/Individual_Analysis.noSexnoRace_130.Rdata"))
> head(df)
  CHR      POS    REF ALT     ALT_AF        MAF    N    pvalue pvalue_log10
1   7 20022978      G   A 0.08850000 0.08850000 1004 0.1671100   0.77699757
2   7 20023277 AAAAAC   A 0.18661972 0.18661972 1004 0.7529397   0.12323983
3   7 20023350      G   A 0.08133733 0.08133733 1004 0.7110876   0.14807688
4   7 20026620      T   C 0.05700000 0.05700000 1004 0.7886436   0.10311921
5   7 20026886     CT   C 0.19687815 0.19687815 1004 0.7687004   0.11424288
6   7 20027095      T   C 0.08941059 0.08941059 1004 0.9202903   0.03607516
       Score Score_se         Est    Est_se
1  3.2924486 2.383156  0.57971434 0.4196116
2 -1.4022395 4.454869 -0.07065652 0.2244735
3 -1.1257320 3.039268 -0.12187006 0.3290266
4  0.8054780 3.004703  0.08921759 0.3328116
5 -1.3251899 4.506296 -0.06525875 0.2219118
6  0.3447801 3.445455  0.02904349 0.2902374

which now has numeric Pvalues in the PValue column. I think I can now run for all the entire job array.

Question: Why do you think SNPSEX could be causing a problem? This is what my phenotype file is coded as

FID SNPSEX  pheno   group_race  PC1 PC2 ...PC10
0_NA20814   1   0   EUR -0.0233194936208413 0.0177841478492174
0_NA20815   1   0   EUR -0.0232876963243618 0.0178482237390411
0_NA20818   2   0   EUR -0.023596640369937  0.0174856248889635
0_NA20819   2   0   EUR -0.0235735234473422 0.016707921936772
0_NA20821   2   0   EUR -0.0233292853163099 0.0167566107200051
0_NA20822   2   0   EUR -0.0233028712187108 0.0174146357365599
0_NA20826   2   0   EUR -0.0235409212779821 0.0179969624535894
0_NA20827   1   0   EUR -0.0234523001514743 0.0176404968800983
0_NA20828   2   0   EUR -0.0233579084370098 0.0173618077100926
0_NA20832   2   0   EUR -0.0232000420926164 0.0169872294492342

where SNPSEX is 1 (for male) or 2 (for female), pheno is 0 (control) or 1 (case) - I made this change after you corrected my error.

xihaoli commented 3 months ago

Hi @samreenzafer,

Thanks for following up. First of all, your output log of arrayid=130 looks good to me, so you may proceed with running other arrayid's and summarize/visualize all results.

For your question, yes it is possible that the non-convergence issue is on SNPSEX due to the collinearity between SNPSEX and intercept or some PCs. A similar issue is here. Could you please check?

Lastly, even if you end up not including SNPSEX, you may also consider adding as.factor(group_race) to the fixed effect formula when fitting the null model.

samreenzafer commented 3 months ago

That makes sense. The cases in our cohort are all male, whereas controls have males and females (only for the European cohort). You can see the tables below. Also we have small number of cases in the Afr and Hispanic cohorts as opposed to the European cohort. So I think adding as.factor(group_race) to the fixed effect formula when fitting the null model , and Excluding SNPSEX makes most sense.


pheno == 0. (i.e. controls)   
                        AFR EUR HISP
  Num males    321 213  169
  Num females    0 212    0

pheno == 1 (i.e. cases)   
                      AFR EUR HISP
  Num males  12  61   16
  Num males   0   0    0

xihaoli commented 3 months ago

Hi @samreenzafer,

Thanks for checking. It all makes sense now. Here you may not be able to include SNPSEX in the null model fitting, because knowing a sample with SNPSEX=2 guarantees that pheno=0.

Also we have small number of cases in the Afr and Hispanic cohorts as opposed to the European cohort.
So I think adding as.factor(group_race) to the fixed effect formula when fitting the null model , and Excluding SNPSEX makes most sense.

This sounds good.

samreenzafer commented 3 months ago

Thank you again for all your help, I’ve begun running the association analyses and so far so good, I must say that the pipeline steps have been implemented so well that tweaking for my directory structure and filenames etc is very easy.

We are interested in Noncoding gene region and hence using your pipeline.

Regards,

Samreen

On Wed, Mar 13, 2024 at 12:53 PM Xihao Li @.***> wrote:

Hi @samreenzafer https://github.com/samreenzafer,

Thanks for checking. It all makes sense now. Here you may not be able to include SNPSEX in the null model fitting, because knowing a sample with SNPSEX=2 guarantees that pheno=0.

Also we have small number of cases in the Afr and Hispanic cohorts as opposed to the European cohort. So I think adding as.factor(group_race) to the fixed effect formula when fitting the null model , and Excluding SNPSEX makes most sense.

This sounds good.

— Reply to this email directly, view it on GitHub https://github.com/xihaoli/STAARpipeline-Tutorial/issues/49#issuecomment-1995573028, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADLD6ZQG7SJZWO7CGAHJPTYYCVCBAVCNFSM6AAAAABERETKGOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJVGU3TGMBSHA . You are receiving this because you were mentioned.Message ID: @.***>

xihaoli commented 3 months ago

Hi Samreen,

This is great to hear. You may close this case for now, and if you have any questions, please feel free to reopen it.

Best, Xihao

xihaoli / STAARpipeline-Tutorial

fit_nullmodel Output is mostly Null and 0 #49

open GDS