zgompert / bgc-hm

R package for Bayesian genomic analyses of hybrid zones, including Bayesian genomic clines
GNU General Public License v3.0
2 stars 0 forks source link

Error when estimating hybrid index using default settings #4

Open selasphoruskershaw opened 1 month ago

selasphoruskershaw commented 1 month ago

Hello,

I successfully ran est_p, and am now trying to estimate the hybrid index using default settings. The data for the putative hybrids is formatted the same way as the parentals, but I am getting an error:

1) I successfully ran this:

estimate parental allele frequencies, uses default HMC settings

p_out<-est_p(G0=gt.alhu,G1=gt.ruhu,model="genotype",ploidy="diploid")

2) I received the error here:

estimate hybrid indexes, uses default HMC settings

and uses point estimates (posterior medians) of allele frequencies

h_out<-est_hi(Gx=gt.hybrids,p0=p_out$p0[,1],p1=p_out$p1[,1],model="genotype",ploidy="diploid")

The error: "Error : Exception: variable does not exist; processing stage=data initialization; variable name=L; base type=int (in 'hi', line 14, column 1 to column 7) failed to create the sampler; sampling not done Stan model 'hi' does not contain samples. Error in apply(rstan::extract(fit, "H")[[1]], 2, quantile, probs = c(0.5, : dim(X) must have a positive length"

I'd greatly appreciate any help you can provide - the package seems very promising.

zgompert commented 1 month ago

Hi, I think the code is failing to get the dimensions for Gx. What do you get from:dim(gt.hybrids)ZachSent from my iPhoneOn Jul 17, 2024, at 9:24 PM, selasphoruskershaw @.***> wrote: Hello, I successfully ran est_p, and am now trying to estimate the hybrid index using default settings. The data for the putative hybrids is formatted the same way as the parentals, but I am getting an error:

I successfully ran this:

estimate parental allele frequencies, uses default HMC settings p_out<-est_p(G0=gt.alhu,G1=gt.ruhu,model="genotype",ploidy="diploid")

I received the error here:

estimate hybrid indexes, uses default HMC settings and uses point estimates (posterior medians) of allele frequencies h_out<-est_hi(Gx=gt.hybrids,p0=p_out$p0[,1],p1=p_out$p1[,1],model="genotype",ploidy="diploid") The error: "Error : Exception: variable does not exist; processing stage=data initialization; variable name=L; base type=int (in 'hi', line 14, column 1 to column 7) failed to create the sampler; sampling not done Stan model 'hi' does not contain samples. Error in apply(rstan::extract(fit, "H")[[1]], 2, quantile, probs = c(0.5, : dim(X) must have a positive length" I'd greatly appreciate any help you can provide - the package seems very promising.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

selasphoruskershaw commented 1 month ago

Hello,

I think I figured out what is going on, but I'm unsure exactly how to solve it. When I enter dim(gt.hybrids), I get "NULL".

There is an issue after I convert my vcfs into genind objects. For example, with "gt.ruhu", which is a genind object converted from my vcf file of one of the parental species, I have a character matrix of genotypes: "chr [1:13268, 1:27] "0/1" "0/1"" etc

I convert these data to an integer form to make the dataset compatible with bgchm using the code below: gt.ruhu <- extract.gt(vcf.ruhu) gt.ruhu[gt.ruhu == "0/0"]<-0 gt.ruhu[gt.ruhu == "0/1"]<-1 gt.ruhu[gt.ruhu == "1/1"]<-2 gt.ruhu <- as.integer(gt.ruhu)

I then have a matrix that reads as follows for gt.ruhu: "int [1:358236] 0 0 0 1 2" etc

This happens for both parentals, and for my hybrids, and clearly appears to be the source of the issue.

So while est_p initially works, it is essentially estimating nonsense, and this is why I believe that est_hi does not work downstream.

I've tried many things over the last few days to correct this - do you think you can point me in the right direction?

Thanks again!

zgompert commented 1 month ago

Hi.  You are not constructing a matrix of SNPs and individuals. You have collapsed things down to a single dimension after as.integer (maybe sooner). One solution would be to create a standard matrix in R and then assign 0, 1 and 2 to that based on the logic you are currently using with the character matrix. Does that make sense?ZachSent from my iPhoneOn Jul 22, 2024, at 4:33 PM, selasphoruskershaw @.***> wr Hello, I think I figured out what is going on, but I'm unsure exactly how to solve it. When I enter dim(gt.hybrids), I get "NULL". There is an issue after I convert my vcfs into genind objects. For example, with "gt.ruhu", which is a genind object converted from my vcf file of one of the parental species, I have a character matrix of genotypes: "chr [1:13268, 1:27] "0/1" "0/1"" etc I convert these data to an integer form to make the dataset compatible with bgchm using the code below: gt.ruhu <- extract.gt(vcf.ruhu) gt.ruhu[gt.ruhu == "0/0"]<-0 gt.ruhu[gt.ruhu == "0/1"]<-1 gt.ruhu[gt.ruhu == "1/1"]<-2 gt.ruhu <- as.integer(gt.ruhu) I then have a matrix that reads as follows for gt.ruhu: "int [1:358236] 0 0 0 1 2" etc This happens for both parentals, and for my hybrids, and clearly appears to be the source of the issue. So while est_p initially works, it is essentially estimating nonsense, and this is why I believe that est_hi does not work downstream. I've tried many things over the last few days to correct this - do you think you can point me in the right direction? Thanks again!

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

selasphoruskershaw commented 1 month ago

Yes, this makes sense, thank you. I tried this before, and everything appears correct, but the data still appear to be in character format, in comparison to the tutorial dataset (GenHybrids, GenP0, GenP1), which are in integer format (see screenshot below). In character format, when I try to run est_p, I get the error below:

Error : Exception: variable does not exist; processing stage=data initialization; variable name=G0; base type=int (in 'p', line 5, column 1 to column 33) failed to create the sampler; sampling not done Stan model 'p' does not contain samples. Error in apply(rstan::extract(fit, "P0")[[1]], 2, quantile, probs = c(0.5, : dim(X) must have a positive length

After applying as.integer to each character matrix, est.p works, but the matrix is obviously jumbled after applying as.integer, as mentioned before. Is there a workaround to get my data in the correct format for est.p? Thank you.

Screen Shot 2024-07-23 at 10 08 55 AM
zgompert commented 1 month ago

Hi, I am away on field work for 4-5 days, but if you send me your data and the code you are using (maybe via a link or something) I can help with the formatting and send you the code to do it when I get back.ZachSent from my iPhoneOn Jul 23, 2024, at 11:14 AM, selasphoruskershaw @.***> wrote: Yes, this makes sense, thank you. I tried this before, and everything appears correct, but the data still appear to be in character format, in comparison to the tutorial dataset (GenHybrids, GenP0, GenP1), which are in integer format (see screenshot below). In character format, when I try to run est_p, I get the error below: Error : Exception: variable does not exist; processing stage=data initialization; variable name=G0; base type=int (in 'p', line 5, column 1 to column 33) failed to create the sampler; sampling not done Stan model 'p' does not contain samples. Error in apply(rstan::extract(fit, "P0")[[1]], 2, quantile, probs = c(0.5, : dim(X) must have a positive length After applying as.integer to each character matrix, est.p works, but the matrix is obviously jumbled after applying as.integer, as mentioned before. Is there a workaround to get my data in the correct format for est.p? Thank you. Screen.Shot.2024-07-23.at.10.08.55.AM.png (view on web)

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

selasphoruskershaw commented 1 month ago

Thank you so much. A subset of my data is attached below. Below is the code I've been using - I've tried multiple variations, but don't include them here for simplicity.

run bgchm

generate a cline from a filtered, biallelic vcf

library("vcfR") library("adegenet") library("SNPfiltR") library("data.table") library("bgchm")

import vcfs

vcf.ruhu <-read.vcfR("ruhu.phase2.Z.vcf.gz") vcf.alhu <-read.vcfR("alhu.phase2.Z.vcf.gz") vcf.hybrids <-read.vcfR("hybrids.phase2.Z.vcf.gz")

convert vcfs to genind objects

gt.ruhu <- extract.gt(vcf.ruhu) gt.ruhu[gt.ruhu == "0/0"]<-0 gt.ruhu[gt.ruhu == "0/1"]<-1 gt.ruhu[gt.ruhu == "1/1"]<-2

gt.alhu <- extract.gt(vcf.alhu) gt.alhu[gt.alhu == "0/0"]<-0 gt.alhu[gt.alhu == "0/1"]<-1 gt.alhu[gt.alhu == "1/1"]<-2

gt.hybrids <- extract.gt(vcf.hybrids) gt.hybrids[gt.hybrids == "0/0"]<-0 gt.hybrids[gt.hybrids == "0/1"]<-1 gt.hybrids[gt.hybrids == "1/1"]<-2

dim(gt.hybrids)

estimate parental allele frequencies, uses default HMC settings

p_out<-est_p(G0=gt.alhu,G1=gt.ruhu,model="genotype",ploidy="diploid") p_out

estimate hybrid indexes, uses default HMC settings

and uses point estimates (posterior medians) of allele frequencies

h_out<-est_hi(Gx=gt.hybrids,p0=p_out$p0[,1],p1=p_out$p1[,1],model="genotype",ploidy="diploid")

plot hybrid index estimates with 90% equal-tail probability intervals

sorted by hybrid index, just a nice way to visualize that in this example we have

few hybrids with intermediate hybrid indexes

plot(sort(h_out$hi[,1]),ylim=c(0,1),pch=19,xlab="Individual (sorted by HI)",ylab="Hybrid index (HI)") segments(1:100,h_out$hi[order(h_out$hi[,1]),3],1:100,h_out$hi[order(h_out$hi[,1]),4])

estimate interspecific ancestry, this can

be especially informative about the types of hybrids present

q_out<-est_Q(Gx=gt.hybrids,p0=p_out$p0[,1],p1=p_out$p1[,1],model="genotype",ploidy="diploid")

hybrids.phase2.Z.vcf.gz alhu.phase2.Z.vcf.gz ruhu.phase2.Z.vcf.gz

zgompert commented 1 month ago

Hi, I am attaching a modified version of your script that properly creates the matrixes.

I also made a few other tweaks. First, with known genotypes you can quickly get posteriors on the allele frequencies analytically. I have implemented this but haven't yet pushed it to the public version. I did it both ways in the code I included. The analytical approach is much, much quicker. I suggest just doing this (this will be updated online in the next few days).

Second, after computing allele frequencies, I identify ancestry informative loci (those with some minimal difference between the parents) and use those for hybrid index estimation. This appears to run fine for me.

Cheers, Zach

On Wed, Jul 24, 2024 at 12:51 PM selasphoruskershaw < @.***> wrote:

Thank you so much. A subset of my data is attached here. Below is the code I've been using - I've tried multiple variations, but don't include them here for simplicity.

run bgchm

generate a cline from a filtered, biallelic vcf

library("vcfR") library("adegenet") library("SNPfiltR") library("data.table") library("bgchm")

import vcfs

vcf.ruhu <-read.vcfR("ruhu.phase2.Z.vcf.gz") vcf.alhu <-read.vcfR("alhu.phase2.Z.vcf.gz") vcf.hybrids <-read.vcfR("hybrids.phase2.Z.vcf.gz")

convert vcfs to genind objects

gt.ruhu <- extract.gt(vcf.ruhu) gt.ruhu[gt.ruhu == "0/0"]<-0 gt.ruhu[gt.ruhu == "0/1"]<-1 gt.ruhu[gt.ruhu == "1/1"]<-2

gt.alhu <- extract.gt(vcf.alhu) gt.alhu[gt.alhu == "0/0"]<-0 gt.alhu[gt.alhu == "0/1"]<-1 gt.alhu[gt.alhu == "1/1"]<-2

gt.hybrids <- extract.gt(vcf.hybrids) gt.hybrids[gt.hybrids == "0/0"]<-0 gt.hybrids[gt.hybrids == "0/1"]<-1 gt.hybrids[gt.hybrids == "1/1"]<-2

dim(gt.hybrids) estimate parental allele frequencies, uses default HMC settings

p_out<-est_p(G0=gt.alhu,G1=gt.ruhu,model="genotype",ploidy="diploid") p_out estimate hybrid indexes, uses default HMC settings and uses point estimates (posterior medians) of allele frequencies

h_out<-est_hi(Gx=gt.hybrids,p0=p_out$p0[,1],p1=p_out$p1[,1],model="genotype",ploidy="diploid") plot hybrid index estimates with 90% equal-tail probability intervals sorted by hybrid index, just a nice way to visualize that in this example we have few hybrids with intermediate hybrid indexes

plot(sort(h_out$hi[,1]),ylim=c(0,1),pch=19,xlab="Individual (sorted by HI)",ylab="Hybrid index (HI)")

segments(1:100,h_out$hi[order(h_out$hi[,1]),3],1:100,h_out$hi[order(h_out$hi[,1]),4]) estimate interspecific ancestry, this can be especially informative about the types of hybrids present

q_out<-est_Q(Gx=gt.hybrids,p0=p_out$p0[,1],p1=p_out$p1[,1],model="genotype",ploidy="diploid")

hybrids.phase2.Z.vcf.gz https://github.com/user-attachments/files/16366401/hybrids.phase2.Z.vcf.gz alhu.phase2.Z.vcf.gz https://github.com/user-attachments/files/16366402/alhu.phase2.Z.vcf.gz ruhu.phase2.Z.vcf.gz https://github.com/user-attachments/files/16366404/ruhu.phase2.Z.vcf.gz

— Reply to this email directly, view it on GitHub https://github.com/zgompert/bgc-hm/issues/4#issuecomment-2248692200, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHTKRHRYQIORN6KZIP35IMDZN7ZR5AVCNFSM6AAAAABLBZMZFOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBYGY4TEMRQGA . You are receiving this because you commented.Message ID: @.***>

selasphoruskershaw commented 1 month ago

Good evening! Thank you for your message - unfortunately I don't see a script attached. Do you think you might be able to resend it?

zgompert commented 1 month ago

Hmm, it looks like it doesn’t work as an attachment over GitHub. I am going to paste the code below, but can also send it in a text file if you have an email address I can send it to.#run bgchm#generate a cline from a filtered, biallelic vcflibrary("vcfR")library("bgchm")#import vcfsvcf.ruhu <-read.vcfR("ruhu.phase2.Z.vcf.gz")vcf.alhu <-read.vcfR("alhu.phase2.Z.vcf.gz")vcf.hybrids <-read.vcfR("hybrids.phase2.Z.vcf.gz")#convert vcfs to genind objectsgt.ruhu.tmp <- extract.gt(vcf.ruhu)L<-dim(gt.ruhu.tmp)[1]N<-dim(gt.ruhu.tmp)[2]gt.ruhu<-matrix(NA,nrow=L,ncol=N)gt.ruhu[gt.ruhu.tmp == "0/0"]<-0gt.ruhu[gt.ruhu.tmp == "0/1"]<-1gt.ruhu[gt.ruhu.tmp == "1/1"]<-2gt.alhu.tmp <- extract.gt(vcf.alhu)L<-dim(gt.alhu.tmp)[1]N<-dim(gt.alhu.tmp)[2]gt.alhu<-matrix(NA,nrow=L,ncol=N)gt.alhu[gt.alhu.tmp == "0/0"]<-0gt.alhu[gt.alhu.tmp == "0/1"]<-1gt.alhu[gt.alhu.tmp == "1/1"]<-2gt.hybrids.tmp <- extract.gt(vcf.hybrids)L<-dim(gt.hybrids.tmp)[1]N<-dim(gt.hybrids.tmp)[2]gt.hybrids<-matrix(NA,nrow=L,ncol=N)gt.hybrids[gt.hybrids.tmp == "0/0"]<-0gt.hybrids[gt.hybrids.tmp == "0/1"]<-1gt.hybrids[gt.hybrids.tmp == "1/1"]<-2dim(gt.hybrids)#estimate parental allele frequencies, uses default HMC settings# note transposing genotype matrixes because program expecting row=ind col=locip_out<-est_p(G0=t(gt.alhu),G1=t(gt.ruhu),model="genotype",ploidy="diploid")# quik version to get point estimates for known genotypes so I can test this,# going to add this as a functionp_0<-(apply(t(gt.alhu),2,sum)+0.5)/(2dim(gt.alhu)[2]+1)p_1<-(apply(t(gt.ruhu),2,sum)+0.5)/(2dim(gt.ruhu)[2]+1)# identify ancestry informative SNPs, let's go with an allele frequency difference# of 0.2 (2341 SNPs), 0.3 would give 1014 SNPs, and 0.1 5419... # could use 0.1 for clines or stick with 0.3dp<-abs(p_0-p_1)anc<-which(dp > 0.2)#estimate hybrid indexes, uses default HMC settings#and uses point estimates (posterior medians) of allele frequenciesh_out<-est_hi(Gx=t(gt.hybrids)[,anc],p0=p_out$p0[anc,1],p1=p_out$p1[anc,1],model="genotype",ploidy="diploid")# version using point estimates obtained quickly/analytically as genotypes knownh_out<-est_hi(Gx=t(gt.hybrids)[,anc],p0=p_0[anc],p1=p_1[anc],model="genotype",ploidy="diploid")Sent from my iPhoneOn Aug 1, 2024, at 10:45 PM, selasphoruskershaw @.***> wrote: Good evening! Thank you for your message - unfortunately I don't see a script attached. Do you think you might be able to resend it?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

selasphoruskershaw commented 1 month ago

Update: I've gotten the script you sent over to work!

One follow-up question - if I try a VCF with a different chromosome, I get the following error:

code: p_out<-est_p(G0=t(gt.alhu),G1=t(gt.ruhu),model="genotype",ploidy="diploid")

Error in FUN(X[[i]], ...) : Stan does not support NA (in G0) in data failed to preprocess the data; sampling not done Stan model 'p' does not contain samples. Error in apply(rstan::extract(fit, "P0")[[1]], 2, quantile, probs = c(0.5, : dim(X) must have a positive length

I checked my matrices and there are NA peppered throughout. This is not the case with the initial VCF files I got to successfully run (with the code you provided). Can you advise for dealing these NA?

Thanks again!

selasphoruskershaw commented 1 month ago

Update: I ran the following code to generate a cline: gc_out<-est_genocl(Gx=t(gt.hybrids),p0=p_out$p0[,1],p1=p_out$p1[,1],H=h_out$hi[,1],model="genotype",ploidy="diploid",hier=TRUE,n_iters=4000)

It took about 4 days to run, then appeared to complete, but then gave this set of errors. Any help is appreciated - thanks again:

Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal Calls: ... doTryCatch -> sendData -> sendData.SOCK0node -> serialize Execution halted Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal Calls: ... doTryCatch -> sendData -> sendData.SOCK0node -> serialize Execution halted Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal Calls: ... doTryCatch -> sendData -> sendData.SOCK0node -> serialize Execution halted Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal Calls: ... doTryCatch -> sendData -> sendData.SOCK0node -> serialize Execution halted