Closed Thatguy027 closed 3 years ago
“ Any clue as to what I might try to get around the knn error? I have obviously tried reducing it to 1.”
Hi, have you input your labeled data (training groups, labels yi= (1, …, c)), are there any groups that have only one sample?
Apparently, if you read the paper, knn< c. If there are some groups only having 1 sample/individual, this group is not well represented. It is suggested to remove these samples (or groups having only 1 sample); If you do not want to remove them, you can re-group them into another group (c>2), note that KLFDAPC still preserves the within group multimodality, so the re-grouping would not affect the results.
Cheers,
Xinghu
Hi, thanks for getting back to me.
I actually filtered out all groups that have low representation (n < 5).
Could you please provide more scripts (or data if possible) on how you proceed the analysis?
You can also find the tutorials from https://xinghuq.github.io/KLFDAPC/articles/index.html
Cheers,
On Fri, 11 Jun 2021 at 06:51, Stefan @.***> wrote:
Hi, thanks for getting back to me.
I actually filtered out all groups that have low representation (n < 5).
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xinghuq/KLFDAPC/issues/2#issuecomment-859135340, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHHTUDDBECWUDJH32STEBWTTSE6XNANCNFSM46MUVBIQ .
Sure thing
Genotyping filtering:
release_name <- "20210121"
# Make the GDS file from VCF
snpgdsVCF2GDS(as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.vcf.gz")),
as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.gds")),
method="biallelic.only", compress.annotation ="ZIP_RA.max", verbose=TRUE)
# Open the GDS file
genofile <- snpgdsOpen(as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.gds")))
# Get sample id
sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))
# fliter LD thresholds for sensitivity analysis
snpset <- snpgdsLDpruning(genofile, ld.threshold=0.6,maf = 0.01,missing.rate = 0.05, autosome.only = F)
# Get all selected snp id
snpset.id <- unlist(unname(snpset))
Output from pruning
SNP pruning based on LD:
Excluding 1,894,931 SNPs (monomorphic: TRUE, MAF: 0.01, missing rate: 0.05)
# of samples: 540
# of SNPs: 1,024,363
using 1 thread
sliding window: 500,000 basepairs, Inf SNPs
|LD| threshold: 0.6
method: composite
Chromosome I: 1.04%, 3,737/358,837
Chromosome II: 0.87%, 4,024/463,722
Chromosome III: 1.02%, 3,503/341,971
Chromosome IV: 0.92%, 4,002/435,133
Chromosome MtDNA: 4.63%, 60/1,297
Chromosome V: 0.50%, 4,544/909,818
Chromosome X: 0.70%, 2,876/408,516
22,746 markers are selected in total.
Run KLFDAPC
clean_strain_locs <- data.table::fread(glue::glue("{release_name}/Clean_Strain_Locs.csv")) %>%
dplyr::arrange(isotype) %>%
dplyr::group_by(Clean_Location) %>%
dplyr::filter(n()>5, Clean_Location!="NA") # remove groups with low representation
table(clean_strain_locs$Clean_Location)
showfile.gds(closeall=TRUE)
klfdapcfile <- KLFDAPC(as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.gds")),
y = clean_strain_locs$Clean_Location, n.pc = 20,sample.id = clean_strain_locs$isotype,
autosome.only = F,
snp.id=snpset.id)
Output from KLFDAPC
Principal Component Analysis (PCA) on genotypes:
Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
# of samples: 459
# of SNPs: 22,746
using 1 thread
# of principal components: 32
PCA: the sum of all selected genotypes (0,1,2) = 18802850
CPU capabilities: Double-Precision SSE2
Thu Jun 10 16:08:46 2021 (internal increment: 2488)
[==================================================] 100%, completed, 1s
Thu Jun 10 16:08:47 2021 Begin (eigenvalues and eigenvectors)
Thu Jun 10 16:08:48 2021 Done.
[1] "Doing KLFDAPC"
Error in lfda::getAffinityMatrix(distance2, knn, nc) :
knn is too large, please try to reduce it.
Please make sure y is a factor
Sure thing
Genotyping filtering:
release_name <- "20210121" # Make the GDS file from VCF snpgdsVCF2GDS(as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.vcf.gz")), as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.gds")), method="biallelic.only", compress.annotation ="ZIP_RA.max", verbose=TRUE) # Open the GDS file genofile <- snpgdsOpen(as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.gds"))) # Get sample id sample.id <- read.gdsn(index.gdsn(genofile, "sample.id")) # fliter LD thresholds for sensitivity analysis snpset <- snpgdsLDpruning(genofile, ld.threshold=0.6,maf = 0.01,missing.rate = 0.05, autosome.only = F) # Get all selected snp id snpset.id <- unlist(unname(snpset))
Output from pruning
SNP pruning based on LD: Excluding 1,894,931 SNPs (monomorphic: TRUE, MAF: 0.01, missing rate: 0.05) # of samples: 540 # of SNPs: 1,024,363 using 1 thread sliding window: 500,000 basepairs, Inf SNPs |LD| threshold: 0.6 method: composite Chromosome I: 1.04%, 3,737/358,837 Chromosome II: 0.87%, 4,024/463,722 Chromosome III: 1.02%, 3,503/341,971 Chromosome IV: 0.92%, 4,002/435,133 Chromosome MtDNA: 4.63%, 60/1,297 Chromosome V: 0.50%, 4,544/909,818 Chromosome X: 0.70%, 2,876/408,516 22,746 markers are selected in total.
Run
KLFDAPC
clean_strain_locs <- data.table::fread(glue::glue("{release_name}/Clean_Strain_Locs.csv")) %>% dplyr::arrange(isotype) %>% dplyr::group_by(Clean_Location) %>% dplyr::filter(n()>5, Clean_Location!="NA") # remove groups with low representation table(clean_strain_locs$Clean_Location) showfile.gds(closeall=TRUE) klfdapcfile <- KLFDAPC(as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.gds")), y = clean_strain_locs$Clean_Location, n.pc = 20,sample.id = clean_strain_locs$isotype, autosome.only = F, snp.id=snpset.id)
Output from
KLFDAPC
Principal Component Analysis (PCA) on genotypes: Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN) # of samples: 459 # of SNPs: 22,746 using 1 thread # of principal components: 32 PCA: the sum of all selected genotypes (0,1,2) = 18802850 CPU capabilities: Double-Precision SSE2 Thu Jun 10 16:08:46 2021 (internal increment: 2488) [==================================================] 100%, completed, 1s Thu Jun 10 16:08:47 2021 Begin (eigenvalues and eigenvectors) Thu Jun 10 16:08:48 2021 Done. [1] "Doing KLFDAPC" Error in lfda::getAffinityMatrix(distance2, knn, nc) : knn is too large, please try to reduce it.
Thanks for posting more details.
Make sure the "clean_strain_locs$Clean_Location' is a factor.
It is better to adjust the kernel function, in order to explore broad range of kernels, users can specific other kernels by using kernlab kernel option, the default is Polynom Kernel, kernel=kernlab::polydot(degree = 1, scale = 1, offset = 1), but my experience is that kernlab provides poor performance for estimating kernel matrices (you need to adjust multiple times to fit the data, and the kernlab::rbfdot(sigma = 5) sigma is a inverse one, sigma=5 equal to the 1/5 based in Sugiyama, M (2006)).
The best practice is to use the below code instead of using the rbfdot kernel.
library(KLFDAPC)
ff=SNPRelate::snpgdsOpen(as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.gds"))) pcadata <- SNPRelate::snpgdsPCA(ff, sample.id = clean_strain_locs$isotype,autosome.only = F,snp.id=snpset.id) snpgdsClose(ff)
normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x))) } pcanorm=apply(pcadata$eigenvect[,1:20], 2, normalize)
kmat <- kmatrixGauss(pcanorm,sigma=5)
klfdapc=KLFDA(kmat, y=clean_strain_locs$Clean_Location, r=3, knn = 2)
Let me know if this works for you.
Cheers,
If you do not have any problem, I will close the issue.
Hello,
I am trying to get the
KLFDAPC
up and running and keep running into an error:Any clue as to what I might try to get around the knn error? I have obviously tried reducing it to 1.