xinghuq / KLFDAPC

KLFDAPC: Kernel local Fisher discriminant analysis of principal components (KLFDAPC) for large genomic data
https://xinghuq.github.io/KLFDAPC/
Other
4 stars 1 forks source link

knn is too large, please try to reduce it #2

Closed Thatguy027 closed 3 years ago

Thatguy027 commented 3 years ago

Hello,

I am trying to get the KLFDAPC up and running and keep running into an error:

Principal Component Analysis (PCA) on genotypes:
Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
    # of samples: 459
    # of SNPs: 22,632
    using 1 thread
    # of principal components: 32
PCA:    the sum of all selected genotypes (0,1,2) = 18712231
CPU capabilities: Double-Precision SSE2
Wed Jun  9 12:03:49 2021    (internal increment: 2488)
[==================================================] 100%, completed, 1s  
Wed Jun  9 12:03:50 2021    Begin (eigenvalues and eigenvectors)
Wed Jun  9 12:03:50 2021    Done.
[1] "Doing KLFDAPC"
Error in lfda::getAffinityMatrix(distance2, knn, nc) : 
  knn is too large, please try to reduce it.

Any clue as to what I might try to get around the knn error? I have obviously tried reducing it to 1.

xinghuq commented 3 years ago

“ Any clue as to what I might try to get around the knn error? I have obviously tried reducing it to 1.”

Hi, have you input your labeled data (training groups, labels yi= (1, …, c)), are there any groups that have only one sample?

Apparently, if you read the paper, knn< c. If there are some groups only having 1 sample/individual, this group is not well represented. It is suggested to remove these samples (or groups having only 1 sample); If you do not want to remove them, you can re-group them into another group (c>2), note that KLFDAPC still preserves the within group multimodality, so the re-grouping would not affect the results.

Cheers,

Xinghu

Thatguy027 commented 3 years ago

Hi, thanks for getting back to me.

I actually filtered out all groups that have low representation (n < 5).

xinghuq commented 3 years ago

Could you please provide more scripts (or data if possible) on how you proceed the analysis?

You can also find the tutorials from https://xinghuq.github.io/KLFDAPC/articles/index.html

Cheers,

On Fri, 11 Jun 2021 at 06:51, Stefan @.***> wrote:

Hi, thanks for getting back to me.

I actually filtered out all groups that have low representation (n < 5).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xinghuq/KLFDAPC/issues/2#issuecomment-859135340, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHHTUDDBECWUDJH32STEBWTTSE6XNANCNFSM46MUVBIQ .

Thatguy027 commented 3 years ago

Sure thing

Genotyping filtering:

release_name <- "20210121"

# Make the GDS file from VCF
snpgdsVCF2GDS(as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.vcf.gz")), 
              as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.gds")),
              method="biallelic.only", compress.annotation ="ZIP_RA.max",  verbose=TRUE)

# Open the GDS file
genofile <- snpgdsOpen(as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.gds")))
# Get sample id
sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))

# fliter LD thresholds for sensitivity analysis
snpset <- snpgdsLDpruning(genofile, ld.threshold=0.6,maf = 0.01,missing.rate = 0.05, autosome.only = F)
# Get all selected snp id
snpset.id <- unlist(unname(snpset))

Output from pruning

SNP pruning based on LD:
Excluding 1,894,931 SNPs (monomorphic: TRUE, MAF: 0.01, missing rate: 0.05)
    # of samples: 540
    # of SNPs: 1,024,363
    using 1 thread
    sliding window: 500,000 basepairs, Inf SNPs
    |LD| threshold: 0.6
    method: composite
Chromosome I: 1.04%, 3,737/358,837
Chromosome II: 0.87%, 4,024/463,722
Chromosome III: 1.02%, 3,503/341,971
Chromosome IV: 0.92%, 4,002/435,133
Chromosome MtDNA: 4.63%, 60/1,297
Chromosome V: 0.50%, 4,544/909,818
Chromosome X: 0.70%, 2,876/408,516
22,746 markers are selected in total.

Run KLFDAPC

clean_strain_locs <- data.table::fread(glue::glue("{release_name}/Clean_Strain_Locs.csv")) %>%
  dplyr::arrange(isotype) %>%
  dplyr::group_by(Clean_Location) %>%
  dplyr::filter(n()>5, Clean_Location!="NA") # remove groups with low representation

table(clean_strain_locs$Clean_Location)

showfile.gds(closeall=TRUE)

klfdapcfile <- KLFDAPC(as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.gds")), 
                       y = clean_strain_locs$Clean_Location, n.pc = 20,sample.id = clean_strain_locs$isotype,
                       autosome.only = F,
                       snp.id=snpset.id)

Output from KLFDAPC

Principal Component Analysis (PCA) on genotypes:
Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
    # of samples: 459
    # of SNPs: 22,746
    using 1 thread
    # of principal components: 32
PCA:    the sum of all selected genotypes (0,1,2) = 18802850
CPU capabilities: Double-Precision SSE2
Thu Jun 10 16:08:46 2021    (internal increment: 2488)
[==================================================] 100%, completed, 1s  
Thu Jun 10 16:08:47 2021    Begin (eigenvalues and eigenvectors)
Thu Jun 10 16:08:48 2021    Done.
[1] "Doing KLFDAPC"
Error in lfda::getAffinityMatrix(distance2, knn, nc) : 
  knn is too large, please try to reduce it.
xinghuq commented 3 years ago

Please make sure y is a factor

Sure thing

Genotyping filtering:

release_name <- "20210121"

# Make the GDS file from VCF
snpgdsVCF2GDS(as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.vcf.gz")), 
              as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.gds")),
              method="biallelic.only", compress.annotation ="ZIP_RA.max",  verbose=TRUE)

# Open the GDS file
genofile <- snpgdsOpen(as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.gds")))
# Get sample id
sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))

# fliter LD thresholds for sensitivity analysis
snpset <- snpgdsLDpruning(genofile, ld.threshold=0.6,maf = 0.01,missing.rate = 0.05, autosome.only = F)
# Get all selected snp id
snpset.id <- unlist(unname(snpset))

Output from pruning

SNP pruning based on LD:
Excluding 1,894,931 SNPs (monomorphic: TRUE, MAF: 0.01, missing rate: 0.05)
    # of samples: 540
    # of SNPs: 1,024,363
    using 1 thread
    sliding window: 500,000 basepairs, Inf SNPs
    |LD| threshold: 0.6
    method: composite
Chromosome I: 1.04%, 3,737/358,837
Chromosome II: 0.87%, 4,024/463,722
Chromosome III: 1.02%, 3,503/341,971
Chromosome IV: 0.92%, 4,002/435,133
Chromosome MtDNA: 4.63%, 60/1,297
Chromosome V: 0.50%, 4,544/909,818
Chromosome X: 0.70%, 2,876/408,516
22,746 markers are selected in total.

Run KLFDAPC

clean_strain_locs <- data.table::fread(glue::glue("{release_name}/Clean_Strain_Locs.csv")) %>%
  dplyr::arrange(isotype) %>%
  dplyr::group_by(Clean_Location) %>%
  dplyr::filter(n()>5, Clean_Location!="NA") # remove groups with low representation

table(clean_strain_locs$Clean_Location)

showfile.gds(closeall=TRUE)

klfdapcfile <- KLFDAPC(as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.gds")), 
                       y = clean_strain_locs$Clean_Location, n.pc = 20,sample.id = clean_strain_locs$isotype,
                       autosome.only = F,
                       snp.id=snpset.id)

Output from KLFDAPC

Principal Component Analysis (PCA) on genotypes:
Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
    # of samples: 459
    # of SNPs: 22,746
    using 1 thread
    # of principal components: 32
PCA:    the sum of all selected genotypes (0,1,2) = 18802850
CPU capabilities: Double-Precision SSE2
Thu Jun 10 16:08:46 2021    (internal increment: 2488)
[==================================================] 100%, completed, 1s  
Thu Jun 10 16:08:47 2021    Begin (eigenvalues and eigenvectors)
Thu Jun 10 16:08:48 2021    Done.
[1] "Doing KLFDAPC"
Error in lfda::getAffinityMatrix(distance2, knn, nc) : 
  knn is too large, please try to reduce it.

Thanks for posting more details.

  1. Make sure the "clean_strain_locs$Clean_Location' is a factor.

  2. It is better to adjust the kernel function, in order to explore broad range of kernels, users can specific other kernels by using kernlab kernel option, the default is Polynom Kernel, kernel=kernlab::polydot(degree = 1, scale = 1, offset = 1), but my experience is that kernlab provides poor performance for estimating kernel matrices (you need to adjust multiple times to fit the data, and the kernlab::rbfdot(sigma = 5) sigma is a inverse one, sigma=5 equal to the 1/5 based in Sugiyama, M (2006)).

The best practice is to use the below code instead of using the rbfdot kernel.

library(KLFDAPC)

doing PCA first

ff=SNPRelate::snpgdsOpen(as.character(glue::glue("{release_name}/WI.{release_name}.hard-filter.isotype.gds"))) pcadata <- SNPRelate::snpgdsPCA(ff, sample.id = clean_strain_locs$isotype,autosome.only = F,snp.id=snpset.id) snpgdsClose(ff)

normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x))) } pcanorm=apply(pcadata$eigenvect[,1:20], 2, normalize)

the Gaussian kernel

kmat <- kmatrixGauss(pcanorm,sigma=5)

klfdapc=KLFDA(kmat, y=clean_strain_locs$Clean_Location, r=3, knn = 2)

klfdapc$Z is your reduced features.

Let me know if this works for you.

Cheers,

xinghuq commented 3 years ago

If you do not have any problem, I will close the issue.