zhengxwen / SNPRelate

R package: parallel computing toolset for relatedness and principal component analysis of SNP data (Development version only)
http://www.bioconductor.org/packages/SNPRelate
101 stars 25 forks source link

Population Assignments for PCA plots #80

Open carlahurt opened 3 years ago

carlahurt commented 3 years ago

Hello, I am working on a PCA analysis of some populations for a conservation genetics project on a crayfish species. My DAPC analysis did not show significant structure between sites, so I thought is would use a PCA approach as I understand this tries to look at individual differences (not group differences). I am able to use the SNPrelate tutorial to a point, but my VCF file does not contain population assignment information. I am not able to see on the plots the population affiliation of the data points. I see that you are importing a population file but I was not able to see how this is formatted. I’m pasting a screenshot of my R-code. Can you tell me the format of the file you are using to add population information? Also, is it possible to label individuals in the plots? I can see that I have a couple of outlier individuals and I would like to look closer at the data to see if there is something fishy. snprelate popns

zhengxwen commented 3 years ago

pop_code is just a vector of characters. Your question is more related to R programming itself, rather than SNPRelate. You can import pop_code from a text file: e.g., pop_code <- readLines("your_file"), each line for an individual.

YRI
YRI
CEU
...

And finally merge it with sample ID and eigenvectors:

  sample.id pop         EV1         EV2
1   NA19152 YRI -0.08237338 -0.01091830
2   NA19139 YRI -0.08299277 -0.01035197
3   NA18912 YRI -0.08160415 -0.01412062
4   NA19160 YRI -0.08695621 -0.01391751