--- title: "Population genetics tutorial" subtitle: "Where do you come from?" author: Forensic Genetics and Legal Medicine date: 19/05/2020 output: #html_notebook rmdformats::readthedown: self_contained: false thumbnails: false lightbox: true gallery: false #highlight: tango fig_caption: yes use_bookdown: true highlight: pygments --- *** # Getting started -- *R and RStudio download* Before starting to work on some real data, we need to download [R statistical software](https://www.r-project.org) and its integrated development environment [RStudio](https://rstudio.com), which help us in programming more easily in R. ## Mac Users ### To Install R * Open an internet browser and go to www.r-project.org. * Click the "download R" link in the middle of the page under "Getting Started." * Select a CRAN location (a mirror site) and click the corresponding link. * Click on the "Download R for (Mac) OS X" link at the top of the page. * Click on the file containing the latest version of R under "Files." * Save the .pkg file, double-click it to open, and follow the installation instructions. Now that R is installed, you need to download and install RStudio. ### To Install RStudio * Go to www.rstudio.com and click on the "Download RStudio" button. * Click on "Download RStudio Desktop." * Click on the version recommended for your system, or the latest Mac version, save the .dmg file on your computer, double-click it to open, and then drag and drop it to your applications folder. ## Windows Users ### To Install R: * Open an internet browser and go to www.r-project.org. * Click the "download R" link in the middle of the page under "Getting Started." * Select a CRAN location (a mirror site) and click the corresponding link. * Click on the "Download R for Windows" link at the top of the page. * Click on the "install R for the first time" link at the top of the page. * Click "Download R for Windows" and save the executable file somewhere on your computer. Run the .exe file and follow the installation instructions. Now that R is installed, you need to download and install RStudio. ### To Install RStudio * Go to www.rstudio.com and click on the "Download RStudio" button. * Click on "Download RStudio Desktop." * Click on the version recommended for your system, or the latest Windows version, and save the executable file. Run the .exe file and follow the installation instructions. *** # Worldwide populations and their DNA -- *genetic data download* In this tutorial we will analyse the genetic data from the [1000 Genomes Project](https://www.internationalgenome.org/home). The 1000 Genomes Project ran between 2008 and 2015, creating the largest public catalogue of human variation and genotype data (Figure \@ref(fig:1000genome) and the map at [this link](https://www.internationalgenome.org/data-portal/population)). ```{r 1000genome, echo=FALSE,fig.cap="\\label{fig:1000genome}1000 Genome Project populations: African in yellow, American in red, East Asian in green, European in blue and South Asian in purple.",fig.show='hold',fig.align='center', fig.fullwidth = TRUE} knitr::include_graphics("1000g_map.png") ``` The 1000 Genomes Project took advantage of developments in sequencing technology, which sharply reduced the cost of sequencing. It was the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. Data from the 1000 Genomes Project was quickly made available to the worldwide scientific community through freely accessible public databases. Data from 1000 Genomes Project can bee freely available from this [webpage](https://www.internationalgenome.org/data#download). Unfortunately, we cannot work here the entire set of variants due to the limited computational resources of our laptops. In fact, in the phase3 of the 1000 Genomes Project more that 2000 individuals have been sequenced, generating a dataset of size around 50Gb. For this reason, I created for you a dataset comprising 786 individuals from eight worldwide populations and around 30,000 genetic variants ("**subsampled.bed**", "**subsampled.bim**", "**subsampled.fam**"). Moreover, you also need a file containing the population label for each individual ("**kgp3.pheno.tsv**"). Create a directory called "popgen_tutorial" and move these two files into the directory. *** # How DNA is coded in a text file? -- *genetic data format* ## File formats Let's have a look at the file. While *.fam* and *.bim* are text file, the *.bed* is a binary file and cannot be opened with a text editor. These are *Plink* formats currently used in population genetics because they can effectively store a huge amount of genetic data. *.fam* file contains the sample information, *.bim* contains the genetic variant information (see the table below) and *.bed* codes the genotype calls. More information about the file formats in *Plink* can be found [here](https://www.cog-genomics.org/plink/1.9/formats). ```{r echo=FALSE} fam = read.table("kgp.pruned_500_50_0.2.BEB_CHB_PEL_TSI_YRI_FIN_GBR_IBS.fam", col.names=c("FID", "IID", "MAT", "PAT", "SEX", "PHENO")) bim = read.table("kgp.pruned_500_50_0.2.BEB_CHB_PEL_TSI_YRI_FIN_GBR_IBS.bim", col.names=c("CHR", "RS", "CM", "POS", "A1", "A2")) knitr::kable( list( head(fam), head(bim) ), caption = '*.fam* (on the left) and *.bim* file (on the right).', booktabs = TRUE, valign = 't' ) ``` ## Set the working directory Before starting to work in R environment, we need to specify the working directory, which will be used by R to read and save files. In order to set our wirking directory we need to find the path leading to the directory called "popgen_tutorial". ```{r} ## For Mac users you path will be something like this: #setwd("/Users/I/Desktop/Uni/forensic_genetics/popgen_tutorial") ## Fow Windows users the pathe will be something like this: # setwd("c:/Documents/Desktop/Uni/forensic_genetics/popgen_tutorial") # My path is: setwd("/Users/bioserenina/Google Drive File Stream/Il mio Drive/lezioni_carlo/lezioni/esercitazione/popgen_tutorial") ``` Given that the genotype information are store in a binary file (*.bed*) which cannot be opened easily, we will make use of an R library specifically developed to work with genetic data: *SNPRelate*. You can find [here](http://corearray.sourceforge.net/tutorials/SNPRelate/) a tutorial on its functions. As a first step, you need to install the library, one of its dependency (the package "*gdsfmt*") as well as other packages we will use in this tutorial using the command below. ```{r eval=FALSE} # installing some packages source("http://bioconductor.org/biocLite.R") biocLite("gdsfmt") biocLite("SNPRelate") install.packages(c("ggplot2","RColorBrewer", "knitr", "kableExtra", "dplyr")) ``` Then, once you have installed a package, you need to "call" the library. ```{r message=FALSE, warning=FALSE} library(SNPRelate) library(gdsfmt) ``` As reported in the SNPRelate tutorial, this library and the gdsfmt package can store the huge amount of genetic information into a different and more efficient file firmat: *.gds*. To support efficient memory management for genome-wide numerical data, the gdsfmt package provides the genomic data structure (GDS) file format for array-oriented bioinformatic data, which is a container for storing annotation data and SNP genotypes. In this format each byte encodes up to four SNP genotypes thereby reducing file size and access time. ```{r} # create the "test.gds" file starting from our .bed, .bim, and .fam. bed.fn <- "subsampled.bed" fam.fn <- "subsampled.fam" bim.fn <- "subsampled.bim" snpgdsBED2GDS(bed.fn, fam.fn, bim.fn, "test.gds") snpgdsSummary("test.gds") ``` Once we have created the "test.gds" file, we need to open it and extract the information of interest, which are: * samples information (*sample.id*) * genetic variant information (*snp.id*) * genotypes (*g*) ```{r} genofile <- snpgdsOpen("test.gds") sample.id <- read.gdsn(index.gdsn(genofile, "sample.id")) snp.id <- read.gdsn(index.gdsn(genofile, "snp.id")) g <- read.gdsn(index.gdsn(genofile, "genotype")) ``` # Spring cleaning -- *how to clean up your data*
**A question for you**: how is it possible that 30 randomly selected SNPs can relatively accurately infer the ancestry of an individual coming from Africa?
## Where does **NA18532** come from? Now, we will compare the classification performances of the random and the Nano set on the individual *NA18532*. ```{r echo=FALSE} unknown_sample = "NA18532" random_markers = sample(snp.id,30) snp_set = random_markers # compute the allele frequency across all populations af_A1 = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(af_A1)= snp_set colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ pop.df = geno_pop[geno_pop$POP == pop,] n.alleles = 2*nrow(pop.df) for (snp in snp_set){ af_A1[snp, pop] = (1 + sum(pop.df[, snp]))/(2 + n.alleles) } } # compute the genotypic frequencies of the genotypes in the unknown sample geno_prob = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(geno_prob)= snp_set colnames(geno_prob) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ #pop.df = geno_pop[geno_pop$POP == pop,] #n.alleles = 2*nrow(pop.df) for (snp in snp_set){ p = af_A1[snp,pop] q = 1 - p if (geno_unk[unknown_sample,snp] == 0){ geno_prob[snp,pop] = q*q } else if (geno_unk[unknown_sample,snp] == 1){ geno_prob[snp,pop] = 2*p*q } else if (geno_unk[unknown_sample,snp] == 2){ geno_prob[snp,pop] = p*p } } } RMP_r = sort(apply(geno_prob, 2, prod), decreasing = T) LR_r = RMP_r[1]/RMP_r[2] RMP_r = as.data.frame(RMP_r) RMP_r = format(RMP_r, digits = 3) ``` ```{r echo=FALSE} # compute the genotypic frequencies of the genotypes in the unknown sample unknown_sample = "NA18532" nano_set_31_df = read.table("found.txt", hea=F) nano_set_31 = as.character(nano_set_31_df$V2) snp_set = nano_set_31 # compute the allele frequency across all populations af_A1 = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(af_A1)= snp_set colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ pop.df = geno_pop[geno_pop$POP == pop,] n.alleles = 2*nrow(pop.df) for (snp in snp_set){ af_A1[snp, pop] = (1 + sum(pop.df[, snp]))/(2 + n.alleles) } } geno_prob = data.frame(matrix(NA,nrow=27, ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(geno_prob)= snp_set colnames(geno_prob) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ #pop.df = geno_pop[geno_pop$POP == pop,] #n.alleles = 2*nrow(pop.df) for (snp in snp_set){ p = af_A1[snp,pop] q = 1 - p if (geno_unk[unknown_sample,snp] == 0){ geno_prob[snp,pop] = q*q } else if (geno_unk[unknown_sample,snp] == 1){ geno_prob[snp,pop] = 2*p*q } else if (geno_unk[unknown_sample,snp] == 2){ geno_prob[snp,pop] = p*p } } } RMP_n = sort(apply(geno_prob, 2, prod), decreasing = T) LR_n = RMP_n[1]/RMP_n[2] RMP_n = as.data.frame(RMP_n) RMP_n = format(RMP_n, digits = 3) ``` ```{r echo=FALSE} colnames(RMP_r) = "random set" colnames(RMP_n) = "Nano set" RMP_r[9,"random set"] = LR_r rownames(RMP_r)[9] = "LR" RMP_n[9,"Nano set"] = LR_n rownames(RMP_n)[9] = "LR" knitr::kable( list( RMP_r, RMP_n ), caption = 'RMP for the sample NA18532 using a random (on the left) and the Nano (on the rigth) marker sets.', booktabs = TRUE, valign = 't' ) ``` **The sample NA18532 is a Han Chinese individual from Beijing** and while the random sets cannot infer his ancestry with high confidence (see the reduced LR with the random SNPs), the Global AIMs Nano set can accurately find his origins. ## Where does **HG03585** come from? Now, we will compare the classification performances of tha random and the Nano set on the individual *HG03585*. ```{r echo=FALSE} unknown_sample = "HG03585" random_markers = sample(snp.id,30) snp_set = random_markers # compute the allele frequency across all populations af_A1 = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(af_A1)= snp_set colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ pop.df = geno_pop[geno_pop$POP == pop,] n.alleles = 2*nrow(pop.df) for (snp in snp_set){ af_A1[snp, pop] = (1 + sum(pop.df[, snp]))/(2 + n.alleles) } } # compute the genotypic frequencies of the genotypes in the unknown sample geno_prob = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(geno_prob)= snp_set colnames(geno_prob) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ #pop.df = geno_pop[geno_pop$POP == pop,] #n.alleles = 2*nrow(pop.df) for (snp in snp_set){ p = af_A1[snp,pop] q = 1 - p if (geno_unk[unknown_sample,snp] == 0){ geno_prob[snp,pop] = q*q } else if (geno_unk[unknown_sample,snp] == 1){ geno_prob[snp,pop] = 2*p*q } else if (geno_unk[unknown_sample,snp] == 2){ geno_prob[snp,pop] = p*p } } } RMP_r = sort(apply(geno_prob, 2, prod), decreasing = T) LR_r = RMP_r[1]/RMP_r[2] RMP_r = as.data.frame(RMP_r) RMP_r = format(RMP_r, digits = 3) ``` ```{r echo=FALSE} # compute the genotypic frequencies of the genotypes in the unknown sample unknown_sample = "HG03585" nano_set_31_df = read.table("found.txt", hea=F) nano_set_31 = as.character(nano_set_31_df$V2) snp_set = nano_set_31 # compute the allele frequency across all populations af_A1 = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(af_A1)= snp_set colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ pop.df = geno_pop[geno_pop$POP == pop,] n.alleles = 2*nrow(pop.df) for (snp in snp_set){ af_A1[snp, pop] = (1 + sum(pop.df[, snp]))/(2 + n.alleles) } } geno_prob = data.frame(matrix(NA,nrow=27, ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(geno_prob)= snp_set colnames(geno_prob) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ #pop.df = geno_pop[geno_pop$POP == pop,] #n.alleles = 2*nrow(pop.df) for (snp in snp_set){ p = af_A1[snp,pop] q = 1 - p if (geno_unk[unknown_sample,snp] == 0){ geno_prob[snp,pop] = q*q } else if (geno_unk[unknown_sample,snp] == 1){ geno_prob[snp,pop] = 2*p*q } else if (geno_unk[unknown_sample,snp] == 2){ geno_prob[snp,pop] = p*p } } } RMP_n = sort(apply(geno_prob, 2, prod), decreasing = T) LR_n = RMP_n[1]/RMP_n[2] RMP_n = as.data.frame(RMP_n) RMP_n = format(RMP_n, digits = 3) ``` ```{r echo=FALSE} colnames(RMP_r) = "random set" colnames(RMP_n) = "Nano set" RMP_r[9,"random set"] = LR_r rownames(RMP_r)[9] = "LR" RMP_n[9,"Nano set"] = LR_n rownames(RMP_n)[9] = "LR" knitr::kable( list( RMP_r, RMP_n ), caption = 'RMP for the sample HG03585 using a random (on the left) and the Nano (on the rigth) marker sets.', booktabs = TRUE, valign = 't' ) ``` **The sample HG03585 comes from Bangladesh**. The Central-South Eurasia has been crossed for millennia by different populations and its populations show intermediate genetic frequencies between East Asia, Europe and Middle East. In the slide 11 of the pdf "Beyond identification (ancestry inference)" you can see that with K 4, 5 and 6 the Central/South Asia has the same colors of Europe (light blue) and East Asia (pink). However, the authors of the Global AIMs Nano set made an effort to distinguish this geographical area. ## Where does **NA20502** come from? Now, we will compare the classification performances of tha random and the Nano set on the individual *NA20502*. ```{r echo=FALSE} unknown_sample = "NA20502" random_markers = sample(snp.id,30) snp_set = random_markers # compute the allele frequency across all populations af_A1 = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(af_A1)= snp_set colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ pop.df = geno_pop[geno_pop$POP == pop,] n.alleles = 2*nrow(pop.df) for (snp in snp_set){ af_A1[snp, pop] = (1 + sum(pop.df[, snp]))/(2 + n.alleles) } } # compute the genotypic frequencies of the genotypes in the unknown sample geno_prob = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(geno_prob)= snp_set colnames(geno_prob) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ #pop.df = geno_pop[geno_pop$POP == pop,] #n.alleles = 2*nrow(pop.df) for (snp in snp_set){ p = af_A1[snp,pop] q = 1 - p if (geno_unk[unknown_sample,snp] == 0){ geno_prob[snp,pop] = q*q } else if (geno_unk[unknown_sample,snp] == 1){ geno_prob[snp,pop] = 2*p*q } else if (geno_unk[unknown_sample,snp] == 2){ geno_prob[snp,pop] = p*p } } } RMP_r = sort(apply(geno_prob, 2, prod), decreasing = T) LR_r = RMP_r[1]/RMP_r[2] RMP_r = as.data.frame(RMP_r) RMP_r = format(RMP_r, digits = 3) ``` ```{r echo=FALSE} # compute the genotypic frequencies of the genotypes in the unknown sample unknown_sample = "NA20502" nano_set_31_df = read.table("found.txt", hea=F) nano_set_31 = as.character(nano_set_31_df$V2) snp_set = nano_set_31 # compute the allele frequency across all populations af_A1 = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(af_A1)= snp_set colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ pop.df = geno_pop[geno_pop$POP == pop,] n.alleles = 2*nrow(pop.df) for (snp in snp_set){ af_A1[snp, pop] = (1 + sum(pop.df[, snp]))/(2 + n.alleles) } } geno_prob = data.frame(matrix(NA,nrow=27, ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(geno_prob)= snp_set colnames(geno_prob) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ #pop.df = geno_pop[geno_pop$POP == pop,] #n.alleles = 2*nrow(pop.df) for (snp in snp_set){ p = af_A1[snp,pop] q = 1 - p if (geno_unk[unknown_sample,snp] == 0){ geno_prob[snp,pop] = q*q } else if (geno_unk[unknown_sample,snp] == 1){ geno_prob[snp,pop] = 2*p*q } else if (geno_unk[unknown_sample,snp] == 2){ geno_prob[snp,pop] = p*p } } } RMP_n = sort(apply(geno_prob, 2, prod), decreasing = T) LR_n = RMP_n[1]/RMP_n[2] RMP_n = as.data.frame(RMP_n) RMP_n = format(RMP_n, digits = 3) ``` ```{r echo=FALSE} colnames(RMP_r) = "random set" colnames(RMP_n) = "Nano set" RMP_r[9,"random set"] = LR_r rownames(RMP_r)[9] = "LR" RMP_n[9,"Nano set"] = LR_n rownames(RMP_n)[9] = "LR" knitr::kable( list( RMP_r, RMP_n ), caption = 'RMP for the sample NA20502 using a random (on the left) and the Nano (on the rigth) marker sets.', booktabs = TRUE, valign = 't' ) ``` **The sample NA20502 comes from Italy and particularly from Tuscany**. None of the marker sets is able to correctly infer his ancestry. A random set of markers could never solve the difficult problem of distinguishing different countries from the same continents (in this case, Europe) and neither the Global AIMs Nano set can do it, because this set was developed with the aim of distinguishing the different continents. In fact, this set has correctly understood that this sample comes from Europe, since all the European populations have around the same probabilities of being the population of origin of NA20502. However, none of them prevails.**A question for you**: where does the individual HG01620 come from?
*** # All populations at a glance -- *principal component analysis* Can we use the entire set of markers to understand where an individual comes from? The simple answer is yes, however, it is unthinkable to use in the current forensic pratice whole genome sequences or, like in our case, around 30,000 genetic markers for ancestry inference. ```{r} library(ggplot2) library(RColorBrewer) pca <- snpgdsPCA(genofile, num.thread=2) pc.percent <- pca$varprop*100 head(round(pc.percent, 2)) tab <- data.frame(sample.id = pca$sample.id, EV1 = pca$eigenvect[,1], # the first eigenvector EV2 = pca$eigenvect[,2], # the second eigenvector stringsAsFactors = FALSE) # add pop labels target = tab$sample.id tab$POP = pop_labels[match(target,pop_labels$SAMPLE),"POP"] tab$SAMPLE = pop_labels[match(target,pop_labels$SAMPLE),"SAMPLE"] # add the inferred ancestry tab$POP = as.character(tab$POP) tab$POP[which(tab$sample.id=="NA18520")] = 'inf_YRI' tab$POP[which(tab$sample.id=="NA18532")] = 'inf_CHB' tab$POP[which(tab$sample.id=="HG03585")] = 'inf_BEB' tab$POP[which(tab$sample.id=="NA20502")] = 'inf_TSI' # create color palette colors = c(brewer.pal(8, "Set1"), rep("black",5)) names(colors) = c("BEB","CHB","FIN","GBR","IBS","PEL","TSI","YRI","inf_TSI","inf_BEB","inf_YRI","inf_CHB","UNK2") # give different shapes to the unknown samples tab$pch=21 tab$pch[which(tab$sample.id=="NA18520")] = 22 tab$pch[which(tab$sample.id=="NA18532")] = 23 tab$pch[which(tab$sample.id=="HG03585")] = 24 tab$pch[which(tab$sample.id=="NA20502")] = 25 tab$pch[which(tab$sample.id=="HG01620")] = "" tab$POP = as.factor(tab$POP) # divide in tab_pop and tab_unk tab_pop = tab[!(tab$POP %in% c("inf_TSI","inf_BEB","inf_YRI","inf_CHB","UNK2")),] tab_pop = droplevels(tab_pop) tab_unk = tab[which(tab$POP %in% c("inf_TSI","inf_BEB","inf_YRI","inf_CHB","UNK2")),] tab_unk = droplevels(tab_unk) ggplot(tab_pop, aes(x=EV1, y=EV2, color=POP)) + geom_point() + theme_linedraw() + xlab(round(pc.percent[1], digits=2)) + ylab(round(pc.percent[2], digits=2)) + scale_color_manual(values = colors) + annotate("point", x = tab_unk$EV1[which(tab_unk$sample.id=="HG03585")], y = tab_unk$EV2[which(tab_unk$sample.id=="HG03585")], fill = "darkgrey", color="black", shape=22, size=3) + annotate("text", label ="inf_BEB", x = tab_unk$EV1[which(tab_unk$sample.id=="HG03585")] + 0.005, y = tab_unk$EV2[which(tab_unk$sample.id=="HG03585")], color="black", size=3) + annotate("point", x = tab_unk$EV1[which(tab_unk$sample.id=="NA18520")], y = tab_unk$EV2[which(tab_unk$sample.id=="NA18520")], fill = "darkgrey", color="black", shape=23, size=3) + annotate("text", label= "inf_YRI", x = tab_unk$EV1[which(tab_unk$sample.id=="NA18520")] - 0.005, y = tab_unk$EV2[which(tab_unk$sample.id=="NA18520")], color="black", size=3) + annotate("point", x = tab_unk$EV1[which(tab_unk$sample.id=="NA18532")], y = tab_unk$EV2[which(tab_unk$sample.id=="NA18532")], fill = "darkgrey", color="black", shape=24, size=3) + annotate("text", label= "inf_CHB", x = tab_unk$EV1[which(tab_unk$sample.id=="NA18532")]+ 0.005, y = tab_unk$EV2[which(tab_unk$sample.id=="NA18532")], color="black", size=3) + annotate("point", x = tab_unk$EV1[which(tab_unk$sample.id=="NA20502")], y = tab_unk$EV2[which(tab_unk$sample.id=="NA20502")], fill = "darkgrey", color="black", shape=25, size=3) + annotate("text", label= "inf_TSI", x = tab_unk$EV1[which(tab_unk$sample.id=="NA20502")]+ 0.005, y = tab_unk$EV2[which(tab_unk$sample.id=="NA20502")], color="black", size=3) + theme(legend.position="bottom") + guides(colour = guide_legend(override.aes = list(size=6))) + ggtitle("Principal component analysis") ``` ```{r} snpgdsClose(genofile) ```