--- title: "Population genetics tutorial" subtitle: "Where do you come from?" author: Forensic Genetics and Legal Medicine date: 19/05/2020 output: #html_notebook rmdformats::readthedown: self_contained: false thumbnails: false lightbox: true gallery: false #highlight: tango fig_caption: yes use_bookdown: true highlight: pygments --- *** # Getting started -- *R and RStudio download* Before starting to work on some real data, we need to download [R statistical software](https://www.r-project.org) and its integrated development environment [RStudio](https://rstudio.com), which help us in programming more easily in R. ## Mac Users ### To Install R * Open an internet browser and go to www.r-project.org. * Click the "download R" link in the middle of the page under "Getting Started." * Select a CRAN location (a mirror site) and click the corresponding link. * Click on the "Download R for (Mac) OS X" link at the top of the page. * Click on the file containing the latest version of R under "Files." * Save the .pkg file, double-click it to open, and follow the installation instructions. Now that R is installed, you need to download and install RStudio. ### To Install RStudio * Go to www.rstudio.com and click on the "Download RStudio" button. * Click on "Download RStudio Desktop." * Click on the version recommended for your system, or the latest Mac version, save the .dmg file on your computer, double-click it to open, and then drag and drop it to your applications folder. ## Windows Users ### To Install R: * Open an internet browser and go to www.r-project.org. * Click the "download R" link in the middle of the page under "Getting Started." * Select a CRAN location (a mirror site) and click the corresponding link. * Click on the "Download R for Windows" link at the top of the page. * Click on the "install R for the first time" link at the top of the page. * Click "Download R for Windows" and save the executable file somewhere on your computer. Run the .exe file and follow the installation instructions. Now that R is installed, you need to download and install RStudio. ### To Install RStudio * Go to www.rstudio.com and click on the "Download RStudio" button. * Click on "Download RStudio Desktop." * Click on the version recommended for your system, or the latest Windows version, and save the executable file. Run the .exe file and follow the installation instructions. *** # Worldwide populations and their DNA -- *genetic data download* In this tutorial we will analyse the genetic data from the [1000 Genomes Project](https://www.internationalgenome.org/home). The 1000 Genomes Project ran between 2008 and 2015, creating the largest public catalogue of human variation and genotype data (Figure \@ref(fig:1000genome) and the map at [this link](https://www.internationalgenome.org/data-portal/population)). ```{r 1000genome, echo=FALSE,fig.cap="\\label{fig:1000genome}1000 Genome Project populations: African in yellow, American in red, East Asian in green, European in blue and South Asian in purple.",fig.show='hold',fig.align='center', fig.fullwidth = TRUE} knitr::include_graphics("1000g_map.png") ``` The 1000 Genomes Project took advantage of developments in sequencing technology, which sharply reduced the cost of sequencing. It was the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. Data from the 1000 Genomes Project was quickly made available to the worldwide scientific community through freely accessible public databases. Data from 1000 Genomes Project can bee freely available from this [webpage](https://www.internationalgenome.org/data#download). Unfortunately, we cannot work here the entire set of variants due to the limited computational resources of our laptops. In fact, in the phase3 of the 1000 Genomes Project more that 2000 individuals have been sequenced, generating a dataset of size around 50Gb. For this reason, I created for you a dataset comprising 786 individuals from eight worldwide populations and around 30,000 genetic variants ("**subsampled.bed**", "**subsampled.bim**", "**subsampled.fam**"). Moreover, you also need a file containing the population label for each individual ("**kgp3.pheno.tsv**"). Create a directory called "popgen_tutorial" and move these two files into the directory. *** # How DNA is coded in a text file? -- *genetic data format* ## File formats Let's have a look at the file. While *.fam* and *.bim* are text file, the *.bed* is a binary file and cannot be opened with a text editor. These are *Plink* formats currently used in population genetics because they can effectively store a huge amount of genetic data. *.fam* file contains the sample information, *.bim* contains the genetic variant information (see the table below) and *.bed* codes the genotype calls. More information about the file formats in *Plink* can be found [here](https://www.cog-genomics.org/plink/1.9/formats). ```{r echo=FALSE} fam = read.table("kgp.pruned_500_50_0.2.BEB_CHB_PEL_TSI_YRI_FIN_GBR_IBS.fam", col.names=c("FID", "IID", "MAT", "PAT", "SEX", "PHENO")) bim = read.table("kgp.pruned_500_50_0.2.BEB_CHB_PEL_TSI_YRI_FIN_GBR_IBS.bim", col.names=c("CHR", "RS", "CM", "POS", "A1", "A2")) knitr::kable( list( head(fam), head(bim) ), caption = '*.fam* (on the left) and *.bim* file (on the right).', booktabs = TRUE, valign = 't' ) ``` ## Set the working directory Before starting to work in R environment, we need to specify the working directory, which will be used by R to read and save files. In order to set our wirking directory we need to find the path leading to the directory called "popgen_tutorial". ```{r} ## For Mac users you path will be something like this: #setwd("/Users/I/Desktop/Uni/forensic_genetics/popgen_tutorial") ## Fow Windows users the pathe will be something like this: # setwd("c:/Documents/Desktop/Uni/forensic_genetics/popgen_tutorial") # My path is: setwd("/Users/bioserenina/Google Drive File Stream/Il mio Drive/lezioni_carlo/lezioni/esercitazione/popgen_tutorial") ``` Given that the genotype information are store in a binary file (*.bed*) which cannot be opened easily, we will make use of an R library specifically developed to work with genetic data: *SNPRelate*. You can find [here](http://corearray.sourceforge.net/tutorials/SNPRelate/) a tutorial on its functions. As a first step, you need to install the library, one of its dependency (the package "*gdsfmt*") as well as other packages we will use in this tutorial using the command below. ```{r eval=FALSE} # installing some packages source("http://bioconductor.org/biocLite.R") biocLite("gdsfmt") biocLite("SNPRelate") install.packages(c("ggplot2","RColorBrewer", "knitr", "kableExtra", "dplyr")) ``` Then, once you have installed a package, you need to "call" the library. ```{r message=FALSE, warning=FALSE} library(SNPRelate) library(gdsfmt) ``` As reported in the SNPRelate tutorial, this library and the gdsfmt package can store the huge amount of genetic information into a different and more efficient file firmat: *.gds*. To support efficient memory management for genome-wide numerical data, the gdsfmt package provides the genomic data structure (GDS) file format for array-oriented bioinformatic data, which is a container for storing annotation data and SNP genotypes. In this format each byte encodes up to four SNP genotypes thereby reducing file size and access time. ```{r} # create the "test.gds" file starting from our .bed, .bim, and .fam. bed.fn <- "subsampled.bed" fam.fn <- "subsampled.fam" bim.fn <- "subsampled.bim" snpgdsBED2GDS(bed.fn, fam.fn, bim.fn, "test.gds") snpgdsSummary("test.gds") ``` Once we have created the "test.gds" file, we need to open it and extract the information of interest, which are: * samples information (*sample.id*) * genetic variant information (*snp.id*) * genotypes (*g*) ```{r} genofile <- snpgdsOpen("test.gds") sample.id <- read.gdsn(index.gdsn(genofile, "sample.id")) snp.id <- read.gdsn(index.gdsn(genofile, "snp.id")) g <- read.gdsn(index.gdsn(genofile, "genotype")) ``` # Spring cleaning -- *how to clean up your data*
```{r, out.width = "400px",echo=F} knitr::include_graphics("2013.04.04_Cleanup_Data.jpg") ```
In order to make the computation steps fasible for our laptops I have already cleaned the data. Moreover, I have also sampled one SNP out of eight SNPs, in order to reduce the size of our files. In particular the cleaning steps you always need to do are: * check for missingness: removal of SNPs with missing call rates exceeding the provided value (0.05) and samples with missing call rates higher than 0.05. * MAF (minor allele frequency): removal of monomorphic or rare SNPs. SNPs with a very low allele frequency in your dataset are of course not informative for distinguishing different populations because they are shared between a few individuals or are private of a single individual. * LD (linkage-disequilibrium): according to the specific analysis we want to do, we may decide to prune (in other words, filter) the genetic variants for linkage disequilibrium. In particular, analyses requiring the independence of genetic markers (such as the Naive Bayes classification or STRUCTURE) will also require the LD pruning. * check for relatedness among samples: the frequency computation cannot be influenced by the presence of related samples. * autosomes only: in some analyses, like ancestry inference, we may prefer to work using only autosomes given the fact that they are equally shared between males and females (in other words, they have the same effective population size). ## Creation of the genotype matrix Once we have finished the cleaning steps (already perfomed), we set up the genotypes matrix. Using the R commands *dim(geno)* we can see matrix size (rows and columns number), while in order to see part of the matrix we can use the command *geno[1:5,1:5]*, which shows the first five rows and columns of the matrix. ```{r} geno = as.data.frame(g) colnames(geno) = snp.id rownames(geno) = sample.id ``` ## Missingness check We know that the genotypes have already been cleaned by me, but the first rule in bioinformatics is "never trust the guy giving you some data". ```{r} paste0("how many missing cells are there in our dataset? ", sum(geno==3)) ``` At this point, we need another information: the ancestry of the individuals in *sample.id* variable. You can find this information in the file "kgp3.pheno.tsv" and, with the commands below, you can add this information to the genotype matrix. ## Adding population labels ```{r} # read the population labels file pop_labels = read.table("kgp3.pheno.tsv", hea=T) # add the population labels to geno target = rownames(geno) #head(pop_labels[match(target,pop_labels$SAMPLE),]) geno$POP = pop_labels[match(target,pop_labels$SAMPLE),"POP"] geno$SAMPLE = pop_labels[match(target,pop_labels$SAMPLE),"SAMPLE"] # check if the order of samples is correct: if the answer is TRUE can be happy identical(rownames(geno), as.character(geno$SAMPLE)) geno = geno[,c("POP", snp.id)] #head(geno)[1:5] ``` How many populations do we have? As you can see with the R command *table(geno$POP)*, there are **8 worldwide populations** and **5 individuals of unknown ancestry**. | label | population description | size | |-------|---------------------------------|------| | BEB | Bengali from Bangladesh | 86 | | CHB | Han Chinese in Beijing, China | 102 | | FIN | Finnish in Finland | 98 | | GBR | British in England and Scotland | 91 | | IBS | Iberian Population in Spain | 106 | | TSI | Tuscans in Italy | 107 | | PEL | Peruvians from Lima, Peru | 84 | | YRI | Yoruba in Ibadan, Nigeria | 107 | ```{r} # Divide the dataset in known populations and unknown individuals geno_unk = geno[which(geno$POP %in% c("UNK1", "UNK2", "UNK3", "UNK4", "UNK5")),] geno_pop = geno[!(geno$POP %in% c("UNK1", "UNK2", "UNK3", "UNK4", "UNK5")),] ``` In the end, our final dataset will be something like this, where each row is a sample, the first column is the population the sample comes from and the other columns are the SNPs: ```{r echo=FALSE} knitr::kable( head(geno_pop)[1:8], caption = 'First 5 samples and 7 SNPs of the genotype matrix *geno_pop*.', booktabs = TRUE, valign = 't' ) ``` *** # Keep calm and do some math -- *allelic and genotypic frequencies* ## Allele frequencies At this point, the genotype matrix is ready and we can start the hard work: allele frequencies computation. The frequency of an allele is defined as the number of the copies of the allelel in the population divided by the total number of gene copies in the population. In a diploid individual (in which all individuals carry two copies of each chromosome) with *N* individual, there are *2N* gene copies. So the frequencies of alleles A1 and A2 are: $f_{A1}$ = $\frac{N_{A1}}{2N}$ and $f_{A2}$ = $\frac{N_{A2}}{2N}$ Let's have a look at our genotypes, by taking the first five rows and five columns. ```{r} geno_pop[1:5,1:5] ``` There are four possible values stored in *geno_pop*: 0, 1, 2 and 3. For bi-allelic SNP sites, "0" indicates two A2 alleles, "1" indicates one A1 allele and one A2 allele, "2" indicates two A1 alleles, and "3" is a missing genotype. Let's see the second SNP in the geno_pop matrix: *[rs115209712](http://www.ensembl.org/Homo_sapiens/Variation/Population?db=core;r=1:85831-86831;v=rs115209712;vdb=variation;vf=691267831)*. ```{r echo=FALSE} tab = geno_pop[25:30,c(1,3)] tab$rs115209712[tab$rs115209712==0] = 'AA' tab$rs115209712[tab$rs115209712==1] = "AG" tab$rs115209712[tab$rs115209712==2] = "GG" knitr::kable( list( geno_pop[25:30,c(1,3)], tab ), caption = 'SNP rs115209712 coded in the genotype matrix with the numbers 0, 1 and 2 (on the left) and the correspondent nucleotides (on the rigth).', booktabs = TRUE, valign = 't' ) ``` Given that the *geno_pop* dataset harbours the count of one allele for each genotype (the cell), in order to compute the allele frequency, we simply need to sum all values in each column (the SNP) and divide it by *2N*. We can compute the frequency of the allele A across all samples or for each population separately. ```{r} # A1 frequency of the SNP rs115209712 in the entire dataset sum(geno_pop$"rs115209712")/(2*dim(geno_pop)[1]) # A2 frequency of the SNP rs115209712 in the entire dataset 1 - sum(geno_pop$"rs115209712")/(2*dim(geno_pop)[1]) # Another way to compute the A2 frequency (sum(geno_pop$"rs115209712" == 0)*2 + sum(geno_pop$"rs115209712" == 1))/(2*dim(geno_pop)[1]) ``` ```{r} # A1 frequency of the SNP rs115209712 in the entire dataset af_A1 = data.frame(matrix(NA,nrow=1, ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(af_A1)= "rs115209712" colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ pop.df = geno_pop[geno_pop$POP == pop,] n.alleles = 2*nrow(pop.df) snp ="rs115209712" af_A1[snp, pop] = sum(pop.df[, snp])/n.alleles } # A2 frequency of the SNP rs115209712 in the entire dataset af_A2 = 1 - af_A1 ``` The frequencies of the alleles A1 and A2 of SNP rs115209712 in each populations are: ```{r echo=FALSE} knitr::kable( af_A1, caption = 'Frequency of the minor allele G.', booktabs = TRUE, valign = 't' ) ``` ```{r echo=FALSE} knitr::kable( af_A2, caption = 'Frequency of the major allele A.', booktabs = TRUE, valign = 't' ) ``` If you compare the allele frequencies found in our dataset with the ones reported in [Ensembl](http://www.ensembl.org/Homo_sapiens/Variation/Population?db=core;r=1:85831-86831;v=rs115209712;vdb=variation;vf=691267831), you can see that they match. ## Genotype frequencies The netx step will be the computation of the genotype frequencies according to Hardy-Weinberg formulas. The genotype frequencies under Hardy-Weinberg Equilibrium are: | Genotype | AA | Aa | aa | |-----------|-------------|---------------|-------------| | Frequency | $f_{A}^{2}$ | $2f_{A}f_{a}$ | $f_{a}^{2}$ | ```{r} gf_rs115209712 = data.frame(matrix(NA,nrow=3, ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(gf_rs115209712)= c("GG", "GA", "AA") colnames(gf_rs115209712) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ gf_rs115209712["GG",pop] = af_A1[pop]*af_A1[pop] gf_rs115209712["GA",pop] = 2*af_A1[pop]*af_A2[pop] gf_rs115209712["AA",pop] = af_A2[pop]*af_A2[pop] } ``` ```{r echo=FALSE, message=FALSE, warning=FALSE} library(dplyr) library(kableExtra) gf_rs115209712$genotype = rownames(gf_rs115209712) gf_rs115209712 = gf_rs115209712[c('genotype', "GBR", "FIN", "IBS","TSI", "BEB", "CHB", "PEL", "YRI")] gf_rs115209712[c("GBR", "FIN", "IBS","TSI", "BEB", "CHB", "PEL", "YRI")] = round(gf_rs115209712[c("GBR", "FIN", "IBS","TSI", "BEB", "CHB", "PEL", "YRI")], digits = 4) gf_rs115209712 %>% mutate( GBR = cell_spec(GBR, "html", color = ifelse(GBR > 0.5, "red", "blue")), FIN = cell_spec(FIN, "html", color = ifelse(FIN > 0.5, "red", "blue")), IBS = cell_spec(IBS, "html", color = ifelse(IBS > 0.5, "red", "blue")), PEL = cell_spec(PEL, "html", color = ifelse(PEL > 0.5, "red", "blue")), BEB = cell_spec(BEB, "html", color = ifelse(BEB > 0.5, "red", "blue")), YRI = cell_spec(YRI, "html", color = ifelse(YRI > 0.5, "red", "blue")), CHB = cell_spec(CHB, "html", color = ifelse(CHB > 0.5, "red", "blue")), TSI = cell_spec(TSI, "html", color = ifelse(TSI > 0.5, "red", "blue")) ) %>% kable(format = "html", escape = F, row.names = F) %>% kable_styling("striped", full_width = T) ``` As you can see the snp rs115209712 is not variable across different worldwide populations: in fact the homozygous genotype AA has an allele frequency higher that 0.71 everywhere, while the other two genotypes show little or no variation. From this rapid "visual inspection", we can conclude that this SNP won't surely be informative for ancestry inference. Let's try with another SNP, this time known in the literature for being informative for ancestry inference. The SNP is [rs2814778](http://www.ensembl.org/Homo_sapiens/Variation/Population?db=core;r=1:159204393-159205393;v=rs2814778;vdb=variation;vf=1853499) and has been reported byt [this paper](https://www.fsigenetics.com/article/S1872-4973(16)30015-1/fulltext) as a genetic marker informative in distinguishing European individuals. Let's see if it works. ```{r} # allele frequencies af_A1 = data.frame(matrix(NA,nrow=1, ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(af_A1)= "rs2814778" colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ pop.df = geno_pop[geno_pop$POP == pop,] n.alleles = 2*nrow(pop.df) snp ="rs2814778" af_A1[snp, pop] = sum(pop.df[, snp])/n.alleles } # A2 frequency of the SNP rs2814778 in the entire dataset af_A2 = 1 - af_A1 # genotype frequencies gf_rs2814778 = data.frame(matrix(NA,nrow=3, ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(gf_rs2814778)= c("CC", "CT", "TT") colnames(gf_rs2814778) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ gf_rs2814778["CC",pop] = af_A1[pop]*af_A1[pop] gf_rs2814778["CT",pop] = 2*af_A1[pop]*af_A2[pop] gf_rs2814778["TT",pop] = af_A2[pop]*af_A2[pop] } ``` ```{r echo=FALSE, message=FALSE, warning=FALSE} library(dplyr) library(kableExtra) gf_rs2814778$genotype = rownames(gf_rs2814778) gf_rs2814778 = gf_rs2814778[c('genotype', "GBR", "FIN", "IBS","TSI", "BEB", "CHB", "PEL", "YRI")] gf_rs2814778[c("GBR", "FIN", "IBS","TSI", "BEB", "CHB", "PEL", "YRI")] = round(gf_rs2814778[c("GBR", "FIN", "IBS","TSI", "BEB", "CHB", "PEL", "YRI")], digits = 4) gf_rs2814778 %>% mutate( GBR = cell_spec(GBR, "html", color = ifelse(GBR > 0.5, "red", "blue")), FIN = cell_spec(FIN, "html", color = ifelse(FIN > 0.5, "red", "blue")), IBS = cell_spec(IBS, "html", color = ifelse(IBS > 0.5, "red", "blue")), PEL = cell_spec(PEL, "html", color = ifelse(PEL > 0.5, "red", "blue")), BEB = cell_spec(BEB, "html", color = ifelse(BEB > 0.5, "red", "blue")), YRI = cell_spec(YRI, "html", color = ifelse(YRI > 0.5, "red", "blue")), CHB = cell_spec(CHB, "html", color = ifelse(CHB > 0.5, "red", "blue")), TSI = cell_spec(TSI, "html", color = ifelse(TSI > 0.5, "red", "blue")) ) %>% kable(format = "html", escape = F, row.names = F) %>% kable_styling("striped", full_width = T) ``` AS you can see from the table above and its [Ensembl webpage](http://www.ensembl.org/Homo_sapiens/Variation/Population?db=core;r=1:159204393-159205393;v=rs2814778;vdb=variation;vf=1853499), this SNP is informative in distinguishing Yoruba (African from Nigeria) individuals, because only in the Yoruba population the allele *C* and the genotype *CC* have almost reached the fixation. *** # Where do you come from? -- *the Naive Bayes classification* As we have seen in the *Beyond identification (ancestry inference)* lecture, the steps we need to do in order to infer the ancestry of a sample are: * select and appropriate AIMs (Ancestral Informative Markers) panel, * compute the allele frequencies in the reference populations, * compute the genotypic frequencies in the reference populations of each unknown individual's genotype, * compute a RMP (Random Match Probability) multiplying the genotypic frequencies, * compute the Likelihood Ratio for the highest probability. In this tutorial, we cannot compute the informativeness for each marker due to the limited computational resources of our laptop. However, we will compare the classification performances of 30 randomly selected markers with the *Global AIMs Nano set*, a 31-plex assay of ancestry-informative SNPs developed in order to differentiate the five continents ([M.de la Puente et al., 2016, Forensic Science International: Genetics](https://www.sciencedirect.com/science/article/abs/pii/S1872497316300151)). The complete set of SNPs is reported in Fig \@ref(fig:nano). ```{r nano, echo=FALSE,fig.cap="\\label{fig:nano}The Global AIMs Nano set.",fig.show='hold',fig.align='center', fig.fullwidth = TRUE} knitr::include_graphics("table1_delapuente.png") ``` Now we will try to infer the ancestry of some of the unknown individuals. Let's start with "NA18520". ## Where does **NA18520** come from? ### What happens if we use 30 randomly choosen markers? Here we will try to infer the ancestry of the individual "NA18520". ```{r} unknown_sample = "NA18520" random_markers = sample(snp.id,30) snp_set = random_markers # compute the allele frequency across all populations af_A1 = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(af_A1)= snp_set colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ pop.df = geno_pop[geno_pop$POP == pop,] n.alleles = 2*nrow(pop.df) for (snp in snp_set){ af_A1[snp, pop] = (1 + sum(pop.df[, snp]))/(2 + n.alleles) } } # compute the genotypic frequencies of the genotypes in the unknown sample geno_prob = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(geno_prob)= snp_set colnames(geno_prob) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ #pop.df = geno_pop[geno_pop$POP == pop,] #n.alleles = 2*nrow(pop.df) for (snp in snp_set){ p = af_A1[snp,pop] q = 1 - p if (geno_unk[unknown_sample,snp] == 0){ geno_prob[snp,pop] = q*q } else if (geno_unk[unknown_sample,snp] == 1){ geno_prob[snp,pop] = 2*p*q } else if (geno_unk[unknown_sample,snp] == 2){ geno_prob[snp,pop] = p*p } } } RMP = sort(apply(geno_prob, 2, prod), decreasing = T) LR = RMP[1]/RMP[2] ``` ```{r echo=FALSE} RMP = t(as.data.frame(RMP)) RMP = format(RMP, digits = 3) knitr::kable( RMP, caption = 'RMP for the sample NA18520 using 30 random markers.', booktabs = TRUE, valign = 't' ) ``` ```{r echo=FALSE} paste0("The ", colnames(RMP)[1], " ancestry is ", LR, " times more probable than ", colnames(RMP)[2], " ancestry.") ``` ### What happens when we use the Global AIMs Nano set? ```{r echo=FALSE} # compute the genotypic frequencies of the genotypes in the unknown sample unknown_sample = "NA18520" nano_set_31_df = read.table("found.txt", hea=F) nano_set_31 = as.character(nano_set_31_df$V2) snp_set = nano_set_31 # compute the allele frequency across all populations af_A1 = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(af_A1)= snp_set colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ pop.df = geno_pop[geno_pop$POP == pop,] n.alleles = 2*nrow(pop.df) for (snp in snp_set){ af_A1[snp, pop] = (1 + sum(pop.df[, snp]))/(2 + n.alleles) } } geno_prob = data.frame(matrix(NA,nrow=27, ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(geno_prob)= snp_set colnames(geno_prob) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ #pop.df = geno_pop[geno_pop$POP == pop,] #n.alleles = 2*nrow(pop.df) for (snp in snp_set){ p = af_A1[snp,pop] q = 1 - p if (geno_unk[unknown_sample,snp] == 0){ geno_prob[snp,pop] = q*q } else if (geno_unk[unknown_sample,snp] == 1){ geno_prob[snp,pop] = 2*p*q } else if (geno_unk[unknown_sample,snp] == 2){ geno_prob[snp,pop] = p*p } } } RMP = sort(apply(geno_prob, 2, prod), decreasing = T) LR = RMP[1]/RMP[2] ``` ```{r echo=FALSE} RMP = t(as.data.frame(RMP)) RMP = format(RMP, digits = 3) knitr::kable( RMP, caption = 'RMP for the sample NA18520 using the Global AIMs Nano set.', booktabs = TRUE, valign = 't' ) ``` ```{r echo= FALSE} paste0("The ", colnames(RMP)[1], " ancestry is ", LR, " times more probable than ", colnames(RMP)[2], " ancestry.") ``` As correctly inferred by both the random SNPs set and the Global AIMs Nano set, **NA18520 comes from the Yoruba population (Nigeria)**. It is pretty impressive that even a set of random markers can (with some uncertainty) infer his ancestry, but if we think about our history as human beings, we can find an explanation.

**A question for you**: how is it possible that 30 randomly selected SNPs can relatively accurately infer the ancestry of an individual coming from Africa?

## Where does **NA18532** come from? Now, we will compare the classification performances of the random and the Nano set on the individual *NA18532*. ```{r echo=FALSE} unknown_sample = "NA18532" random_markers = sample(snp.id,30) snp_set = random_markers # compute the allele frequency across all populations af_A1 = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(af_A1)= snp_set colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ pop.df = geno_pop[geno_pop$POP == pop,] n.alleles = 2*nrow(pop.df) for (snp in snp_set){ af_A1[snp, pop] = (1 + sum(pop.df[, snp]))/(2 + n.alleles) } } # compute the genotypic frequencies of the genotypes in the unknown sample geno_prob = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(geno_prob)= snp_set colnames(geno_prob) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ #pop.df = geno_pop[geno_pop$POP == pop,] #n.alleles = 2*nrow(pop.df) for (snp in snp_set){ p = af_A1[snp,pop] q = 1 - p if (geno_unk[unknown_sample,snp] == 0){ geno_prob[snp,pop] = q*q } else if (geno_unk[unknown_sample,snp] == 1){ geno_prob[snp,pop] = 2*p*q } else if (geno_unk[unknown_sample,snp] == 2){ geno_prob[snp,pop] = p*p } } } RMP_r = sort(apply(geno_prob, 2, prod), decreasing = T) LR_r = RMP_r[1]/RMP_r[2] RMP_r = as.data.frame(RMP_r) RMP_r = format(RMP_r, digits = 3) ``` ```{r echo=FALSE} # compute the genotypic frequencies of the genotypes in the unknown sample unknown_sample = "NA18532" nano_set_31_df = read.table("found.txt", hea=F) nano_set_31 = as.character(nano_set_31_df$V2) snp_set = nano_set_31 # compute the allele frequency across all populations af_A1 = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(af_A1)= snp_set colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ pop.df = geno_pop[geno_pop$POP == pop,] n.alleles = 2*nrow(pop.df) for (snp in snp_set){ af_A1[snp, pop] = (1 + sum(pop.df[, snp]))/(2 + n.alleles) } } geno_prob = data.frame(matrix(NA,nrow=27, ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(geno_prob)= snp_set colnames(geno_prob) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ #pop.df = geno_pop[geno_pop$POP == pop,] #n.alleles = 2*nrow(pop.df) for (snp in snp_set){ p = af_A1[snp,pop] q = 1 - p if (geno_unk[unknown_sample,snp] == 0){ geno_prob[snp,pop] = q*q } else if (geno_unk[unknown_sample,snp] == 1){ geno_prob[snp,pop] = 2*p*q } else if (geno_unk[unknown_sample,snp] == 2){ geno_prob[snp,pop] = p*p } } } RMP_n = sort(apply(geno_prob, 2, prod), decreasing = T) LR_n = RMP_n[1]/RMP_n[2] RMP_n = as.data.frame(RMP_n) RMP_n = format(RMP_n, digits = 3) ``` ```{r echo=FALSE} colnames(RMP_r) = "random set" colnames(RMP_n) = "Nano set" RMP_r[9,"random set"] = LR_r rownames(RMP_r)[9] = "LR" RMP_n[9,"Nano set"] = LR_n rownames(RMP_n)[9] = "LR" knitr::kable( list( RMP_r, RMP_n ), caption = 'RMP for the sample NA18532 using a random (on the left) and the Nano (on the rigth) marker sets.', booktabs = TRUE, valign = 't' ) ``` **The sample NA18532 is a Han Chinese individual from Beijing** and while the random sets cannot infer his ancestry with high confidence (see the reduced LR with the random SNPs), the Global AIMs Nano set can accurately find his origins. ## Where does **HG03585** come from? Now, we will compare the classification performances of tha random and the Nano set on the individual *HG03585*. ```{r echo=FALSE} unknown_sample = "HG03585" random_markers = sample(snp.id,30) snp_set = random_markers # compute the allele frequency across all populations af_A1 = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(af_A1)= snp_set colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ pop.df = geno_pop[geno_pop$POP == pop,] n.alleles = 2*nrow(pop.df) for (snp in snp_set){ af_A1[snp, pop] = (1 + sum(pop.df[, snp]))/(2 + n.alleles) } } # compute the genotypic frequencies of the genotypes in the unknown sample geno_prob = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(geno_prob)= snp_set colnames(geno_prob) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ #pop.df = geno_pop[geno_pop$POP == pop,] #n.alleles = 2*nrow(pop.df) for (snp in snp_set){ p = af_A1[snp,pop] q = 1 - p if (geno_unk[unknown_sample,snp] == 0){ geno_prob[snp,pop] = q*q } else if (geno_unk[unknown_sample,snp] == 1){ geno_prob[snp,pop] = 2*p*q } else if (geno_unk[unknown_sample,snp] == 2){ geno_prob[snp,pop] = p*p } } } RMP_r = sort(apply(geno_prob, 2, prod), decreasing = T) LR_r = RMP_r[1]/RMP_r[2] RMP_r = as.data.frame(RMP_r) RMP_r = format(RMP_r, digits = 3) ``` ```{r echo=FALSE} # compute the genotypic frequencies of the genotypes in the unknown sample unknown_sample = "HG03585" nano_set_31_df = read.table("found.txt", hea=F) nano_set_31 = as.character(nano_set_31_df$V2) snp_set = nano_set_31 # compute the allele frequency across all populations af_A1 = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(af_A1)= snp_set colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ pop.df = geno_pop[geno_pop$POP == pop,] n.alleles = 2*nrow(pop.df) for (snp in snp_set){ af_A1[snp, pop] = (1 + sum(pop.df[, snp]))/(2 + n.alleles) } } geno_prob = data.frame(matrix(NA,nrow=27, ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(geno_prob)= snp_set colnames(geno_prob) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ #pop.df = geno_pop[geno_pop$POP == pop,] #n.alleles = 2*nrow(pop.df) for (snp in snp_set){ p = af_A1[snp,pop] q = 1 - p if (geno_unk[unknown_sample,snp] == 0){ geno_prob[snp,pop] = q*q } else if (geno_unk[unknown_sample,snp] == 1){ geno_prob[snp,pop] = 2*p*q } else if (geno_unk[unknown_sample,snp] == 2){ geno_prob[snp,pop] = p*p } } } RMP_n = sort(apply(geno_prob, 2, prod), decreasing = T) LR_n = RMP_n[1]/RMP_n[2] RMP_n = as.data.frame(RMP_n) RMP_n = format(RMP_n, digits = 3) ``` ```{r echo=FALSE} colnames(RMP_r) = "random set" colnames(RMP_n) = "Nano set" RMP_r[9,"random set"] = LR_r rownames(RMP_r)[9] = "LR" RMP_n[9,"Nano set"] = LR_n rownames(RMP_n)[9] = "LR" knitr::kable( list( RMP_r, RMP_n ), caption = 'RMP for the sample HG03585 using a random (on the left) and the Nano (on the rigth) marker sets.', booktabs = TRUE, valign = 't' ) ``` **The sample HG03585 comes from Bangladesh**. The Central-South Eurasia has been crossed for millennia by different populations and its populations show intermediate genetic frequencies between East Asia, Europe and Middle East. In the slide 11 of the pdf "Beyond identification (ancestry inference)" you can see that with K 4, 5 and 6 the Central/South Asia has the same colors of Europe (light blue) and East Asia (pink). However, the authors of the Global AIMs Nano set made an effort to distinguish this geographical area. ## Where does **NA20502** come from? Now, we will compare the classification performances of tha random and the Nano set on the individual *NA20502*. ```{r echo=FALSE} unknown_sample = "NA20502" random_markers = sample(snp.id,30) snp_set = random_markers # compute the allele frequency across all populations af_A1 = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(af_A1)= snp_set colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ pop.df = geno_pop[geno_pop$POP == pop,] n.alleles = 2*nrow(pop.df) for (snp in snp_set){ af_A1[snp, pop] = (1 + sum(pop.df[, snp]))/(2 + n.alleles) } } # compute the genotypic frequencies of the genotypes in the unknown sample geno_prob = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(geno_prob)= snp_set colnames(geno_prob) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ #pop.df = geno_pop[geno_pop$POP == pop,] #n.alleles = 2*nrow(pop.df) for (snp in snp_set){ p = af_A1[snp,pop] q = 1 - p if (geno_unk[unknown_sample,snp] == 0){ geno_prob[snp,pop] = q*q } else if (geno_unk[unknown_sample,snp] == 1){ geno_prob[snp,pop] = 2*p*q } else if (geno_unk[unknown_sample,snp] == 2){ geno_prob[snp,pop] = p*p } } } RMP_r = sort(apply(geno_prob, 2, prod), decreasing = T) LR_r = RMP_r[1]/RMP_r[2] RMP_r = as.data.frame(RMP_r) RMP_r = format(RMP_r, digits = 3) ``` ```{r echo=FALSE} # compute the genotypic frequencies of the genotypes in the unknown sample unknown_sample = "NA20502" nano_set_31_df = read.table("found.txt", hea=F) nano_set_31 = as.character(nano_set_31_df$V2) snp_set = nano_set_31 # compute the allele frequency across all populations af_A1 = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(af_A1)= snp_set colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ pop.df = geno_pop[geno_pop$POP == pop,] n.alleles = 2*nrow(pop.df) for (snp in snp_set){ af_A1[snp, pop] = (1 + sum(pop.df[, snp]))/(2 + n.alleles) } } geno_prob = data.frame(matrix(NA,nrow=27, ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")))) rownames(geno_prob)= snp_set colnames(geno_prob) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI") for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){ #pop.df = geno_pop[geno_pop$POP == pop,] #n.alleles = 2*nrow(pop.df) for (snp in snp_set){ p = af_A1[snp,pop] q = 1 - p if (geno_unk[unknown_sample,snp] == 0){ geno_prob[snp,pop] = q*q } else if (geno_unk[unknown_sample,snp] == 1){ geno_prob[snp,pop] = 2*p*q } else if (geno_unk[unknown_sample,snp] == 2){ geno_prob[snp,pop] = p*p } } } RMP_n = sort(apply(geno_prob, 2, prod), decreasing = T) LR_n = RMP_n[1]/RMP_n[2] RMP_n = as.data.frame(RMP_n) RMP_n = format(RMP_n, digits = 3) ``` ```{r echo=FALSE} colnames(RMP_r) = "random set" colnames(RMP_n) = "Nano set" RMP_r[9,"random set"] = LR_r rownames(RMP_r)[9] = "LR" RMP_n[9,"Nano set"] = LR_n rownames(RMP_n)[9] = "LR" knitr::kable( list( RMP_r, RMP_n ), caption = 'RMP for the sample NA20502 using a random (on the left) and the Nano (on the rigth) marker sets.', booktabs = TRUE, valign = 't' ) ``` **The sample NA20502 comes from Italy and particularly from Tuscany**. None of the marker sets is able to correctly infer his ancestry. A random set of markers could never solve the difficult problem of distinguishing different countries from the same continents (in this case, Europe) and neither the Global AIMs Nano set can do it, because this set was developed with the aim of distinguishing the different continents. In fact, this set has correctly understood that this sample comes from Europe, since all the European populations have around the same probabilities of being the population of origin of NA20502. However, none of them prevails.

**A question for you**: where does the individual HG01620 come from?

*** # All populations at a glance -- *principal component analysis* Can we use the entire set of markers to understand where an individual comes from? The simple answer is yes, however, it is unthinkable to use in the current forensic pratice whole genome sequences or, like in our case, around 30,000 genetic markers for ancestry inference. ```{r} library(ggplot2) library(RColorBrewer) pca <- snpgdsPCA(genofile, num.thread=2) pc.percent <- pca$varprop*100 head(round(pc.percent, 2)) tab <- data.frame(sample.id = pca$sample.id, EV1 = pca$eigenvect[,1], # the first eigenvector EV2 = pca$eigenvect[,2], # the second eigenvector stringsAsFactors = FALSE) # add pop labels target = tab$sample.id tab$POP = pop_labels[match(target,pop_labels$SAMPLE),"POP"] tab$SAMPLE = pop_labels[match(target,pop_labels$SAMPLE),"SAMPLE"] # add the inferred ancestry tab$POP = as.character(tab$POP) tab$POP[which(tab$sample.id=="NA18520")] = 'inf_YRI' tab$POP[which(tab$sample.id=="NA18532")] = 'inf_CHB' tab$POP[which(tab$sample.id=="HG03585")] = 'inf_BEB' tab$POP[which(tab$sample.id=="NA20502")] = 'inf_TSI' # create color palette colors = c(brewer.pal(8, "Set1"), rep("black",5)) names(colors) = c("BEB","CHB","FIN","GBR","IBS","PEL","TSI","YRI","inf_TSI","inf_BEB","inf_YRI","inf_CHB","UNK2") # give different shapes to the unknown samples tab$pch=21 tab$pch[which(tab$sample.id=="NA18520")] = 22 tab$pch[which(tab$sample.id=="NA18532")] = 23 tab$pch[which(tab$sample.id=="HG03585")] = 24 tab$pch[which(tab$sample.id=="NA20502")] = 25 tab$pch[which(tab$sample.id=="HG01620")] = "" tab$POP = as.factor(tab$POP) # divide in tab_pop and tab_unk tab_pop = tab[!(tab$POP %in% c("inf_TSI","inf_BEB","inf_YRI","inf_CHB","UNK2")),] tab_pop = droplevels(tab_pop) tab_unk = tab[which(tab$POP %in% c("inf_TSI","inf_BEB","inf_YRI","inf_CHB","UNK2")),] tab_unk = droplevels(tab_unk) ggplot(tab_pop, aes(x=EV1, y=EV2, color=POP)) + geom_point() + theme_linedraw() + xlab(round(pc.percent[1], digits=2)) + ylab(round(pc.percent[2], digits=2)) + scale_color_manual(values = colors) + annotate("point", x = tab_unk$EV1[which(tab_unk$sample.id=="HG03585")], y = tab_unk$EV2[which(tab_unk$sample.id=="HG03585")], fill = "darkgrey", color="black", shape=22, size=3) + annotate("text", label ="inf_BEB", x = tab_unk$EV1[which(tab_unk$sample.id=="HG03585")] + 0.005, y = tab_unk$EV2[which(tab_unk$sample.id=="HG03585")], color="black", size=3) + annotate("point", x = tab_unk$EV1[which(tab_unk$sample.id=="NA18520")], y = tab_unk$EV2[which(tab_unk$sample.id=="NA18520")], fill = "darkgrey", color="black", shape=23, size=3) + annotate("text", label= "inf_YRI", x = tab_unk$EV1[which(tab_unk$sample.id=="NA18520")] - 0.005, y = tab_unk$EV2[which(tab_unk$sample.id=="NA18520")], color="black", size=3) + annotate("point", x = tab_unk$EV1[which(tab_unk$sample.id=="NA18532")], y = tab_unk$EV2[which(tab_unk$sample.id=="NA18532")], fill = "darkgrey", color="black", shape=24, size=3) + annotate("text", label= "inf_CHB", x = tab_unk$EV1[which(tab_unk$sample.id=="NA18532")]+ 0.005, y = tab_unk$EV2[which(tab_unk$sample.id=="NA18532")], color="black", size=3) + annotate("point", x = tab_unk$EV1[which(tab_unk$sample.id=="NA20502")], y = tab_unk$EV2[which(tab_unk$sample.id=="NA20502")], fill = "darkgrey", color="black", shape=25, size=3) + annotate("text", label= "inf_TSI", x = tab_unk$EV1[which(tab_unk$sample.id=="NA20502")]+ 0.005, y = tab_unk$EV2[which(tab_unk$sample.id=="NA20502")], color="black", size=3) + theme(legend.position="bottom") + guides(colour = guide_legend(override.aes = list(size=6))) + ggtitle("Principal component analysis") ``` ```{r} snpgdsClose(genofile) ```