1 Getting started – R and RStudio download

Before starting to work on some real data, we need to download R statistical software and its integrated development environment RStudio, which help us in programming more easily in R.

1.1 Mac Users

1.1.1 To Install R

Open an internet browser and go to www.r-project.org.
Click the “download R” link in the middle of the page under “Getting Started.”
Select a CRAN location (a mirror site) and click the corresponding link.
Click on the “Download R for (Mac) OS X” link at the top of the page.
Click on the file containing the latest version of R under “Files.”
Save the .pkg file, double-click it to open, and follow the installation instructions.

Now that R is installed, you need to download and install RStudio.

1.1.2 To Install RStudio

Go to www.rstudio.com and click on the “Download RStudio” button.
Click on “Download RStudio Desktop.”
Click on the version recommended for your system, or the latest Mac version, save the .dmg file on your computer, double-click it to open, and then drag and drop it to your applications folder.

1.2 Windows Users

1.2.1 To Install R:

Open an internet browser and go to www.r-project.org.
Click the “download R” link in the middle of the page under “Getting Started.”
Select a CRAN location (a mirror site) and click the corresponding link.
Click on the “Download R for Windows” link at the top of the page.
Click on the “install R for the first time” link at the top of the page.
Click “Download R for Windows” and save the executable file somewhere on your computer. Run the .exe file and follow the installation instructions.

Now that R is installed, you need to download and install RStudio.

1.2.2 To Install RStudio

Go to www.rstudio.com and click on the “Download RStudio” button.
Click on “Download RStudio Desktop.”
Click on the version recommended for your system, or the latest Windows version, and save the executable file. Run the .exe file and follow the installation instructions.

2 Worldwide populations and their DNA – genetic data download

In this tutorial we will analyse the genetic data from the 1000 Genomes Project. The 1000 Genomes Project ran between 2008 and 2015, creating the largest public catalogue of human variation and genotype data (Figure 2.1 and the map at this link).

$\label{fig:1000genome}1000 Genome Project populations: African in yellow, American in red, East Asian in green, European in blue and South Asian in purple.$

Figure 2.1: 1000 Genome Project populations: African in yellow, American in red, East Asian in green, European in blue and South Asian in purple.

The 1000 Genomes Project took advantage of developments in sequencing technology, which sharply reduced the cost of sequencing. It was the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. Data from the 1000 Genomes Project was quickly made available to the worldwide scientific community through freely accessible public databases.

Data from 1000 Genomes Project can bee freely available from this webpage. Unfortunately, we cannot work here the entire set of variants due to the limited computational resources of our laptops. In fact, in the phase3 of the 1000 Genomes Project more that 2000 individuals have been sequenced, generating a dataset of size around 50Gb.

For this reason, I created for you a dataset comprising 786 individuals from eight worldwide populations and around 30,000 genetic variants (“subsampled.bed”, “subsampled.bim”, “subsampled.fam”). Moreover, you also need a file containing the population label for each individual (“kgp3.pheno.tsv”).

Create a directory called “popgen_tutorial” and move these two files into the directory.

3 How DNA is coded in a text file? – genetic data format

3.1 File formats

Let’s have a look at the file. While .fam and .bim are text file, the .bed is a binary file and cannot be opened with a text editor. These are Plink formats currently used in population genetics because they can effectively store a huge amount of genetic data. .fam file contains the sample information, .bim contains the genetic variant information (see the table below) and .bed codes the genotype calls. More information about the file formats in Plink can be found here.

Table 3.1: .fam (on the left) and .bim file (on the right).

FID	IID	PHENO
KGP3	HG00096	-9
KGP3	HG00097	-9
KGP3	HG00099	-9
KGP3	HG00100	-9
KGP3	HG00101	-9
KGP3	HG00102	-9

CHR	RS	POS	A1	A2
1	rs575272151	11008	G	C
1	rs62635286	13116	G	T
1	rs531730856	13273	C	G
1	rs531646671	14599	A	T
1	rs75454623	14930	G	A
1	rs78601809	15211	T	G

3.2 Set the working directory

Before starting to work in R environment, we need to specify the working directory, which will be used by R to read and save files. In order to set our wirking directory we need to find the path leading to the directory called “popgen_tutorial”.

## For Mac users you path will be something like this:
#setwd("/Users/I/Desktop/Uni/forensic_genetics/popgen_tutorial")

## Fow Windows users the pathe will be something like this:
# setwd("c:/Documents/Desktop/Uni/forensic_genetics/popgen_tutorial")

# My path is:
setwd("/Users/bioserenina/Google Drive File Stream/Il mio Drive/lezioni_carlo/lezioni/esercitazione/popgen_tutorial")

Given that the genotype information are store in a binary file (.bed) which cannot be opened easily, we will make use of an R library specifically developed to work with genetic data: SNPRelate. You can find here a tutorial on its functions.

As a first step, you need to install the library, one of its dependency (the package “gdsfmt”) as well as other packages we will use in this tutorial using the command below.

# installing some packages
source("http://bioconductor.org/biocLite.R")
biocLite("gdsfmt")
biocLite("SNPRelate")
install.packages(c("ggplot2","RColorBrewer", "knitr", "kableExtra", "dplyr"))

Then, once you have installed a package, you need to “call” the library.

library(SNPRelate)
library(gdsfmt)

As reported in the SNPRelate tutorial, this library and the gdsfmt package can store the huge amount of genetic information into a different and more efficient file firmat: .gds.

To support efficient memory management for genome-wide numerical data, the gdsfmt package provides the genomic data structure (GDS) file format for array-oriented bioinformatic data, which is a container for storing annotation data and SNP genotypes. In this format each byte encodes up to four SNP genotypes thereby reducing file size and access time.

# create the "test.gds" file starting from our .bed, .bim, and .fam.
bed.fn <- "subsampled.bed"
fam.fn <- "subsampled.fam"
bim.fn <- "subsampled.bim"
snpgdsBED2GDS(bed.fn, fam.fn, bim.fn, "test.gds")

## Start snpgdsBED2GDS ...
##  BED file: "subsampled.bed" in the SNP-major mode (Sample X SNP)
##  FAM file: "subsampled.fam", DONE.
##  BIM file: "subsampled.bim", DONE.
## Wed May 20 09:51:52 2020     store sample id, snp id, position, and chromosome.
##  start writing: 786 samples, 29547 SNPs ...
##      Wed May 20 09:51:52 2020    0%
##      Wed May 20 09:51:52 2020    100%
## Wed May 20 09:51:52 2020     Done.
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
##     open the file 'test.gds' (5.7M)
##     # of fragments: 39
##     save to 'test.gds.tmp'
##     rename 'test.gds.tmp' (5.7M, reduced: 252B)
##     # of fragments: 18

snpgdsSummary("test.gds")

## The file name: /Volumes/GoogleDrive/Il mio Drive/lezioni_carlo/lezioni/esercitazione/popgen_tutorial/test.gds 
## The total number of samples: 786 
## The total number of SNPs: 29547 
## SNP genotypes are stored in SNP-major mode (Sample X SNP).

Once we have created the “test.gds” file, we need to open it and extract the information of interest, which are:

samples information (sample.id)
genetic variant information (snp.id)
genotypes (g)

genofile <- snpgdsOpen("test.gds")
sample.id <- read.gdsn(index.gdsn(genofile, "sample.id"))
snp.id <- read.gdsn(index.gdsn(genofile, "snp.id"))
g <- read.gdsn(index.gdsn(genofile, "genotype"))

4 Spring cleaning – how to clean up your data

In order to make the computation steps fasible for our laptops I have already cleaned the data. Moreover, I have also sampled one SNP out of eight SNPs, in order to reduce the size of our files.

In particular the cleaning steps you always need to do are:

check for missingness: removal of SNPs with missing call rates exceeding the provided value (0.05) and samples with missing call rates higher than 0.05.
MAF (minor allele frequency): removal of monomorphic or rare SNPs. SNPs with a very low allele frequency in your dataset are of course not informative for distinguishing different populations because they are shared between a few individuals or are private of a single individual.
LD (linkage-disequilibrium): according to the specific analysis we want to do, we may decide to prune (in other words, filter) the genetic variants for linkage disequilibrium. In particular, analyses requiring the independence of genetic markers (such as the Naive Bayes classification or STRUCTURE) will also require the LD pruning.
check for relatedness among samples: the frequency computation cannot be influenced by the presence of related samples.
autosomes only: in some analyses, like ancestry inference, we may prefer to work using only autosomes given the fact that they are equally shared between males and females (in other words, they have the same effective population size).

4.1 Creation of the genotype matrix

Once we have finished the cleaning steps (already perfomed), we set up the genotypes matrix. Using the R commands dim(geno) we can see matrix size (rows and columns number), while in order to see part of the matrix we can use the command geno[1:5,1:5], which shows the first five rows and columns of the matrix.

geno = as.data.frame(g)
colnames(geno) = snp.id
rownames(geno) = sample.id

4.2 Missingness check

We know that the genotypes have already been cleaned by me, but the first rule in bioinformatics is “never trust the guy giving you some data”.

paste0("how many missing cells are there in our dataset? ", sum(geno==3))

## [1] "how many missing cells are there in our dataset? 0"

At this point, we need another information: the ancestry of the individuals in sample.id variable. You can find this information in the file “kgp3.pheno.tsv” and, with the commands below, you can add this information to the genotype matrix.

4.3 Adding population labels

# read the population labels file
pop_labels = read.table("kgp3.pheno.tsv", hea=T)

# add the population labels to geno
target = rownames(geno)
#head(pop_labels[match(target,pop_labels$SAMPLE),])
geno$POP = pop_labels[match(target,pop_labels$SAMPLE),"POP"]
geno$SAMPLE = pop_labels[match(target,pop_labels$SAMPLE),"SAMPLE"]
# check if the order of samples is correct: if the answer is TRUE can be happy
identical(rownames(geno), as.character(geno$SAMPLE))

## [1] TRUE

geno = geno[,c("POP", snp.id)]
#head(geno)[1:5]

How many populations do we have? As you can see with the R command table(geno$POP), there are 8 worldwide populations and 5 individuals of unknown ancestry.

label	population description	size
BEB	Bengali from Bangladesh	86
CHB	Han Chinese in Beijing, China	102
FIN	Finnish in Finland	98
GBR	British in England and Scotland	91
IBS	Iberian Population in Spain	106
TSI	Tuscans in Italy	107
PEL	Peruvians from Lima, Peru	84
YRI	Yoruba in Ibadan, Nigeria	107

# Divide the dataset in known populations and unknown individuals
geno_unk = geno[which(geno$POP %in% c("UNK1", "UNK2", "UNK3", "UNK4", "UNK5")),]
geno_pop = geno[!(geno$POP %in% c("UNK1", "UNK2", "UNK3", "UNK4", "UNK5")),]

In the end, our final dataset will be something like this, where each row is a sample, the first column is the population the sample comes from and the other columns are the SNPs:

Table 4.1: First 5 samples and 7 SNPs of the genotype matrix *geno_pop*.
	POP	rs116400033	rs115209712	rs112455420	rs535902969	rs9283154	rs4520358
HG00096	GBR	2	0	0	0	0	1
HG00097	GBR	0	0	0	0	1	1
HG00099	GBR	0	1	0	0	0	1
HG00100	GBR	0	0	1	1	1	1
HG00101	GBR	0	0	0	0	2	1
HG00102	GBR	0	2	1	0	1	1

5 Keep calm and do some math – allelic and genotypic frequencies

5.1 Allele frequencies

At this point, the genotype matrix is ready and we can start the hard work: allele frequencies computation.

The frequency of an allele is defined as the number of the copies of the allelel in the population divided by the total number of gene copies in the population. In a diploid individual (in which all individuals carry two copies of each chromosome) with N individual, there are 2N gene copies. So the frequencies of alleles A1 and A2 are:

$f_{A1}$ = $\frac{N_{A1}}{2N}$

and

$f_{A2}$ = $\frac{N_{A2}}{2N}$

Let’s have a look at our genotypes, by taking the first five rows and five columns.

geno_pop[1:5,1:5]

##         POP rs116400033 rs115209712 rs112455420 rs375803869
## HG00096 GBR           2           0           0           0
## HG00097 GBR           0           0           0           0
## HG00099 GBR           0           1           0           0
## HG00100 GBR           0           0           1           0
## HG00101 GBR           0           0           0           0

There are four possible values stored in geno_pop: 0, 1, 2 and 3. For bi-allelic SNP sites, “0” indicates two A2 alleles, “1” indicates one A1 allele and one A2 allele, “2” indicates two A1 alleles, and “3” is a missing genotype.

Let’s see the second SNP in the geno_pop matrix: rs115209712.

Table 5.1: SNP rs115209712 coded in the genotype matrix with the numbers 0, 1 and 2 (on the left) and the correspondent nucleotides (on the rigth).

	POP	rs115209712
HG00122	GBR	1
HG00123	GBR	0
HG00125	GBR	0
HG00126	GBR	0
HG00127	GBR	0
HG00128	GBR	0

	POP	rs115209712
HG00122	GBR	AG
HG00123	GBR	AA
HG00125	GBR	AA
HG00126	GBR	AA
HG00127	GBR	AA
HG00128	GBR	AA

Given that the geno_pop dataset harbours the count of one allele for each genotype (the cell), in order to compute the allele frequency, we simply need to sum all values in each column (the SNP) and divide it by 2N.

We can compute the frequency of the allele A across all samples or for each population separately.

# A1 frequency of the SNP rs115209712 in the entire dataset
sum(geno_pop$"rs115209712")/(2*dim(geno_pop)[1])

## [1] 0.1101152

# A2 frequency of the SNP rs115209712 in the entire dataset
1 - sum(geno_pop$"rs115209712")/(2*dim(geno_pop)[1])

## [1] 0.8898848

# Another way to compute the A2 frequency
(sum(geno_pop$"rs115209712" == 0)*2 + sum(geno_pop$"rs115209712" == 1))/(2*dim(geno_pop)[1])

## [1] 0.8898848

# A1 frequency of the SNP rs115209712 in the entire dataset
af_A1 = data.frame(matrix(NA,nrow=1, ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI"))))
rownames(af_A1)= "rs115209712"
colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")

for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){
  pop.df = geno_pop[geno_pop$POP == pop,]
  n.alleles = 2*nrow(pop.df)
  snp ="rs115209712"
  af_A1[snp, pop] = sum(pop.df[, snp])/n.alleles
}

# A2 frequency of the SNP rs115209712 in the entire dataset
af_A2 = 1 - af_A1

The frequencies of the alleles A1 and A2 of SNP rs115209712 in each populations are:

Table 5.2: Frequency of the minor allele G.
	GBR	FIN	IBS	PEL	BEB	YRI	CHB	TSI
rs115209712	0.1208791	0.1060606	0.0849057	0.0529412	0.1176471	0.1542056	0.0882353	0.1462264

Table 5.3: Frequency of the major allele A.
	GBR	FIN	IBS	PEL	BEB	YRI	CHB	TSI
rs115209712	0.8791209	0.8939394	0.9150943	0.9470588	0.8823529	0.8457944	0.9117647	0.8537736

If you compare the allele frequencies found in our dataset with the ones reported in Ensembl, you can see that they match.

5.2 Genotype frequencies

The netx step will be the computation of the genotype frequencies according to Hardy-Weinberg formulas.

The genotype frequencies under Hardy-Weinberg Equilibrium are:

Genotype	AA	Aa	aa
Frequency	$f_{A}^{2}$	$2f_{A}f_{a}$	$f_{a}^{2}$

gf_rs115209712 = data.frame(matrix(NA,nrow=3, ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI"))))
rownames(gf_rs115209712)= c("GG", "GA", "AA")
colnames(gf_rs115209712) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")

for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){
  gf_rs115209712["GG",pop] = af_A1[pop]*af_A1[pop]
  gf_rs115209712["GA",pop] = 2*af_A1[pop]*af_A2[pop]
  gf_rs115209712["AA",pop] = af_A2[pop]*af_A2[pop]
}

genotype	GBR	FIN	IBS	TSI	BEB	CHB	PEL	YRI
GG	0.0146	0.0112	0.0072	0.0214	0.0138	0.0078	0.0028	0.0238
GA	0.2125	0.1896	0.1554	0.2497	0.2076	0.1609	0.1003	0.2609
AA	0.7729	0.7991	0.8374	0.7289	0.7785	0.8313	0.8969	0.7154

As you can see the snp rs115209712 is not variable across different worldwide populations: in fact the homozygous genotype AA has an allele frequency higher that 0.71 everywhere, while the other two genotypes show little or no variation. From this rapid “visual inspection”, we can conclude that this SNP won’t surely be informative for ancestry inference.

Let’s try with another SNP, this time known in the literature for being informative for ancestry inference. The SNP is rs2814778 and has been reported byt this paper as a genetic marker informative in distinguishing European individuals. Let’s see if it works.

# allele frequencies
af_A1 = data.frame(matrix(NA,nrow=1, ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI"))))
rownames(af_A1)= "rs2814778"
colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")

for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){
  pop.df = geno_pop[geno_pop$POP == pop,]
  n.alleles = 2*nrow(pop.df)
  snp ="rs2814778"
  af_A1[snp, pop] = sum(pop.df[, snp])/n.alleles
}
# A2 frequency of the SNP rs2814778 in the entire dataset
af_A2 = 1 - af_A1

# genotype frequencies
gf_rs2814778 = data.frame(matrix(NA,nrow=3, ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI"))))
rownames(gf_rs2814778)= c("CC", "CT", "TT")
colnames(gf_rs2814778) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")

for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){
  gf_rs2814778["CC",pop] = af_A1[pop]*af_A1[pop]
  gf_rs2814778["CT",pop] = 2*af_A1[pop]*af_A2[pop]
  gf_rs2814778["TT",pop] = af_A2[pop]*af_A2[pop]
}

genotype	GBR	FIN	IBS	TSI	BEB	CHB	PEL	YRI
CC	0	0	4e-04	1e-04	0	0	0.0017	0.9907
CT	0	0	0.037	0.0187	0	0	0.079	0.0093
TT	1	1	0.9626	0.9812	1	1	0.9193	0

AS you can see from the table above and its Ensembl webpage, this SNP is informative in distinguishing Yoruba (African from Nigeria) individuals, because only in the Yoruba population the allele C and the genotype CC have almost reached the fixation.

6 Where do you come from? – the Naive Bayes classification

As we have seen in the Beyond identification (ancestry inference) lecture, the steps we need to do in order to infer the ancestry of a sample are:

select and appropriate AIMs (Ancestral Informative Markers) panel,
compute the allele frequencies in the reference populations,
compute the genotypic frequencies in the reference populations of each unknown individual’s genotype,
compute a RMP (Random Match Probability) multiplying the genotypic frequencies,
compute the Likelihood Ratio for the highest probability.

In this tutorial, we cannot compute the informativeness for each marker due to the limited computational resources of our laptop. However, we will compare the classification performances of 30 randomly selected markers with the Global AIMs Nano set, a 31-plex assay of ancestry-informative SNPs developed in order to differentiate the five continents (M.de la Puente et al., 2016, Forensic Science International: Genetics). The complete set of SNPs is reported in Fig 6.1.

$\label{fig:nano}The Global AIMs Nano set.$

Figure 6.1: The Global AIMs Nano set.

Now we will try to infer the ancestry of some of the unknown individuals. Let’s start with “NA18520”.

6.1 Where does NA18520 come from?

6.1.1 What happens if we use 30 randomly choosen markers?

Here we will try to infer the ancestry of the individual “NA18520”.

unknown_sample = "NA18520"

random_markers = sample(snp.id,30)
snp_set = random_markers

# compute the allele frequency across all populations
af_A1 = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI"))))
rownames(af_A1)= snp_set
colnames(af_A1) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")

for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){
  pop.df = geno_pop[geno_pop$POP == pop,]
  n.alleles = 2*nrow(pop.df)
  for (snp in snp_set){
    af_A1[snp, pop] = (1 + sum(pop.df[, snp]))/(2 + n.alleles)
  }
}

# compute the genotypic frequencies of the genotypes in the unknown sample
geno_prob = data.frame(matrix(NA,nrow=length(snp_set), ncol=length(c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI"))))
rownames(geno_prob)= snp_set
colnames(geno_prob) = c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")

for (pop in c("GBR", "FIN", "IBS", "PEL", "BEB", "YRI", "CHB", "TSI")){
  #pop.df = geno_pop[geno_pop$POP == pop,]
  #n.alleles = 2*nrow(pop.df)
  for (snp in snp_set){
    p = af_A1[snp,pop]
    q = 1 - p
    if (geno_unk[unknown_sample,snp] == 0){
      geno_prob[snp,pop] = q*q
    } else if (geno_unk[unknown_sample,snp] == 1){
      geno_prob[snp,pop] = 2*p*q
    } else if (geno_unk[unknown_sample,snp] == 2){
      geno_prob[snp,pop] = p*p
    }
  }   
}
RMP = sort(apply(geno_prob, 2, prod), decreasing = T)
LR = RMP[1]/RMP[2]

Table 6.1: RMP for the sample NA18520 using 30 random markers.
	YRI	BEB	TSI	IBS	CHB	GBR	FIN	PEL
RMP	3.61e-09	9.95e-13	2.36e-13	1.64e-13	9.48e-14	7.67e-14	4.67e-15	8.48e-18

## [1] "The YRI ancestry is 3627.38281716031 times more probable than BEB ancestry."

6.1.2 What happens when we use the Global AIMs Nano set?

Table 6.2: RMP for the sample NA18520 using the Global AIMs Nano set.
	YRI	BEB	PEL	IBS	TSI	FIN	GBR	CHB
RMP	1.20e-03	8.88e-21	3.82e-22	3.60e-23	4.78e-28	1.27e-28	1.53e-29	1.39e-31

## [1] "The YRI ancestry is 134892248993230752 times more probable than BEB ancestry."

As correctly inferred by both the random SNPs set and the Global AIMs Nano set, NA18520 comes from the Yoruba population (Nigeria). It is pretty impressive that even a set of random markers can (with some uncertainty) infer his ancestry, but if we think about our history as human beings, we can find an explanation.

A question for you: how is it possible that 30 randomly selected SNPs can relatively accurately infer the ancestry of an individual coming from Africa?

6.2 Where does NA18532 come from?

Now, we will compare the classification performances of the random and the Nano set on the individual NA18532.

Table 6.3: RMP for the sample NA18532 using a random (on the left) and the Nano (on the rigth) marker sets.

	random set
CHB	2.27e-11
FIN	8.46e-13
BEB	7.09e-13
TSI	4.51e-14
YRI	3.83e-14
GBR	3.64e-14
IBS	3.62e-14
PEL	2.96e-14
LR	26.8196563650414

	Nano set
CHB	2.11e-05
PEL	2.23e-13
BEB	5.35e-14
FIN	1.34e-28
IBS	5.11e-31
TSI	3.00e-32
GBR	7.21e-34
YRI	1.97e-38
LR	94622057.9740677

The sample NA18532 is a Han Chinese individual from Beijing and while the random sets cannot infer his ancestry with high confidence (see the reduced LR with the random SNPs), the Global AIMs Nano set can accurately find his origins.

6.3 Where does HG03585 come from?

Now, we will compare the classification performances of tha random and the Nano set on the individual HG03585.

Table 6.4: RMP for the sample HG03585 using a random (on the left) and the Nano (on the rigth) marker sets.

	random set
BEB	1.17e-10
TSI	7.66e-11
GBR	6.05e-11
IBS	3.95e-11
FIN	2.39e-11
PEL	1.06e-11
CHB	2.00e-12
YRI	2.46e-14
LR	1.52491589635186

	Nano set
BEB	4.80e-08
PEL	3.12e-10
CHB	8.89e-14
FIN	9.83e-19
IBS	6.04e-20
TSI	7.36e-22
GBR	3.14e-22
YRI	1.45e-29
LR	153.984528323084

The sample HG03585 comes from Bangladesh. The Central-South Eurasia has been crossed for millennia by different populations and its populations show intermediate genetic frequencies between East Asia, Europe and Middle East. In the slide 11 of the pdf “Beyond identification (ancestry inference)” you can see that with K 4, 5 and 6 the Central/South Asia has the same colors of Europe (light blue) and East Asia (pink). However, the authors of the Global AIMs Nano set made an effort to distinguish this geographical area.

6.4 Where does NA20502 come from?

Now, we will compare the classification performances of tha random and the Nano set on the individual NA20502.

Table 6.5: RMP for the sample NA20502 using a random (on the left) and the Nano (on the rigth) marker sets.

	random set
GBR	1.13e-10
TSI	8.54e-11
IBS	7.80e-11
FIN	3.09e-11
BEB	1.45e-12
PEL	8.08e-14
CHB	1.74e-16
YRI	1.11e-16
LR	1.32381794918444

	Nano set
GBR	4.85e-06
TSI	3.37e-06
IBS	1.85e-06
FIN	1.52e-06
BEB	3.29e-11
PEL	1.55e-16
YRI	4.39e-28
CHB	3.34e-30
LR	1.43899026619238

The sample NA20502 comes from Italy and particularly from Tuscany. None of the marker sets is able to correctly infer his ancestry. A random set of markers could never solve the difficult problem of distinguishing different countries from the same continents (in this case, Europe) and neither the Global AIMs Nano set can do it, because this set was developed with the aim of distinguishing the different continents. In fact, this set has correctly understood that this sample comes from Europe, since all the European populations have around the same probabilities of being the population of origin of NA20502. However, none of them prevails.

A question for you: where does the individual HG01620 come from?

7 All populations at a glance – principal component analysis

Can we use the entire set of markers to understand where an individual comes from? The simple answer is yes, however, it is unthinkable to use in the current forensic pratice whole genome sequences or, like in our case, around 30,000 genetic markers for ancestry inference.

library(ggplot2)
library(RColorBrewer)

pca <- snpgdsPCA(genofile, num.thread=2)

## Principal Component Analysis (PCA) on genotypes:
## Excluding 0 SNP on non-autosomes
## Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
## Working space: 786 samples, 29,547 SNPs
##     using 2 (CPU) cores
## PCA:    the sum of all selected genotypes (0,1,2) = 11466412
## CPU capabilities: Double-Precision SSE2
## Wed May 20 09:52:52 2020    (internal increment: 452)
## 
[..................................................]  0%, ETC: ---        
[==================================================] 100%, completed in 3s
## Wed May 20 09:52:55 2020    Begin (eigenvalues and eigenvectors)
## Wed May 20 09:52:55 2020    Done.

pc.percent <- pca$varprop*100
head(round(pc.percent, 2))

## [1] 5.94 3.85 1.61 0.75 0.43 0.23

tab <- data.frame(sample.id = pca$sample.id,
                  EV1 = pca$eigenvect[,1],    # the first eigenvector
                  EV2 = pca$eigenvect[,2],    # the second eigenvector
                  stringsAsFactors = FALSE)

# add pop labels
target = tab$sample.id
tab$POP = pop_labels[match(target,pop_labels$SAMPLE),"POP"]
tab$SAMPLE = pop_labels[match(target,pop_labels$SAMPLE),"SAMPLE"]

# add the inferred ancestry
tab$POP = as.character(tab$POP)
tab$POP[which(tab$sample.id=="NA18520")] = 'inf_YRI'
tab$POP[which(tab$sample.id=="NA18532")] = 'inf_CHB'
tab$POP[which(tab$sample.id=="HG03585")] = 'inf_BEB'
tab$POP[which(tab$sample.id=="NA20502")] = 'inf_TSI'

# create color palette
colors = c(brewer.pal(8, "Set1"), rep("black",5))
names(colors) = c("BEB","CHB","FIN","GBR","IBS","PEL","TSI","YRI","inf_TSI","inf_BEB","inf_YRI","inf_CHB","UNK2")

# give different shapes to the unknown samples
tab$pch=21
tab$pch[which(tab$sample.id=="NA18520")] = 22
tab$pch[which(tab$sample.id=="NA18532")] = 23
tab$pch[which(tab$sample.id=="HG03585")] = 24
tab$pch[which(tab$sample.id=="NA20502")] = 25
tab$pch[which(tab$sample.id=="HG01620")] = ""

tab$POP = as.factor(tab$POP)

# divide in tab_pop and tab_unk
tab_pop = tab[!(tab$POP %in% c("inf_TSI","inf_BEB","inf_YRI","inf_CHB","UNK2")),]
tab_pop = droplevels(tab_pop)
tab_unk = tab[which(tab$POP %in% c("inf_TSI","inf_BEB","inf_YRI","inf_CHB","UNK2")),]
tab_unk = droplevels(tab_unk)

ggplot(tab_pop, aes(x=EV1, y=EV2, color=POP)) + geom_point() + theme_linedraw() + xlab(round(pc.percent[1], digits=2)) + ylab(round(pc.percent[2], digits=2)) + scale_color_manual(values = colors) +
  annotate("point", x = tab_unk$EV1[which(tab_unk$sample.id=="HG03585")], y = tab_unk$EV2[which(tab_unk$sample.id=="HG03585")], fill = "darkgrey", color="black", shape=22, size=3) +
  annotate("text", label ="inf_BEB", x = tab_unk$EV1[which(tab_unk$sample.id=="HG03585")] + 0.005, y = tab_unk$EV2[which(tab_unk$sample.id=="HG03585")], color="black", size=3) +
  
  annotate("point", x = tab_unk$EV1[which(tab_unk$sample.id=="NA18520")], y = tab_unk$EV2[which(tab_unk$sample.id=="NA18520")], fill = "darkgrey", color="black", shape=23, size=3) +
  annotate("text", label= "inf_YRI", x = tab_unk$EV1[which(tab_unk$sample.id=="NA18520")] - 0.005, y = tab_unk$EV2[which(tab_unk$sample.id=="NA18520")], color="black", size=3) +
  
  annotate("point", x = tab_unk$EV1[which(tab_unk$sample.id=="NA18532")], y = tab_unk$EV2[which(tab_unk$sample.id=="NA18532")], fill = "darkgrey", color="black", shape=24, size=3) +
  annotate("text", label= "inf_CHB", x = tab_unk$EV1[which(tab_unk$sample.id=="NA18532")]+ 0.005, y = tab_unk$EV2[which(tab_unk$sample.id=="NA18532")], color="black", size=3) +
  
  annotate("point", x = tab_unk$EV1[which(tab_unk$sample.id=="NA20502")], y = tab_unk$EV2[which(tab_unk$sample.id=="NA20502")], fill = "darkgrey", color="black", shape=25, size=3) +
  annotate("text", label= "inf_TSI", x = tab_unk$EV1[which(tab_unk$sample.id=="NA20502")]+ 0.005, y = tab_unk$EV2[which(tab_unk$sample.id=="NA20502")], color="black", size=3) +
  
  theme(legend.position="bottom") + guides(colour = guide_legend(override.aes = list(size=6))) + ggtitle("Principal component analysis")

snpgdsClose(genofile)

Population genetics tutorial

Where do you come from?