ADMIXTURE Tutorial | Approaching Archaeogenetics

Approaching Archaeogenetics

How to use ADMIXTURE

Includes Plotting Instructions, Can Also Be Used for Downstream Analyses (Options Described at End of Article)

As stated in this article, ADMIXTURE estimates the ancestry of a population using K (a given number) hypothetical populations. It does so in a model-based manner. It utilizes a given dataset containing samples (which can be either ancient or modern, and can be sequenced in any way), and estimates the amount of genetic ancestry of the samples derived from each of K populations, though it does not model genetic drift. These populations are hypothetical and not designated by the user, meaning that ADMIXTURE does not directly test for admixture between populations, examining individual samples instead of populations as a whole. The use of ADMIXTURE is beneficial when differentiating between clusters of populations. Additionally, unlike ADMIXTOOLS, there is little bias when it comes to the type of file (though the way that a sample was sequenced has the capability to affect the composition of a sample at higher values of K)

ADMIXTURE results can be plotted via a bar chart, or even used as PCA-like coordinates. Note that ADMIXTURE can only be used in Linux and MacOS. This tutorial will pertain to Linux only, as I have no experience with MacOS and do not want to disseminate any misinformation relating to the OS.

Generating Models

Requirements:

1. Linux OS (including WSL2 or VM) of your choice (terminal syntax may vary depending on the distro)

2. Download ADMIXTURE here

3. Download PLINK here for dataset construction and QC

4. Download EIGENSOFT here for converting files

5. Read the Curating Datasets tutorials here and here

6. Have a text editor that you can access, such as Notepad or Kate.

7.Optional: Have a spreadsheet editor that you can access.

1. Curate Dataset

ADMIXTURE takes a single dataset in bfile/PLINK format (bed/bim/fam) as its input. As a result, all samples that are to be included in the analysis should be included in the dataset.

Also prune using --geno, especially when running LINADMIX after. This ensures that SNPs shared by less than x% (given the number expressed after the --geno flag in decimal form) of samples will be removed, depending on the parameter.

Sex chromosomes and mtDNA should be removed from the dataset as well, as they are not used in ADMIXTURE analyses. As a result, when merging datasets, the flag --not-chrom X,Y,MT should be used.

2. Thin Dataset's Linkage Disequilibrium

Given that ADMIXTURE does not take linkage disequilibrium (where alleles at different locations are associated in a non-random manner) into account, and ADMIXTURE runs for a very long time when a higher number of SNPs are in a dataset, datasets should be pruned for linkage disequilibrium. This is done in PLINK by using the flag --indep-pairwise. For datasets only containing ancient samples, as established by Lazaridis et al., (2014), the command --indep-pairwise 200 25 0.5 should be used. For datasets involving modern populations, the command --indep-pairwise 50 10 0.1 should be used.

Running Unsupervised ADMIXTURE

Unsupervised ADMIXTURE is the most common mode of ADMIXTURE and utilizes the means of estimation mentioned at the beginning of this tutorial. Given a dataset, ADMIXTURE will estimate the admixture proportions of each sample using K hypothetical populations. Admixture proportions are dependent on the samples used and the value K selected.

1. Put your dataset files in the same folder as the ADMIXTURE program

2. Running via terminal

a. Set the directory to the ADMIXTURE folder. Then, run

./admixture yourdataset.bed (k value of your choosing) --cv=(folds of cross validation of your choosing, 5 is typically used)

If your computer can handle it, you can run multiple at a time (using different values of K, of course). When complete, the terminal will display the cross-validation error of the run at that value k, along with the distances between each hypothetical population. In the folder, you will find the output files, which will show the percentages of each hypothetical population that comprises the genome of each sample. Samples are unlabeled, but are delimited by row. The optimal value K that should be used to model the samples will display the lowest cross-validation error relative to other runs. Once the optimal value K for your dataset has been identified, you can visualize the coefficients modeling the samples in your dataset using that value K (among other things).

Graphing Unsupervised ADMIXTURE Models

Requirements:

1. R

2. Rstudio

3. Spreadsheet tool

Plotting Using Pophelper

Pophelper is an R package that can be used to plot multiple model-based genomic analyses, including ADMIXTURE. To download pophelper, run the following commands on R:

install.packages(c("ggplot2","gridExtra","label.switching","tidyr","remotes"),repos="https://cloud.r-project.org")

remotes::install_github('royfrancis/pophelper')

1. Load library for use

library(pophelper)

2. Set your working directory

setwd("~/nameofworkingdirectory")

3. Load the ADMIXTURE Q file

sfiles <- "~/pathtoyourdirectory/sample.(k value).Q"

4. Make a list using the Q file

slist <- readQ(files=sfiles)

5. Make sample labels

Make a txt file with the sample names or population names associated with each row in the .Q file. This would just be in the form of

Sample1

Sample2

Sample3

...etc

That corresponds to the rows on the Q file, which themselves are dictated by the order of samples in the .fam file. To efficiently make such a list, load the .fam file in a spreadsheet with space-delimited tabs, and choose the column containing the labels that you want. I tend to concatenate the individual sample ID and the population ID. Once the ind file has been made, run the following command in R.

inds <- read.delim(("~/indidfile"),header=FALSE,stringsAsFactors=F)

6. Applying individual labels

rownames(slist[["sample.(k value).Q"]]) <- inds$V1

7. Export the data frame as a CSV file (optional)

This step is very helpful when determining which components are maximized in each sample, so you know what each hypothetical population represents.

write.csv(slist, file = "whateveryouwannanameit.csv")

You can open this file in the spreadsheet editor of your choice. By sorting each column from Z to A, you can see which hypothetical component is maximized in each sample/population. This information can then be used to determine what each hypothetical population is supposed to represent. However, if the percentage at which a given component is maximized is not extremely high, then no sample is representative of the given population. Whether or not this is the case will depend on the value K, and maximized components typically decrease as the value of K increases.

8. Create population labels (optional)

Different populations can be labeled below the individual sample labels. Make a txt file with the group names associated with each row in the .Q file. This would just be in the form of

GroupforSample1

GroupforSample2

GroupforSample3

...etc

Once the .txt file is complete, run the following command in R:

onelabset <- read.delim(("~/pathtogrouplabels/grouplabels"), header=F,stringsAsFactors=F)

9. Plotting the data frame

plotQ(slist,showindlab=T,useindlab=T,indlabsize=whatever size you want the ind label to be, grplab (only if using group labels)=onelabset,grplabsize(only if using group labels)=whatever size you want the group label to be,divtype (if using group labels)=1,grplabspacer(if using group labels)=0,linepos=1,showsp=F,height=3.5,grplabheight(if using group labels)=0.003,grplabpos(if using group labels)=0.8,outputfilename="output",imgtype="png",showyaxis=T,linesize=0.1,pointsize=0.8,dpi=2000,showticks=T,exportpath=getwd()) ... or something like that. Look at the specific parameters on the pophelper website and choose how you want to plot the data.

10. Adding a legend (optional)

Add "legendkeysize=2,legendtextsize=1,legendlab=c("Component maximized in: ","Component maximized in: ","Component maximized in: ")" before "exportpath" in the previous command. That example would be for K=4, use as many labels as the value K.

Plotting multiple runs

Get all Q files from a folder- ensure that the path is to a specific folder with nothing but the Q files.

sfiles <- list.files("~/")

slist <- readQ(files=sfiles[1:whatever the number of sfiles imported is])

inds <- read.delim(("~/indidfile"),header=FALSE,stringsAsFactors=F)

rownames(slist[[1:whatever the number of sfiles imported is]]) <- inds$V1 If you get the error "recursive indexing failed at level 3", then change row names one by one

plotQ(slist[#:#], imgoutput="join",showindlab=T,useindlab=T,indlabsize=1.2, grplab=onelabset,grplabsize=0.5,divtype=1,grplabspacer=0,linepos=1,showsp=F,height=3.5,grplabheight=0.003,showlegend=T,grplabpos=0.8,outputfilename="output",imgtype="png",showyaxis=T,linesize=0.1,pointsize=0.8,showticks=T,legendkeysize=2,legendtextsize=1,dpi=1500,legendlab=c("blah","blah blah"...etc),exportpath=getwd()) ... or something like that. Look at the specific parameters on the pophelper website and choose how you want to plot the data.

The output files for ADMIXTURE can also be used to manually estimate admixture using LINADMIX. Learn how to do so here!

Sources

Github - LINADMIX

ADMIXTURE

Pophelper

GitHub - DReichLab/EIG: Eigen tools by Nick Patterson and Alkes Price lab

Standard data input - PLINK 1.9