Approaching Archaeogenetics

DNA Map Populations Tutorials



How to use LINADMIX

Tutorial for Estimating Mixing Coefficients for a Given Population Using LINADMIX

As stated in this article, LINADMIX was developed by Lily Agranat-Tamir, Shamam Waldman, Naomi Rosen, Benjamin Yakir, Shai Carmi, and Liran Carmel as an alternative to qpAdm given the fact that it's advised that qpAdm should not be used when both modern and ancient samples are co-examined. LINADMIX works in tandem with ADMIXTURE, relying on ADMIXTURE's output. LINADMIX uses a linear regression model and estimates admixture proportions for a target population using ADMIXTURE results of source populations as mixing coefficients and computes a plausibility value to determine whether or not the model is plausible, meaning that it can also be used to designate plausible models. LINADMIX can be used to model modern populations, and can be used in cases of missing data and genetic drift (whereas ADMIXTURE cannot model genetic drift, LINADMIX is robust to it). Although LINADMIX performs better when source populations are highly diverged, genetically similar source populations can still be used.

Though LINADMIX is a Python program, because ADMIXTURE requires Linux or Mac OS (my command line examples are Linux-based, though), it's reccommended that only those with MacOS or a Linux OS proceed.

Generating Necessary Files
Requirements:

1. Linux OS (including WSL2 or VM) of your choice (terminal syntax may vary depending on the distro)

2. Download ADMIXTURE here

3. Download PLINK here for dataset construction and QC

4. Download EIGENSOFT here for converting files

5. Read the Curating Datasets tutorials here and here

6. Read the ADMIXTURE tutorial here.

7. Have a text editor that you can access, such as Notepad or Kate.

8.Optional: Have a spreadsheet editor that you can access.

1. Generating ADMIXTURE-based Input Files

LINADMIX requires the input of both a P and a Q file for a given run for a given value K. See the ADMIXTURE tutorial here to learn how to do this.

To identify which ADMIXTURE output should be used, run each value K with CV=5. Run ADMIXTURE from K values 4 to whatever value you think would be best (ideally in the 20s) in parallel or successively. Once all of these have run, you can then identify which runs should be used for LINADMIX (although, if possible, all should be run for the sake of comparison- see the bottom of this article for more information.)

Theoretically, the K chosen for LINADMIX should have the lowest cross-validation error out of all of the runs. However, since ADMIXTURE sometimes begins to separate clusters by sampling method at higher values of K, this should not be done blindly. I recommend plugging in the Q file data (along with the ind list) into a spreadsheet tool (as described in ADMIXTURE tutorial) to see which individuals components are maximized in- if there is a component maximized only in ancient samples or only in samples using a certain sequencing method, do not use this K for LINADMIX, as it will create skewed results when coanalyzing samples from different periods. Pick the K value run(s) with the lowest CV error (using --cv=5) that does not divide clusters by sampling method in this case.

2. Other Necessary Input Files

LINADMIX also requires an .ind file (PACKEDANCESTRYMAP/EIGENSTRAT format) for the same dataset used in the ADMIXTURE runs. Use convertf to convert the .fam for the dataset that you generated the ADMIXTURE data with to .ind format (PACKEDANCESTRYMAP will be faster than EIGENSTRAT) and use a spreadsheet tool to manipulate population labels if necessary.

Another required input is the .fam file that you used to run ADMIXTURE (this already should be generated given that it was needed to run ADMIXTURE).

LINADMIX also requires a .raw file (generated in PLINK) for the same dataset used in the ADMIXTURE runs. To generate a RAW file for use in LINADMIX, use the flag --recode a in PLINK with the --bfile flag set to the dataset that you used to run ADMIXTURE with.

Example (using Ubuntu syntax): ./plink --bfile myfile --recode a --out myfileraw

You should then have a .raw file ready to be used for LINADMIX. The Curating Datasets tutorials here has more information about the processes mentioned, as does the ADMIXTURE tutorial here.

Downloading LINADMIX

LINADMIX can be downloaded from the terminal using git clone.

Here's an example (specific to Ubuntu):

cd git-repos/

git clone https://github.com/swidler/LINADMIX.git

cd LINADMIX

(with path set to in LINADMIX folder that is generated)sudo apt install python3-pip

(with path set to in LINADMIX folder that is generated)pip install numpy

(with path set to in LINADMIX folder that is generated)pip install qpsolvers[open_source_solvers]

Alternatively, download the zip from GitHub and then extract to a folder designated to LINADMIX. You'll still need to install pip, numpy, and qpsolvers via the terminal if you haven't already. Use your distro's equivalent of the sudo command to install pip, and then do pip install numpy and pip install qpsolvers[open_source_solvers].

Since LINADMIX is just a python script with input files, you don't need to build any programs. However, as mentioned above, you will need to install python3-pip, numpy (via pip), and qpsolvers (via pip) if you haven't already.

Running LINADMIX

1. Using the Example Config Files

Once everything is set up in a specific folder, you can move all of the input files that you've generated to the "example" folder within the LINADMIX master folder. Everything in the config file is basically already set up, except you'll need to change the input files, the output file, and the reference and test populations.

Base Config File

# input files

#dir = "inputdirectoryname/"

dir = "example/"

#indfile_main = dir + "indfilename.ind" # IND file

indfile_main = dir + "example_ind_file.ind" # IND file

#indfile_admix = dir + "famfilename.fam" # FAM file

indfile_admix = dir + "example_file.fam" # FAM file

#Qfile = dir + "Qfilename.K.Q" # Q file

Qfile = dir + "example_file.6.Q" # Q file

#Pfile = dir + "Pfilename.K.P" # P file

Pfile = dir + "example_file.6.P" # P file

#Gfile = dir + "genotypefilename.raw" # G file

Gfile = dir + "example_geno_file.raw" # G file

# output

#out_dir = "outputdirectoryname/"

out_dir = "example/"

#outfile = out_dir + "outputfilename.txt"

outfile = out_dir + "example_results.txt"

# user-supplied vars

#target_pops = ["Target1","Target2",...,"TargetN"]

target_pops = ["Som50Eng50"]

#source_pops_var = ["VaryingSource1","VaryingSource2",...,"VaryingSourceM"]

source_pops_var = ["Somali", "Spanish", "Iranian"]

#source_pops_const = ["ConstantSource1","ConstantSource2",...,"ConstantSourceL"]- you can use multiple source/test populations in the same run.

source_pops_const = ["English"]

num_reps = 1000 # number of repetitions desired for bootstrap in standard error estimation

num_reps_pval = 10000 # number of repetitions desired for bootstrap in empirical p value calculations

qval = 0.01 # threshold of values of ADMIXTURE ancestral populations to be considered in the bootstrap

Edited Config File

# input files

#dir = "inputdirectoryname/"

dir = "example/"

#indfile_main = dir + "indfilename.ind" # IND file

indfile_main = dir + "yourindfile.ind" # IND file

#indfile_admix = dir + "famfilename.fam" # FAM file

indfile_admix = dir + "yourfamfile.fam" # FAM file

#Qfile = dir + "Qfilename.K.Q" # Q file

Qfile = dir + "yourqfile.6.Q" # Q file

Author's note- 6 is just a placeholder here. Use whatever K value is optimal (as mentioned at the beginning of the article)

#Pfile = dir + "Pfilename.K.P" # P file

Pfile = dir + "yourpfile.6.P" # P file

#Gfile = dir + "genotypefilename.raw" # G file

Gfile = dir + "yourrawfile.raw" # G file

# output

#out_dir = "outputdirectoryname/"

out_dir = "example/"

#outfile = out_dir + "outputfilename.txt"

outfile = out_dir + "myresults.txt"

# user-supplied vars

#target_pops = ["Target1","Target2",...,"TargetN"]

Author's note: you can use multiple target populations. However, these target populations will all be modeled using the source populations below.

target_pops = ["Targetpop"]

#source_pops_var = ["VaryingSource1","VaryingSource2",...,"VaryingSourceM"]

source_pops_var = ["Sourcepopvar1", "Sourcepopvar2"]

#source_pops_const = ["ConstantSource1","ConstantSource2",...,"ConstantSourceL"]- you can use multiple source/test populations in the same run.

source_pops_const = ["Sourcepopconst1", “Sourcepopconst2”]

this setup assumes a 3-way admixture model with 2 constant sources and 2 variable sources, meaning that there will be 2 runs (in the same output file).

num_reps = 1000 # number of repetitions desired for bootstrap in standard error estimation

num_reps_pval = 10000 # number of repetitions desired for bootstrap in empirical p value calculations

qval = 0.01 # threshold of values of ADMIXTURE ancestral populations to be considered in the bootstrap

Be sure to use the population names for the populations involved in your analysis. Though the population names are referenced from the .ind file, I recommend having the same population names in both the .fam file and the .ind file.

To run the file, run the correlating run_linadmix_example file (either run_linadmix_example.py or run_linadmix_example2.py).

You can either run the file directly (by pressing "run") or you can run it via the terminal. I prefer to run things in the terminal so I can see my progress in real time. To do this, just use the run command for the specific run file (assuming that you're using the first example config file):

./run_linadmix_example.py

Running LINADMIX using these config files is not optimal (it's "hacky"), but if you don't have the means to troubleshoot errors, I'd recommend running LINADMIX this way.

Do NOT change anything in this file unless you plan on duplicating example config files to parallelize >3 runs (I do not recommend this, as it consumes a lot of RAM)

If you do duplicate an example config file, and rename it, say, config_example3, then duplicate the run_linadmix_example file and name it "run_linadmix_example3", and then change from config_example import * to from config_example3 import *

2. Using the Actual Config File

In this case, you can move your source files to any designated directory within the LINADMIX-master folder. Just edit the parameters in the file titled "config". Do not change the file name for this, or else the configuration file will not be able to be imported into run_linadmix.py. If you want to change the title of the configuration file, this will have to be reflected in the run_linadmix.py file.

# input files

#dir = "inputdirectoryname/"

dir = "inputdirectoryname/" # can be "example/" if that's where your files are

indfile_main = dir + "indfilename.ind" # IND file

indfile_admix = dir + "famfilename.fam" # FAM file

Qfile = dir + "Qfilename.K.Q" # Q file

Pfile = dir + "Pfilename.K.P" # P file

Gfile = dir + "genotypefilename.raw" # G file

# output

out_dir = "outputdirectoryname/"

outfile = out_dir + "outputfilename.txt"

# user-supplied vars

target_pops = ["Target1","Target2",...,"TargetN"]

source_pops_var = ["VaryingSource1","VaryingSource2",...,"VaryingSourceM"]

source_pops_const = ["ConstantSource1","ConstantSource2",...,"ConstantSourceL"]- you can use multiple source/test populations in the same run.

num_reps = 1000 # number of repetitions desired for bootstrap in standard error estimation

num_reps_pval = 10000 # number of repetitions desired for bootstrap in empirical p value calculations

qval = 0.01 # threshold of values of ADMIXTURE ancestral populations to be considered in the bootstrap

The variable part of the run file (in run_linadmix_example.py, run_linadmix_example2.py or run_linadmix.py):

from config import * If you are to change the name of the config file, this is where you do it, replacing "config" with the new file name WITHOUT the .py suffix. If you are using the example run files, do not touch this unless you make a new config file, then change it to the name of the new config file.

Once your configuration parameters are set, then you can just run the run file. For example, if you configured everything in config_example2, then you'd run run_linadmix_example2- if you configured everything in config, then you'd run run_linadmix. To run the run file, you can either right click on it and press "run", or you can run it in the terminal. Both methods are outlined earlier in the article.

It's also possible to run multiple differently-configured files separately just by running them simultaneously or running them in separate terminals. You can also write a script to parallelize the runs. However, LINADMIX consumes a significant amount of RAM, increasing with the amount of samples in your dataset. Before parallelizing, do a singular full run (as the amount of RAM consumed will increase over time as data is analyzed) to see what your computer can handle.

Analyzing Results

Once your output file has been created, you can now analyze your results!

A given model "passes" in LINADMIX when its P-value is above 0.05. Generally, the standard errors should not be very large, either (I use ~10% as a cutoff). I would recommend testing the same model at varying values K (using the same configurations except for the input file, which will just be a differing K value) to see if there is regularity within the models depending on the input. Because the output of LINADMIX is reliant on the output of ADMIXTURE, results can and do vary.

Things to Note

LINADMIX is reliant on the output of ADMIXTURE.

This means that if an ADMIXTURE output has the "ancient signal" mentioned earlier in this article, when co-analyzing ancient and modern individuals, results will be inaccurate due to the variances in ADMIXTURE's assigned mixing coefficients based on the presence of the "ancient signal". Regardless of the presence of the "ancient signal", LINADMIX's assigned mixing coefficients (along with the other parameters assigned) will vary depending on the K-run of the ADMIXTURE input files. This is because of the variation within the mixing coefficients assigned by ADMIXTURE with varying numbers of clusters.

LINADMIX is not sensitive to recombination given that it is based on linear regression of admixture models.

This means that the percentage of admixture of a given population will translate to that percentage for each hypothetical population in admixture when this does not reflect the recombinative nature of admixture passed from parent to offspring by, admixture proportions would not be transmitted regularly.

Running LINADMIX requires a significant amount of RAM, especially with larger datasets.

Only parallelize runs if you have the RAM for it.

Despite this, LINADMIX is a powerful tool for estimating admixture proportions, and is often optimal in many cases when compared to qpAdm.

Sources

Github - LINADMIX- the code shown in this tutorial is from this repository, none is my own.

ADMIXTURE

GitHub - DReichLab/EIG: Eigen tools by Nick Patterson and Alkes Price lab

Standard data input - PLINK 1.9