Basic Dataset Curation with EIGENSOFT and PLINK| Approaching Archaeogenetics

All population genetics/archaeogenetics analyses require the use of a dataset. A dataset refers to the genetic data of a collection of individuals and other associated data. If you are to use any of the programs mentioned in this article, you will need to use a dataset. The format of the dataset will depend on the program being used.

For ease of understanding, this tutorial will begin with an exploration of the different types of datasets and how they are used.

Most dataset file form that are used follow a specific format: there will be an "individual" file containing the individuals in the sample, a "genotype" file containing genetic data, and a "SNP" file containing specific SNPs.

Dataset file forms that follow this format are as follows:

EIGENSTRAT

This file form is associated with the EIGENSTRAT program that is included in the EIGENSOFT package. Genotype data is stored in a text file. This format can be used by ADMIXTOOLS, and SmartPCA. This format cannot be used in PLINK (PLINK is a program that is used to manipulate and control the quality of datasets).

Individual File Suffix	.ind
Genotype File Suffix	.geno
SNP File Suffix	.snp

PACKEDANCESTRYMAP

This file form is almost identical to EIGENSTRAT, with the difference being that PACKEDANCESTRYMAP-formatted genotype data is stored in a packed binary file. This format can be used by ADMIXTOOLS and SmartPCA. This format cannot be used in PLINK.

Individual File Suffix	.ind
Genotype File Suffix	.geno
SNP File Suffix	.snp

PACKEDPED

This file form is the binary PLINK-associated format. Like PACKEDANCESTRYMAP, genotype data is stored in a packed binary file. This format can be used by ADMIXTURE, ADMIXTOOLS, LINADMIX, and SmartPCA, and can be used in PLINK.

Individual File Suffix	.fam
Genotype File Suffix	.bed
SNP File Suffix	.bim

However, there is one other commonly-used dataset format that does not follow this format. This is the .raw format, which can be generated by PLINK (the process of doing so will be explained in a later part of this tutorial) and is used by LINADMIX.

When to Make a Custom Dataset

There are various datasets for use in archaeogenetics and population that are already published. Most published datasets can be found here . However, depending on the subject of genetic analysis, different samples are needed. If a genetic analysis using the programs mentioned in the aforementioned article requires samples that are not contained in a singular dataset, then a custom dataset would need to be made. This can be done by merging files using PLINK and EIGENSOFT.

Many published datasets, such as the AADR, (which serves as a compendium of multiple datasets) are in PACKEDANCESTRYMAP format. In order for these datasets to be merged with datasets of another format, or to be used by programs such as ADMIXTURE and LINADMIX, the file must be converted to an acceptable format (PACKEDPED for merging, ADMIXTURE and LINADMIX).

Making a Custom Dataset Using EIGENSOFT and PLINK

As stated earlier, PLINK is a program that is used to manipulate and control the quality of datasets. It can be downloaded here: Using PLINK to merge datasets typically results in the optimal outcome. However, using PLINK to merge datasets can be somewhat tedious.

Requirements:

1. Download PLINK and EIGENSOFT (Linux, Windows/Mac). Note that use of the Linux build of both programs is strongly preferred. If your computer runs on an OS that is not Linux-based, using a virtual machine is recommended, which would also entail placing the associated files within the virtual machine.

2. Download all of the datasets that will be merged, and the AADR dataset.

3. Have a spreadsheet tool that you can access, such as Excel, LibreOffice Calc, or Google Sheets.

4. Have a text editor that you can access, such as Notepad or Kate.

This step requires EIGENSOFT. The EIGENSOFT program used for this step is called "convertf".

a. The first substep is to have all of your dataset files in the location of the EIGENSOFT program. On Windows, the subfolder of the Eigensoft_Master folder containing the program will be called "Win64" (it should be noted that the Linux build of EIGENSOFT can handle files that are significantly larger than the files that the Win64 build can handle. If your Win64 build of EIGENSOFT cannot handle a file, set up a Linux VM and do everything through there). On Linux, the EIGENSOFT program folder will be called "src".

b. The second substep is to set up a .txt parameter file for convertf and put it in the program folder as well. The file can be named anything. Note that only one file can be converted at a time using convertf, so this parameter file will pertain to a singular dataset.

Convertf parameter files are .txt files that should contain the following fields:

genotypename:

snpname:

indivname:

outputformat:

genotypeoutname:

snpoutname:

indivoutname:

familynames:

outputgroup:

Then, fill out the empty fields with respect to the file that is being converted.

The "genotypename" field should contain the name of the genotype file (containing the file type, ex: dataset.geno)

The "snpname" field should contain the name of the snp file (containing the file type, ex: dataset.snp)

The "indivname" field should contain the name of the individual file (containing the file type, ex: dataset.ind)

Since the given dataset is being converted to PACKEDPED for use in PLINK, the "outputformat" field should contain "PACKEDPED".

The "genotypeoutname" field should contain the name of the genotype file. It MUST contain ".bed", since the output format is PACKEDPED (ex: dataset.bed)

The "snpoutname" field should contain the name of the snp file. It MUST contain ".bim", since the output format is PACKEDPED (ex: dataset.bim)

The "indivoutname" field should contain the name of the individual file. It MUST contain ".fam", since the output format is PACKEDPED (ex: dataset.fam)

Fields only applicable when converting to PACKEDPED:

The "familynames" field affects the name of the sample ID if the output format is PACKEDPED. Set this to "NO" so sample IDs are not changed.

Like "familynames", "outputgroup" field is only relevant when converting to PACKEDPED. When set to YES, it adds another column containing the population labels from the PACKEDANCESTRYMAP document.

What your parameter file should look like:

genotypename: example.geno

snpname: example.snp

indivname: example.ind

outputformat: PACKEDPED

genotypeoutname: example.bed

snpoutname: example.bim

indivoutname: example.fam

familynames: NO

outputgroup: YES

c. Once the parameter file is set up, open your terminal. Set the directory to the folder containing the EIGENSOFT programs, datasets, and parameter file. Then, run the command "convertf -p "name of parameter file".

For Windows, the parameter file should be specified as "name of parameter file".txt

You will know if your files have successfully converted if the terminal displays the message "PACKEDPED output" and the output files are in the folder. To convert another dataset, just alter the parameter file to be specific to the next dataset (by changing the file names specified), and run the convertf -p "name of parameter file" command again.

d. Open the .fam file in the spreadsheet editor of your choice (make sure that the columns are space or tab delinated depending on the setup of your .fam file). The first column contains the family names, which should be numbers (as of now). The last column should contain the preserved population labels. Copy the last column and paste it into the first column to make the family names the same as the population labels, but do not cut or remove the last column, or the .fam file will be invalid in format.

The PLINK program is located in a folder that should be titled "plink". Move all of the PACKEDPED files that you plan on merging into this folder.

The next step is to select one of the datasets that is being merged and generate a SNP list. This list will be used to ensure that all datasets contain the same SNPs. The file that should be used to make a SNP list is the AADR 1240K dataset. It is HIGHLY recommended that this dataset be used to make a SNP list, even if you do not plan on merging this dataset to make a custom dataset. As a result, this dataset should be downloaded anyways.

To make a SNP list, open your terminal. Set the directory to the PLINK program folder.

Linux: ./plink --bfile "name of 1240K dataset"(WITHOUT suffix) --write-snplist

Windows: plink --bfile "name of 1240K dataset"(WITHOUT suffix) --write-snplist

Then, if the run is successful, in your PLINK folder there should be a SNP list. Hold onto this file, it will be needed in a few steps.

4. Select samples that you want to use from each dataset (this step is optional)

If only specific samples are to be merged from a dataset and the entire dataset is not necessary, a smaller dataset containing only those samples can be made from the larger dataset using PLINK.

a. To make a smaller dataset, have the PACKEDPED dataset that you plan on selecting samples from in your PLINK folder.

b. Then, in the PLINK folder, make a new .txt file. This .txt file will contain a list of the samples that are to comprise the smaller dataset. This .txt file can be named anything.

family name of sample 1 (ID on left of .fam file)[space/tab]ind name of sample 1 (ID in right of .fam file)

family name of sample 2 (ID on left of .fam file)[space/tab]ind name of sample 2 (ID in right of .fam file)

family name of sample 3 (ID on left of .fam file)[space/tab]ind name of sample 3 (ID in right of .fam file)

This list can be made by copy pasting the row of the .fam file containing the desired samples, and removing the four numbers at the end.

c. Once the sample list is made, open your terminal and set the directory to the PLINK folder.

Linux: ./plink --bfile "name of dataset file that select samples are from"(WITHOUT suffix) --keep "name of .txt file with list of select samples".txt --make-bed --out "whatever you want to name the new, smaller dataset"(WITHOUT suffix)

Windows: plink --bfile "name of dataset file that select samples are from"(WITHOUT suffix) --keep "name of .txt file with list of select samples".txt --make-bed --out "whatever you want to name the new, smaller dataset"(WITHOUT suffix)

Then, if the run is successful, in your PLINK folder there should be a new dataset with the designated name of the new, smaller dataset (there should be a bed, bim, and fam file). This is your new, smaller dataset! You can use this dataset to merge your select samples into another dataset.

All of the datasets that are to be merged together should be located in the PLINK folder. There are two methods to merge files on PLINK. One entails merging two files at a time, and the other entails merging multiple files at a time.

Merging two files at a time:

Open your terminal and set the directory to the PLINK folder and run the following command:

Linux: ./plink --bfile "name of first dataset"(WITHOUT file suffix) --merge "name of second dataset".bed "name of second dataset".bim "name of second dataset".fam --make-bed --out "merged file name"(WITHOUT suffix)

Windows: plink --bfile "name of first dataset"(WITHOUT file suffix) --merge "name of second dataset".bed "name of second dataset".bim "name of second dataset".fam --make-bed --out "merged file name"(WITHOUT suffix)

If the run is successful, the new merged dataset (.bed, .bim, and .fam folders) should appear in your PLINK folder.

Merging more than two files at a time:

1. Make a new .txt file in the PLINK folder. Ensure that this .txt file has a different name than the select samples list. This file will function as a merge parameter file.

2. In the .txt file, make a list of files in the following form. Do NOT include one of the files that is to be merged in this list.

"nameofseconddataset".bed "nameofseconddataset".bim "nameofseconddataset".fam

"nameofthirddataset".bed "nameofthirdataset".bim "nameofthirddataset".fam

"nameoffourthdataset".bed "nameoffourthataset".bim "nameoffourthdataset".fam

...etc

Rows can be added or removed depending on the number of datasets involved.

3. Once the merge parameter .txt file is complete, open your terminal and set the directory to the PLINK folder and run the following command:

Linux: ./plink --bfile "name of first dataset"(WITHOUT file suffix) --merge-list "name of merge parameter file" --make-bed --out "merged file name"(WITHOUT suffix)

Windows: plink --bfile "name of first dataset"(WITHOUT file suffix) --merge-list "name of merge parameter file".txt --make-bed --out "merged file name"(WITHOUT suffix)

If the run is successful, the new merged dataset (.bed, .bim, and .bam folders) should appear in your PLINK folder.

Now that the merged dataset has been created, the SNPs included in the dataset must be pruned. This step requires the SNP list generated in an earlier step from the AADR 1240K dataset.

Have your terminal open and the directory set to the PLINK folder. Then, run the following command:

Linux: ./plink --bfile "name of merged dataset" (WITHOUT suffix) --extract plink.snplist --make-bed --out "whatever you want to name your pruned custom dataset"(WITHOUT suffix)

Windows: plink --bfile "name of merged dataset" (WITHOUT suffix) --extract plink.snplist --make-bed --out "whatever you want to name your pruned custom dataset"(WITHOUT suffix)

The 1240K dataset does not have to be used to generate the SNP list, though. I reccommend generating a SNP list for the base dataset that you are merging, then pruning your other datasets to that SNP list. Then, if all datasets still don't have the same amount of SNPs, generate a SNP list for the second dataset and prune all of the other datasets to that one.

Depending on the program being used, or your personal preferences, the new merged PACKEDPED dataset can be converted back to another format (EIGENSTRAT for ADMIXTOOLS and SmartPCA, PACKEDANCESTRYMAP for ADMIXTOOLS and SmartPCA).

a. To do so, move the merged PACKEDPED dataset back to the EIGENSOFT program folder.

b. Then, open the convertf parameter .txt file that you made earlier. Edit the "genotypename", "snpname", and "indivname" fields to correspond to the name of the merged PACKEDPED dataset.

c. Edit the "genotypeoutname", "snpoutname", and "indivoutname" fields to correspond to what you want to name the converted dataset.

d. Edit the "OUTPUTFORMAT" field to say either PACKEDANCESTRYMAP or EIGENSTRAT depending on the format that you want to convert to.

genotypename: "nameofmergedfile".bed

snpname: "nameofmergedfile".bim

indivname: "nameofmergedfile".fam

outputformat: EIGENSOFT/PACKEDANCESTRYMAP (pick one!)

genotypeoutname: example.geno

snpoutname: example.snp

indivoutname: example.ind

f. Once the parameter file is complete, open your terminal and set the directory to the EIGENSOFT program folder. Then, run the following command:

Making RAW files using PLINK

PLINK's RAW files are a necessary input file in order to run LINADMIX, a linear regression-based tool to model admixture that I've mentioned in this article. To generate a RAW file for LINADMIX, open PLINK and run the following command:

Linux (note that LINADMIX can only be used on a Linux OS, so I am only including the Linux command): ./plink --bfile "name of dataset that you want to convert" (WITHOUT suffix) --recode A --allow-no-sex (if your dataset has samples of an unidentified sex) --out "whatever you want to name your RAW file"(WITHOUT suffix)

EIGENSOFT (Experimental for Windows and Mac, contains SmartPCA) NOTE: In my personal experience, this version is far less powerful than the original Linux build and cannot handle larger (multiple GB) files. As a result, I reccommend that Windows and Mac users set up a Linux virtual machine if they want to utilize the full potential of EIGENSOFT.