Plink IBS MDS Tutorial

Getting started with PLink IBS

IBS-MDS Process Overview

![PLink IBS-MDS Activity Diagram](http://yuml.me/diagram/plain;dir:LR/activity/%28start%29->[Dataset %28text%29]->[Dataset' %28binary%29]->[Pairwise IBS Metrics], [Pairwise IBS Metrics]->[IBS Clustering Analysis],[Dataset' %28binary%29]->[IBS Clustering Analysis]->%28end%29, [Pairwise IBS Metrics]->[MDS Analysis],[Dataset' %28binary%29]->[MDS Analysis]->[MDS Plot %28Graph%29]->%28end%29)

The input and output file formats used by the above process are described in the wiki page on PLink file formats.

Datasets

The Example dataset can be downloaded here (15 MB). Contains 90 individuals and 230,000 variants.
The HapMap (phase II) dataset can be downloaded here (120 MB). Contains 270 individuals and 4M variants.
The HapMap (phase III) dataset can be downloaded here: MAP file (12 MB), PED file (850 MB). Contains 1184 individuals and 1.4M variants.

Start the tutorial with the Example dataset as all operations should complete under 1-2 minutes on a modern laptop.

Then, you can go through the tutorial again using the HapMap dataset to understand the scalability issue. To simplify the process and use the same commands with both datasets, you can symlink the files to match the first dataset (be sure to use a distinct folder though), as follows:

ln -s hapmap_r23a.bed wgas2.bed
ln -s hapmap_r23a.bim wgas2.bim
ln -s hapmap_r23a.fam wgas2.fam

IBS Tutorial

Generic PLink options

--file: indicates the prefix of the input files (.ped, .map).
--bfile: idem, but tells plink to use binary input files instead (.bed, .bim, .fam)
--out: indicates the prefix of the output files (output format depends on the command being used)

Note: You don't need to specify the extensions for any of these files.

Creating the binary dataset (`--make-bed`)

Reference: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#bed
Purpose: Make binary versions of the PED/MAP files
Input: wgas1.ped, wgas1.map
Output: wgas2.bed, wgas2.bim, wgas3.fam
Duration: 22s

(Time is benchmarked using the time UNIX command)

Note: Skip this step when using the second dataset.

plink --file wgas1  --make-bed  --out wgas2

Calculating the pairwise IBS metrics (`--genome`)

Reference: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#cluster
Purpose: Pre-computes the most expensive step (pairwise IBS metrics) to allow running the cluster analysis multiple times (with different constraints).
Input: wgas2.bed, wgas2.bim, wgas3.fam
Output: ibs1.genome
Duration: 25s (example) , 1h05m46s (hapmap_r23a)

plink --bfile wgas2  --genome  --out ibs1

Performing an IBS clustering analysis (`--cluster`)

Reference: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#cluster
Purpose: Performs a cluster analysis on a pre-computed IBS
Input: wgas2.bed, wgas2.bim, wgas3.fam, ibs1.genome
Output: ibs2.cluster[0-3], ibs2.hh
Duration: 3s (example) , 2m12s (hapmap_r23a)

plink --bfile wgas2  --read-genome ibs1.genome  --out ibs2 \
      --cluster

Performing an IBS clustering analysis with two constraints (`--cluster --ppc --cc`)

Reference: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#options
Input: wgas2.bed, wgas2.bim, wgas3.fam, ibs1.genome
Output: ibs3.cluster[0-3], ibs3.hh
Duration: 3s (example) , 2m09s (hapmap_r23a)

plink --bfile wgas2  --read-genome ibs1.genome --out ibs3 \
      --cluster --ppc 0.01 --cc

Performing an IBS similarity matrix (`--cluster --matrix`)

Reference: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#matrix
Input: wgas2.bed, wgas2.bim, wgas3.fam, ibs1.genome
Output: ibs4.cluster[0-3], ibs4.hh, ibs4.mibs
Duration: 3s (example) , 2m09s (hapmap_r23a)

plink --bfile wgas2  --read-genome ibs1.genome  --out ibs4 \
      --cluster  --matrix

Performing an IBS distance matrix (`--cluster --distance-matrix`)

Reference: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#matrix
Input: wgas2.bed, wgas2.bim, wgas3.fam, ibs1.genome
Output: ibs5.cluster[0-3], ibs5.hh, ibs5.mdist
Duration: 3s (example) , 2m12s (hapmap_r23a)

plink --bfile wgas2  --read-genome ibs1.genome --out ibs5 \
      --cluster  --distance-matrix

Performing a MDS analysis (`--mds-plot`)

Reference: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#mds
Input: hapmap_r23a.{bed,bim,fam}, ibs1.genome
Output: mds1.cluster{0-3}, mds1.hh, mds1.mds
Duration: 3s (example) , 2m09s (hapmap_r23a)

plink --bfile wgas2  --read-genome ibs1.genome  --out mds1 \
      --cluster  --mds-plot 4

Visualizing the MDS Plot (using `R software`)

Purpose: Generate a plot of the first 2 MDS components, with individuals coloured according to the cluster assignment based on the SNP data. Chinese cluster at the left, Japanese cluster at the right
Input: mds1.mds
Output: mds1-strat.png
Duration: 1s

Copy-paste this code in a R shell:

png("mds1-strat.png");
p <- read.table("mds1.mds", header=T);
plot( p$C1 , p$C2 , pch=20 , cex=2 , col=p$SOL+1);
dev.off();

Duration with Plink2

It turns out Plink2 (still in beta) is much faster. What was taking 1 hour now takes under 10 seconds using 8 threads. It may be worthwhile to look into Plink2's source code instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plink IBS MDS Tutorial

Getting started with PLink IBS

IBS-MDS Process Overview

Datasets

IBS Tutorial

Generic PLink options

Creating the binary dataset (`--make-bed`)

Calculating the pairwise IBS metrics (`--genome`)

Performing an IBS clustering analysis (`--cluster`)

Performing an IBS clustering analysis with two constraints (`--cluster --ppc --cc`)

Performing an IBS similarity matrix (`--cluster --matrix`)

Performing an IBS distance matrix (`--cluster --distance-matrix`)

Performing a MDS analysis (`--mds-plot`)

Visualizing the MDS Plot (using `R software`)

Duration with Plink2

Clone this wiki locally

Plink IBS MDS Tutorial

Getting started with PLink IBS

IBS-MDS Process Overview

Datasets

IBS Tutorial

Generic PLink options

Creating the binary dataset (--make-bed)

Calculating the pairwise IBS metrics (--genome)

Performing an IBS clustering analysis (--cluster)

Performing an IBS clustering analysis with two constraints (--cluster --ppc --cc)

Performing an IBS similarity matrix (--cluster --matrix)

Performing an IBS distance matrix (--cluster --distance-matrix)

Performing a MDS analysis (--mds-plot)

Visualizing the MDS Plot (using R software)

Duration with Plink2

Clone this wiki locally

Creating the binary dataset (`--make-bed`)

Calculating the pairwise IBS metrics (`--genome`)

Performing an IBS clustering analysis (`--cluster`)

Performing an IBS clustering analysis with two constraints (`--cluster --ppc --cc`)

Performing an IBS similarity matrix (`--cluster --matrix`)

Performing an IBS distance matrix (`--cluster --distance-matrix`)

Performing a MDS analysis (`--mds-plot`)

Visualizing the MDS Plot (using `R software`)