Skip to content

Plink IBS MDS Tutorial

David Lauzon edited this page Apr 27, 2015 · 1 revision

Getting started with PLink IBS

IBS-MDS Process Overview

![PLink IBS-MDS Activity Diagram](http://yuml.me/diagram/plain;dir:LR/activity/%28start%29->[Dataset %28text%29]->[Dataset' %28binary%29]->[Pairwise IBS Metrics], [Pairwise IBS Metrics]->[IBS Clustering Analysis],[Dataset' %28binary%29]->[IBS Clustering Analysis]->%28end%29, [Pairwise IBS Metrics]->[MDS Analysis],[Dataset' %28binary%29]->[MDS Analysis]->[MDS Plot %28Graph%29]->%28end%29)

The input and output file formats used by the above process are described in the wiki page on PLink file formats.

Datasets

  • The Example dataset can be downloaded here (15 MB). Contains 90 individuals and 230,000 variants.
  • The HapMap (phase II) dataset can be downloaded here (120 MB). Contains 270 individuals and 4M variants.
  • The HapMap (phase III) dataset can be downloaded here: MAP file (12 MB), PED file (850 MB). Contains 1184 individuals and 1.4M variants.

Start the tutorial with the Example dataset as all operations should complete under 1-2 minutes on a modern laptop.

Then, you can go through the tutorial again using the HapMap dataset to understand the scalability issue. To simplify the process and use the same commands with both datasets, you can symlink the files to match the first dataset (be sure to use a distinct folder though), as follows:

ln -s hapmap_r23a.bed wgas2.bed
ln -s hapmap_r23a.bim wgas2.bim
ln -s hapmap_r23a.fam wgas2.fam

IBS Tutorial

Generic PLink options

  • --file: indicates the prefix of the input files (.ped, .map).
  • --bfile: idem, but tells plink to use binary input files instead (.bed, .bim, .fam)
  • --out: indicates the prefix of the output files (output format depends on the command being used)

Note: You don't need to specify the extensions for any of these files.

Creating the binary dataset (--make-bed)

(Time is benchmarked using the time UNIX command)

Note: Skip this step when using the second dataset.

plink --file wgas1  --make-bed  --out wgas2

Calculating the pairwise IBS metrics (--genome)

  • Reference: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#cluster
  • Purpose: Pre-computes the most expensive step (pairwise IBS metrics) to allow running the cluster analysis multiple times (with different constraints).
  • Input: wgas2.bed, wgas2.bim, wgas3.fam
  • Output: ibs1.genome
  • Duration: 25s (example) , 1h05m46s (hapmap_r23a)
plink --bfile wgas2  --genome  --out ibs1

Performing an IBS clustering analysis (--cluster)

plink --bfile wgas2  --read-genome ibs1.genome  --out ibs2 \
      --cluster

Performing an IBS clustering analysis with two constraints (--cluster --ppc --cc)

plink --bfile wgas2  --read-genome ibs1.genome --out ibs3 \
      --cluster --ppc 0.01 --cc

Performing an IBS similarity matrix (--cluster --matrix)

plink --bfile wgas2  --read-genome ibs1.genome  --out ibs4 \
      --cluster  --matrix

Performing an IBS distance matrix (--cluster --distance-matrix)

plink --bfile wgas2  --read-genome ibs1.genome --out ibs5 \
      --cluster  --distance-matrix

Performing a MDS analysis (--mds-plot)

plink --bfile wgas2  --read-genome ibs1.genome  --out mds1 \
      --cluster  --mds-plot 4

Visualizing the MDS Plot (using R software)

  • Purpose: Generate a plot of the first 2 MDS components, with individuals coloured according to the cluster assignment based on the SNP data. Chinese cluster at the left, Japanese cluster at the right
  • Input: mds1.mds
  • Output: mds1-strat.png
  • Duration: 1s

Copy-paste this code in a R shell:

png("mds1-strat.png");
p <- read.table("mds1.mds", header=T);
plot( p$C1 , p$C2 , pch=20 , cex=2 , col=p$SOL+1);
dev.off();

Duration with Plink2

It turns out Plink2 (still in beta) is much faster. What was taking 1 hour now takes under 10 seconds using 8 threads. It may be worthwhile to look into Plink2's source code instead.