-
Notifications
You must be signed in to change notification settings - Fork 6
Plink IBS MDS Tutorial

The input and output file formats used by the above process are described in the wiki page on PLink file formats.
- The Example dataset can be downloaded here (15 MB). Contains 90 individuals and 230,000 variants.
- The HapMap (phase II) dataset can be downloaded here (120 MB). Contains 270 individuals and 4M variants.
- The HapMap (phase III) dataset can be downloaded here: MAP file (12 MB), PED file (850 MB). Contains 1184 individuals and 1.4M variants.
Start the tutorial with the Example dataset as all operations should complete under 1-2 minutes on a modern laptop.
Then, you can go through the tutorial again using the HapMap dataset to understand the scalability issue. To simplify the process and use the same commands with both datasets, you can symlink the files to match the first dataset (be sure to use a distinct folder though), as follows:
ln -s hapmap_r23a.bed wgas2.bed
ln -s hapmap_r23a.bim wgas2.bim
ln -s hapmap_r23a.fam wgas2.fam
-
--file
: indicates the prefix of the input files (.ped
,.map
). -
--bfile
: idem, but tells plink to use binary input files instead (.bed
,.bim
,.fam
) -
--out
: indicates the prefix of the output files (output format depends on the command being used)
Note: You don't need to specify the extensions for any of these files.
- Reference: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#bed
- Purpose: Make binary versions of the PED/MAP files
-
Input:
wgas1.ped, wgas1.map
-
Output:
wgas2.bed, wgas2.bim, wgas3.fam
- Duration: 22s
(Time is benchmarked using the time
UNIX command)
Note: Skip this step when using the second dataset.
plink --file wgas1 --make-bed --out wgas2
- Reference: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#cluster
- Purpose: Pre-computes the most expensive step (pairwise IBS metrics) to allow running the cluster analysis multiple times (with different constraints).
-
Input:
wgas2.bed, wgas2.bim, wgas3.fam
-
Output:
ibs1.genome
- Duration: 25s (example) , 1h05m46s (hapmap_r23a)
plink --bfile wgas2 --genome --out ibs1
- Reference: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#cluster
- Purpose: Performs a cluster analysis on a pre-computed IBS
-
Input:
wgas2.bed, wgas2.bim, wgas3.fam, ibs1.genome
-
Output:
ibs2.cluster[0-3], ibs2.hh
- Duration: 3s (example) , 2m12s (hapmap_r23a)
plink --bfile wgas2 --read-genome ibs1.genome --out ibs2 \
--cluster
- Reference: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#options
-
Input:
wgas2.bed, wgas2.bim, wgas3.fam, ibs1.genome
-
Output:
ibs3.cluster[0-3], ibs3.hh
- Duration: 3s (example) , 2m09s (hapmap_r23a)
plink --bfile wgas2 --read-genome ibs1.genome --out ibs3 \
--cluster --ppc 0.01 --cc
- Reference: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#matrix
-
Input:
wgas2.bed, wgas2.bim, wgas3.fam, ibs1.genome
-
Output:
ibs4.cluster[0-3], ibs4.hh, ibs4.mibs
- Duration: 3s (example) , 2m09s (hapmap_r23a)
plink --bfile wgas2 --read-genome ibs1.genome --out ibs4 \
--cluster --matrix
- Reference: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#matrix
-
Input:
wgas2.bed, wgas2.bim, wgas3.fam, ibs1.genome
-
Output:
ibs5.cluster[0-3], ibs5.hh, ibs5.mdist
- Duration: 3s (example) , 2m12s (hapmap_r23a)
plink --bfile wgas2 --read-genome ibs1.genome --out ibs5 \
--cluster --distance-matrix
- Reference: http://pngu.mgh.harvard.edu/~purcell/plink/strat.shtml#mds
-
Input:
hapmap_r23a.{bed,bim,fam}, ibs1.genome
-
Output:
mds1.cluster{0-3}, mds1.hh, mds1.mds
- Duration: 3s (example) , 2m09s (hapmap_r23a)
plink --bfile wgas2 --read-genome ibs1.genome --out mds1 \
--cluster --mds-plot 4
- Purpose: Generate a plot of the first 2 MDS components, with individuals coloured according to the cluster assignment based on the SNP data. Chinese cluster at the left, Japanese cluster at the right
-
Input:
mds1.mds
-
Output:
mds1-strat.png
- Duration: 1s
Copy-paste this code in a R
shell:
png("mds1-strat.png");
p <- read.table("mds1.mds", header=T);
plot( p$C1 , p$C2 , pch=20 , cex=2 , col=p$SOL+1);
dev.off();
It turns out Plink2 (still in beta) is much faster. What was taking 1 hour now takes under 10 seconds using 8 threads. It may be worthwhile to look into Plink2's source code instead.