Skip to content

A soup to nuts example

Andrew Kern edited this page Nov 23, 2021 · 14 revisions

diploSHIC usage

In the normal workflow of supervised machine learning one creates a training set, trains an algorithm using that set, validates the accuracy of the trained algorithm, and then finally applies the trained algo to real data. We want to take you the user through that full process using diploSHIC here.

We have created a rather large directory that contains everything you need to replicate the mosquito genome results shown in Kern and Schrider (2018). The directory contains raw simulation output from discoal for hard and soft sweeps and regions unlinked to sweeps which condition upon the demographic model for the Burkino Faso sample (BFS) from our recent AG1000G consortium paper. It also contains a large VCF file of genotype data from that paper, a masking file which shows which regions of the genome were sequenced, and a simple text file that maps the individual sample IDs to populations to fish the relevant genotypes out of the VCF file.

Start by downloading and unpacking the example directory. Throughout this example I'm assuming that you are in the diploSHIC directory that you cloned from github

$ wget http://sesame.uoregon.edu/~adkern/diploSHIC/exampleApplication.tar.gz && tar zxvf exampleApplication.tar.gz

The relevant files are as follows

  • ag1000g.phase1.ar3.pass.biallelic.3R.vcf.28000000-29000000.gz -- VCF genotype file for a region of Chr3R from AG1000G phase1
  • Anopheles-gambiae-PEST_CHROMOSOMES_AgamP3.accessible.fa.gz -- the masking file
  • samples_pops.txt -- file that maps individual samples to populations
  • hard_*.msOut.gz -- simulation for hard sweeps. the number refers to the subwindow where the sweep ocurred
  • soft_*.msOut.gz -- simulation for soft sweeps. the number is as above
  • neut.msOut.gz -- simulations for regions unlinked to sweeps

A brief note on simulations

The way we have set up S/HIC (Schrider and Kern 2016) and diploS/HIC (Kern and Schrider 2018) is that it takes five classes of data as its input (hard sweeps, soft sweeps, regions linked to hard sweeps, regions linked to soft sweeps, and regions unlinked to sweeps (neutral loci). We get the first four classes of data by performing coalescent simulations of a large region of the chromosome with a single hard or soft sweep that has occurred at a different position along the genome. We create these by assuming that the sweep has occurred at the center of one of our 11 subwindows. Those that occur in the 5th subwindow, i.e. the center of the larger simulated region, are treated as our hard or soft sweep cases. Those that occur in windows 0-4 or 6-11 are considered the linked classes.

The discoal command lines used for each simulation are given in the first line of the *.msOut.gz files. Let's look quickly at hard_5.msOut.gz

$ zcat /scratch/ak917/exampleApplication/hard_5.msOut.gz | head -n 1
discoal 162 2000 55000 -Pt 1750.204699 17502.046985 -Pre 19252.251684 57756.755051 \
-en 0.000038 0 1.001158 -en 0.000077 0 1.004693 -en 0.000117 0 1.017850 -en 0.000157 0 1.019335 \ 
-en 0.000198 0 1.022229 -en 0.000240 0 1.025899 -en 0.000282 0 1.030126 -en 0.000326 0 1.030824 \ 
-en 0.000369 0 1.031031 -en 0.000459 0 1.031171 -en 0.000891 0 1.030126 -en 0.000943 0 1.027106 \ 
-en 0.000995 0 1.019246 -en 0.001047 0 1.019335 -en 0.001100 0 1.019086 -en 0.001154 0 1.019246 \ 
-en 0.001209 0 1.020601 -en 0.001265 0 1.018926 -en 0.001321 0 1.013910 -en 0.001378 0 1.004683 \ 
-en 0.001435 0 1.000118 -en 0.001493 0 0.985278 -en 0.001551 0 0.967523 -en 0.001608 0 0.621680 \ 
-en 0.001646 0 0.599478 -en 0.001683 0 0.583576 -en 0.001719 0 0.576138 -en 0.001756 0 0.564006 \ 
-en 0.001792 0 0.559137 -en 0.001829 0 0.555749 -en 0.001866 0 0.552085 -en 0.001904 0 0.551380 \ 
-en 0.001942 0 0.550215 -en 0.002059 0 0.550213 -en 0.002226 0 0.550215 -en 0.002751 0 0.550213 \ 
-en 0.004234 0 0.550215 -en 0.004410 0 0.549391 -en 0.004502 0 0.549190 -en 0.004596 0 0.523247 \ 
-en 0.004688 0 0.219782 -en 0.004727 0 0.218488 -en 0.005243 0 0.218480 -en 0.005416 0 0.217284 \
-en 0.005477 0 0.217170 -en 0.005607 0 0.217036 -en 0.005675 0 0.216631 -en 0.005818 0 0.216410 \ 
-en 0.006054 0 0.215218 -en 0.006138 0 0.215008 -en 0.006226 0 0.214357 -en 0.006317 0 0.214284 \ 
-en 0.006412 0 0.213176 -en 0.006510 0 0.213172 -en 0.006613 0 0.212340 -en 0.006721 0 0.211736 \ 
-en 0.006833 0 0.211688 -en 0.006950 0 0.211334 -en 0.007630 0 0.127176 -en 0.007725 0 0.124824 \ 
-en 0.008040 0 0.128665 -en 0.008162 0 0.133725 -en 0.008297 0 0.137004 -en 0.008444 0 0.140183 \ 
-en 0.008606 0 0.143539 -en 0.008782 0 0.147709 -en 0.008978 0 0.155927 -en 0.009200 0 0.160223 \ 
-en 0.009446 0 0.165811 -en 0.009723 0 0.168947 -en 0.010029 0 0.177921 -en 0.010380 0 0.211421 \ 
-en 0.010838 0 0.230486 -en 0.011387 0 0.238488 -en 0.012014 0 0.215713 -en 0.012645 0 0.155886 \
-en 0.013155 0 0.078653 -en 0.013444 0 0.042227 -en 0.013620 0 0.018138 -en 0.013706 0 0.015204 \
-en 0.013790 0 0.014786 -en 0.013884 0 0.014356 -en 0.014124 0 0.017850 -en 0.014322 0 0.039856 \
-en 0.014876 0 0.100438 -en 0.016669 0 0.262716 -en 0.022924 0 0.049834 -en 0.024585 0 0.009407 \
-en 0.025056 0 0.001776 -en 0.025204 0 0.007266 -en 0.026415 0 0.038498 \
-ws 0 -Pa 2500.292426 250029.242646 -Pu 0.000000 0.000040 -x 0.5

This is a bit ugly because of all the -en flags which specify the demographic size change history of the BFS sample, so I have added backslashes and line breaks to make it appear nicer on this wiki. Ignore those for a minute and instead focus on the options before and after the demography. In particular you can see that -x 0.5 has been given indicating that the hard sweep is at the middle of our simulated region. If instead we look in hard_0.msOut.gz there we set -x 0.045454545454545456 indicating that the sweep occurred at the center of the leftmost subwindow.

Hopefully this simulation output will give you a very good headstart on how to create your own simulations with discoal to generate your own training set for diploSHIC.

Calculating feature vectors from simulations

Once you have simulation output, we need to compute summary statistics and feature vectors on those. To do that we will use diploSHIC in its fvecSim mode. These simulations are large so the computation will take a while. I usually do this on a cluster but to make things simple I'll give an example of launching all the jobs on a multicore machine

$ for f in exampleApplication/*.msOut.gz; do diploSHIC fvecSim diploid $f $f.diploid.fvec --totalPhysLen 55000 --maskFileName exampleApplication/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP3.accessible.fa.gz --chrArmsForMasking 3R & done

this will launch all the jobs to the background. Go get lunch-- this will take a couple of hours to complete.

Make a balanced training set

Once this is complete we will create a balanced training set (i.e. the same number of examples per class) using the makeTrainingSets mode within diploSHIC.py.

$ mkdir rawFVFiles && mv /scratch/ak917/exampleApplication/*.fvec rawFVFiles/
$ mkdir trainingSets
$ diploSHIC makeTrainingSets rawFVFiles/neut.msOut.gz.diploid.fvec rawFVFiles/soft \
rawFVFiles/hard 5 0,1,2,3,4,6,7,8,9,10 trainingSets/

et voila! our training sets are ready.

train diploS/HIC

We will now move on to training a classifier. The hard work is done, so this step is rather simple. Training may take a while depending on your hardware setup. For things to go really fast a GPU would help, but a multicore machine will do fine for our CNN

$ diploSHIC train trainingSets/ trainingSets/ bfsModel

note that I have specified the feature vector files in trainingSets/ to be used for both training and testing. This is fine, diploSHIC knows how to handle this. After 18 epochs of training my optimization run is complete yielding the following

total time spent fitting and evaluating: 5230.030000 secs
evaluation on test set:
diploSHIC loss: 0.249277
diploSHIC accuracy: 0.930000

which looks quite good. the trained model is stored in bfsModel.json and bfsModel.weights.hdf5. That's it for training. Now lets apply this to real data.

feature vectors for example mosquito data

Applying the trained model has two steps: calculating feature vectors from data and then prediction from our trained CNN. diploSHIC will compute feature vectors from VCF files. We will do that on the ag1000g.phase1.ar3.pass.biallelic.3R.vcf.28000000-29000000.gz file that is supplied in the exampleApplication directory.

$ diploSHIC fvecVcf diploid \
exampleApplication/ag1000g.phase1.ar3.pass.biallelic.3R.vcf.28000000-29000000.gz 3R 53200684 \ exampleApplication/ag1000g.phase1.ar3.pass.biallelic.3R.vcf.28000000-29000000.gz.diploid.fvec \
--targetPop BFS --sampleToPopFileName exampleApplication/samples_pops.txt --winSize 55000 \ 
--maskFileName exampleApplication/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP3.accessible.fa.gz

this will take a while to run, go hit the pool. Once complete exampleApplication/ag1000g.phase1.ar3.pass.biallelic.3R.vcf.28000000-29000000.gz.diploid.fvec has data that is ready to do prediction on

prediction on empirical feature vectors

last step is to feed the feature vectors from the empirical data back to the trained CNN to predict which regions of the genome fit into each of our five classes. This will be quite quick

$ diploSHIC predict bfsModel.json bfsModel.weights.hdf5 rawFVFiles/ag1000g.phase1.ar3.pass.biallelic.3R.vcf.28000000-29000000.gz.diploid.fvec mossie.preds

Let's look at the output from that call

$ head mossie.preds
chrom	classifiedWinStart	classifiedWinEnd	bigWinRange	predClass	prob(neutral)	prob(likedSoft)	prob(linkedHard)	prob(soft)	prob(hard)
3R	28022501	28027500	28000001-28055000	linkedSoft	0.002331	0.997667	0.000002	0.000000	0.000000
3R	28027501	28032500	28005001-28060000	linkedSoft	0.000168	0.999832	0.000000	0.000000	0.000000
3R	28032501	28037500	28010001-28065000	linkedSoft	0.000116	0.999856	0.000027	0.000001	0.000000
3R	28037501	28042500	28015001-28070000	linkedSoft	0.001110	0.998243	0.000539	0.000105	0.000002
3R	28042501	28047500	28020001-28075000	soft	0.014143	0.219835	0.001996	0.758655	0.005371
3R	28047501	28052500	28025001-28080000	linkedSoft	0.018158	0.743658	0.000130	0.237950	0.000102
3R	28052501	28057500	28030001-28085000	linkedSoft	0.008697	0.617615	0.000480	0.372664	0.000545
3R	28057501	28062500	28035001-28090000	linkedSoft	0.000932	0.998939	0.000123	0.000005	0.000000
3R	28062501	28067500	28040001-28095000	linkedSoft	0.000031	0.999955	0.000014	0.000000	0.000000

Each row here represents a given window of the genome that has been classified. The predicted class is given along with the probabilities of class membership for that window for each of the five classes.

That's it!