Skip to content

lyriclh/local_pca-master

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local PCA/population structure (lostruct)

To install the package, make sure you have devtools (by doing install.packages("devtools")), and then running

install.packages("data.table")
devtools::install_github("petrelharp/local_pca/lostruct")
library(lostruct)

Note: the library is called lostruct.

Using the R package

The example scripts in the directories above mostly work without the R package. To start using the code on your own data, have a look at these files:

  • A quick example : in four lines of code, reads in chromosome 22 from a TPED, and does local PCA.

  • Setting up the medicago data : after documenting where the data are from, does local PCA on a small subset of the whole dataset, to establish how the functions work.

  • Script for medicago analysis : an Rscript to run the same analysis on medicago data, varying various parameters by command-line options: run Rscript run_on_medicago.R --help for a list.

  • Report summarizing an analysis : an Rmarkdown file that can be compiled with templater to produce visualizations of the results of the above.

Prerequisites

  • To use the functions to read in windows out of BCF file, you will need bcftools.
  • To compile the example report, you probably want templater.

Standalone code

Also included is code we used to analyze the datasets in the paper (before the R package was written). The general order to see the code in each directory is

  1. recode : turn bases into numbers
  2. PCA : find local PCs
  3. distance : compute distance matrix between windows from local PCs
  4. MDS : visualize the result

There are standalone examples for each of the three datasets studied:

POPRES (Homo sapiens, SNP chip data from a few worldwide populations)

Chromosome 1 is the example given. See also popres_example.R for an example of some steps using the package.

DPGP (Drosophila melanogaster population genome project)

Chromosome 3L is the example given .

Medicago (Medicago truncatula hapmap)

For Medicago, it calculates the pairwise distance for all 8 chromosome together and then apply MDS and use subset of the whole MDS result for each chromosome.

A note on implementation:

This method works through the genome doing something (PCA on the covariance matrix) one window at a time. Because of this, it can be frustratingly slow to first load the entire dataset into memory. There are several methods implemented here to avoid this; for instance, vcf_windower() which is used to compute PCs for the medicago data. The interface is via a function that takes an integer, n, and returns a data frame of the genomic data in the nth window.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published