Skip to content

Latest commit

 

History

History
63 lines (49 loc) · 1.87 KB

genotyping-inversions.md

File metadata and controls

63 lines (49 loc) · 1.87 KB

Genotyping Inversions

Once you've determined that some of the PCs capture inversions, you can proceed to genotyping the samples. You can cluster samples using their coordinates along the PCs. The clusters can reveal the genotypes of the samples. You can use the Manhattan plots from localizing inversions to figured out which PCs capture which inversions.

Unsupervised Genotyping with Clustering

For each inversion, we can cluster the samples using the coordinates along the PCs that capture the inversion.

For example, if an inversion is captured by PC 1, we can cluster the samples as so:

$ asaph_genotype \
	cluster \
	--workdir <workdir> \
	--components 1 \
	--n-clusters 3 \
	--predicted-labels-fl predicted_labels.pops

If an inversion is captured by PCs 1 and 2, we can cluster the samples using both PCs:

$ asaph-genotype \
	cluster \
	--workdir <workdir> \
	--components 1 2 \
	--n-clusters 3 \
	--predicted-labels-fl predicted_labels.pops

Optimizing Clustering Parameters

If you have known labels, you can find the parameters that optimize the clustering. Each clustering is scored by how well the cluster labels predict the ground truth labels and vice versa.

$ asaph-genotype \
	sweep-parameters \
	--workdir <workdir> \
    --labels-fl known_labels.pops \
	--components 1 2 3 4 \
	kmeans \
	--n-clusters 2 3 4 5 \

The output will look like so:

Best score: 0.9722845902897717
Best parameters: {'n_clusters': 3, 'components': [1], 'feature_scaling': False}
Best centroids: [[-0.3983714 ]
 [ 3.01494853]
 [ 1.08898097]]

Evaluating Predictions

If you happen to know the genotypes for your samples, you can test the cluster and other labels for agreement:

$ asaph_genotype \
	evaluate-predicted-genotypes \
	--predicted-labels-fl predicted_labels.pops \
	--output-labels-fl known_labels.pops