Support pedigreed populations #18

lvclark · 2021-01-20T16:46:58Z

I'd like to make a new pipeline that uses pedigree information. Genotype estimates of parents and offspring will iteratively influence genotype priors of parents and offspring. Even for biparental populations, this could perform much better than the existing pipeline, which doesn't handle segregation distortion well.

I probably won't tackle this until I have added support for multiploid populations (Issue #17), in order to avoid having to rewrite a lot of code after the fact.

If you have a good test dataset for this sort of population, please contact me!

lvclark · 2021-09-15T13:10:09Z

Leaving some notes here for myself:

There needs to be a way to deal with errors in the pedigree, since greenhouse mixups, wayward pollen, and unexpected self-fertilization are so common. Maybe have some prior that each connection in the pedigree is correct. Then do a Bayesian comparison of the hypothesis that the pedigree is correct vs. the hypothesis that the individual is just a random individual in the population.

Alternatively, get a set of inter-individual distances using read depth ratios, and let the user interactively identify pedigree errors.

For missing parents, we can simply add individuals with zero read depth.

lvclark · 2021-10-03T16:07:04Z

All individuals start with even priors, then as information is added across the pedigree, priors get multiplied by the new information and normalized to sum to one.

The unit of analysis should be a single pair of parents and their offspring. Have a list that indicates the sample names for parents and offspring for each family. Then for each marker and each family, we need to jointly estimate the probability of both parent genotypes at the same time, using what we already know about parent and offspring genotypes. For each ploidy combination, have a list already set up for every possible parental genotype combination, listing the possible progeny genotypes as well. The probability of a given genotype combination being the true one is the product of the probability of each parent being that genotype, and the probability of each offspring having a genotype that is possible under that cross (ignoring expected genotype frequencies, because we could have segregation distortion!). Then that goes back to inform the priors of individuals; basically the probability of each genotype under each parental genotype combination, weighted by the probability of the parental genotype combination.

So in essence

Estimate individual genotype posterior probabilities under even priors
Estimate probabilities of parental genotype combinations, given parent and offspring genotype posterior probabilities
Update individual priors based on probabilities of parental genotype combinations
Re-estimate individual genotype posterior probabilities under the new priors

Perform as many iterations as the maximum number of generations between individuals, or find some other way to make sure grandparents are influenced by grandchildren genotypes etc.

lvclark · 2021-10-09T15:45:52Z

The internal Rcpp function ThirdDimProd may be helpful.

lvclark added the enhancement label Jan 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support pedigreed populations #18

Support pedigreed populations #18

lvclark commented Jan 20, 2021

lvclark commented Sep 15, 2021

lvclark commented Oct 3, 2021 •

edited

Loading

lvclark commented Oct 9, 2021

Support pedigreed populations #18

Support pedigreed populations #18

Comments

lvclark commented Jan 20, 2021

lvclark commented Sep 15, 2021

lvclark commented Oct 3, 2021 • edited Loading

lvclark commented Oct 9, 2021

lvclark commented Oct 3, 2021 •

edited

Loading