Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pedigreed populations #18

Open
lvclark opened this issue Jan 20, 2021 · 3 comments
Open

Support pedigreed populations #18

lvclark opened this issue Jan 20, 2021 · 3 comments

Comments

@lvclark
Copy link
Owner

lvclark commented Jan 20, 2021

I'd like to make a new pipeline that uses pedigree information. Genotype estimates of parents and offspring will iteratively influence genotype priors of parents and offspring. Even for biparental populations, this could perform much better than the existing pipeline, which doesn't handle segregation distortion well.

I probably won't tackle this until I have added support for multiploid populations (Issue #17), in order to avoid having to rewrite a lot of code after the fact.

If you have a good test dataset for this sort of population, please contact me!

@lvclark
Copy link
Owner Author

lvclark commented Sep 15, 2021

Leaving some notes here for myself:

There needs to be a way to deal with errors in the pedigree, since greenhouse mixups, wayward pollen, and unexpected self-fertilization are so common. Maybe have some prior that each connection in the pedigree is correct. Then do a Bayesian comparison of the hypothesis that the pedigree is correct vs. the hypothesis that the individual is just a random individual in the population.

Alternatively, get a set of inter-individual distances using read depth ratios, and let the user interactively identify pedigree errors.

For missing parents, we can simply add individuals with zero read depth.

@lvclark
Copy link
Owner Author

lvclark commented Oct 3, 2021

All individuals start with even priors, then as information is added across the pedigree, priors get multiplied by the new information and normalized to sum to one.

The unit of analysis should be a single pair of parents and their offspring. Have a list that indicates the sample names for parents and offspring for each family. Then for each marker and each family, we need to jointly estimate the probability of both parent genotypes at the same time, using what we already know about parent and offspring genotypes. For each ploidy combination, have a list already set up for every possible parental genotype combination, listing the possible progeny genotypes as well. The probability of a given genotype combination being the true one is the product of the probability of each parent being that genotype, and the probability of each offspring having a genotype that is possible under that cross (ignoring expected genotype frequencies, because we could have segregation distortion!). Then that goes back to inform the priors of individuals; basically the probability of each genotype under each parental genotype combination, weighted by the probability of the parental genotype combination.

So in essence

  1. Estimate individual genotype posterior probabilities under even priors
  2. Estimate probabilities of parental genotype combinations, given parent and offspring genotype posterior probabilities
  3. Update individual priors based on probabilities of parental genotype combinations
  4. Re-estimate individual genotype posterior probabilities under the new priors

Perform as many iterations as the maximum number of generations between individuals, or find some other way to make sure grandparents are influenced by grandchildren genotypes etc.

@lvclark
Copy link
Owner Author

lvclark commented Oct 9, 2021

The internal Rcpp function ThirdDimProd may be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant