Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic ancestry functionality #143

Merged
merged 29 commits into from
Oct 22, 2021

Conversation

arvkevi
Copy link
Contributor

@arvkevi arvkevi commented Sep 21, 2021

This PR adds basic functionality to predict genetic ancestry using ezancestry. @apriha please feel free to make suggestions/direct edits as you see fit, this is just to get the concept moving forward. Here's how a user could utilize this functionality from snps.
Screen Shot 2021-09-20 at 10 31 39 PM

.github/workflows/ci.yml Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Oct 5, 2021

Codecov Report

Merging #143 (d29743f) into develop (8ca5d75) will increase coverage by 0.07%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #143      +/-   ##
===========================================
+ Coverage    93.44%   93.52%   +0.07%     
===========================================
  Files            8        8              
  Lines         1540     1559      +19     
  Branches       273      274       +1     
===========================================
+ Hits          1439     1458      +19     
  Misses          54       54              
  Partials        47       47              
Impacted Files Coverage Δ
src/snps/snps.py 95.94% <100.00%> (+0.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8ca5d75...d29743f. Read the comment docs.

@apriha
Copy link
Owner

apriha commented Oct 5, 2021

@arvkevi I think we're close with getting the initial tests working. However, pip is taking a long time to search for compatible packages. I can fix this via a two step install, e.g.:

pip install ezancestry
pip install .

However, that defeats the simplicity of just pip install .[ezancestry]. Any ideas on how this can be improved?

@arvkevi
Copy link
Contributor Author

arvkevi commented Oct 6, 2021

Thank you for hacking on this PR, Andrew! I cut a release to ezancestry that supports 3.7, which is why I triggered the build yesterday w/ an empty commit. I am confused as to why this is taking so long to resolve dependencies. I'll spend some more time with it.

@apriha
Copy link
Owner

apriha commented Oct 8, 2021

Hi Kevin, same here. FYI, I tried running the test-extras job locally via act, and dependencies were resolved quickly and without any issues...

@apriha
Copy link
Owner

apriha commented Oct 9, 2021

Hey @arvkevi , turns out pip couldn't find the correct version of snps since the tag version history was not available after checkout; 4582b51 fixed it! Pretty close now... looks like some issues with finding ezancestry data.

@apriha
Copy link
Owner

apriha commented Oct 11, 2021

I did some more testing with act and listed the contents of the equivalent of the /home/runner/.ezancestry/data/ directory... It looks like the ezancestry Python code is looking up filenames with a different case to what's actually on the filesystem; e.g., aisnps/Kidd.AISNP.txt (Python) vs aisnps/KIDD.AISNP.txt (actual). Same for models/knn.PCA.Kidd.population.bin and models/knn.PCA.Kidd.superpopulation.bin.

Hopefully that helps speed the troubleshooting along. 🙂

@arvkevi
Copy link
Contributor Author

arvkevi commented Oct 13, 2021

Thanks, Andrew. I will cut a new release this weekend with a fix for the filenames. I'll also setup my own ci in ezancestry so we don't languish on this branch. Thanks for being so patient with this.

@arvkevi
Copy link
Contributor Author

arvkevi commented Oct 19, 2021

I think I fixed the issue with the new release. The new errors are likely due newly trained models in the release. We can probably just update the assert value.

@apriha
Copy link
Owner

apriha commented Oct 19, 2021

I think we're good @arvkevi! What are your thoughts on also exposing the raw predictions dataframe?

@arvkevi
Copy link
Contributor Author

arvkevi commented Oct 19, 2021

@apriha I think that's a good idea. I will put together some documentation with column descriptions.

@arvkevi
Copy link
Contributor Author

arvkevi commented Oct 20, 2021

I'll leave this here and feel free to modify and incorporate wherever you like.

Populations described below are defined here.
'component1', 'component2', 'component3':
The coordinates of the sample in the dimensionality-reduced component space. Can be used as (x, y, z,) coordinates for plotting in a 3d scatter plot.

predicted_population_population:
The max predicted population for the sample.

'ACB', 'ASW', 'BEB', 'CDX', 'CEU', 'CHB', 'CHS', 'CLM', 'ESN', 'FIN', 'GBR', 'GIH', 'GWD', 'IBS', 'ITU', 'JPT', 'KHV', 'LWK', 'MSL', 'MXL', 'PEL', 'PJL', 'PUR', 'STU', 'TSI', 'YRI',:
Predicted probabilities for each of the populations. These sum to 1.0.

'predicted_population_superpopulation':
The max predicted super population (continental) for the sample.

'AFR', 'AMR', 'EAS', 'EUR', 'SAS':
Predicted probabilities for each of the super populations. These sum to 1.0.

'population_description', 'superpopulation_name'
Descriptive names of the population and superpopulations.

@apriha
Copy link
Owner

apriha commented Oct 22, 2021

@arvkevi updates incorporated. Please let me know what you think... If you agree, I think it's ready to merge. Thanks again for developing this awesome capability!

@arvkevi
Copy link
Contributor Author

arvkevi commented Oct 22, 2021

LGTM @apriha, thank you for all your hard work on this PR!

@apriha apriha merged commit 2e8ccfe into apriha:develop Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants