Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/parse large files low memory #33

Merged

Conversation

willgdjones
Copy link
Contributor

@willgdjones willgdjones commented Oct 22, 2019

This PR implements functionality to efficiently parse large VCF files without needing to decompress them. It also provides functionality to specify a set of rsids to extract from the file:

df = SNPs(bytes_data, rsids=["rs1", "rs2"]

This functionality only works when feeding in a bytes_data object, which needs to be a valid, gzip compressed byte-string.

I am able to extract rsids from an output of an imputation pipeline which is a VCF file ~450mb compressed in ~ 1 minute.

This should be merged after #32.

@willgdjones
Copy link
Contributor Author

I think I need to match my black settings with yours. What editor settings do you use?

@apriha
Copy link
Owner

apriha commented Oct 23, 2019

I think I need to match my black settings with yours. What editor settings do you use?

Just the default settings, integrated with PyCharm as an external tool: https://github.com/psf/black#editor-integration . You can run black --check --diff . before a commit to see if it would reformat any files.

@codecov-io
Copy link

codecov-io commented Oct 23, 2019

Codecov Report

Merging #33 into develop will increase coverage by 1.4%.
The diff coverage is 96.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           develop      #33     +/-   ##
==========================================
+ Coverage    89.15%   90.56%   +1.4%     
==========================================
  Files            5        5             
  Lines         1097     1123     +26     
  Branches       196      204      +8     
==========================================
+ Hits           978     1017     +39     
+ Misses          73       60     -13     
  Partials        46       46
Impacted Files Coverage Δ
src/snps/__init__.py 94.11% <100%> (+0.07%) ⬆️
src/snps/io.py 87.32% <95.74%> (+1.69%) ⬆️
src/snps/resources.py 94.33% <0%> (+3.77%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a876341...f891388. Read the comment docs.

@willgdjones
Copy link
Contributor Author

Ok! I've matched the settings now.

@willgdjones
Copy link
Contributor Author

Hi @apriha - the refactoring that you've made to this PR looks good to me, thanks for doing so.

@apriha
Copy link
Owner

apriha commented Oct 25, 2019

Thanks again @willgdjones ! Let me know if you agree with the latest commits and I'll merge the PR.

@willgdjones
Copy link
Contributor Author

Those changes look good @apriha !

@apriha apriha merged commit ba9ef91 into apriha:develop Oct 30, 2019
@apriha apriha mentioned this pull request Dec 20, 2019
apriha pushed a commit that referenced this pull request Jun 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants