Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

defaultdict(int) vs Counter() and memory usage #905

Closed
peterjc opened this issue Mar 10, 2023 · 2 comments
Closed

defaultdict(int) vs Counter() and memory usage #905

peterjc opened this issue Mar 10, 2023 · 2 comments

Comments

@peterjc
Copy link
Contributor

peterjc commented Mar 10, 2023

While submitting #904, I was looking at this bit of the code in biom/parse.py function parse_uc (Create a Table object from a uclust/usearch/vsearch uc file):

https://github.com/biocore/biom-format/blob/2.1.14/biom/parse.py#L282

I think it would be more memory efficient to replace defaultdict(int) with Counter(), see PyCQA/flake8-bugbear#323

However, that probably deserves a little benchmarking by someone familiar with the code to see if it actually matters for the memory overhead here?

@wasade
Copy link
Member

wasade commented Mar 10, 2023

That's a really interesting observation. I think we're safe here though as we are not testing for the presence of missing keys, but instead incrementing specific keys (see https://github.com/biocore/biom-format/blob/2.1.14/biom/parse.py#L338). In PyCQA/flake8-bugbear#323, I think what's driving the memory bloat is the if threshold < counts[str(x)] check forcing the creation of a key : default value pair.

@peterjc
Copy link
Contributor Author

peterjc commented Mar 10, 2023

Probably OK then - yes, the memory bloat is if you try to access the missing entries (because it then adds a 0 entry with the key), made worse if you use long strings as keys as I was in my own code.

@peterjc peterjc closed this as completed Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants