defaultdict(int) vs Counter() and memory usage #905

peterjc · 2023-03-10T14:53:01Z

While submitting #904, I was looking at this bit of the code in biom/parse.py function parse_uc (Create a Table object from a uclust/usearch/vsearch uc file):

https://github.com/biocore/biom-format/blob/2.1.14/biom/parse.py#L282

I think it would be more memory efficient to replace defaultdict(int) with Counter(), see PyCQA/flake8-bugbear#323

However, that probably deserves a little benchmarking by someone familiar with the code to see if it actually matters for the memory overhead here?

The text was updated successfully, but these errors were encountered:

wasade · 2023-03-10T16:15:20Z

That's a really interesting observation. I think we're safe here though as we are not testing for the presence of missing keys, but instead incrementing specific keys (see https://github.com/biocore/biom-format/blob/2.1.14/biom/parse.py#L338). In PyCQA/flake8-bugbear#323, I think what's driving the memory bloat is the if threshold < counts[str(x)] check forcing the creation of a key : default value pair.

peterjc · 2023-03-10T16:20:16Z

Probably OK then - yes, the memory bloat is if you try to access the missing entries (because it then adds a 0 entry with the key), made worse if you use long strings as keys as I was in my own code.

peterjc closed this as completed Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

defaultdict(int) vs Counter() and memory usage #905

defaultdict(int) vs Counter() and memory usage #905

peterjc commented Mar 10, 2023

wasade commented Mar 10, 2023

peterjc commented Mar 10, 2023

defaultdict(int) vs Counter() and memory usage #905

defaultdict(int) vs Counter() and memory usage #905

Comments

peterjc commented Mar 10, 2023

wasade commented Mar 10, 2023

peterjc commented Mar 10, 2023