Skip to content

Commit

Permalink
Use category dtype
Browse files Browse the repository at this point in the history
This is more memory efficient for columns with many duplicate values,
which can be expected in most use cases.

¹ <https://pandas.pydata.org/pandas-docs/version/1.5/user_guide/scale.html#use-efficient-datatypes>
  • Loading branch information
victorlin committed Mar 7, 2024
1 parent 7bb4650 commit 8d7206c
Show file tree
Hide file tree
Showing 3 changed files with 4 additions and 4 deletions.
4 changes: 2 additions & 2 deletions augur/filter/_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ def run(args):
columns=useful_metadata_columns,
id_columns=[metadata_object.id_column],
chunk_size=args.metadata_chunk_size,
dtype="string",
dtype={col: 'category' for col in useful_metadata_columns},
)
for metadata in metadata_reader:
duplicate_strains = (
Expand Down Expand Up @@ -297,7 +297,7 @@ def run(args):
columns=useful_metadata_columns,
id_columns=args.metadata_id_columns,
chunk_size=args.metadata_chunk_size,
dtype="string",
dtype={col: 'category' for col in useful_metadata_columns},
)
for metadata in metadata_reader:
# Recalculate groups for subsampling as we loop through the
Expand Down
2 changes: 1 addition & 1 deletion tests/functional/filter/cram/filter-query-errors.t
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Some error messages from Pandas may be useful, so they are exposed:
> --query "region >= 0.50" \
> --output-strains filtered_strains.txt > /dev/null
ERROR: Internal Pandas error when applying query:
'>=' not supported between instances of 'str' and 'float'
Unordered Categoricals can only compare equality or not
Ensure the syntax is valid per <https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-query>.
[2]

Expand Down
2 changes: 1 addition & 1 deletion tests/functional/filter/cram/filter-query-numerical.t
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ The 'category' column will fail when used with a numerical comparison.
> --query "category >= 0.95" \
> --output-strains filtered_strains.txt
ERROR: Internal Pandas error when applying query:
'>=' not supported between instances of 'str' and 'float'
Unordered Categoricals can only compare equality or not
Ensure the syntax is valid per <https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-query>.
[2]

Expand Down

0 comments on commit 8d7206c

Please sign in to comment.