rename `counts` to `count` in struct field using `value_counts` #11462

Julian-J-S · 2023-10-02T09:32:53Z

Description

Best practices (in my experience) for naming columns in tabular data like a dataframe is to use singular nouns which refer to a single value in each row.

good: age, height, weight, name, address, city, state, country
bad: ages, heights, weights, names, addresses, cities, states, countries

The value_counts method violates this convention by returning a "counts" field which suggests that it contains multiple values per row

Recommended solution:

rename the "counts" field to "count"
Benefits:

follows best practices for naming columns
makes it clear that the field contains a single value per row
in line with the pl.count method

Example:

s = pl.Series('x', [1, 1, 2])

s.value_counts()
┌─────┬────────┐
│ x   ┆ counts │ <<< rename this to "count" (we only have a single "count" in each row)
│ --- ┆ ---    │
│ i64 ┆ u32    │
╞═════╪════════╡
│ 2   ┆ 1      │
│ 1   ┆ 2      │
└─────┴────────┘

s.to_frame().group_by('x').agg(pl.count())
┌─────┬───────┐
│ x   ┆ count │ <<< like this
│ --- ┆ ---   │
│ i64 ┆ u32   │
╞═════╪═══════╡
│ 1   ┆ 2     │
│ 2   ┆ 1     │
└─────┴───────┘

The text was updated successfully, but these errors were encountered:

ritchie46 · 2023-10-02T09:48:28Z

I don't think this is worth the breaking change.

There is something to say for both. But naming you series plural to me feels very natural in relational data.

Julian-J-S · 2023-10-02T09:57:27Z

Yeah, I am with you that it is hard to justify a breaking change for this small change :/

Was also thinking about instead adding an alternative count_values that does this... :D

lyngc · 2023-10-02T20:09:40Z

@ritchie46 You break tons of other stuff, which is good, why not make this right? Could also add a drive-by normalize parameter

stinodego · 2023-10-03T08:48:25Z

I agree, this should be fixed in my opinion. We can't really deprecate it nicely though. Still, I'd vote for including this with the next breaking release.

alexander-beedie · 2023-10-04T06:38:16Z

On the same topic; I'd be strongly in favour of changing .str.lengths() to .str.length(); this one has always felt really peculiar to me (almost everything else in the string namespace is singular, eg: to_date, not to_dates, etc) 😅

Julian-J-S · 2023-10-04T12:03:14Z

happy to hear that this idea was accepted. =)

I am curious what you think about also renaming the function name.
I think the current value_counts is okay (copied from pandas), but not perfect.

Most function names fall into one of these categories:

verbs: describe, sort, filter, rename, drop, explode, melt, join
nouns: columns, shape, size, head, mean, sin
verbs + nouns: drop_nulls, get_column, concat_list, map_elements, extract_groups, count_matches

Following this very rough categorization, a new name could be:

count_values (verb + noun)
Or ChatGPT suggestions ("how would you call a python function that counts the occurrences of each item in a list?"):
count_items (verb + noun)
count_occurrences (verb + noun)

stinodego · 2023-10-04T12:07:49Z

If we were to rename this it should be count_distinct, in my opinion.

That would actually make it easier to rename the column as well (deprecate the old method with some message that the new version has a different column name).

mcrumiller · 2023-10-04T12:16:21Z

If we were to rename this it should be count_distinct, in my opinion.

This one is a bit hard to rename. Coming from the outside, I would think that count_distinct would be the same as n_unique, i.e. we are counting how many distinct items there are. count_per_distinct is a bit more descriptive but also starts getting too verbose. I also don't really like how the word "unique" and "distinct" mean the same thing essentially and we have both uses in the API. It's hard to distinguish between counting the number of uniques and the number of items within each distinct item. Maybe group_sizes()?

stinodego · 2023-10-04T12:25:37Z

I would think that count_distinct would be the same as n_unique

Ah yea fair point...

Julian-J-S · 2023-10-04T12:59:52Z

I personally like count_occurences because it will "count" how often elements "occur".

@mcrumiller with regard to unique/distinct, you speak from my soul.
I have tried so many times to point out that they are different things but are usually used synonymously.
I don't know if I have the strength to tackle this again...

But here is a very simple example that shows the problem:

mcrumiller · 2023-10-04T13:27:10Z

For clarification, "unique" means the value only occurs once in the dataset, distinct indicates how many values there are once all duplicates are removed. IN the above example, there is only one value that is unique (2 is not unique, since it is repeated).

Most programming languages use unique and distinct interchangeably. If I call s.unique() I typically expect to get list with duplicates removed. I'm guessing pandas wanted to avoid this ambiguity, which is why they called their function drop_duplicates. I don't know what polars' take on the matter is, I recall an is_unique discussion a while back that used the "occurs only once" definition, although the unique function returns distinct elements.

stinodego · 2023-11-16T09:15:16Z

I have a PR that will address the original issue - we will merge this in the next breaking release.

Please open a separate issue for renaming value_counts to something else.

Julian-J-S added the enhancement New feature or an improvement of an existing feature label Oct 2, 2023

stinodego added the accepted Ready for implementation label Oct 4, 2023

github-project-automation bot added this to Backlog Oct 4, 2023

github-project-automation bot moved this to Ready in Backlog Oct 4, 2023

stinodego self-assigned this Oct 4, 2023

stinodego moved this from Ready to Next in Backlog Oct 4, 2023

stinodego mentioned this issue Oct 6, 2023

Rename .list.lengths and .str.lengths #11577

Closed

stinodego added this to the 1.0.0 milestone Nov 14, 2023

stinodego mentioned this issue Nov 16, 2023

feat!: Change value_counts resulting column name from counts to count #12506

Merged

stinodego modified the milestones: 1.0.0, 0.20.0 Nov 16, 2023

stinodego moved this from Next to Candidate in Backlog Nov 30, 2023

stinodego closed this as completed in #12506 Dec 3, 2023

github-project-automation bot moved this from Candidate to Done in Backlog Dec 3, 2023

cmdlineluser mentioned this issue Mar 22, 2024

Rename rle() struct fields to len and value #15230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rename `counts` to `count` in struct field using `value_counts` #11462

rename `counts` to `count` in struct field using `value_counts` #11462

Julian-J-S commented Oct 2, 2023 •

edited

Loading

ritchie46 commented Oct 2, 2023

Julian-J-S commented Oct 2, 2023

lyngc commented Oct 2, 2023

stinodego commented Oct 3, 2023

alexander-beedie commented Oct 4, 2023 •

edited

Loading

Julian-J-S commented Oct 4, 2023

stinodego commented Oct 4, 2023 •

edited

Loading

mcrumiller commented Oct 4, 2023

stinodego commented Oct 4, 2023

Julian-J-S commented Oct 4, 2023

mcrumiller commented Oct 4, 2023

stinodego commented Nov 16, 2023

rename counts to count in struct field using value_counts #11462

rename counts to count in struct field using value_counts #11462

Comments

Julian-J-S commented Oct 2, 2023 • edited Loading

Description

Recommended solution:

ritchie46 commented Oct 2, 2023

Julian-J-S commented Oct 2, 2023

lyngc commented Oct 2, 2023

stinodego commented Oct 3, 2023

alexander-beedie commented Oct 4, 2023 • edited Loading

Julian-J-S commented Oct 4, 2023

stinodego commented Oct 4, 2023 • edited Loading

mcrumiller commented Oct 4, 2023

stinodego commented Oct 4, 2023

Julian-J-S commented Oct 4, 2023

mcrumiller commented Oct 4, 2023

stinodego commented Nov 16, 2023

rename `counts` to `count` in struct field using `value_counts` #11462

rename `counts` to `count` in struct field using `value_counts` #11462

Julian-J-S commented Oct 2, 2023 •

edited

Loading

alexander-beedie commented Oct 4, 2023 •

edited

Loading

stinodego commented Oct 4, 2023 •

edited

Loading