Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rename counts to count in struct field using value_counts #11462

Closed
Julian-J-S opened this issue Oct 2, 2023 · 12 comments · Fixed by #12506
Closed

rename counts to count in struct field using value_counts #11462

Julian-J-S opened this issue Oct 2, 2023 · 12 comments · Fixed by #12506
Assignees
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Milestone

Comments

@Julian-J-S
Copy link
Contributor

Julian-J-S commented Oct 2, 2023

Description

Best practices (in my experience) for naming columns in tabular data like a dataframe is to use singular nouns which refer to a single value in each row.

  • good: age, height, weight, name, address, city, state, country
  • bad: ages, heights, weights, names, addresses, cities, states, countries

The value_counts method violates this convention by returning a "counts" field which suggests that it contains multiple values per row

Recommended solution:

rename the "counts" field to "count"
Benefits:

  • follows best practices for naming columns
  • makes it clear that the field contains a single value per row
  • in line with the pl.count method

Example:

s = pl.Series('x', [1, 1, 2])

s.value_counts()
┌─────┬────────┐
│ xcounts<<< rename this to "count" (we only have a single "count" in each row)
│ ------    │
│ i64u32    │
╞═════╪════════╡
│ 21      │
│ 12      │
└─────┴────────┘

s.to_frame().group_by('x').agg(pl.count())
┌─────┬───────┐
│ xcount<<< like this------   │
│ i64u32   │
╞═════╪═══════╡
│ 12     │
│ 21     │
└─────┴───────┘
@Julian-J-S Julian-J-S added the enhancement New feature or an improvement of an existing feature label Oct 2, 2023
@ritchie46
Copy link
Member

I don't think this is worth the breaking change.

There is something to say for both. But naming you series plural to me feels very natural in relational data.

@Julian-J-S
Copy link
Contributor Author

Yeah, I am with you that it is hard to justify a breaking change for this small change :/

Was also thinking about instead adding an alternative count_values that does this... :D

@lyngc
Copy link

lyngc commented Oct 2, 2023

@ritchie46 You break tons of other stuff, which is good, why not make this right? Could also add a drive-by normalize parameter

@stinodego
Copy link
Member

I agree, this should be fixed in my opinion. We can't really deprecate it nicely though. Still, I'd vote for including this with the next breaking release.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Oct 4, 2023

On the same topic; I'd be strongly in favour of changing .str.lengths() to .str.length(); this one has always felt really peculiar to me (almost everything else in the string namespace is singular, eg: to_date, not to_dates, etc) 😅

@stinodego stinodego added the accepted Ready for implementation label Oct 4, 2023
@github-project-automation github-project-automation bot moved this to Ready in Backlog Oct 4, 2023
@stinodego stinodego self-assigned this Oct 4, 2023
@stinodego stinodego moved this from Ready to Next in Backlog Oct 4, 2023
@Julian-J-S
Copy link
Contributor Author

happy to hear that this idea was accepted. =)

I am curious what you think about also renaming the function name.
I think the current value_counts is okay (copied from pandas), but not perfect.

Most function names fall into one of these categories:

  • verbs: describe, sort, filter, rename, drop, explode, melt, join
  • nouns: columns, shape, size, head, mean, sin
  • verbs + nouns: drop_nulls, get_column, concat_list, map_elements, extract_groups, count_matches

Following this very rough categorization, a new name could be:

  • count_values (verb + noun)
    Or ChatGPT suggestions ("how would you call a python function that counts the occurrences of each item in a list?"):
  • count_items (verb + noun)
  • count_occurrences (verb + noun)

@stinodego
Copy link
Member

stinodego commented Oct 4, 2023

If we were to rename this it should be count_distinct, in my opinion.

That would actually make it easier to rename the column as well (deprecate the old method with some message that the new version has a different column name).

@mcrumiller
Copy link
Contributor

If we were to rename this it should be count_distinct, in my opinion.

This one is a bit hard to rename. Coming from the outside, I would think that count_distinct would be the same as n_unique, i.e. we are counting how many distinct items there are. count_per_distinct is a bit more descriptive but also starts getting too verbose. I also don't really like how the word "unique" and "distinct" mean the same thing essentially and we have both uses in the API. It's hard to distinguish between counting the number of uniques and the number of items within each distinct item. Maybe group_sizes()?

@stinodego
Copy link
Member

I would think that count_distinct would be the same as n_unique

Ah yea fair point...

@Julian-J-S
Copy link
Contributor Author

I personally like count_occurences because it will "count" how often elements "occur".

@mcrumiller with regard to unique/distinct, you speak from my soul.
I have tried so many times to point out that they are different things but are usually used synonymously.
I don't know if I have the strength to tackle this again...

But here is a very simple example that shows the problem:
image

@mcrumiller
Copy link
Contributor

For clarification, "unique" means the value only occurs once in the dataset, distinct indicates how many values there are once all duplicates are removed. IN the above example, there is only one value that is unique (2 is not unique, since it is repeated).

Most programming languages use unique and distinct interchangeably. If I call s.unique() I typically expect to get list with duplicates removed. I'm guessing pandas wanted to avoid this ambiguity, which is why they called their function drop_duplicates. I don't know what polars' take on the matter is, I recall an is_unique discussion a while back that used the "occurs only once" definition, although the unique function returns distinct elements.

@stinodego
Copy link
Member

I have a PR that will address the original issue - we will merge this in the next breaking release.

Please open a separate issue for renaming value_counts to something else.

@stinodego stinodego modified the milestones: 1.0.0, 0.20.0 Nov 16, 2023
@stinodego stinodego moved this from Next to Candidate in Backlog Nov 30, 2023
@github-project-automation github-project-automation bot moved this from Candidate to Done in Backlog Dec 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants