Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: make dense ranks results scale to 100 percent #21203

Merged
merged 2 commits into from
May 31, 2018

Conversation

peterpanmj
Copy link
Contributor

@peterpanmj peterpanmj commented May 25, 2018

@@ -167,11 +167,11 @@ def test_infs_n_nans(grps, vals, ties_method, ascending, na_option, exp):
('dense', True, 'keep', False,
[1., 1., np.nan, 3., 1., 2., np.nan, np.nan]),
('dense', True, 'keep', True,
[0.2, 0.2, np.nan, 0.6, 0.2, 0.4, np.nan, np.nan]),
[1. / 3., 1. / 3., np.nan, 3. / 3., 1. / 3., 2. / 3., np.nan, np.nan]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use literals instead of expressions here?

Copy link
Contributor Author

@peterpanmj peterpanmj May 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@that will be 0.3333333333333333. A little hard to read. Still use literal ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yea don’t do that. If it’s minimal effort to replace the data with something more easily divisible please do so. Otherwise then I’m ok with it as is

grp_na_count = 0
val_start = i + 1
grp_start = i + 1
lab_start = i + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same thing as grp_start?

grp_sizes[_as[j], 0] = i - grp_start + 1 - grp_na_count
dups = sum_ranks = 0
if pct:
if tiebreak != TIEBREAK_DENSE:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this here is it not possible to alter the handling of the tie_count variable incrementing above depending on whether or not we are using pct and TIEBREAK_DENSE together? Seems like it would be simpler to do that then to implement another branch for assigning percents, if that's possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean move the logic to here ?

                elif tiebreak == TIEBREAK_DENSE:
                    for j in range(i - dups + 1, i + 1):
                        out[_as[j], 0] = grp_vals_seen

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm well I suppose that that would alter the non-pct items, but my overall point is I feel like this can be done more succinctly. In either case (TIEBREAK_DENSE or not) it's a matter of dividing by the right denominator to get the correct value.

So I think it would be cleaner to do something along the lines of:

if TIEBREAK_DENSE:
    denominator = ...
else:
    denominator = ...

for j in range(lab_start, i + 1):
    out[_as[j], 0] = out[_as[j], 0] / denominator

@codecov
Copy link

codecov bot commented May 27, 2018

Codecov Report

Merging #21203 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21203   +/-   ##
=======================================
  Coverage   91.84%   91.84%           
=======================================
  Files         153      153           
  Lines       49543    49543           
=======================================
  Hits        45504    45504           
  Misses       4039     4039
Flag Coverage Δ
#multiple 90.24% <ø> (ø) ⬆️
#single 41.87% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5348e06...c5b2756. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a whatsnew note. otherwise lgtm.

@jreback jreback added the Bug label May 29, 2018
@jreback
Copy link
Contributor

jreback commented May 29, 2018

I guess this should be in 0.24 as its a numeric change. any takers for 0.23.1?

@WillAyd
Copy link
Member

WillAyd commented May 29, 2018

I think its minor enough to be in 0.23.1 but don't have that strong a preference either way

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, let's move to 0.23.1 @WillAyd merge when ready.

grp_sizes[_as[j], 0] = i - grp_start + 1 - grp_na_count
else:
for j in range(grp_start, i + 1):
grp_sizes[_as[j], 0] = (grp_tie_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, but can you put the - on the previous line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break line after the operator ?

                    for j in range(grp_start, i + 1):
                        grp_sizes[_as[j], 0] = (grp_tie_count -
                                                (grp_na_count > 0))

@@ -111,7 +111,7 @@ Offsets
Numeric
^^^^^^^

-
- Bug in :func:`pandas.core.groupby.GroupBy.rank` where results did not scale to 100% when specifying ``method='dense'`` and ``pct=True``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let's move to 0.23.1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@jreback jreback added this to the 0.23.1 milestone May 31, 2018
@WillAyd WillAyd changed the title BUG: make dense ranks results scale to 100 percent (#20731) BUG: make dense ranks results scale to 100 percent May 31, 2018
@WillAyd WillAyd merged commit b237b11 into pandas-dev:master May 31, 2018
@WillAyd
Copy link
Member

WillAyd commented May 31, 2018

Thanks @peterpanmj !

jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Jun 8, 2018
jorisvandenbossche pushed a commit that referenced this pull request Jun 9, 2018
david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: groupby.rank results do not scale to 100% for dense
4 participants