Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Handle various parameter combinations in replace API #7207

Merged
merged 26 commits into from
Feb 1, 2021

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Jan 26, 2021

Fixes: #7206

The replace API has two parameters to_replace & value which are overloaded and support different types of inputs for each of these two parameters have different behaviors. These changes introduce clear code-flow for each type of possible parameter combination. This way it would be easier to support newer parameters in future like regex & nested dict types, which would change the behaviour of to_replace & value parameters..

  • Ensure all combinations are covered for to_replace & value for both DataFrame.replace & Series.replace.
  • Document changes inline & Update func docs.
  • Add tests to include coverage for all combinations that are not yet covered.

@galipremsagar galipremsagar added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. labels Jan 26, 2021
@galipremsagar galipremsagar self-assigned this Jan 26, 2021
@codecov
Copy link

codecov bot commented Jan 26, 2021

Codecov Report

Merging #7207 (a68e754) into branch-0.18 (8860baf) will increase coverage by 0.16%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #7207      +/-   ##
===============================================
+ Coverage        82.09%   82.25%   +0.16%     
===============================================
  Files               97       99       +2     
  Lines            16474    16890     +416     
===============================================
+ Hits             13524    13893     +369     
- Misses            2950     2997      +47     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/parquet.py 0.00% <ø> (ø)
python/cudf/cudf/_lib/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/_typing.py 92.30% <ø> (ø)
python/cudf/cudf/core/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/core/abc.py 91.48% <ø> (+4.25%) ⬆️
python/cudf/cudf/core/buffer.py 80.00% <ø> (+0.95%) ⬆️
python/cudf/cudf/core/column/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/core/column/categorical.py 92.55% <ø> (-0.80%) ⬇️
python/cudf/cudf/core/column/column.py 87.75% <ø> (-0.39%) ⬇️
... and 71 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b608832...a68e754. Read the comment docs.

@galipremsagar galipremsagar added the non-breaking Non-breaking change label Jan 28, 2021
@galipremsagar galipremsagar marked this pull request as ready for review January 28, 2021 18:43
@galipremsagar galipremsagar requested a review from a team as a code owner January 28, 2021 18:43
@galipremsagar galipremsagar changed the title [WIP] Handle various parameter combinations in replace API [REVIEW] Handle various parameter combinations in replace API Jan 28, 2021
@galipremsagar galipremsagar added 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer and removed 2 - In Progress Currently a work in progress labels Jan 28, 2021
@galipremsagar galipremsagar added the bug Something isn't working label Jan 28, 2021
if not isinstance(to_replace_col, NumericalColumn) or not isinstance(
replacement_col, NumericalColumn
):
return self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If replacing numbers with e.g., strings, should we fail instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should allow this operation. Reason is that a dataframe that could have a mix of string and numeric values will be called with some generic to_replace & value params like:

>>> import pandas as pd
>>> df = pd.DataFrame({'a':[1, 2, 3, 4], 'strings':['a','b','d','e']})
>>> df
   a strings
0  1       a
1  2       b
2  3       d
3  4       e
>>> df.replace(1, 111)
     a strings
0  111       a
1    2       b
2    3       d
3    4       e
>>> df.replace("a", "aaa")
   a strings
0  1     aaa
1  2       b
2  3       d
3  4       e

In such cases failing with an error message asking the user to drop the rest of the columns and do replace and again the caller of the API will have to perform the reordering/rejoining of the dataframe columns would be painful. Instead, we can simply ignore if we detect mismatching dtypes are passed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. What happens when we do the same with a Series? For example:

In [2]: a = cudf.Series([1, 2, 3])

In [3]: a.replace(1, "a")

Do we fail or silently return?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return silently in case of series aswell.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas actually does the replace here, which makes me think we should either raise or clearly document the difference in behaviour with an example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an example and mentioned the difference in behavior with respect to pandas.

Copy link
Collaborator

@kkraus14 kkraus14 Jan 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think we should throw in this situation. If the dtype of to_replace and value aren't the same with reason (allow different integers / floats for example), we're in a situation we don't want to support in cuDF and we should throw accordingly.

This is different than the case where to_replace and value are the same type but a different type than self, in which case silently returning a copy at least somewhat makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, so we want to error loudly for:

>>> s = cudf.Series(['a', 'b', 'c'])
>>> s.replace(['a', 'b'], [0, 1])

and return silently for:

>>> s = cudf.Series(['a', 'b', 'c'])
>>> s.replace([1, 2], [3, 4])

Did I get it right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, those are my thoughts at least. What do you think @galipremsagar and @shwina?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to me. I have added the validation and tests validating them.

Co-authored-by: Ashwin Srinath <3190405+shwina@users.noreply.github.com>
python/cudf/cudf/core/column/numerical.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/numerical.py Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
@galipremsagar galipremsagar added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Feb 1, 2021
@galipremsagar
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b8cb8c7 into rapidsai:branch-0.18 Feb 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] replace is not handling dict + scalar combination for to_replace & value params respectively
4 participants