-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Handle various parameter combinations in replace
API
#7207
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-0.18 #7207 +/- ##
===============================================
+ Coverage 82.09% 82.25% +0.16%
===============================================
Files 97 99 +2
Lines 16474 16890 +416
===============================================
+ Hits 13524 13893 +369
- Misses 2950 2997 +47
Continue to review full report at Codecov.
|
replace
APIreplace
API
if not isinstance(to_replace_col, NumericalColumn) or not isinstance( | ||
replacement_col, NumericalColumn | ||
): | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If replacing numbers with e.g., strings, should we fail instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should allow this operation. Reason is that a dataframe that could have a mix of string and numeric values will be called with some generic to_replace
& value
params like:
>>> import pandas as pd
>>> df = pd.DataFrame({'a':[1, 2, 3, 4], 'strings':['a','b','d','e']})
>>> df
a strings
0 1 a
1 2 b
2 3 d
3 4 e
>>> df.replace(1, 111)
a strings
0 111 a
1 2 b
2 3 d
3 4 e
>>> df.replace("a", "aaa")
a strings
0 1 aaa
1 2 b
2 3 d
3 4 e
In such cases failing with an error message asking the user to drop the rest of the columns and do replace and again the caller of the API will have to perform the reordering/rejoining of the dataframe columns would be painful. Instead, we can simply ignore if we detect mismatching dtypes are passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. What happens when we do the same with a Series
? For example:
In [2]: a = cudf.Series([1, 2, 3])
In [3]: a.replace(1, "a")
Do we fail or silently return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return silently in case of series aswell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas actually does the replace here, which makes me think we should either raise or clearly document the difference in behaviour with an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an example and mentioned the difference in behavior with respect to pandas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think we should throw in this situation. If the dtype of to_replace
and value
aren't the same with reason (allow different integers / floats for example), we're in a situation we don't want to support in cuDF and we should throw accordingly.
This is different than the case where to_replace
and value
are the same type but a different type than self
, in which case silently returning a copy at least somewhat makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, so we want to error loudly for:
>>> s = cudf.Series(['a', 'b', 'c'])
>>> s.replace(['a', 'b'], [0, 1])
and return silently for:
>>> s = cudf.Series(['a', 'b', 'c'])
>>> s.replace([1, 2], [3, 4])
Did I get it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, those are my thoughts at least. What do you think @galipremsagar and @shwina?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to me. I have added the validation and tests validating them.
Co-authored-by: Ashwin Srinath <3190405+shwina@users.noreply.github.com>
@gpucibot merge |
Fixes: #7206
The
replace
API has two parametersto_replace
&value
which are overloaded and support different types of inputs for each of these two parameters have different behaviors. These changes introduce clear code-flow for each type of possible parameter combination. This way it would be easier to support newer parameters in future likeregex
& nested dict types, which would change the behaviour ofto_replace
&value
parameters..to_replace
&value
for bothDataFrame.replace
&Series.replace
.