-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-2432: [Python] Fix Pandas decimal type conversion with None values #1878
ARROW-2432: [Python] Fix Pandas decimal type conversion with None values #1878
Conversation
@@ -184,14 +184,15 @@ Status DecimalMetadata::Update(int32_t suggested_precision, int32_t suggested_sc | |||
} | |||
|
|||
Status DecimalMetadata::Update(PyObject* object) { | |||
DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal"; | |||
bool is_decimal = PyDecimal_Check(object); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's ok to do this in optimized build. DecimalMetadata
expects you to pass a decimal object. @cpcloud may confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't necessary because I added a check before calling Update
but it does prevent a seg fault if for some reason it's called with non Decimal objects - which is not nice to get. If it's hurts an optimization though, I can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we are doing the check twice in optimized builds, which is not nice IMHO. DecimalMetadata::Update
is a private API so it's up to the caller to provide appropriate input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you mean remove PyDecimal_Check
all together? This is only called when the type is not specified by the user and then yes, it will end doing 2 passes over the objects and checks both times if they are decimal. It might be possible to do less checks on the second pass if we keep a list of which ones are decimal objects, but I'm not sure that would be worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, we can optimize later if we find it too slow. The conversion itself is very slow anyway :-)
@@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() { | |||
|
|||
if (type_ == NULLPTR) { | |||
for (PyObject* object : objects) { | |||
RETURN_NOT_OK(max_decimal_metadata.Update(object)); | |||
if (!internal::PandasObjectIsNull(object)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about accepting other NULL-like objects such as float('nan')
? Otherwise object != Py_None
is a much faster check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, is it possible to get NaNs from operations on Decimals? Or is that something the user might mix in somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python decimal objects can be nan, unfortunately:
>>> import decimal
>>> decimal.Decimal('nan')
Decimal('NaN')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it could be NaN also:
In [5]: s1 = pd.Series([Decimal('1.0'), Decimal('2.0')])
In [6]: s2 = pd.Series([Decimal('2.0'), None])
In [7]: s1 / s2
Out[7]:
0 0.5
1 NaN
dtype: object
RETURN_NOT_OK(max_decimal_metadata.Update(object)); | ||
if (!internal::PandasObjectIsNull(object)) { | ||
RETURN_NOT_OK(max_decimal_metadata.Update(object)); | ||
} | ||
} | ||
|
||
type_ = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, what happens here if all items are None? Do we have a test for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Decimal128 value; | ||
RETURN_NOT_OK(internal::DecimalFromPythonDecimal(object, decimal_type, &value)); | ||
RETURN_NOT_OK(builder.Append(value)); | ||
} else if (is_decimal == 0 && internal::PandasObjectIsNull(object)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as above: do we care about other NULL-like values than simply None?
series = pd.Series(data) | ||
_check_series_roundtrip(series, type_=pa.decimal128(12, 5)) | ||
|
||
def test_decimal_with_None_infer_type(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also check that the expected type is inferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
_check_series_roundtrip(s, type_=pa.binary()) | ||
# Infer type from bytearrays | ||
_check_series_roundtrip(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you should pass expected_pa_type
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooops, right - thanks for catching that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks @pitrou! |
This fixes conversion of Pandas decimal types to Arrow with None values. Previously, if the type was specified, an error would occur when checking if the object was Decimal. If the type was not specified, a segmentation fault would occur when attempting to find the max precision and scale.
Added new tests which include None values for both the above cases.