Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-2432: [Python] Fix Pandas decimal type conversion with None values #1878

Conversation

BryanCutler
Copy link
Member

This fixes conversion of Pandas decimal types to Arrow with None values. Previously, if the type was specified, an error would occur when checking if the object was Decimal. If the type was not specified, a segmentation fault would occur when attempting to find the max precision and scale.

Added new tests which include None values for both the above cases.

@BryanCutler BryanCutler requested a review from cpcloud April 10, 2018 18:44
@BryanCutler
Copy link
Member Author

Ping @cpcloud and @pitrou, please take a look when you can, thanks!

@@ -184,14 +184,15 @@ Status DecimalMetadata::Update(int32_t suggested_precision, int32_t suggested_sc
}

Status DecimalMetadata::Update(PyObject* object) {
DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal";
bool is_decimal = PyDecimal_Check(object);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's ok to do this in optimized build. DecimalMetadata expects you to pass a decimal object. @cpcloud may confirm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary because I added a check before calling Update but it does prevent a seg fault if for some reason it's called with non Decimal objects - which is not nice to get. If it's hurts an optimization though, I can remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we are doing the check twice in optimized builds, which is not nice IMHO. DecimalMetadata::Update is a private API so it's up to the caller to provide appropriate input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean remove PyDecimal_Check all together? This is only called when the type is not specified by the user and then yes, it will end doing 2 passes over the objects and checks both times if they are decimal. It might be possible to do less checks on the second pass if we keep a list of which ones are decimal objects, but I'm not sure that would be worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, we can optimize later if we find it too slow. The conversion itself is very slow anyway :-)

@@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() {

if (type_ == NULLPTR) {
for (PyObject* object : objects) {
RETURN_NOT_OK(max_decimal_metadata.Update(object));
if (!internal::PandasObjectIsNull(object)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about accepting other NULL-like objects such as float('nan')? Otherwise object != Py_None is a much faster check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, is it possible to get NaNs from operations on Decimals? Or is that something the user might mix in somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python decimal objects can be nan, unfortunately:

>>> import decimal
>>> decimal.Decimal('nan')
Decimal('NaN')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it could be NaN also:

In [5]: s1 = pd.Series([Decimal('1.0'), Decimal('2.0')])

In [6]: s2 = pd.Series([Decimal('2.0'), None])

In [7]: s1 / s2
Out[7]: 
0    0.5
1    NaN
dtype: object

RETURN_NOT_OK(max_decimal_metadata.Update(object));
if (!internal::PandasObjectIsNull(object)) {
RETURN_NOT_OK(max_decimal_metadata.Update(object));
}
}

type_ =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, what happens here if all items are None? Do we have a test for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Decimal128 value;
RETURN_NOT_OK(internal::DecimalFromPythonDecimal(object, decimal_type, &value));
RETURN_NOT_OK(builder.Append(value));
} else if (is_decimal == 0 && internal::PandasObjectIsNull(object)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above: do we care about other NULL-like values than simply None?

series = pd.Series(data)
_check_series_roundtrip(series, type_=pa.decimal128(12, 5))

def test_decimal_with_None_infer_type(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also check that the expected type is inferred?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

_check_series_roundtrip(s, type_=pa.binary())
# Infer type from bytearrays
_check_series_roundtrip(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you should pass expected_pa_type here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooops, right - thanks for catching that!

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@pitrou pitrou closed this in 9ad8602 Apr 12, 2018
@BryanCutler
Copy link
Member Author

Thanks @pitrou!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants