ARROW-2432: [Python] Fix Pandas decimal type conversion with None values #1878

BryanCutler · 2018-04-10T18:44:21Z

This fixes conversion of Pandas decimal types to Arrow with None values. Previously, if the type was specified, an error would occur when checking if the object was Decimal. If the type was not specified, a segmentation fault would occur when attempting to find the max precision and scale.

Added new tests which include None values for both the above cases.

BryanCutler · 2018-04-10T18:45:00Z

Ping @cpcloud and @pitrou, please take a look when you can, thanks!

pitrou · 2018-04-10T18:52:40Z

cpp/src/arrow/python/decimal.cc

@@ -184,14 +184,15 @@ Status DecimalMetadata::Update(int32_t suggested_precision, int32_t suggested_sc
 }

 Status DecimalMetadata::Update(PyObject* object) {
-  DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal";
+  bool is_decimal = PyDecimal_Check(object);


I don't think it's ok to do this in optimized build. DecimalMetadata expects you to pass a decimal object. @cpcloud may confirm.

This isn't necessary because I added a check before calling Update but it does prevent a seg fault if for some reason it's called with non Decimal objects - which is not nice to get. If it's hurts an optimization though, I can remove it.

Right now we are doing the check twice in optimized builds, which is not nice IMHO. DecimalMetadata::Update is a private API so it's up to the caller to provide appropriate input.

So you mean remove PyDecimal_Check all together? This is only called when the type is not specified by the user and then yes, it will end doing 2 passes over the objects and checks both times if they are decimal. It might be possible to do less checks on the second pass if we keep a list of which ones are decimal objects, but I'm not sure that would be worth it.

Fair enough, we can optimize later if we find it too slow. The conversion itself is very slow anyway :-)

pitrou · 2018-04-10T18:53:35Z

cpp/src/arrow/python/numpy_to_arrow.cc

@@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() {

  if (type_ == NULLPTR) {
    for (PyObject* object : objects) {
-      RETURN_NOT_OK(max_decimal_metadata.Update(object));
+      if (!internal::PandasObjectIsNull(object)) {


Do we care about accepting other NULL-like objects such as float('nan')? Otherwise object != Py_None is a much faster check.

I'm not sure, is it possible to get NaNs from operations on Decimals? Or is that something the user might mix in somehow?

Python decimal objects can be nan, unfortunately:

>>> import decimal >>> decimal.Decimal('nan') Decimal('NaN')

Seems like it could be NaN also:

In [5]: s1 = pd.Series([Decimal('1.0'), Decimal('2.0')]) In [6]: s2 = pd.Series([Decimal('2.0'), None]) In [7]: s1 / s2 Out[7]: 0 0.5 1 NaN dtype: object

pitrou · 2018-04-10T18:54:28Z

cpp/src/arrow/python/numpy_to_arrow.cc

-      RETURN_NOT_OK(max_decimal_metadata.Update(object));
+      if (!internal::PandasObjectIsNull(object)) {
+        RETURN_NOT_OK(max_decimal_metadata.Update(object));
+      }
    }

    type_ =


By the way, what happens here if all items are None? Do we have a test for that?

I'll add that

pitrou · 2018-04-10T18:55:49Z

cpp/src/arrow/python/numpy_to_arrow.cc

+      Decimal128 value;
+      RETURN_NOT_OK(internal::DecimalFromPythonDecimal(object, decimal_type, &value));
+      RETURN_NOT_OK(builder.Append(value));
+    } else if (is_decimal == 0 && internal::PandasObjectIsNull(object)) {


Same question as above: do we care about other NULL-like values than simply None?

pitrou · 2018-04-10T18:56:50Z

python/pyarrow/tests/test_convert_pandas.py

+        series = pd.Series(data)
+        _check_series_roundtrip(series, type_=pa.decimal128(12, 5))
+
+    def test_decimal_with_None_infer_type(self):


Can you also check that the expected type is inferred?

pitrou · 2018-04-11T15:40:00Z

python/pyarrow/tests/test_convert_pandas.py

        _check_series_roundtrip(s, type_=pa.binary())
+        # Infer type from bytearrays
+        _check_series_roundtrip(s)


But you should pass expected_pa_type here.

ooops, right - thanks for catching that!

pitrou

LGTM.

BryanCutler · 2018-04-12T17:04:10Z

Thanks @pitrou!

BryanCutler added 4 commits April 9, 2018 17:29

added tests

916aa8d

fix for case of explicit type, still seg fault when infer

89e53fc

remove predict macro

dabdcda

fix case for infer decimal type

ca16983

BryanCutler requested a review from cpcloud April 10, 2018 18:44

pitrou reviewed Apr 10, 2018

View reviewed changes

BryanCutler added 2 commits April 10, 2018 13:40

fixed up tests to check types

7c3977d

fix flake8 formatting

e60ec09

pitrou reviewed Apr 11, 2018

View reviewed changes

forgot to specify expected type in test

00a1b4d

pitrou approved these changes Apr 11, 2018

View reviewed changes

pitrou closed this in 9ad8602 Apr 12, 2018

cpcloud mentioned this pull request Apr 12, 2018

pyarrow crash on converstion of Pandas dataframe -> arrow with Decimal column #1888

Closed

asfimport mentioned this pull request Apr 12, 2018

[Python] from_pandas fails when converting decimals if have None values #18549

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-2432: [Python] Fix Pandas decimal type conversion with None values #1878

ARROW-2432: [Python] Fix Pandas decimal type conversion with None values #1878

BryanCutler commented Apr 10, 2018

BryanCutler commented Apr 10, 2018

pitrou Apr 10, 2018

BryanCutler Apr 10, 2018

pitrou Apr 10, 2018

BryanCutler Apr 11, 2018

pitrou Apr 11, 2018

pitrou Apr 10, 2018

BryanCutler Apr 10, 2018

cpcloud Apr 10, 2018

BryanCutler Apr 10, 2018

pitrou Apr 10, 2018

BryanCutler Apr 10, 2018

BryanCutler Apr 10, 2018

pitrou Apr 10, 2018

pitrou Apr 10, 2018

BryanCutler Apr 10, 2018

pitrou Apr 11, 2018

BryanCutler Apr 11, 2018

pitrou left a comment

BryanCutler commented Apr 12, 2018

ARROW-2432: [Python] Fix Pandas decimal type conversion with None values #1878

ARROW-2432: [Python] Fix Pandas decimal type conversion with None values #1878

Conversation

BryanCutler commented Apr 10, 2018

BryanCutler commented Apr 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou left a comment

Choose a reason for hiding this comment

BryanCutler commented Apr 12, 2018