Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: merging DataFrames on a column containing just NaN values triggers address violation in safe_sort #59421

Closed
3 tasks done
jsspencer opened this issue Aug 5, 2024 · 4 comments · Fixed by #59489
Closed
3 tasks done
Labels
Bug Needs Info Clarification about behavior needed to assess issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@jsspencer
Copy link
Contributor

jsspencer commented Aug 5, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

df1 = pd.DataFrame(
    {'x': [1, 2, 3], 'y': [np.nan, np.nan, np.nan], 'z': [4, 5, 6]}
)
df2 = pd.DataFrame(
    {'x': [1, 2, 3], 'y': [np.nan, np.nan, np.nan], 'zz': [4, 5, 6]}
)
df1.merge(df2, on=['x', 'y'], how='outer')

Issue Description

Related to #55984

Merging DataFrames on a column containing all NaN values results in a
This was not present in 2.1.4 and I think was introduced in #55984 (which fixed other address violations).

Found using asan, can also seen by enabling bounds_checking on take_1d_* in algos_take_helper.pxi.in

My understanding of the cause is:

  1. uniques is an empty array in _factorize_keys - https://github.com/pandas-dev/pandas/blob/main/pandas/core/reshape/merge.py#L2706
  2. The mask set in safe_sort assumes that the array being sorted is at least size 1 - https://github.com/pandas-dev/pandas/blob/main/pandas/core/algorithms.py#L1531
  3. The masked indices are set to 0.
  4. take_nd assumes the indexer contains no out-of-bounds indices, but an index of 0 is out of bounds in this case.

I am not familiar with pandas internals but changing the mask on https://github.com/pandas-dev/pandas/blob/main/pandas/core/algorithms.py#L1531 to

mask = (codes < min(-len(values), -1)) | (codes >= len(values))

avoids this out-of-bounds access. Is this a suitable fix? If so, I can prepare a pull request.

Expected Behavior

No array bounds access errors, should produce

   x   y  z  zz
0  1 NaN  4   4
1  2 NaN  5   5
2  3 NaN  6   6

Installed Versions

INSTALLED VERSIONS

commit : 642d244
python : 3.11.9
python-bits : 64
OS : Linux
OS-release : 6.6.15-2rodete2-amd64
Version : #1 SMP PREEMPT_DYNAMIC Debian 6.6.15-2rodete2 (2024-03-19)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+1287.g642d244606.dirty
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
pip : 24.0
Cython : 3.0.11
sphinx : 8.0.2
IPython : 8.26.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : 1.4.0
fastparquet : 2024.5.0
fsspec : 2024.6.1
html5lib : 1.1
hypothesis : 6.108.8
gcsfs : 2024.6.1
jinja2 : 3.1.4
lxml.etree : 5.2.2
matplotlib : 3.9.0
numba : 0.60.0
numexpr : 2.10.1
odfpy : None
openpyxl : 3.1.5
psycopg2 : 2.9.9
pymysql : 1.4.6
pyarrow : 17.0.0
pyreadstat : 1.2.7
pytest : 8.3.2
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2024.6.1
scipy : 1.14.0
sqlalchemy : 2.0.31
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.7.0
xlrd : 2.0.1
xlsxwriter : 3.2.0
zstandard : 0.23.0
tzdata : 2024.1
qtpy : None
pyqt5 : None

@jsspencer jsspencer added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 5, 2024
@rhshadrach
Copy link
Member

rhshadrach commented Aug 5, 2024

Thanks for the report. I am not able to reproduce on main with Linux / Python3.12. Can you double check on a clean branch?

@rhshadrach rhshadrach added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 5, 2024
@jsspencer
Copy link
Contributor Author

Sorry, I haven't been able to reproduce on main using gcc with asan flags (previously I was using clang and a slightly odd setup). However, I can reliably reproduce with enabling cython bounds checking in _libs/algos_take_helper.pxi.in:

$ git rev-parse HEAD
772fbc0ca5cb7835a3cab647981adadf086b3afa
$ git diff
diff --git a/pandas/_libs/algos_take_helper.pxi.in b/pandas/_libs/algos_take_helper.pxi.in
index e6b3989628..72418b1a16 100644
--- a/pandas/_libs/algos_take_helper.pxi.in
+++ b/pandas/_libs/algos_take_helper.pxi.in
@@ -64,7 +64,7 @@ def get_dispatch(dtypes):


 @cython.wraparound(False)
-@cython.boundscheck(False)
+@cython.boundscheck(True)
 {{if c_type_in != "object"}}
 def take_1d_{{name}}_{{dest}}(const {{c_type_in}}[:] values,
 {{else}}

I compiled inside a clean virtualenv using:

$ python -m venv ~/venv/pandas-head
$ . ~/venv/pandas-head/bin/activate
$ pip install -r requirements-dev.txt
$ rm -rf debug/
$ python -m pip install -ve . --no-build-isolation --config-settings=builddir="debug" --config-settings=setup-args="-Dbuildtype=debug"

The repro code above then results in the traceback:

Traceback (most recent call last):
  File "/home/jss/local/src/example/pd_merge.py", line 12, in <module>
    print(df1.merge(df2, on=['x', 'y'], how='outer'))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jss/local/src/pandas/pandas/core/frame.py", line 10807, in merge
    return merge(
           ^^^^^^
  File "/home/jss/local/src/pandas/pandas/core/reshape/merge.py", line 382, in merge
    return op.get_result()
           ^^^^^^^^^^^^^^^
  File "/home/jss/local/src/pandas/pandas/core/reshape/merge.py", line 1082, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
                                              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jss/local/src/pandas/pandas/core/reshape/merge.py", line 1345, in _get_join_info
    (left_indexer, right_indexer) = self._get_join_indexers()
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jss/local/src/pandas/pandas/core/reshape/merge.py", line 1319, in _get_join_indexers
    return get_join_indexers(
           ^^^^^^^^^^^^^^^^^^
  File "/home/jss/local/src/pandas/pandas/core/reshape/merge.py", line 1934, in get_join_indexers
    zipped = zip(*mapped)
             ^^^^^^^^^^^^
  File "/home/jss/local/src/pandas/pandas/core/reshape/merge.py", line 1931, in <genexpr>
    _factorize_keys(left_keys[n], right_keys[n], sort=sort)
  File "/home/jss/local/src/pandas/pandas/core/reshape/merge.py", line 2786, in _factorize_keys
    llab, rlab = _sort_labels(uniques, llab, rlab)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jss/local/src/pandas/pandas/core/reshape/merge.py", line 2847, in _sort_labels
    _, new_labels = algos.safe_sort(uniques, labels, use_na_sentinel=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jss/local/src/pandas/pandas/core/algorithms.py", line 1535, in safe_sort
    new_codes = take_nd(order2, codes, fill_value=-1)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jss/local/src/pandas/pandas/core/array_algos/take.py", line 115, in take_nd
    return _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jss/local/src/pandas/pandas/core/array_algos/take.py", line 160, in _take_nd_ndarray
    func(arr, indexer, out, fill_value)
  File "pandas/_libs/algos_take_helper.pxi", line 1879, in pandas._libs.algos.take_1d_int64_int64
IndexError: Out of bounds on buffer access (axis 0)

The IndexError is not triggered with the change I proposed to https://github.com/pandas-dev/pandas/blob/main/pandas/core/algorithms.py#L1531 as codes in the call to take_nd remains having only elements with value -1 rather than being set to 0 in the case of order being an empty array.

@rhshadrach
Copy link
Member

Thanks - can reproduce on my end with the change. I think this might be due to this line:

uniques = rizer.uniques.to_array()

When all values are NA, we end up trying to take from an empty array because uniques does not include any NA values. This leads to the out of bounds in _sort_labels. Further investigations and PRs to fix are welcome!

@jsspencer
Copy link
Contributor Author

I agree that uniques does not include any NA values but this does match the described behaviour in Factorizer classes.

safe_sort(..., verify=True) handles out of bounds accesses (matching behaviour described in its docstring) so long as the array is not empty. I think we should fix this to also work for the empty array case.

PR #59489 fixes this.

jsspencer added a commit to jsspencer/pandas that referenced this issue Aug 12, 2024
mroeschke pushed a commit that referenced this issue Aug 12, 2024
…n-empty codes (#59489)

* Fix out-of-bounds violations in safe_sort for empty arrays.

Previously we masked `codes` referring to out-of-bounds elements to 0 and
then fixed them after to -1 using `np.putmask`. However, this results in
out-of-bounds access in `take_nd` if the array is empty.

Instead, set all out-of-bounds indices in `codes` to -1 immediately, as
these can be handled by `take_nd`.

* Remove dead code.

`use_na_sentinel` cannot be truthy inside an else branch where it is
falsy.

* Add test based upon #59421
shreyas-dev pushed a commit to shreyas-dev/pandas that referenced this issue Aug 13, 2024
…n-empty codes (pandas-dev#59489)

* Fix out-of-bounds violations in safe_sort for empty arrays.

Previously we masked `codes` referring to out-of-bounds elements to 0 and
then fixed them after to -1 using `np.putmask`. However, this results in
out-of-bounds access in `take_nd` if the array is empty.

Instead, set all out-of-bounds indices in `codes` to -1 immediately, as
these can be handled by `take_nd`.

* Remove dead code.

`use_na_sentinel` cannot be truthy inside an else branch where it is
falsy.

* Add test based upon pandas-dev#59421
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Info Clarification about behavior needed to assess issue Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants