Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(16905): Fix a number of edge cases where assignment corrupted a Series #16930

Merged
merged 10 commits into from
Jun 15, 2024

Conversation

itamarst
Copy link
Contributor

@itamarst itamarst commented Jun 13, 2024

Fixes #16905

Previously various interactions in Python of assigning to a Series would clear the data from the Series if an error occurred. In interactive usage, for example in a Jupyter notebook, this is undesirable behavior because typos or just playing around shouldn't result in data disappearing.

Example of previous, buggy behavior that this PR fixes:

>>> import polars as pl
>>> s = pl.Series([1, 2, 3])
>>> s[-200] = 5
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/itamarst/devel/sandbox/venv311/lib/python3.11/site-packages/polars/series/series.py", line 1250, in __setitem__
    self.scatter(key, value)
  File "/home/itamarst/devel/sandbox/venv311/lib/python3.11/site-packages/polars/series/series.py", line 4895, in scatter
    self._s.scatter(indices._s, values._s)
polars.exceptions.OutOfBoundsError: indices are out of bounds
>>> s
shape: (0,)
Series: 'default' [null]
[
]

s should not be empty just because an OutOfBoundsError happens...

@github-actions github-actions bot added the fix Bug fix label Jun 13, 2024
@ritchie46
Copy link
Member

The operation is in place and saves a memory allocation. I think we should keep the original code, but do the bound check before we do the in place operation.

Copy link

codecov bot commented Jun 13, 2024

Codecov Report

Attention: Patch coverage is 69.23077% with 12 lines in your changes missing coverage. Please review.

Project coverage is 81.09%. Comparing base (cd68100) to head (82067cb).
Report is 15 commits behind head on main.

Files Patch % Lines
py-polars/src/series/scatter.rs 68.42% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16930      +/-   ##
==========================================
- Coverage   81.44%   81.09%   -0.36%     
==========================================
  Files        1425     1435      +10     
  Lines      187970   189585    +1615     
  Branches     2704     2712       +8     
==========================================
+ Hits       153091   153741     +650     
- Misses      34382    35344     +962     
- Partials      497      500       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@itamarst
Copy link
Contributor Author

The operation is in place and saves a memory allocation. I think we should keep the original code, but do the bound check before we do the in place operation.

I wrote a little mini-benchmark to make sure there's no performance hit in non-error case for this change:

from time import time

import polars as pl

series = pl.Series(list(range(1_000_000)))

start = time()
for i in range(1000):
    series[500_000] = 17
print(time() - start)

Some of my first attempts made it 100× slower, admittedly in non-release builds.

@itamarst itamarst marked this pull request as ready for review June 13, 2024 14:11
@itamarst
Copy link
Contributor Author

itamarst commented Jun 13, 2024

Oh, and to be clear, it is still in-place, I didn't change that! I just made sure it gets restored if an error occurs.

@ritchie46
Copy link
Member

Can you confirm that you still hit the get_mut_values branch in our tests. That is important because it is very finicky. If we keep a ref count too many we will copy the underlying buffer instead of mutate it.

@itamarst
Copy link
Contributor Author

Will investigate.

How would you feel about adding a test based on https://crates.io/crates/cov-mark? It's a good technique for this sort of situation where the external behavior is the same, but you really want to hit a particular branch in your test.

@itamarst
Copy link
Contributor Author

The get_mut_values() branch does git hit, yeah, I added a debug print:

tests/unit/series/test_scatter.py get_mut_values() SOME BRANCH HIT
get_mut_values() SOME BRANCH HIT
get_mut_values() SOME BRANCH HIT
get_mut_values() SOME BRANCH HIT
get_mut_values() SOME BRANCH HIT
...get_mut_values() SOME BRANCH HIT
.get_mut_values() SOME BRANCH HIT
.

But a cov_mark test would actually validate this is in an automated fashion. Not sure if py-polars does cargo test at the moment, though.

@ritchie46
Copy link
Member

Great! Yeah, actually testing it would be better indeed. Though it is important that it only is used in test compilations.

@ritchie46 ritchie46 merged commit 9a3e032 into pola-rs:main Jun 15, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In a variety of cases, errors in PySeries::scatter mean Python Series.__setitem__ will corrupt the Series
3 participants