fix(16905): Fix a number of edge cases where assignment corrupted a Series #16930

itamarst · 2024-06-13T13:19:20Z

Previously various interactions in Python of assigning to a Series would clear the data from the Series if an error occurred. In interactive usage, for example in a Jupyter notebook, this is undesirable behavior because typos or just playing around shouldn't result in data disappearing.

Example of previous, buggy behavior that this PR fixes:

>>> import polars as pl
>>> s = pl.Series([1, 2, 3])
>>> s[-200] = 5
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/itamarst/devel/sandbox/venv311/lib/python3.11/site-packages/polars/series/series.py", line 1250, in __setitem__
    self.scatter(key, value)
  File "/home/itamarst/devel/sandbox/venv311/lib/python3.11/site-packages/polars/series/series.py", line 4895, in scatter
    self._s.scatter(indices._s, values._s)
polars.exceptions.OutOfBoundsError: indices are out of bounds
>>> s
shape: (0,)
Series: 'default' [null]
[
]

s should not be empty just because an OutOfBoundsError happens...

ritchie46 · 2024-06-13T14:06:56Z

The operation is in place and saves a memory allocation. I think we should keep the original code, but do the bound check before we do the in place operation.

codecov · 2024-06-13T14:08:03Z

Codecov Report

Attention: Patch coverage is 69.23077% with 12 lines in your changes missing coverage. Please review.

Project coverage is 81.09%. Comparing base (cd68100) to head (82067cb).
Report is 15 commits behind head on main.

Files	Patch %	Lines
py-polars/src/series/scatter.rs	68.42%	12 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #16930      +/-   ##
==========================================
- Coverage   81.44%   81.09%   -0.36%     
==========================================
  Files        1425     1435      +10     
  Lines      187970   189585    +1615     
  Branches     2704     2712       +8     
==========================================
+ Hits       153091   153741     +650     
- Misses      34382    35344     +962     
- Partials      497      500       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

itamarst · 2024-06-13T14:11:06Z

The operation is in place and saves a memory allocation. I think we should keep the original code, but do the bound check before we do the in place operation.

I wrote a little mini-benchmark to make sure there's no performance hit in non-error case for this change:

from time import time

import polars as pl

series = pl.Series(list(range(1_000_000)))

start = time()
for i in range(1000):
    series[500_000] = 17
print(time() - start)

Some of my first attempts made it 100× slower, admittedly in non-release builds.

itamarst · 2024-06-13T14:11:41Z

Oh, and to be clear, it is still in-place, I didn't change that! I just made sure it gets restored if an error occurs.

ritchie46 · 2024-06-14T13:53:40Z

Can you confirm that you still hit the get_mut_values branch in our tests. That is important because it is very finicky. If we keep a ref count too many we will copy the underlying buffer instead of mutate it.

itamarst · 2024-06-14T14:41:29Z

Will investigate.

How would you feel about adding a test based on https://crates.io/crates/cov-mark? It's a good technique for this sort of situation where the external behavior is the same, but you really want to hit a particular branch in your test.

itamarst · 2024-06-14T14:44:11Z

The get_mut_values() branch does git hit, yeah, I added a debug print:

tests/unit/series/test_scatter.py get_mut_values() SOME BRANCH HIT
get_mut_values() SOME BRANCH HIT
get_mut_values() SOME BRANCH HIT
get_mut_values() SOME BRANCH HIT
get_mut_values() SOME BRANCH HIT
...get_mut_values() SOME BRANCH HIT
.get_mut_values() SOME BRANCH HIT
.

But a cov_mark test would actually validate this is in an automated fashion. Not sure if py-polars does cargo test at the moment, though.

ritchie46 · 2024-06-15T07:45:29Z

Great! Yeah, actually testing it would be better indeed. Though it is important that it only is used in test compilations.

pythonspeed added 6 commits June 12, 2024 11:48

Tests for various bugs that corrupt the Series.

bc3c355

Split off the issue.

a693a98

Restore Series if something goes wrong.

01e7f4a

Correct the message

756a1ee

More thorough tests.

db520d7

Match more accurate message.

7b2ca8c

github-actions bot added the fix Bug fix label Jun 13, 2024

pythonspeed added 4 commits June 13, 2024 09:22

Add a performance check.

ba5eef8

Support varying index sizes.

2bbb2f5

Correct the documentation.

0ceb572

Expand explanation, remove assert that isn't always true.

82067cb

itamarst marked this pull request as ready for review June 13, 2024 14:11

itamarst requested review from ritchie46, stinodego, c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners June 13, 2024 14:11

ritchie46 merged commit 9a3e032 into pola-rs:main Jun 15, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(16905): Fix a number of edge cases where assignment corrupted a Series #16930

fix(16905): Fix a number of edge cases where assignment corrupted a Series #16930

itamarst commented Jun 13, 2024 •

edited

Loading

ritchie46 commented Jun 13, 2024

codecov bot commented Jun 13, 2024

itamarst commented Jun 13, 2024

itamarst commented Jun 13, 2024 •

edited

Loading

ritchie46 commented Jun 14, 2024

itamarst commented Jun 14, 2024

itamarst commented Jun 14, 2024

ritchie46 commented Jun 15, 2024

fix(16905): Fix a number of edge cases where assignment corrupted a Series #16930

fix(16905): Fix a number of edge cases where assignment corrupted a Series #16930

Conversation

itamarst commented Jun 13, 2024 • edited Loading

ritchie46 commented Jun 13, 2024

codecov bot commented Jun 13, 2024

Codecov Report

itamarst commented Jun 13, 2024

itamarst commented Jun 13, 2024 • edited Loading

ritchie46 commented Jun 14, 2024

itamarst commented Jun 14, 2024

itamarst commented Jun 14, 2024

ritchie46 commented Jun 15, 2024

itamarst commented Jun 13, 2024 •

edited

Loading

itamarst commented Jun 13, 2024 •

edited

Loading