Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polars vet? #7721

Open
MarcoGorelli opened this issue Sep 30, 2023 · 4 comments
Open

polars vet? #7721

MarcoGorelli opened this issue Sep 30, 2023 · 4 comments
Labels
plugin Implementing a known but unsupported plugin

Comments

@MarcoGorelli
Copy link

MarcoGorelli commented Sep 30, 2023

Hello,

I've noticed that ruff has a pandas-vet plugin. Would you open to adding a Polars-vet one?

It could make suggestions such as

- pl.read_csv(file_name).lazy()
+ pl.scan_csv(file_name)

or

- df.select(pl.col('a').map_elements(lambda lst: ' '.join([str(x) for x in lst])))
+ df.select(pl.col('a').list.join(' '))

which can have a real impact on performance

I could try putting something together if you'd be open to it

@charliermarsh
Copy link
Member

I'm happy to add something like this! I'd prefer if we had a non-trivial set of rules in mind (e.g., at least five or so?) before we started to add them. I want to avoid a situation in which we create a new category, add one rule, then fail to expand it to a meaningful set.

@charliermarsh charliermarsh added the plugin Implementing a known but unsupported plugin label Sep 30, 2023
@MarcoGorelli
Copy link
Author

MarcoGorelli commented Sep 30, 2023

Sure, thanks! For a start, there's all the rewrites from pola-rs/polars#9968, such as

- pl.col('a').map_elements(lambda x: np.sin(x))
+ pl.col('a').sin()

- pl.col('a').map_elements(lambda x: x+1)
+ (pl.col('a') + 1)

- pl.col('a').map_elements(lambda x: json.loads(x))
+ pl.col("a").str.json_extract()

- pl.col('a').map_elements(lambda x: dt.datetime.strptime(x, "%Y-%m-%d"))
+ pl.col('a').str.to_datetime(format='%Y-%m-%d')

- pl.col('a').map_elements(lambda x: x.upper())
+ pl.col("a").str.to_uppercase()

. Within Polars, warnings are emitted for some of these by parsing the bytecode of the passed function - but as Ruff deals with the AST, then I'd expect it to be possible to cover a lot more from that list

The full list of test cases is here, there's quite a few already:

https://github.com/pola-rs/polars/blob/f3142ccd321873d7be5b339fd6ec5536e8c3153e/py-polars/tests/test_udfs.py#L28-L144

@ritchie46
Copy link

Any read operation followed by a lazy is very fishy.

E.g. pl.read_parquet(..).lazy() should suggest pl.scan_parquet(..).

And that for all our scan supported file types.

@stinodego
Copy link
Contributor

stinodego commented Oct 19, 2023

One more suggestion in the 'lazy' category:

- DataFrame(...).lazy()
+ LazyFrame(...)

Maybe one for assertions (the equality statements would result in an error):

- assert s1 == s2
+ assert_series_equal(s1, s2)

- assert df1 == df2
+ assert_frame_equal(df1, df2)

- assert lf1 == lf2
+ assert_frame_equal(lf1, lf2)

- assert s1 != s2
+ assert_series_not_equal(s1, s2)
...

One for select/with_columns:

- df.select(pl.all(), ...)
+ df.with_columns(...)

- df.select(pl.col("*"), ...)
+ df.with_columns(...)

Keyword syntax in select/with_columns:

- df.select(pl.col('a').abs().alias('abs'))
+ df.select(abs=pl.col('a').abs())

Keyword syntax in filter:

- df.filter(pl.col('a') == 'foo')
+ df.filter(a='foo')

Using positional args instead of lists where possible:

- df.sort(['a', 'b'])
+ df.sort('a', 'b')

...I'm sure I can come up with more 😄
@MarcoGorelli Is this enough input?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plugin Implementing a known but unsupported plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants