Skip to content

Commit

Permalink
docs(python): Finish upgrade guide for 1.0.0 (#17257)
Browse files Browse the repository at this point in the history
  • Loading branch information
stinodego authored Jun 30, 2024
1 parent 59d2529 commit 227b350
Showing 1 changed file with 266 additions and 36 deletions.
302 changes: 266 additions & 36 deletions docs/releases/upgrade/1.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,5 @@
# Version 1

!!! warning "Work in progress"

This upgrade guide is not yet complete. Check back when 1.0.0 is released for the full overview of breaking changes.

## Breaking changes

### Properly apply `strict` parameter in Series constructor
Expand Down Expand Up @@ -184,6 +180,55 @@ Traceback (most recent call last):
polars.exceptions.InvalidOperationError: conversion from `i64` to `u8` failed in column 'a' for 1 out of 3 values: [300]
```

### Update `read/scan_parquet` to disable Hive partitioning by default for file inputs

Parquet reading functions now also support directory inputs.
Hive partitioning is enabled by default for directories, but is now _disabled_ by default for file inputs.
File inputs include single files, globs, and lists of files.
Explicitly pass `hive_partitioning=True` to restore previous behavior.

**Example**

Before:

```pycon
>>> pl.read_parquet("dataset/a=1/foo.parquet")
shape: (2, 2)
┌─────┬─────┐
│ a ┆ x │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1 ┆ 1.0 │
│ 1 ┆ 2.0 │
└─────┴─────┘
```

After:

```pycon
>>> pl.read_parquet("dataset/a=1/foo.parquet")
shape: (2, 1)
┌─────┐
│ x │
│ --- │
│ f64 │
╞═════╡
│ 1.0 │
│ 2.0 │
└─────┘
>>> pl.read_parquet("dataset/a=1/foo.parquet", hive_partitioning=True)
shape: (2, 2)
┌─────┬─────┐
│ a ┆ x │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1 ┆ 1.0 │
│ 1 ┆ 2.0 │
└─────┴─────┘
```

### Update `reshape` to return Array types instead of List types

`reshape` now returns an Array type instead of a List type.
Expand Down Expand Up @@ -218,6 +263,83 @@ Series: '' [array[i64, 3]]
]
```

### Read 2D NumPy arrays as `Array` type instead of `List`

The Series constructor now parses 2D NumPy arrays as an `Array` type rather than a `List` type.

**Example**

Before:

```pycon
>>> import numpy as np
>>> arr = np.array([[1, 2], [3, 4]])
>>> pl.Series(arr)
shape: (2,)
Series: '' [list[i64]]
[
[1, 2]
[3, 4]
]
```

After:

```pycon
>>> import numpy as np
>>> arr = np.array([[1, 2], [3, 4]])
>>> pl.Series(arr)
shape: (2,)
Series: '' [array[i64, 2]]
[
[1, 2]
[3, 4]
]
```

### Split `replace` functionality into two separate methods

The API for `replace` has proven to be confusing to many users, particularly with regards to the `default` argument and the resulting data type.

It has been split up into two methods: `replace` and `replace_strict`.
`replace` now always keeps the existing data type _(breaking, see example below)_ and is meant for replacing some values in your existing column.
Its parameters `default` and `return_dtype` have been deprecated.

The new method `replace_strict` is meant for creating a new column, mapping some or all of the values of the original column, and optionally specifying a default value. If no default is provided, it raises an error if any non-null values are not mapped.

**Example**

Before:

```pycon
>>> s = pl.Series([1, 2, 3])
>>> s.replace(1, "a")
shape: (3,)
Series: '' [str]
[
"a"
"2"
"3"
]
```

After:

```pycon
>>> s.replace(1, "a")
Traceback (most recent call last):
...
polars.exceptions.InvalidOperationError: conversion from `str` to `i64` failed in column 'literal' for 1 out of 1 values: ["a"]
>>> s.replace_strict(1, "a", default=s)
shape: (3,)
Series: '' [str]
[
"a"
"2"
"3"
]
```

### Preserve nulls in `ewm_mean`, `ewm_std`, and `ewm_var`

Polars will no longer forward-fill null values in `ewm` methods.
Expand Down Expand Up @@ -291,38 +413,6 @@ shape: (3, 1)
└──────┘
```

### Read 2D NumPy arrays as `Array` type instead of `List`

**Example**

Before:

```pycon
>>> import numpy as np
>>> arr = np.array([[1, 2], [3, 4]])
>>> pl.Series(arr)
shape: (2,)
Series: '' [list[i64]]
[
[1, 2]
[3, 4]
]
```

After:

```pycon
>>> import numpy as np
>>> arr = np.array([[1, 2], [3, 4]])
>>> pl.Series(arr)
shape: (2,)
Series: '' [array[i64, 2]]
[
[1, 2]
[3, 4]
]
```

### Change `str.to_datetime` to default to microsecond precision for format specifiers `"%f"` and `"%.f"`

In `.str.to_datetime`, when specifying `%.f` as the format, the default was to set the resulting datatype to nanosecond precision. This has been changed to microsecond precision.
Expand Down Expand Up @@ -701,6 +791,39 @@ Series: '' [i64]
]
```

### Change default engine for `read_excel` to `"calamine"`

The `calamine` engine (available through the `fastexcel` package) has been added to Polars relatively recently.
It's much faster than the other engines, and was already the default for `xlsb` and `xls` files.
We now made it the default for all Excel files.

There may be subtle differences between this engine and the previous default (`xlsx2csv`).
One clear difference is that the `calamine` engine does not support the `engine_options` parameter.
If you cannot get your desired behavior with the `calamine` engine, specify `engine="xlsx2csv"` to restore previous behavior.

### Example

Before:

```pycon
>>> pl.read_excel("data.xlsx", engine_options={"skip_empty_lines": True})
```

After:

```pycon
>>> pl.read_excel("data.xlsx", engine_options={"skip_empty_lines": True})
Traceback (most recent call last):
...
TypeError: read_excel() got an unexpected keyword argument 'skip_empty_lines'
```

Instead, explicitly specify the `xlsx2csv` engine or omit the `engine_options`:

```pycon
>>> pl.read_excel("data.xlsx", engine="xlsx2csv", engine_options={"skip_empty_lines": True})
```

### Remove class variables from some DataTypes

Some DataType classes had class variables.
Expand Down Expand Up @@ -779,6 +902,52 @@ shape: (3, 2)
└────────────┴───────────┘
```

### Change default serialization format of `LazyFrame/DataFrame/Expr`

The only serialization format available for the `serialize/deserialize` methods on Polars objects was JSON.
We added a more optimized binary format and made this the default.
JSON serialization is still available by passing `format="json"`.

**Example**

Before:

```pycon
>>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum()
>>> serialized = lf.serialize()
>>> serialized
'{"MapFunction":{"input":{"DataFrameScan":{"df":{"columns":[{"name":...'
>>> from io import StringIO
>>> pl.LazyFrame.deserialize(StringIO(serialized)).collect()
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 6 │
└─────┘
```

After:

```pycon
>>> lf = pl.LazyFrame({"a": [1, 2, 3]}).sum()
>>> serialized = lf.serialize()
>>> serialized
b'\xa1kMapFunction\xa2einput\xa1mDataFrameScan\xa4bdf...'
>>> from io import BytesIO # Note: using BytesIO instead of StringIO
>>> pl.LazyFrame.deserialize(BytesIO(serialized)).collect()
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 6 │
└─────┘
```

### Constrain access to globals from `DataFrame.sql` in favor of `pl.sql`

The `sql` methods on `DataFrame` and `LazyFrame` can no longer access global variables.
Expand Down Expand Up @@ -831,3 +1000,64 @@ shape: (4, 2)
│ 2 ┆ 4 │
└─────┴─────┘
```

### Remove re-export of type aliases

We have a lot of type aliases defined in the `polars.type_aliases` module.
Some of these were re-exported at the top-level and in the `polars.datatypes` module.
These re-exports have been removed.

We plan on adding a public `polars.typing` module in the future with a number of curated type aliases.
Until then, please define your own type aliases, or import from our `polars.type_aliases` module.
Note that the `type_aliases` module is not technically public, so use at your own risk.

**Example**

Before:

```python
def foo(dtype: pl.PolarsDataType) -> None: ...
```

After:

```python
PolarsDataType = pl.DataType | type[pl.DataType]

def foo(dtype: PolarsDataType) -> None: ...
```

### Streamline optional dependency definitions in `pyproject.toml`

We revisited to optional dependency definitions and made some minor changes.
If you were using the extras `fastexcel`, `gevent`, `matplotlib`, or `async`, this is a breaking change.
Please update your Polars installation to use the new extras.

**Example**

Before:

```bash
pip install 'polars[fastexcel,gevent,matplotlib]'
```

After:

```bash
pip install 'polars[calamine,async,graph]'
```

## Deprecations

### Issue `PerformanceWarning` when LazyFrame properties `schema/dtypes/columns/width` are used

Recent improvements to the correctness of the schema resolving in the lazy engine have had significant performance impact on the cost of resolving the schema.
It is no longer 'free' - in fact, in complex pipelines with lazy file reading, resolving the schema can be relatively expensive.

Because of this, the schema-related properties on LazyFrame were no longer good API design.
Properties represent information that is already available, and just needs to be retrieved.
However, for the LazyFrame properties, accessing these may have significant performance cost.

To solve this, we added the `LazyFrame.collect_schema` method, which retrieves the schema and returns a `Schema` object.
The properties raise a `PerformanceWarning` and tell the user to use `collect_schema` instead.
We chose not to deprecate the properties for now to facilitatate writing code that is generic for both DataFrames and LazyFrames.

0 comments on commit 227b350

Please sign in to comment.