Skip to content

Commit

Permalink
docs: Add pandas strictness API difference (#21312)
Browse files Browse the repository at this point in the history
Co-authored-by: Lawrence Mitchell <lmitchell@nvidia.com>
  • Loading branch information
ritchie46 and wence- authored Feb 19, 2025
1 parent f97b46a commit 67f4da4
Showing 1 changed file with 29 additions and 4 deletions.
33 changes: 29 additions & 4 deletions docs/source/user-guide/migration/pandas.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,17 +26,28 @@ technique.

### Polars adheres to the Apache Arrow memory format to represent data in memory while pandas uses NumPy arrays

Polars represents data in memory according to the Arrow memory spec while pandas represents data in
memory with NumPy arrays. Apache Arrow is an emerging standard for in-memory columnar analytics that
can accelerate data load times, reduce memory usage and accelerate calculations.
Polars represents data in memory according to the Arrow memory spec while pandas by default
represents data in memory with NumPy arrays. Apache Arrow is an emerging standard for in-memory
columnar analytics that can accelerate data load times, reduce memory usage and accelerate
calculations.

Polars can convert data to NumPy format with the `to_numpy` method.

### Polars has more support for parallel operations than pandas

Polars exploits the strong support for concurrency in Rust to run many operations in parallel. While
some operations in pandas are multi-threaded the core of the library is single-threaded and an
additional library such as `Dask` must be used to parallelize operations.
additional library such as `Dask` must be used to parallelize operations. Polars is faster than all
open source solutions that parallelize pandas code.

### Polars has support for different engines

Polars has native support for an engine optimized for in-memory processing and a streaming engine
optimized for large scale data processing. Furthermore Polars has native integration with a CuDF
supported engine. All these engines benefit from Polars' query optimizer and Polars ensures semantic
correctness between all those engines. In pandas the implementation can dispatch between numpy and
Pyarrow, but because of pandas' loose strictness guarantees, the data-type outputs and semantics
between those backends can differ. This can lead to subtle bugs.

### Polars can lazily evaluate queries and apply query optimization

Expand All @@ -50,6 +61,20 @@ examines the query plan and looks for ways to accelerate the query or reduce mem

`Dask` also supports lazy evaluation when it generates a query plan.

### Polars is strict

Polars is strict about data types. Data type resolution in Polars is dependent on the operation
graph, whereas pandas converts types loosely (e.g. new missing data can lead to integer columns
being converted to floats). This strictness leads to fewer bugs and more predictable behavior.

### Polars has a more verstatile API

Polars is built on expressions and allows expression inputs in almost all operations. This means
that when you understand how expressions work, your knowledge in Polars extrapolates. Pandas doesn't
have an expression system and often requires Python `lambda`s to express the complexity you want.
Polars sees the requirement of a Python `lambda` as a lack of expressiveness of its API, and tries
to give you native support whenever possible.

## Key syntax differences

Users coming from pandas generally need to know one thing...
Expand Down

0 comments on commit 67f4da4

Please sign in to comment.