PERF: DataFrame(ndarray) constructor ensure to copy to column-major layout #57459

jorisvandenbossche · 2024-02-16T17:01:18Z

With CoW enabled now, the default behaviour of DataFrame(ndarray) is to copy the numpy array (before 3.0, this would not copy the data). However, if we do a copy, we should also make sure we copy it to the optimal layout (typically column major). Before CoW, this copy would often happen later on anyway, and an explicit copy() actually does ensure column-major layout:

On main:

>>> arr = np.random.randn(10, 3)
>>> df = pd.DataFrame(arr)  # does actually make a copy of `arr` under the hood
>>> df._mgr.blocks[0].values.flags.c_contiguous
False
>>> df2 = df.copy()  # explicit copy method
>>> df2._mgr.blocks[0].values.flags.c_contiguous
True

…ayout

jorisvandenbossche · 2024-02-16T17:02:14Z

With the change, the result of the arithmetic.FrameWithFrameWide.time_op_different_blocks op=; shape=(1000000, 10)

n_rows, n_cols = 1_000_000, 10

# construct dataframe with 2 blocks
arr1 = np.random.randn(n_rows, n_cols // 2).astype("f8")
arr2 = np.random.randn(n_rows, n_cols // 2).astype("f4")
df = pd.concat([DataFrame(arr1), DataFrame(arr2)], axis=1, ignore_index=True)
df._consolidate_inplace()

arr1 = np.random.randn(n_rows, max(n_cols // 4, 3)).astype("f8")
arr2 = np.random.randn(n_rows, n_cols // 2).astype("i8")
arr3 = np.random.randn(n_rows, n_cols // 4).astype("f8")
df2 = pd.concat(
    [DataFrame(arr1), DataFrame(arr2), DataFrame(arr3)],
    axis=1,
    ignore_index=True,
)
df2._consolidate_inplace()

%timeit df > df2

gives

In [4]: %timeit df > df2
21.3 ms ± 3.97 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)   # main
8.18 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)   # PR

jorisvandenbossche · 2024-02-16T17:04:22Z

We actually discussed before that we should do this for CoW -> #50756, and it was also at some point included in PR #51731 but then removed again before merging.

jbrockmendel · 2024-02-16T21:39:28Z

Looks reasonable to me. The perf improvement is real nice.

mroeschke · 2024-03-05T00:57:04Z

Thanks @jorisvandenbossche

DeaMariaLeon · 2024-03-07T08:33:06Z

Conbench complaining on these benchmarks:

  `io.hdf.HDF.time_write_hdf` (Python) with format='fixed'
  `stat_ops.FrameOps.time_op` (Python) with axis=None, dtype='int', op='median'
  `arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0` (Python) with me='pow'
  `io.hdf.HDFStoreDataFrame.time_write_store_table_wide` (Python)
  `frame_methods.Where.time_where` (Python) with dtype=True, param2='float64'
  `stat_ops.FrameOps.time_op` (Python) with axis=None, dtype='float', op='median'
  `reshape.Unstack.time_without_last_row` (Python) with param1='int'
  `stat_ops.FrameOps.time_op` (Python) with axis=None, dtype='float', op='kurt'
  `groupby.GroupManyLabels.time_sum` (Python) with ncols=1000
  `frame_methods.Rank.time_rank` (Python) with dtype='uint'
  `rolling.TableMethod.time_apply` (Python) with method='table'
  `reshape.Unstack.time_full_product` (Python) with param1='int'
  `stat_ops.FrameOps.time_op` (Python) with axis=None, dtype='float', op='skew'

Please ignore if this was expected.

@jorisvandenbossche

…ayout (pandas-dev#57459) * PERF: DataFrame(ndarray) constructor ensure to copy to column-major layout * fixup --------- Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>

PERF: DataFrame(ndarray) constructor ensure to copy to column-major l…

acbdcff

…ayout

jorisvandenbossche added Performance Memory or execution speed performance Copy / view semantics labels Feb 16, 2024

jorisvandenbossche requested review from phofl and rhshadrach February 16, 2024 17:04

jorisvandenbossche requested a review from jbrockmendel February 16, 2024 17:06

jorisvandenbossche mentioned this pull request Feb 16, 2024

PERF: Inefficient data representation when building dataframe from 2D NumPy array #50756

Closed

rhshadrach mentioned this pull request Feb 16, 2024

BUG: Endianness problem with weird data types and shape #57457

Closed

3 tasks

fixup

bad7dd6

jorisvandenbossche mentioned this pull request Feb 19, 2024

Potential perf regressions introduced by Copy-on-Write #57431

Open

50 tasks

Merge branch 'main' into cow-frame-constuctor-copy

a07a08e

mroeschke added this to the 3.0 milestone Feb 28, 2024

mroeschke approved these changes Mar 5, 2024

View reviewed changes

mroeschke merged commit 59e5d93 into pandas-dev:main Mar 5, 2024
47 checks passed

jorisvandenbossche deleted the cow-frame-constuctor-copy branch March 5, 2024 08:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: DataFrame(ndarray) constructor ensure to copy to column-major layout #57459

PERF: DataFrame(ndarray) constructor ensure to copy to column-major layout #57459

jorisvandenbossche commented Feb 16, 2024 •

edited

Loading

jorisvandenbossche commented Feb 16, 2024

jorisvandenbossche commented Feb 16, 2024 •

edited

Loading

jbrockmendel commented Feb 16, 2024

mroeschke commented Mar 5, 2024

DeaMariaLeon commented Mar 7, 2024

PERF: DataFrame(ndarray) constructor ensure to copy to column-major layout #57459

PERF: DataFrame(ndarray) constructor ensure to copy to column-major layout #57459

Conversation

jorisvandenbossche commented Feb 16, 2024 • edited Loading

jorisvandenbossche commented Feb 16, 2024

jorisvandenbossche commented Feb 16, 2024 • edited Loading

jbrockmendel commented Feb 16, 2024

mroeschke commented Mar 5, 2024

DeaMariaLeon commented Mar 7, 2024

jorisvandenbossche commented Feb 16, 2024 •

edited

Loading

jorisvandenbossche commented Feb 16, 2024 •

edited

Loading