Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: DataFrame(ndarray) constructor ensure to copy to column-major layout #57459

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Feb 16, 2024

Closes #50756

xref #57431

With CoW enabled now, the default behaviour of DataFrame(ndarray) is to copy the numpy array (before 3.0, this would not copy the data). However, if we do a copy, we should also make sure we copy it to the optimal layout (typically column major). Before CoW, this copy would often happen later on anyway, and an explicit copy() actually does ensure column-major layout:

On main:

>>> arr = np.random.randn(10, 3)
>>> df = pd.DataFrame(arr)  # does actually make a copy of `arr` under the hood
>>> df._mgr.blocks[0].values.flags.c_contiguous
False
>>> df2 = df.copy()  # explicit copy method
>>> df2._mgr.blocks[0].values.flags.c_contiguous
True

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Copy / view semantics labels Feb 16, 2024
@jorisvandenbossche
Copy link
Member Author

With the change, the result of the arithmetic.FrameWithFrameWide.time_op_different_blocks op=; shape=(1000000, 10)

n_rows, n_cols = 1_000_000, 10

# construct dataframe with 2 blocks
arr1 = np.random.randn(n_rows, n_cols // 2).astype("f8")
arr2 = np.random.randn(n_rows, n_cols // 2).astype("f4")
df = pd.concat([DataFrame(arr1), DataFrame(arr2)], axis=1, ignore_index=True)
df._consolidate_inplace()

arr1 = np.random.randn(n_rows, max(n_cols // 4, 3)).astype("f8")
arr2 = np.random.randn(n_rows, n_cols // 2).astype("i8")
arr3 = np.random.randn(n_rows, n_cols // 4).astype("f8")
df2 = pd.concat(
    [DataFrame(arr1), DataFrame(arr2), DataFrame(arr3)],
    axis=1,
    ignore_index=True,
)
df2._consolidate_inplace()

%timeit df > df2

gives

In [4]: %timeit df > df2
21.3 ms ± 3.97 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)   # main
8.18 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)   # PR

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Feb 16, 2024

We actually discussed before that we should do this for CoW -> #50756, and it was also at some point included in PR #51731 but then removed again before merging.

@jbrockmendel
Copy link
Member

Looks reasonable to me. The perf improvement is real nice.

@mroeschke mroeschke added this to the 3.0 milestone Feb 28, 2024
@mroeschke mroeschke merged commit 59e5d93 into pandas-dev:main Mar 5, 2024
47 checks passed
@mroeschke
Copy link
Member

Thanks @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the cow-frame-constuctor-copy branch March 5, 2024 08:22
@DeaMariaLeon
Copy link
Member

Conbench complaining on these benchmarks:

  `io.hdf.HDF.time_write_hdf` (Python) with format='fixed'
  `stat_ops.FrameOps.time_op` (Python) with axis=None, dtype='int', op='median'
  `arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0` (Python) with me='pow'
  `io.hdf.HDFStoreDataFrame.time_write_store_table_wide` (Python)
  `frame_methods.Where.time_where` (Python) with dtype=True, param2='float64'
  `stat_ops.FrameOps.time_op` (Python) with axis=None, dtype='float', op='median'
  `reshape.Unstack.time_without_last_row` (Python) with param1='int'
  `stat_ops.FrameOps.time_op` (Python) with axis=None, dtype='float', op='kurt'
  `groupby.GroupManyLabels.time_sum` (Python) with ncols=1000
  `frame_methods.Rank.time_rank` (Python) with dtype='uint'
  `rolling.TableMethod.time_apply` (Python) with method='table'
  `reshape.Unstack.time_full_product` (Python) with param1='int'
  `stat_ops.FrameOps.time_op` (Python) with axis=None, dtype='float', op='skew'

Please ignore if this was expected.

Screenshot 2024-03-06 at 18 56 59

@jorisvandenbossche

pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
…ayout (pandas-dev#57459)

* PERF: DataFrame(ndarray) constructor ensure to copy to column-major layout

* fixup

---------

Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Copy / view semantics Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Inefficient data representation when building dataframe from 2D NumPy array
4 participants