Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: variables pane updates for data frames with many columns is slow #2174

Closed
wesm opened this issue Jan 31, 2024 · 5 comments
Closed

Python: variables pane updates for data frames with many columns is slow #2174

wesm opened this issue Jan 31, 2024 · 5 comments
Labels
area: variables Issues related to Variables category. lang: python performance

Comments

@wesm
Copy link
Contributor

wesm commented Jan 31, 2024

Positron Version: git main

Steps to reproduce the issue:

  • Load data frame with over 100,000 rows
  • Transpose data frame into new variable (df2 = df.T)

For example

import numpy as np
import pandas as pd
df = pd.DataFrame({'a': np.arange(1000000)})
df
df2 = df.T

(without Positron in the loop, df.T is virtually instantaneous)

In [4]: %timeit df.T
20.6 µs ± 76.6 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The variables pane takes over 10 seconds for the update to come through.

We should look at the variables logic around data frames to make sure that we don't do excess computations for data frames with > 10,000 columns. For some users, this case is less unusual than one might think, and so we don't want the UI to get blocked in this scenario.

@seeM
Copy link
Contributor

seeM commented Feb 2, 2024

It turns out that the delay is not due to the change detection logic. It happens at the sys.getsizeof here: https://github.com/posit-dev/positron-python/blob/3e7d7cf181449d537913cdea68c1b14ab9bb2957/pythonFiles/positron/inspectors.py#L124. It seems to be slow for really wide dataframes. We use that to calculate the size field of the Variable (which I don't think we're using in the UI yet).

I wondered if we could use pd.DataFrame.memory_usage instead but that's also really slow for wide dataframes.

@seeM seeM added this to the Private Alpha 2024 Q1 milestone Feb 5, 2024
@seeM
Copy link
Contributor

seeM commented Feb 5, 2024

How we use size in the UI

After taking a look, I see that we use size in the UI for three things:

  1. Users can sort variables by size (or name, which is the default).
  2. When variables are sorted by size, their respective sizes are shown on the right hand side instead of the display type. However, for tabular variables, a table icon is always shown instead of display type or size, so users can't actually see the size of tabular data atm.
  3. Users can group variables by size (or kind, which is the default).

Before looking deeper, I didn't actually know that any of these features existed and didn't have a need for them. How important are these features, and would we consider dropping them?

Possible solutions

It's a surprisingly tricky situation. My CPU stays maxed out for the full duration of pd.DataFrame.memory_usage, so we can't solve this by pushing the work out into a separate coroutine or thread. Even using a separate process (forked to avoid copying the dataframe) wouldn't be great since the user would have a process unexpecedly maxing out CPU on re-assignments, and it would still take 10+ seconds for the variables pane to update.

My thoughts on solutions:

  1. My suggestion is that we calculate a very crude estimate of the size, such that we get reasonable results for sorting and grouping (using the groups here). Since we don't currently show the size of dataframes it may not affect the UX at all. We could also later handle estimated sizes more elegantly in the UI e.g. add a flag to indicate that the size is approximate and show a ~ or a hover message.
  2. It might be possible to find faster ways to calculate (or estimate) memory usage for wide pandas dataframes. There might be hard blockers depending on how pandas is architected, since I would imagine that they're already trying to do this as efficiently as possible but that might not be true.

cc'ing Python folks since I think this is will be a huge pain for anyone working with wide datasets. Any re-assignment will be this slow. @isabelizimm @jgutman @juliasilge @petetronic

@wesm
Copy link
Contributor Author

wesm commented Feb 6, 2024

I think we can just return the number of data cells * 8 as a rough estimate of size for now, until we need something more precise.

@jgutman
Copy link
Contributor

jgutman commented Feb 6, 2024

Agreed -- its not even displayed anywhere so this seems like it would be fine to me. If we decide to move pandas Series to the data group instead of the values group as we've discussed, it would be the same (number of values in the series * constant)

jgutman added a commit that referenced this issue Feb 12, 2024
bump positron-python, includes fixes for #2206, #2174, #2120
@juliasilge
Copy link
Contributor

In Positron 2024.02.0 (Universal) build 1563 I can make a data frame with lots and lots of columns, and the Variables pane updates instantaneously to the correct state:

many-columns

@wesm wesm added lang: python area: variables Issues related to Variables category. performance and removed performance labels Feb 29, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area: variables Issues related to Variables category. lang: python performance
Projects
None yet
Development

No branches or pull requests

5 participants