Python: variables pane updates for data frames with many columns is slow #2174

wesm · 2024-01-31T19:07:21Z

Positron Version: git main

Steps to reproduce the issue:

Load data frame with over 100,000 rows
Transpose data frame into new variable (df2 = df.T)

For example

import numpy as np
import pandas as pd
df = pd.DataFrame({'a': np.arange(1000000)})
df
df2 = df.T

(without Positron in the loop, df.T is virtually instantaneous)

In [4]: %timeit df.T
20.6 µs ± 76.6 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The variables pane takes over 10 seconds for the update to come through.

We should look at the variables logic around data frames to make sure that we don't do excess computations for data frames with > 10,000 columns. For some users, this case is less unusual than one might think, and so we don't want the UI to get blocked in this scenario.

The text was updated successfully, but these errors were encountered:

seeM · 2024-02-02T11:14:02Z

It turns out that the delay is not due to the change detection logic. It happens at the sys.getsizeof here: https://github.com/posit-dev/positron-python/blob/3e7d7cf181449d537913cdea68c1b14ab9bb2957/pythonFiles/positron/inspectors.py#L124. It seems to be slow for really wide dataframes. We use that to calculate the size field of the Variable (which I don't think we're using in the UI yet).

I wondered if we could use pd.DataFrame.memory_usage instead but that's also really slow for wide dataframes.

seeM · 2024-02-05T11:49:19Z

How we use `size` in the UI

After taking a look, I see that we use size in the UI for three things:

Users can sort variables by size (or name, which is the default).
When variables are sorted by size, their respective sizes are shown on the right hand side instead of the display type. However, for tabular variables, a table icon is always shown instead of display type or size, so users can't actually see the size of tabular data atm.
Users can group variables by size (or kind, which is the default).

Before looking deeper, I didn't actually know that any of these features existed and didn't have a need for them. How important are these features, and would we consider dropping them?

Possible solutions

It's a surprisingly tricky situation. My CPU stays maxed out for the full duration of pd.DataFrame.memory_usage, so we can't solve this by pushing the work out into a separate coroutine or thread. Even using a separate process (forked to avoid copying the dataframe) wouldn't be great since the user would have a process unexpecedly maxing out CPU on re-assignments, and it would still take 10+ seconds for the variables pane to update.

My thoughts on solutions:

My suggestion is that we calculate a very crude estimate of the size, such that we get reasonable results for sorting and grouping (using the groups here). Since we don't currently show the size of dataframes it may not affect the UX at all. We could also later handle estimated sizes more elegantly in the UI e.g. add a flag to indicate that the size is approximate and show a ~ or a hover message.
It might be possible to find faster ways to calculate (or estimate) memory usage for wide pandas dataframes. There might be hard blockers depending on how pandas is architected, since I would imagine that they're already trying to do this as efficiently as possible but that might not be true.

cc'ing Python folks since I think this is will be a huge pain for anyone working with wide datasets. Any re-assignment will be this slow. @isabelizimm @jgutman @juliasilge @petetronic

wesm · 2024-02-06T22:38:00Z

I think we can just return the number of data cells * 8 as a rough estimate of size for now, until we need something more precise.

jgutman · 2024-02-06T22:44:49Z

Agreed -- its not even displayed anywhere so this seems like it would be fine to me. If we decide to move pandas Series to the data group instead of the values group as we've discussed, it would be the same (number of values in the series * constant)

bump positron-python, includes fixes for #2206, #2174, #2120

juliasilge · 2024-02-21T21:37:45Z

In Positron 2024.02.0 (Universal) build 1563 I can make a data frame with lots and lots of columns, and the Variables pane updates instantaneously to the correct state:

seeM added this to the Private Alpha 2024 Q1 milestone Feb 5, 2024

jgutman added a commit that referenced this issue Feb 12, 2024

bump positron-python, includes fixes for #2206, #2174, #2120

5cfb811

jgutman added a commit that referenced this issue Feb 12, 2024

Merge pull request #2243 from posit-dev/bump_python

65e1871

bump positron-python, includes fixes for #2206, #2174, #2120

petetronic modified the milestones: Private Alpha 2024 Q1, Public Beta 2024 Q2 Feb 15, 2024

petetronic added the performance label Feb 15, 2024

juliasilge closed this as completed Feb 21, 2024

wesm added lang: python area: variables Issues related to Variables category. performance and removed performance labels Feb 29, 2024

github-actions bot locked as resolved and limited conversation to collaborators Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: variables pane updates for data frames with many columns is slow #2174

Python: variables pane updates for data frames with many columns is slow #2174

wesm commented Jan 31, 2024

seeM commented Feb 2, 2024

seeM commented Feb 5, 2024

wesm commented Feb 6, 2024

jgutman commented Feb 6, 2024

juliasilge commented Feb 21, 2024

Python: variables pane updates for data frames with many columns is slow #2174

Python: variables pane updates for data frames with many columns is slow #2174

Comments

wesm commented Jan 31, 2024

Positron Version: git main

Steps to reproduce the issue:

seeM commented Feb 2, 2024

seeM commented Feb 5, 2024

How we use size in the UI

Possible solutions

wesm commented Feb 6, 2024

jgutman commented Feb 6, 2024

juliasilge commented Feb 21, 2024

How we use `size` in the UI