Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable deletion consumes a lot of memory #17092

Closed
ivallesp opened this issue Jul 27, 2017 · 8 comments
Closed

Variable deletion consumes a lot of memory #17092

ivallesp opened this issue Jul 27, 2017 · 8 comments
Labels
Performance Memory or execution speed performance Usage Question

Comments

@ivallesp
Copy link
Contributor

ivallesp commented Jul 27, 2017

Hi team,

I have been having issues with pandas memory management. Specifically, there is an (at least for me) unavoidable peak of memory which occurs when attempting to remove variables from a data set. It should be (almost) free! I am getting rid of part of the data, but it still needs to allocate a big amount of memory producing MemoryErrors.

Just to give you a little bit of context, I am working with a DataFrame which contains 33M of rows and 500 columns (just a big one!), almost all of them numeric, in a machine with 360GB of RAM. The whole data set fits in memory and I can successfully apply some transformations to the variables. The problem comes when I need to drop a 10% of the columns contained in the table. It just produces a big peak of memory leading to a MemoryError. Before performing this operation, there are more than 80GB of memory available!.

I tried to use the following methods for removing the columns and all of them failed.

  • drop() with or without inplace parameter
  • pop()
  • reindex()
  • reindex_axis()
  • del df[column] in a loop over the columns to be removed
  • __delitem__(column) in a loop over the columns to be removed
  • pop() and drop() in a loop over the columns to be removed.
  • I also tried to reasign the columns overwritting the data frame using indexing with loc() and iloc() but it does not help.

I found that the drop method with inplace is the most efficient one but it still generates a huge peak.

I would like to discuss if there is there any way of implementing (or is it already implemented by any chance) a method for more efficiently removing variables without generating more memory consumption...

Thank you
Iván

@gfyoung gfyoung added Low-Memory Needs Discussion Requires discussion from core team before further action labels Jul 27, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 27, 2017

xref #16529 : This touches upon a larger question of whether we want to deprecate / remove the inplace parameter, which has been a point of contention in terms of the future of pandas.

@ivallesp : Do you by any chance have code / data that could be used to replicate this issue?

@ivallesp
Copy link
Contributor Author

ivallesp commented Jul 27, 2017

@gfyoung Sure, find it attached. Just to make it clear, the usage of the inplace parameter does not change anything in terms of memory usage. Can I help with something? Is there any idea of how to improve the drop function or how to design a more efficient function? I would like to collaborate on this :D

I profile using the memory profiler extension of Jupyter Notebooks.

import pandas as pd
from sklearn.datasets import make_classification

N_FEATURES=100
N_SAMPLES=1000000
x=make_classification(n_samples=1000000, n_features=100)[0]
df = pd.DataFrame(x[0], columns = ["VAR_%s"%x for x in range(N_FEATURES)])

# Begining of code to profile ------------------------------------
df.drop(df.columns[0:50], inplace=True, axis=1)
# End of code to profile -----------------------------------------

@jreback
Copy link
Contributor

jreback commented Jul 27, 2017

Just to make it clear, the usage of the inplace parameter does not change anything in terms of memory usage.

where is it stated that this actually does anything w.r.t. memory usage? virtually all inplace operations make a copy and then re-assign the data.

It may release the memory, depending on IF the underlying data was a view or a copy.

In [32]: df = pd.DataFrame(np.random.randn(100000, 10))

In [33]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 10 columns):
0    100000 non-null float64
1    100000 non-null float64
2    100000 non-null float64
3    100000 non-null float64
4    100000 non-null float64
5    100000 non-null float64
6    100000 non-null float64
7    100000 non-null float64
8    100000 non-null float64
9    100000 non-null float64
dtypes: float64(10)
memory usage: 7.6 MB

In [34]: df.drop([0, 1], axis=1, inplace=True)

In [35]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
2    100000 non-null float64
3    100000 non-null float64
4    100000 non-null float64
5    100000 non-null float64
6    100000 non-null float64
7    100000 non-null float64
8    100000 non-null float64
9    100000 non-null float64
dtypes: float64(8)
memory usage: 6.1 MB

You are much more likely though to release memory if you use a more idiomatic.

df = df.drop(..., axis=1)

This removes the top-level reference to the original frame. Note that none of this actually will garbage collect (and nothing will release the memory back to the os).

@jreback jreback closed this as completed Jul 27, 2017
@jreback jreback added this to the No action milestone Jul 27, 2017
@jreback jreback added Performance Memory or execution speed performance Usage Question and removed Low-Memory Needs Discussion Requires discussion from core team before further action labels Jul 27, 2017
@ivallesp
Copy link
Contributor Author

ivallesp commented Jul 27, 2017

I know the inplace parameter is not helping avoiding the memory increase. I just measured it!. Although the inplace name suggests that no copy is made.

Anyway, this was not the topic of this conversation. Closing the issue does not help solving it, it is just hiding the dirty under the mat... It would be better to read my main message. The problem is that there is not a way of deleting variables in a big DataFrame without generating a huge peak of memory, this is a big problem guys.

In addition, again, regarding to your comment @jreback, I do not have problems releasing memory, I have a highly unexpected peak of memory.

Best,
Iván

@jreback
Copy link
Contributor

jreback commented Jul 27, 2017

this is not going to be solved in pandas 1. Data of a single dtype is blocked, creating a a view on that does not release the memory (and that is what you are doing). You can do this.

df =....

df2 = df.drop(...., axis=1)
del dfd

@alvarouc
Copy link

Is there any update on this issue? SO far two contradicting solutions have been proposed.

You are much more likely though to release memory if you use a more idiomatic.

df = df.drop(..., axis=1)
This removes the top-level reference to the original frame. Note that none of this actually will garbage collect (and nothing will release the memory back to the os).

and

You can do this.

df =....

df2 = df.drop(...., axis=1)
del dfd

What is the best way to delete a column without running out of memory?

@giangdaotr
Copy link

We encountered the same issue, and just to reiterate, it's the problem with huge memory peak during the drop which leads to MemoryError, NOT the problem with memory release.

@ianozsvald
Copy link
Contributor

ianozsvald commented May 14, 2021

@giangdaotr I've made a demo to show the cost of using del df[col] vs df.drop(...), the del solution in my example is indeed very expensive. I wonder if the block manager is duplicating RAM under certain conditions (which @jreback notes above). Demo here https://github.com/ianozsvald/ipython_memory_usage/blob/master/src/ipython_memory_usage/examples/example_usage_np_pd.ipynb (see In[16] onwards).

Personally I'm keen to know more because reasoning about memory using in Pandas (and when/if you get a view or a copy) is pretty tricky, I'm using my ipython_memory_usage tool to try to build up some demos. I'm happy to collect use cases here: ianozsvald/ipython_memory_usage#30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Usage Question
Projects
None yet
Development

No branches or pull requests

6 participants