Variable deletion consumes a lot of memory #17092

ivallesp · 2017-07-27T07:38:15Z

Hi team,

I have been having issues with pandas memory management. Specifically, there is an (at least for me) unavoidable peak of memory which occurs when attempting to remove variables from a data set. It should be (almost) free! I am getting rid of part of the data, but it still needs to allocate a big amount of memory producing MemoryErrors.

Just to give you a little bit of context, I am working with a DataFrame which contains 33M of rows and 500 columns (just a big one!), almost all of them numeric, in a machine with 360GB of RAM. The whole data set fits in memory and I can successfully apply some transformations to the variables. The problem comes when I need to drop a 10% of the columns contained in the table. It just produces a big peak of memory leading to a MemoryError. Before performing this operation, there are more than 80GB of memory available!.

I tried to use the following methods for removing the columns and all of them failed.

drop() with or without inplace parameter
pop()
reindex()
reindex_axis()
del df[column] in a loop over the columns to be removed
__delitem__(column) in a loop over the columns to be removed
pop() and drop() in a loop over the columns to be removed.
I also tried to reasign the columns overwritting the data frame using indexing with loc() and iloc() but it does not help.

I found that the drop method with inplace is the most efficient one but it still generates a huge peak.

I would like to discuss if there is there any way of implementing (or is it already implemented by any chance) a method for more efficiently removing variables without generating more memory consumption...

Thank you
Iván

The text was updated successfully, but these errors were encountered:

gfyoung · 2017-07-27T09:02:27Z

xref #16529 : This touches upon a larger question of whether we want to deprecate / remove the inplace parameter, which has been a point of contention in terms of the future of pandas.

@ivallesp : Do you by any chance have code / data that could be used to replicate this issue?

ivallesp · 2017-07-27T11:03:22Z

@gfyoung Sure, find it attached. Just to make it clear, the usage of the inplace parameter does not change anything in terms of memory usage. Can I help with something? Is there any idea of how to improve the drop function or how to design a more efficient function? I would like to collaborate on this :D

I profile using the memory profiler extension of Jupyter Notebooks.

import pandas as pd
from sklearn.datasets import make_classification

N_FEATURES=100
N_SAMPLES=1000000
x=make_classification(n_samples=1000000, n_features=100)[0]
df = pd.DataFrame(x[0], columns = ["VAR_%s"%x for x in range(N_FEATURES)])

# Begining of code to profile ------------------------------------
df.drop(df.columns[0:50], inplace=True, axis=1)
# End of code to profile -----------------------------------------

jreback · 2017-07-27T11:42:02Z

Just to make it clear, the usage of the inplace parameter does not change anything in terms of memory usage.

where is it stated that this actually does anything w.r.t. memory usage? virtually all inplace operations make a copy and then re-assign the data.

It may release the memory, depending on IF the underlying data was a view or a copy.

In [32]: df = pd.DataFrame(np.random.randn(100000, 10))

In [33]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 10 columns):
0    100000 non-null float64
1    100000 non-null float64
2    100000 non-null float64
3    100000 non-null float64
4    100000 non-null float64
5    100000 non-null float64
6    100000 non-null float64
7    100000 non-null float64
8    100000 non-null float64
9    100000 non-null float64
dtypes: float64(10)
memory usage: 7.6 MB

In [34]: df.drop([0, 1], axis=1, inplace=True)

In [35]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
2    100000 non-null float64
3    100000 non-null float64
4    100000 non-null float64
5    100000 non-null float64
6    100000 non-null float64
7    100000 non-null float64
8    100000 non-null float64
9    100000 non-null float64
dtypes: float64(8)
memory usage: 6.1 MB

You are much more likely though to release memory if you use a more idiomatic.

df = df.drop(..., axis=1)

This removes the top-level reference to the original frame. Note that none of this actually will garbage collect (and nothing will release the memory back to the os).

ivallesp · 2017-07-27T15:08:21Z

I know the inplace parameter is not helping avoiding the memory increase. I just measured it!. Although the inplace name suggests that no copy is made.

Anyway, this was not the topic of this conversation. Closing the issue does not help solving it, it is just hiding the dirty under the mat... It would be better to read my main message. The problem is that there is not a way of deleting variables in a big DataFrame without generating a huge peak of memory, this is a big problem guys.

In addition, again, regarding to your comment @jreback, I do not have problems releasing memory, I have a highly unexpected peak of memory.

Best,
Iván

jreback · 2017-07-27T15:21:11Z

this is not going to be solved in pandas 1. Data of a single dtype is blocked, creating a a view on that does not release the memory (and that is what you are doing). You can do this.

df =....

df2 = df.drop(...., axis=1)
del dfd

alvarouc · 2019-11-13T15:26:32Z

Is there any update on this issue? SO far two contradicting solutions have been proposed.

You are much more likely though to release memory if you use a more idiomatic.

df = df.drop(..., axis=1)
This removes the top-level reference to the original frame. Note that none of this actually will garbage collect (and nothing will release the memory back to the os).

and

You can do this.

df =....

df2 = df.drop(...., axis=1)
del dfd

What is the best way to delete a column without running out of memory?

giangdaotr · 2021-05-12T02:24:32Z

We encountered the same issue, and just to reiterate, it's the problem with huge memory peak during the drop which leads to MemoryError, NOT the problem with memory release.

ianozsvald · 2021-05-14T12:23:07Z

@giangdaotr I've made a demo to show the cost of using del df[col] vs df.drop(...), the del solution in my example is indeed very expensive. I wonder if the block manager is duplicating RAM under certain conditions (which @jreback notes above). Demo here https://github.com/ianozsvald/ipython_memory_usage/blob/master/src/ipython_memory_usage/examples/example_usage_np_pd.ipynb (see In[16] onwards).

Personally I'm keen to know more because reasoning about memory using in Pandas (and when/if you get a view or a copy) is pretty tricky, I'm using my ipython_memory_usage tool to try to build up some demos. I'm happy to collect use cases here: ianozsvald/ipython_memory_usage#30

gfyoung added Low-Memory Needs Discussion Requires discussion from core team before further action labels Jul 27, 2017

jreback closed this as completed Jul 27, 2017

jreback added this to the No action milestone Jul 27, 2017

jreback added Performance Memory or execution speed performance Usage Question and removed Low-Memory Needs Discussion Requires discussion from core team before further action labels Jul 27, 2017

ianozsvald mentioned this issue May 14, 2021

Collect pandas (and other) weird memory usage cases ianozsvald/ipython_memory_usage#30

Open

i-am-sijia mentioned this issue Mar 26, 2024

Automatically drop unneeded columns in choosers table ActivitySim/activitysim#833

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable deletion consumes a lot of memory #17092

Variable deletion consumes a lot of memory #17092

ivallesp commented Jul 27, 2017 •

edited

Loading

gfyoung commented Jul 27, 2017

ivallesp commented Jul 27, 2017 •

edited

Loading

jreback commented Jul 27, 2017

ivallesp commented Jul 27, 2017 •

edited

Loading

jreback commented Jul 27, 2017

alvarouc commented Nov 13, 2019

giangdaotr commented May 12, 2021

ianozsvald commented May 14, 2021 •

edited

Loading

Variable deletion consumes a lot of memory #17092

Variable deletion consumes a lot of memory #17092

Comments

ivallesp commented Jul 27, 2017 • edited Loading

gfyoung commented Jul 27, 2017

ivallesp commented Jul 27, 2017 • edited Loading

jreback commented Jul 27, 2017

ivallesp commented Jul 27, 2017 • edited Loading

jreback commented Jul 27, 2017

alvarouc commented Nov 13, 2019

giangdaotr commented May 12, 2021

ianozsvald commented May 14, 2021 • edited Loading

ivallesp commented Jul 27, 2017 •

edited

Loading

ivallesp commented Jul 27, 2017 •

edited

Loading

ivallesp commented Jul 27, 2017 •

edited

Loading

ianozsvald commented May 14, 2021 •

edited

Loading