Memory leak in Naive engine when profiling #15375

larroy · 2019-06-26T19:33:40Z

I believe there's a memory leak in Naive engine here:

https://github.com/apache/incubator-mxnet/blob/master/src/engine/naive_engine.cc#L164

I think it would be a great opportunity to refactor how NewOperator is released. As it's also difficult to reason about the release of the operator memory in the threaded engine (looks correct to me, but it takes quite a bit to dig it up and understand).

I would propose returning a unique_ptr directly with a custom deleter using a virtual operator() that is specialized for the engine type which will get the same effect as how operators are being released now. This would make easier to reason about this memory management part.

https://github.com/apache/incubator-mxnet/blob/master/src/engine/threaded_engine.cc#L270

mxnet-label-bot · 2019-06-26T19:33:43Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Bug

vdantu · 2019-06-26T21:06:30Z

@mxnet-label-bot add [bug]

anirudh2290 · 2019-06-27T22:33:59Z

I agree with you about fixing the memory leak in the NaiveEngine and thanks for raising that. But i disagree about the need to refactor the release of operator memory in the threaded engine. This was a conscious design decision to use Object pool for operator objects which would be manually removed from and added back to the pool. This removal and addition was supposed to be managed by graph executors and other modules that push operators to the engine. If readability is really the issue that is very subjective and I don't really see the benefit trump the cost here.

cjolivier01 · 2019-06-28T02:32:32Z

I agree with Anirudh.

I didn’t write the object pool code, but the reason it was done that way was to
avoid a new/delete operation in-line to the profiling, thus causing profiling to introduce possibly extra locks (and associated heap code), depending on the allocation library used, causing the “Observer Effect”, which is not desired in profiling, of all things. Ideally, the naive engine would use the same mechanism (a better change, imho), but since it’s not used as much for production workloads, I suppose they didn’t think it was worth it to code it.

larroy · 2019-06-28T19:00:27Z

@anirudh2290 I proposed refactoring around a unique_ptr I don't think is going to change the way memory is released, maybe you got a different impression of what I was suggesting to do. I don't know why are you oposing a proposed change right away without considering the benefits. Having a pointer released in a call hiearchy 3 levels down leads to bugs and memory leaks, is difficult to reason about it. Best is to use RAII and return a unique_ptr or similar which holds ownership. I don't think the way the release of resources is going to be modified or performance impacted in any way, but If the proposal is blocked from the start makes no sense to even propose a PR, this is not very encouraging.

larroy · 2019-06-28T19:09:06Z

@cjolivier01 I didn't understand a word of what you are writing. I'm not proposing to change the object pool, just make NewOp return a smart pointer.

anirudh2290 · 2019-06-28T19:31:25Z

@larroy even less encouraging is having you make a PR and then blocking it. I was trying to save you some time. Ownership of the pointer rests with the module that pushes it. So imperative or graph executor should handle it. If you are really keen on doing this do it in the graph executor. There is no need to touch the engine code here.

Having a pointer released in a call hierarchy 3 levels down leads to bugs and memory leaks, is difficult to reason about it.

This is the design, Engine relies on object pool for allocation, and the modules depend on engine to handle this, the modules just tell them when it wants a new operator creator or destroyed.
I think the code is very logical and it makes sense to me. Also, the memory leak you found is nowhere related to the ThreadedEngine code.

I considered the benefits of your approach: "readability/understandability" in the first comment.

larroy · 2019-06-28T20:14:03Z

I will make a small fix to the leak in Naive engine, and a separate PR with a refactor, so we can have a more focused and productive conversation then. Thanks for your feedback.

larroy · 2019-06-28T20:52:38Z

@anirudh2290 it makes sense to you because you know the code, I would encourage you to read about the benfits of RAII. This is orthogonal to having an object pool for allocation, which I was not proposing to change.

anirudh2290 · 2019-06-28T21:08:46Z

it makes sense to you because you know the code

So who are you targeting, someone unfamiliar with the code. Everyone needs to spend some time with the code to understand it, but after they spend that time it makes sense.

I know the benefits of RAII, but it doesn't apply to this use case of threaded engine. I am saying there is no need to change threaded engine code here which you are proposing to do. RAII won't benefit here, because engine itself is only responsible for indicating to the object pool that the object has to be deleted. The rest is taken care by the object pool itself. As I said if you really want to add RAII, do it in the graph executor (although I don't think it is really needed there either).

larroy · 2019-06-28T21:23:40Z

Maybe you know the code better and what I propose doesn't make sense, my proposal was to wrap the operator in a unique_ptr with a custom deleter with a virtual apply operator, so delete doesn't need to be called here:

https://github.com/apache/incubator-mxnet/blob/master/src/engine/threaded_engine.cc#L472

But it will still use the object pool, this wouldn't be changed.

Did I understand correctly that you think this is not possible or that is better to explicitly delete the operator in this case?

The leak would not be in the first place if you would always pass around a smart pointer from the operator, that was my point, maybe I'm missing something which really needs the raw pointer around, but couldn't you pass around a unique_ptr that gets freed from the pool when needed? having the pointer deleted manually is not exactly exception safe.

anirudh2290 · 2019-06-28T21:46:33Z

First when you talk about leak, you talk about naive, which does things differently from threaded.

Coming to threaded, What will be the scope and lifetime of the unique_ptr, that you will be creating, how will dependency engine know when it has to be deleted ? It has individual view of operators not the full graph view.

I have spent a lot of time here, having this discussion, on something that hardly adds any value. I have said all I have to say.

Fixes apache#15375

larroy · 2019-06-28T23:03:42Z

I have spent a lot of time here, having this discussion, on something that hardly adds any value. I have said all I have to say.

Hi @anirudh2290
I don't think this is very constructive to conclude your reply with that, if you are asking questions. I was trying to understand your point and explain my proposal. If you don't have time to followup, no need to start a discussion in the first place. I just made a suggestion on this ticket to prevent similar kinds of leaks in other places using this code. I respect that you have your own opinion, but for me having unmanaged pointers in many places with a complex delete pattern is creating an environment when reasoning about the code is hard and prone to bugs. Saying that something hardly adds any value is not very constructive if you don't try to understand the problem first. Cheers.

anirudh2290 · 2019-06-28T23:13:51Z

Okay, I will say it again: the leak is in naive engine and not the threaded engine. Also, let me say again for the threaded engine, there is no problem here. I would request you to spend time fixing some real issues that mxnet has.

I am going to close this issue, please feel free to open a new one for the naive engine memory leak.

larroy · 2019-06-28T23:24:24Z

Did you read the title of the issue? Why are you closing this? it says the leak is in the NaiveEngine when profiling. I said the leak is in Naive Engine since the beginning.

anirudh2290 · 2019-06-28T23:48:24Z

This thread has been derailed a lot from the original issue and the whole discussion has been about threaded engine where there is no issue. I will keep this closed, feel free to open a new issue and keep the discussion focused to naive engine.

larroy · 2019-06-29T00:40:18Z

Attaching valgrind run which shows the leaked memory.

larroy · 2019-06-29T00:44:42Z

valgrind.txt

Fixes apache#15375

* Fix memory leak in NaiveEngine Fixes #15375 * Fix lint

marcoabreu added the Bug label Jun 26, 2019

larroy added a commit to larroy/mxnet that referenced this issue Jun 28, 2019

Fix memory leak in NaiveEngine

6ed2af2

Fixes apache#15375

anirudh2290 closed this as completed Jun 28, 2019

larroy added a commit to larroy/mxnet that referenced this issue Jun 29, 2019

Fix memory leak in NaiveEngine

2e317cd

Fixes apache#15375

larroy mentioned this issue Jun 29, 2019

Fix memory leak in NaiveEngine #15405

Merged

wkcn pushed a commit that referenced this issue Jul 4, 2019

Fix memory leak in NaiveEngine (#15405)

faccc59

* Fix memory leak in NaiveEngine Fixes #15375 * Fix lint

damvantai mentioned this issue Sep 4, 2019

Error memory leak in multi thread use Threading library in python, when predict from same a model for multi thread #16087

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in Naive engine when profiling #15375

Memory leak in Naive engine when profiling #15375

larroy commented Jun 26, 2019

mxnet-label-bot commented Jun 26, 2019

vdantu commented Jun 26, 2019

anirudh2290 commented Jun 27, 2019

cjolivier01 commented Jun 28, 2019

larroy commented Jun 28, 2019 •

edited

Loading

larroy commented Jun 28, 2019

anirudh2290 commented Jun 28, 2019

larroy commented Jun 28, 2019

larroy commented Jun 28, 2019

anirudh2290 commented Jun 28, 2019

larroy commented Jun 28, 2019 •

edited

Loading

anirudh2290 commented Jun 28, 2019

larroy commented Jun 28, 2019

anirudh2290 commented Jun 28, 2019

larroy commented Jun 28, 2019 •

edited

Loading

anirudh2290 commented Jun 28, 2019

larroy commented Jun 29, 2019

larroy commented Jun 29, 2019

Memory leak in Naive engine when profiling #15375

Memory leak in Naive engine when profiling #15375

Comments

larroy commented Jun 26, 2019

mxnet-label-bot commented Jun 26, 2019

vdantu commented Jun 26, 2019

anirudh2290 commented Jun 27, 2019

cjolivier01 commented Jun 28, 2019

larroy commented Jun 28, 2019 • edited Loading

larroy commented Jun 28, 2019

anirudh2290 commented Jun 28, 2019

larroy commented Jun 28, 2019

larroy commented Jun 28, 2019

anirudh2290 commented Jun 28, 2019

larroy commented Jun 28, 2019 • edited Loading

anirudh2290 commented Jun 28, 2019

larroy commented Jun 28, 2019

anirudh2290 commented Jun 28, 2019

larroy commented Jun 28, 2019 • edited Loading

anirudh2290 commented Jun 28, 2019

larroy commented Jun 29, 2019

larroy commented Jun 29, 2019

larroy commented Jun 28, 2019 •

edited

Loading

larroy commented Jun 28, 2019 •

edited

Loading

larroy commented Jun 28, 2019 •

edited

Loading