-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Memory leak in Naive engine when profiling #15375
Comments
Hey, this is the MXNet Label Bot. |
@mxnet-label-bot add [bug] |
I agree with you about fixing the memory leak in the NaiveEngine and thanks for raising that. But i disagree about the need to refactor the release of operator memory in the threaded engine. This was a conscious design decision to use Object pool for operator objects which would be manually removed from and added back to the pool. This removal and addition was supposed to be managed by graph executors and other modules that push operators to the engine. If readability is really the issue that is very subjective and I don't really see the benefit trump the cost here. |
I agree with Anirudh. I didn’t write the object pool code, but the reason it was done that way was to |
@anirudh2290 I proposed refactoring around a unique_ptr I don't think is going to change the way memory is released, maybe you got a different impression of what I was suggesting to do. I don't know why are you oposing a proposed change right away without considering the benefits. Having a pointer released in a call hiearchy 3 levels down leads to bugs and memory leaks, is difficult to reason about it. Best is to use RAII and return a unique_ptr or similar which holds ownership. I don't think the way the release of resources is going to be modified or performance impacted in any way, but If the proposal is blocked from the start makes no sense to even propose a PR, this is not very encouraging. |
@cjolivier01 I didn't understand a word of what you are writing. I'm not proposing to change the object pool, just make NewOp return a smart pointer. |
@larroy even less encouraging is having you make a PR and then blocking it. I was trying to save you some time. Ownership of the pointer rests with the module that pushes it. So imperative or graph executor should handle it. If you are really keen on doing this do it in the graph executor. There is no need to touch the engine code here.
This is the design, Engine relies on object pool for allocation, and the modules depend on engine to handle this, the modules just tell them when it wants a new operator creator or destroyed. I considered the benefits of your approach: "readability/understandability" in the first comment. |
I will make a small fix to the leak in Naive engine, and a separate PR with a refactor, so we can have a more focused and productive conversation then. Thanks for your feedback. |
@anirudh2290 it makes sense to you because you know the code, I would encourage you to read about the benfits of RAII. This is orthogonal to having an object pool for allocation, which I was not proposing to change. |
So who are you targeting, someone unfamiliar with the code. Everyone needs to spend some time with the code to understand it, but after they spend that time it makes sense. I know the benefits of RAII, but it doesn't apply to this use case of threaded engine. I am saying there is no need to change threaded engine code here which you are proposing to do. RAII won't benefit here, because engine itself is only responsible for indicating to the object pool that the object has to be deleted. The rest is taken care by the object pool itself. As I said if you really want to add RAII, do it in the graph executor (although I don't think it is really needed there either). |
Maybe you know the code better and what I propose doesn't make sense, my proposal was to wrap the operator in a unique_ptr with a custom deleter with a virtual apply operator, so delete doesn't need to be called here: https://github.com/apache/incubator-mxnet/blob/master/src/engine/threaded_engine.cc#L472 But it will still use the object pool, this wouldn't be changed. Did I understand correctly that you think this is not possible or that is better to explicitly delete the operator in this case? The leak would not be in the first place if you would always pass around a smart pointer from the operator, that was my point, maybe I'm missing something which really needs the raw pointer around, but couldn't you pass around a unique_ptr that gets freed from the pool when needed? having the pointer deleted manually is not exactly exception safe. |
First when you talk about leak, you talk about naive, which does things differently from threaded. Coming to threaded, What will be the scope and lifetime of the unique_ptr, that you will be creating, how will dependency engine know when it has to be deleted ? It has individual view of operators not the full graph view. I have spent a lot of time here, having this discussion, on something that hardly adds any value. I have said all I have to say. |
Hi @anirudh2290 |
Okay, I will say it again: the leak is in naive engine and not the threaded engine. Also, let me say again for the threaded engine, there is no problem here. I would request you to spend time fixing some real issues that mxnet has. I am going to close this issue, please feel free to open a new one for the naive engine memory leak. |
Did you read the title of the issue? Why are you closing this? it says the leak is in the NaiveEngine when profiling. I said the leak is in Naive Engine since the beginning. |
This thread has been derailed a lot from the original issue and the whole discussion has been about threaded engine where there is no issue. I will keep this closed, feel free to open a new issue and keep the discussion focused to naive engine. |
* Fix memory leak in NaiveEngine Fixes #15375 * Fix lint
I believe there's a memory leak in Naive engine here:
https://github.com/apache/incubator-mxnet/blob/master/src/engine/naive_engine.cc#L164
I think it would be a great opportunity to refactor how NewOperator is released. As it's also difficult to reason about the release of the operator memory in the threaded engine (looks correct to me, but it takes quite a bit to dig it up and understand).
I would propose returning a unique_ptr directly with a custom deleter using a virtual operator() that is specialized for the engine type which will get the same effect as how operators are being released now. This would make easier to reason about this memory management part.
https://github.com/apache/incubator-mxnet/blob/master/src/engine/threaded_engine.cc#L270
The text was updated successfully, but these errors were encountered: