Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate memory leak #2403

Closed
carlfm01 opened this issue Oct 7, 2019 · 31 comments · Fixed by #2448
Closed

Investigate memory leak #2403

carlfm01 opened this issue Oct 7, 2019 · 31 comments · Fixed by #2448
Assignees
Labels

Comments

@carlfm01
Copy link
Collaborator

carlfm01 commented Oct 7, 2019

  • Have I written custom code (as opposed to running examples on an unmodified clone of the repository): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 Enterprise
  • TensorFlow version of the Nuget: v1.13.1-13-g174b4760eb
  • DeepSpeech version of the Nuget:v0.6.0-alpha.1-0-g90e7980
  • Python version: NA
  • Bazel version (if compiling from source): NA
  • GCC/Compiler version (if compiling from source): NA
  • CUDA/cuDNN version: NA
  • GPU model and memory: NA
  • Exact command to reproduce: NA. Using C# client
  • Nuget version: :0.6.0-alpha.1

Description:
Memory not being released on model destroy when using the language model.

The issue is coming from 0.6.0-alpha.1, 0.6.0-alpha.0 works just fine.

@lissyx the LAZY config change is not the source of the issue, lazy change: #2385

@carlfm01 carlfm01 self-assigned this Oct 7, 2019
@carlfm01 carlfm01 added the bug label Oct 7, 2019
@lissyx
Copy link
Collaborator

lissyx commented Oct 7, 2019

The issue is coming from 0.6.0-alpha.1, 0.6.0-alpha.0 works just fine.

Thanks for that. Do you reproduce on Linux as well?

@carlfm01
Copy link
Collaborator Author

carlfm01 commented Oct 7, 2019

Thanks for that. Do you reproduce on Linux as well?

Not yet, I was testing the Nugets first

@lissyx
Copy link
Collaborator

lissyx commented Oct 7, 2019

The only big changes in that window are indeed related to the LM: 31afc68 cc @reuben

@reuben
Copy link
Contributor

reuben commented Oct 7, 2019

In your tests are you specifying an empty trie path and relying on the Scorer to create it dynamically? In other words, is Scorer::fill_dictionary being called?

@carlfm01
Copy link
Collaborator Author

carlfm01 commented Oct 7, 2019

In your tests are you specifying an empty trie path and relying on the Scorer to create it dynamicall

No, I'm using a trie file path, let me see without it.

@carlfm01
Copy link
Collaborator Author

carlfm01 commented Oct 7, 2019

Without the trie, it keeps in the range from 2GB to 4GB every run, without a noticeable difference in resource usage from the first one and the last one.

@lissyx
Copy link
Collaborator

lissyx commented Oct 7, 2019

So far I have a hard time reproducing the leak. @carlfm01 do you have specific code to share for reproducing ? amounts of memory leaked ?

@lissyx
Copy link
Collaborator

lissyx commented Oct 7, 2019

@carlfm01 Your previous message mentionned growing from 200MB to 700MB over 20 iterations, that would mean we are loosing 25MB per run.

So far, I can only account for ~2MB at best:

==22384== LEAK SUMMARY:
==22384==    definitely lost: 24 bytes in 1 blocks
==22384==    indirectly lost: 0 bytes in 0 blocks
==22384==      possibly lost: 332,124 bytes in 1,521 blocks
==22384==    still reachable: 2,655,432 bytes in 33,076 blocks
==22384==                       of which reachable via heuristic:
==22384==                         newarray           : 52,576 bytes in 196 blocks
==22384==         suppressed: 0 bytes in 0 blocks

And most of it is from TensorFlow itself, and this might be false-positive from valgrind.

The only items that would connect to language model / decoder would account only for a few bytes (10 occurrences, between 24 bytes and 32 bytes, so at worst it's < 320 bytes per run). Obivously, very far away from what you experience. I do fear this might be Windows-specific. Or at least, not reproductible on Linux / under valgrind.

@carlfm01
Copy link
Collaborator Author

carlfm01 commented Oct 7, 2019

do you have specific code to share for reproducing?

Just the console example using a for over the same file::https://gist.github.com/carlfm01/fd69a8ca2784837dabf9375d35258953#file-test-cs-L59
To see the memory usage I'm using the VS profiler(poor details of the unmanaged side)

Now I want to compile the console client for Linux to perform the same test, or that was exactly what you did?

@carlfm01
Copy link
Collaborator Author

carlfm01 commented Oct 8, 2019

Finally compiled and now testing:

Versions:

TensorFlow: v1.14.0-16-g3b4ce374f5
DeepSpeech: v0.6.0-alpha.8-16-gfb611ef

Valgrind report for 1 run:

==32502== LEAK SUMMARY:
==32502==    definitely lost: 24 bytes in 1 blocks
==32502==    indirectly lost: 214 bytes in 5 blocks
==32502==      possibly lost: 331,620 bytes in 1,500 blocks
==32502==    still reachable: 2,115,668 bytes in 39,762 blocks
==32502==                       of which reachable via heuristic:
==32502==                         stdstring          : 465,999 bytes in 11,883 blocks
==32502==                         newarray           : 42,880 bytes in 194 blocks
==32502==         suppressed: 0 bytes in 0 blocks

20 runs:

==45309==
==45309== LEAK SUMMARY:
==45309==    definitely lost: 7,520 bytes in 42 blocks
==45309==    indirectly lost: 7,959,270 bytes in 102,263 blocks
==45309==      possibly lost: 2,973,857 bytes in 34,726 blocks
==45309==    still reachable: 2,154,783 bytes in 40,201 blocks
==45309==                       of which reachable via heuristic:
==45309==                         stdstring          : 938,728 bytes in 17,033 blocks
==45309==                         newarray           : 208,224 bytes in 988 blocks

I do fear this might be Windows-specific.

Given the results looks you are correct.

Now looking :kkm000/openfst#8

@lissyx
Copy link
Collaborator

lissyx commented Oct 8, 2019

Now looking :kkm000/openfst#8

So it would mean ConstFst is not really doing mmap() on windows, and thus we leak from there?

@carlfm01
Copy link
Collaborator Author

carlfm01 commented Oct 8, 2019

not really doing mmap() on windows

Instead is using a custom implementation using a buffer: https://github.com/kkm000/openfst/blob/989affd3043b6357e6047a395565c3e0d979c01f/src/lib/mapped-file.cc#L48

I'll compile and debug that file, I also want to test a few things with https://code.google.com/archive/p/mman-win32/ and see if we can get rid of the buffer implementation.

@carlfm01
Copy link
Collaborator Author

carlfm01 commented Oct 9, 2019

The destructor of MappedFile is never called:

then never executes:

operator delete(static_cast<char *>(region_.data) - region_.offset);

Before I found that the deconstructor is not called, I tried to replicate with the python client on windows and turns out that the python client does no contains freeModel, why @lissyx ? I'm missing something?
Windows Python version : deepspeech==0.6.0a8

@lissyx
Copy link
Collaborator

lissyx commented Oct 9, 2019

turns out that the python client does no contains freeModel

It is handled by the wrapper:

def __del__(self):
if self._impl:
deepspeech.impl.FreeModel(self._impl)
self._impl = None

@lissyx
Copy link
Collaborator

lissyx commented Oct 9, 2019

The deconstructor of MappedFile is never called

Just to make sure, this is not only when using the Python code, this is always ?

@carlfm01
Copy link
Collaborator Author

carlfm01 commented Oct 9, 2019

Just to make sure, this is not only when using the Python code, this is always ?

Yes always, with both, python client and the c# client.

@lissyx
Copy link
Collaborator

lissyx commented Oct 9, 2019

c# client.

Can you replicate with the C++ basic client ? Just to see if the .Net bindings could have a play in the equation.

@lissyx
Copy link
Collaborator

lissyx commented Oct 9, 2019

MappedFile as much as I can read in the windows part is all std::unique_ptr<> scoped, is it possible we are missing something at a upper level?

@carlfm01
Copy link
Collaborator Author

carlfm01 commented Oct 9, 2019

Can you replicate with the C++ basic client ?

yes I'll test the basic C++ client, allow me some time to complete my builds and switch back to r1.14

@carlfm01
Copy link
Collaborator Author

carlfm01 commented Oct 10, 2019

yes I'll test the basic C++ client,

Dealing with make: \bin\amd64\cl.exe: Command not found, I'll read the cluster examples again and try carefully.

@lissyx
Copy link
Collaborator

lissyx commented Oct 14, 2019

yes I'll test the basic C++ client,

Dealing with make: \bin\amd64\cl.exe: Command not found, I'll read the cluster examples again and try carefully.

Have you been able to sort this out?

@carlfm01
Copy link
Collaborator Author

Hello @lissyx, unfortunately not, last week was a busy week working on TTS. I tried with short time windows but did not get any luck.

I'm back to it :)

@carlfm01
Copy link
Collaborator Author

carlfm01 commented Oct 16, 2019

Just realized I wasted my time, Bazel is not detecting changes inside header files :), manually removed the fst.obj under _objs and now I see the execution path printed (With this I was trying to see where is mapped-file allocated but not released).

Bazel version: 0.24.1

At this point I don't know which changes were applied for the tests, testing again... :/

Now about the C++ basic client:

Can you replicate with the C++ basic client ?

Yes, is eating the same amount of memory as the .Net client.

make: \bin\amd64\cl.exe: Command not found

solved this by replacing :

TOOLCHAIN := '$(VCINSTALLDIR)\bin\amd64\'

with my full path to the cl.exe of my VS and then running vcvars64 before the make command

@lissyx
Copy link
Collaborator

lissyx commented Oct 16, 2019

with my full path to the cl.exe of my VS and then running vcvars64 before the make command

Right, reminds me of things I had to do on TaskCluster. I assumed that on developper systems, this should be already dealt with.

Can you replicate with the C++ basic client ?

Yes, is eating the same amount of memory as the .Net client.

Good, at least confirms it's not coming from the bindings. Do you think you can investigate why the destructor is not called ?

We have some code that triggers some lost but still reachable memory under valgrind on linux, and it deals with what calls this, so I'm wondering if this is not the root cause indeed, and we are just more lucky / going through another path on linux to free ?

@carlfm01
Copy link
Collaborator Author

carlfm01 commented Oct 17, 2019

Do you think you can investigate why the destructor is not called ?

Yes, I'm already trying to spot the issue, but due to my newbie eyes for C++ I'm not making any significant progress.

The only thing that seems wrong apart from the destructor of MappedFile never called is the destructor of ConstFstImpl only called for the first run, then never again. I want to see if this happens also on Linux.

scoped, is it possible we are missing something at a upper level?

Upper ConstFstImpl is ConstFst which is used for PathTrie as FstType I sort of feel the issue is coming from PathTrie and the usage of share_ptr:

new_path->matcher_ = matcher_;

¿What do you think?

@lissyx
Copy link
Collaborator

lissyx commented Oct 17, 2019

@carlfm01 Long-shot, but doing so does not seems to trigger issues here:

diff --git a/native_client/ctcdecode/path_trie.cpp b/native_client/ctcdecode/path_trie.cpp
index 51f75ff..dee792d 100644
--- a/native_client/ctcdecode/path_trie.cpp
+++ b/native_client/ctcdecode/path_trie.cpp
@@ -33,6 +33,8 @@ PathTrie::~PathTrie() {
   for (auto child : children_) {
     delete child.second;
   }
+
+  matcher_ = nullptr;
 }
 
 PathTrie* PathTrie::get_path_trie(int new_char, int new_timestep, float cur_log_prob_c, bool reset) {

This should make sure that any PathTrie destruction frees the matching allocation of matcher_.

@lissyx
Copy link
Collaborator

lissyx commented Oct 17, 2019

@carlfm01 I'm having more doubts (and @reuben shares this as well) against dictionary_ there, which is std::unique_ptr<> in native_client/ctcdecode/scorer.h and that we ->Copy(true) in native_client/ctcdecode/ctc_beam_search_decoder.cpp. This Copy() call triggers a new behind.

lissyx pushed a commit to lissyx/STT that referenced this issue Oct 18, 2019
@lissyx
Copy link
Collaborator

lissyx commented Oct 18, 2019

i'm keeping that open until @carlfm01 can confirm this is fixed

@carlfm01
Copy link
Collaborator Author

Thanks @lissyx

Using :
TensorFlow: v1.14.0-16-g3b4ce374f5
DeepSpeech: v0.6.0-alpha.10-2-g469ddd2

First run 53MB, last run 45MB.(.Net client)

I can confirm that issue is fixed for the .Net client.

Testing with Python and C++ client, looks like the C++ client is not releasing completely.

@lissyx
Copy link
Collaborator

lissyx commented Oct 19, 2019

I'm going to close it then, we can still fix the C++ client but if the leak in the lib is fixed it's the most important :)

@lissyx lissyx closed this as completed Oct 19, 2019
@lock
Copy link

lock bot commented Nov 18, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Nov 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants