-
Notifications
You must be signed in to change notification settings - Fork 113
Fix ncurses-related memory leak; various updates to help debugging #121
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Important for the raw env which isn't weakref-protected.
251935f
to
f9511f0
Compare
Even after recent fixes, we were still seeing memory leaks. This commit fixes another one. The terminfo/termcap library is a weird 1980s mess. Among other things, it cannot clean up after itself. The tgetent(3) call allocs some memory; it also sets some global state that tells it to re-use that memory in future calls (most programs will call tgetent only once; NLE and screen/tmux might be exceptions). However, or dlopen/dlclose trick dynamically loaded AND UNLOADED libncurses, which probably reset its global state, creating new memory allocations for each tgetent call. Or perhaps calling tgetent several times just leaks in every case. This change moves the tgetent call up one level into nledl.c, and only does it once (per environment). This appears to fix any memory issues I could see with valgrind (although ncurses will always show up there) and appears to not make ./rlmain r leak on Ubuntu, measured with pmap -x $!. On MacOS, (1) valgrind appears to show a number of entries, but they all look like false positives or (2) are dlopen/dlclose-related or (3) are tgetent-via-nle_start related. The practial memory behavior on MacOS is harder to ascertain. Measuring with ps -axm -o pid,%mem,rss,comm | grep rlmain shows a slooow increase with occasional drops. Certainly slower than before :D An alternative to this change would be to rip out ncurses altogether, e.g. by #undef'ing TERMLIB and #define'ing ANSI_DEFAULT. Doing so creates other memory leaks in the hilites array in termcap.c which can probably be fixed too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly cool, two small questions.
@@ -737,52 +735,10 @@ static const short tmspc10[] = { /* from termcap */ | |||
#endif | |||
#endif | |||
|
|||
/* delay 50 ms */ | |||
/* NLE: Don't delay ever. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change somehow related to the bug fix? Or is it a drive-by perf win?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related in the sense that if one doesn't link against ncurses/terminfo/termcap by #undef
'ing TERMLIB
, there's symbols in tty_delay_output
that the linker complains about.
Also we never called that function in the first place.
static bool tc_started = FALSE; | ||
if (!tc_started) { | ||
const char *term = getenv("TERM"); | ||
if (!term || tgetent(NULL, term) < 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we sending in NULL here, instead of a buffer we can free?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Welcome to the depth of various versions of termcap. Stallman's doc on GNU termcap says:
If you are using the GNU version of termcap, you can alternatively ask tgetent to allocate enough space. Pass a null pointer for buffer, and tgetent itself allocates the storage using malloc. There is no way to get the address that was allocated, and you shouldn't try to free the storage.
With the Unix version of termcap, you must allocate space for the description yourself and pass the address of the space as the argument buffer. There is no way you can tell how much space is needed, so the convention is to allocate a buffer 2048 characters long and assume that is enough. (Formerly the convention was to allocate 1024 characters and assume that was enough. But one day, for one kind of terminal, that was not enough.)
No matter how the space to store the description has been obtained, termcap records its address internally for use when you later interrogate the description with tgetnum, tgetstr or tgetflag. If the buffer was allocated by termcap, it will be freed by termcap too if you call tgetent again. If the buffer was provided by you, you must make sure that its contents remain unchanged for as long as you still plan to interrogate the description.
(This is part of the reason termcap will always show up in memleak-checking tools like valgrind.)
However, GNU termcap isn't being used by modern distributions I believe. Instead, the manpage for tgetent(3) says:
The tgetent routine loads the entry for name. It returns 1 on success, 0 if there is no such entry, and -1 if the terminfo database could not be found. The emulation ignores the buffer pointer bp.
Perhaps NetHack tried to be compatible with an older / non-ncurses based version? However, they too never looked at the pointer they allocate (and it's totally unclear why that couldn't have been on the stack).
Long story short: This argument doesn't matter.
@@ -57,6 +60,16 @@ nle_ctx_t * | |||
nle_start(const char *dlpath, nle_obs *obs, FILE *ttyrec, | |||
nle_seeds_init_t *seed_init) | |||
{ | |||
static bool tc_started = FALSE; | |||
if (!tc_started) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise we call tgetent
several times again which we don't want to do.
@@ -737,52 +735,10 @@ static const short tmspc10[] = { /* from termcap */ | |||
#endif | |||
#endif | |||
|
|||
/* delay 50 ms */ | |||
/* NLE: Don't delay ever. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need for speed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes (although we didn't call this function); but also this includes symbols that aren't present if TERMLIB
isn't defined, and doesn't properly protect against that with #ifdef
s.
This will break on Ubuntu 20.04 if ncurses6 is installed (I guess). I'm not sure why ${NCURSES_LIB} doesn't work here, seems related to this unanswered question: https://stackoverflow.com/questions/63055293/cmake-transitive-dependency-linked-by-name-only
The previous version chose ncurses.so.6 on Ubuntu 20.04.
ccf0ae3
to
69efb1f
Compare
(Appears to have fixed #120.) |
Fix a termcap/terminfo/termlib/ncurses related memory leak.
Even after recent fixes, we were still seeing memory leaks. This
commit fixes another one.
The terminfo/termcap library is a weird 1980s mess. Among other
things, it cannot clean up after itself. The tgetent(3) call allocs
some memory; it also sets some global state that tells it to re-use
that memory in future calls (most programs will call tgetent only
once; NLE and screen/tmux might be exceptions).
However, or dlopen/dlclose trick dynamically loaded AND UNLOADED
libncurses, which probably reset its global state, creating new
memory allocations for each tgetent call. Or perhaps calling tgetent
several times just leaks in every case.
This change moves the tgetent call up one level into nledl.c, and
only does it once (per environment). This appears to fix any memory
issues I could see with valgrind (although ncurses will always show
up there) and appears to not make ./rlmain r leak on Ubuntu, measured
with
On MacOS, (1) valgrind appears to show a number of entries, but they all
look like false positives or (2) are dlopen/dlclose-related or (3)
are tgetent-via-nle_start related. The practial memory behavior on
MacOS is harder to ascertain. Measuring with
shows a slooow increase with occasional drops. Certainly slower
than before :D
An alternative to this change would be to rip out ncurses altogether,
e.g. by #undef'ing TERMLIB and #define'ing ANSI_DEFAULT. Doing so
creates other memory leaks in the hilites array in termcap.c which
can probably be fixed too.