Skip to content
This repository has been archived by the owner on May 6, 2024. It is now read-only.

Fix ncurses-related memory leak; various updates to help debugging #121

Merged
merged 12 commits into from
Nov 12, 2020

Conversation

heiner
Copy link
Contributor

@heiner heiner commented Nov 9, 2020

Fix a termcap/terminfo/termlib/ncurses related memory leak.

Even after recent fixes, we were still seeing memory leaks. This
commit fixes another one.

The terminfo/termcap library is a weird 1980s mess. Among other
things, it cannot clean up after itself. The tgetent(3) call allocs
some memory; it also sets some global state that tells it to re-use
that memory in future calls (most programs will call tgetent only
once; NLE and screen/tmux might be exceptions).

However, or dlopen/dlclose trick dynamically loaded AND UNLOADED
libncurses, which probably reset its global state, creating new
memory allocations for each tgetent call. Or perhaps calling tgetent
several times just leaks in every case.

This change moves the tgetent call up one level into nledl.c, and
only does it once (per environment). This appears to fix any memory
issues I could see with valgrind (although ncurses will always show
up there) and appears to not make ./rlmain r leak on Ubuntu, measured
with

pmap -x $!.

On MacOS, (1) valgrind appears to show a number of entries, but they all
look like false positives or (2) are dlopen/dlclose-related or (3)
are tgetent-via-nle_start related. The practial memory behavior on
MacOS is harder to ascertain. Measuring with

ps -axm -o pid,%mem,rss,comm | grep rlmain

shows a slooow increase with occasional drops. Certainly slower
than before :D

An alternative to this change would be to rip out ncurses altogether,
e.g. by #undef'ing TERMLIB and #define'ing ANSI_DEFAULT. Doing so
creates other memory leaks in the hilites array in termcap.c which
can probably be fixed too.

@heiner heiner requested a review from yobibyte November 9, 2020 13:43
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 9, 2020
Copy link
Contributor

@yobibyte yobibyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@heiner heiner changed the title Close env at end of play.py. Various updates to help debugging Nov 9, 2020
Heinrich Kuttler added 3 commits November 10, 2020 00:13
Even after recent fixes, we were still seeing memory leaks. This
commit fixes another one.

The terminfo/termcap library is a weird 1980s mess. Among other
things, it cannot clean up after itself. The tgetent(3) call allocs
some memory; it also sets some global state that tells it to re-use
that memory in future calls (most programs will call tgetent only
once; NLE and screen/tmux might be exceptions).

However, or dlopen/dlclose trick dynamically loaded AND UNLOADED
libncurses, which probably reset its global state, creating new
memory allocations for each tgetent call. Or perhaps calling tgetent
several times just leaks in every case.

This change moves the tgetent call up one level into nledl.c, and
only does it once (per environment). This appears to fix any memory
issues I could see with valgrind (although ncurses will always show
up there) and appears to not make ./rlmain r leak on Ubuntu, measured
with
  pmap -x $!.

On MacOS, (1) valgrind appears to show a number of entries, but they all
look like false positives or (2) are dlopen/dlclose-related or (3)
are tgetent-via-nle_start related. The practial memory behavior on
MacOS is harder to ascertain. Measuring with
  ps -axm -o pid,%mem,rss,comm | grep rlmain
shows a slooow increase with occasional drops. Certainly slower
than before :D

An alternative to this change would be to rip out ncurses altogether,
e.g. by #undef'ing TERMLIB and #define'ing ANSI_DEFAULT. Doing so
creates other memory leaks in the hilites array in termcap.c which
can probably be fixed too.
@heiner heiner changed the title Various updates to help debugging Fix ncurses-related memory leak; various updates to help debugging Nov 11, 2020
@heiner heiner requested a review from cdmatters November 11, 2020 15:49
Copy link
Contributor

@cdmatters cdmatters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly cool, two small questions.

@@ -737,52 +735,10 @@ static const short tmspc10[] = { /* from termcap */
#endif
#endif

/* delay 50 ms */
/* NLE: Don't delay ever. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change somehow related to the bug fix? Or is it a drive-by perf win?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related in the sense that if one doesn't link against ncurses/terminfo/termcap by #undef'ing TERMLIB, there's symbols in tty_delay_output that the linker complains about.

Also we never called that function in the first place.

static bool tc_started = FALSE;
if (!tc_started) {
const char *term = getenv("TERM");
if (!term || tgetent(NULL, term) < 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we sending in NULL here, instead of a buffer we can free?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome to the depth of various versions of termcap. Stallman's doc on GNU termcap says:

If you are using the GNU version of termcap, you can alternatively ask tgetent to allocate enough space. Pass a null pointer for buffer, and tgetent itself allocates the storage using malloc. There is no way to get the address that was allocated, and you shouldn't try to free the storage.

With the Unix version of termcap, you must allocate space for the description yourself and pass the address of the space as the argument buffer. There is no way you can tell how much space is needed, so the convention is to allocate a buffer 2048 characters long and assume that is enough. (Formerly the convention was to allocate 1024 characters and assume that was enough. But one day, for one kind of terminal, that was not enough.)

No matter how the space to store the description has been obtained, termcap records its address internally for use when you later interrogate the description with tgetnum, tgetstr or tgetflag. If the buffer was allocated by termcap, it will be freed by termcap too if you call tgetent again. If the buffer was provided by you, you must make sure that its contents remain unchanged for as long as you still plan to interrogate the description.

(This is part of the reason termcap will always show up in memleak-checking tools like valgrind.)

However, GNU termcap isn't being used by modern distributions I believe. Instead, the manpage for tgetent(3) says:

The tgetent routine loads the entry for name. It returns 1 on success, 0 if there is no such entry, and -1 if the terminfo database could not be found. The emulation ignores the buffer pointer bp.

Perhaps NetHack tried to be compatible with an older / non-ncurses based version? However, they too never looked at the pointer they allocate (and it's totally unclear why that couldn't have been on the stack).

Long story short: This argument doesn't matter.

@@ -57,6 +60,16 @@ nle_ctx_t *
nle_start(const char *dlpath, nle_obs *obs, FILE *ttyrec,
nle_seeds_init_t *seed_init)
{
static bool tc_started = FALSE;
if (!tc_started) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise we call tgetent several times again which we don't want to do.

@@ -737,52 +735,10 @@ static const short tmspc10[] = { /* from termcap */
#endif
#endif

/* delay 50 ms */
/* NLE: Don't delay ever. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need for speed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (although we didn't call this function); but also this includes symbols that aren't present if TERMLIB isn't defined, and doesn't properly protect against that with #ifdefs.

Heinrich Kuttler added 2 commits November 11, 2020 19:23
This will break on Ubuntu 20.04 if ncurses6 is installed (I guess).

I'm not sure why ${NCURSES_LIB} doesn't work here, seems related to
this unanswered question: https://stackoverflow.com/questions/63055293/cmake-transitive-dependency-linked-by-name-only
The previous version chose ncurses.so.6 on Ubuntu 20.04.
@heiner heiner merged commit 2bd2360 into master Nov 12, 2020
@heiner heiner deleted the yobibyte-fixes branch November 12, 2020 10:09
@heiner
Copy link
Contributor Author

heiner commented Nov 12, 2020

(Appears to have fixed #120.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants