Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Upgrade] QLRU 12 DBs Upgrade from 2.18.4 to 2024.1.0.0-b123 causes master leader to go unreachable and throughput never comes back #22262

Closed
1 task done
shamanthchandra-yb opened this issue May 4, 2024 · 0 comments
Assignees
Labels
2024.1_blocker area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@shamanthchandra-yb
Copy link

shamanthchandra-yb commented May 4, 2024

Jira Link: DB-11181

Description

Please find detailed conversation in slack thread. Attaching in JIRA.

I upgraded 2 universes from version 2.18.4 to 2024.1. Both times, the master leader became an unreachable node. Even after bringing it back live by stopping and starting from the AWS console, the dropped connections never returned. Interestingly, this didn’t happen with our 18DBs in our before experiments, whereas here in 12 DBs we hit this, where workload is lighter than previous.

The difference was 18 DBs test was upgraded to 2024.1.0.0-b105 and 12 DBs was to 2024.1.0.0-b123. Earlier, the better defaults from Mark existed, which is now under gflag and by default off, from 2024.1.0.0-b116. https://phorge.dev.yugabyte.com/D34565. Looks like the better defaults was masking this issue.

Regarding the upgrade from 2.18.4 -> 2.20.3 with 12 DBs, I didn’t observe the above issue; none of the nodes became unavailable. Like in our previous experiments, I don’t observe a throughput drop in this experiment too.

Screenshot 2024-05-04 at 12 42 07 AM
Screenshot 2024-05-04 at 12 41 50 AM

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@shamanthchandra-yb shamanthchandra-yb added area/docdb YugabyteDB core features priority/high High Priority status/awaiting-triage Issue awaiting triage labels May 4, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug 2024.1_blocker labels May 4, 2024
@yugabyte-ci yugabyte-ci assigned mdbridge and unassigned rthallamko3 May 6, 2024
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label May 6, 2024
@yugabyte-ci yugabyte-ci assigned myang2021 and unassigned mdbridge May 7, 2024
myang2021 added a commit that referenced this issue May 9, 2024
Summary:
The memory leak was identified via code inspection during debugging the issue.

```
static void
YbInitUpdateRelationCacheState(YbUpdateRelationCacheState *state)
{
    YbLoadTupleCache(&state->pg_attrdef_cache, AttrDefaultRelationId,
                     &YbExtractAttrDefTupleCacheKey, "pg_attrdef local cache");
    YbLoadTupleCache(&state->pg_constraint_cache, ConstraintRelationId,
                     &YbExtractConstraintTupleCacheKey,
                     "pg_constraint local cache");
}
```

and

```
static void
YbLoadTupleCache(YbTupleCache *cache, Oid relid,
                 YbTupleCacheKeyExtractor key_extractor, const char *cache_name)
{
    Assert(!(cache->rel || cache->data));
    cache->rel = heap_open(relid, AccessShareLock);
    HASHCTL ctl = {0};
    ctl.keysize = sizeof(Oid);
    ctl.entrysize = sizeof(YbTupleCacheEntry);
    cache->data = hash_create(cache_name, 32, &ctl, HASH_ELEM | HASH_BLOBS);
```

We have called `hash_create` in `YbLoadTupleCache`, but in the corresponding
`YbCleanupTupleCache` I do not see a call to `hash_destroy`:

```
static void
YbCleanupTupleCache(YbTupleCache *cache)
{
    if (!cache->rel)
        return;

    heap_close(cache->rel, AccessShareLock);
}
```

This diff fixes the memory leak by calling `hash_destroy` in
`YbCleanupTupleCache`.

Jira: DB-11181

Test Plan:
Manual test.
(1) ./bin/yb-ctl create --rf 1 --tserver_flags ysql_catalog_preload_additional_tables=true
(2) before the fix

```
yugabyte=# select used_bytes,total_bytes from pg_get_backend_memory_contexts() where name like '%local cache';
 used_bytes | total_bytes
------------+-------------
       6616 |        8192
       5576 |        8192
(2 rows)

```
(3) after the fix

```
yugabyte=# select used_bytes,total_bytes from pg_get_backend_memory_contexts() where name like '%local cache';
 used_bytes | total_bytes
------------+-------------
(0 rows)

```

Reviewers: kfranz

Reviewed By: kfranz

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D34845
@yugabyte-ci yugabyte-ci added area/ysql Yugabyte SQL (YSQL) and removed area/docdb YugabyteDB core features labels May 9, 2024
myang2021 added a commit that referenced this issue May 13, 2024
Summary:
In a newly conducted upgrade stress testing, we found that when upgrading
2.18.4.0 to 2.20.3.0, we can successfully complete the upgrade of a 12-DB,
50 connection per DB, 7-node cluster configuration. But when upgrading 2.18.4.0
to 2024.1-b123, the upgrade failed. We noticed that the PG memory spike in case
of 2024.1-b123 was higher. After debugging, the upgrade to 2024.1-b123 involved
running more DDLs that cause catalog version to increment. Each time after doing
a catalog cache refresh, the memory size of `CacheMemoryContext` goes up which
suggests a memory leak. An experiment below demonstrated the memory leak
(between each `select`, run `alter user yugabyte superuser` on another session
to trigger catalog cache refresh on the next execution of the `select`
statement)

```
yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
      524288 |     409160
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10254152
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10507912
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10761672
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   11015432
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   11269192
(1 row)
```

Each catalog cache refresh we see `used_bytes` increased by an average of
(11269192 - 10254152) / 4 = 253760

Because of the memory leak, the more catalog-version-bumping-DDLs to execute,
the more memory used by `CacheMemoryContext`. Therefore the upgrade to
2024.1-b123 saw higher PG memory spike.

This diff fixes two identified memory leaks in `CacheMemoryContext` that
account for most of the leaks.

* ybctid memory needs to be released in `CatCacheRemoveCTup`
* yb_table_properties needs to be released in `RelationDestroyRelation`.
Jira: DB-11181

Test Plan:
Manual test. After the fix

```
yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
      524288 |     409000
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10253160
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10256456
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10259752
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10263048
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10266344
(1 row)

```

Each catalog cache refresh we see `used_bytes` increased by an average of
(10266344 - 10253160) / 4 = 3296

Reviewers: kfranz, fizaa

Reviewed By: fizaa

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D34969
myang2021 added a commit that referenced this issue May 16, 2024
Summary:
This diff fixes the remaining memory leaks in `CacheMemoryContext` that were
identified to allow a stable cache memory after each catalog cache refresh.
Two more identified memory leaks:

* `YbCompleteAttrProcessingImpl`: `relation->rd_att->constr` needs to be freed.
* `YbPreloadPgInheritsCache`: tighter scope of `MemoryContextSwitchTo(CacheMemoryContext)`

For the first leak, we do not cleanup all the relations from rel cache on a
catalog cache refresh. Some are considered `YbIsNonAlterableRelation` and will
remain. But the overall processing flow of `YbCompleteAttrProcessingImpl`
processes them anyway and will set their `relation->rd_att->constr` to the newly
allocated `constr`.

For the second leak in `YbPreloadPgInheritsCache`, PG code already appropriately
allocates/copies into hcxt (CacheMemoryContext in this case) whenever required.
If we still allocate any temporary memory into `CacheMemoryContext`, then they
are not freed after being copied into `CacheMemoryContext`. Related code are:

```
static HTAB *YbPgInheritsCache;
```
and

```
void
YbInitPgInheritsCache()
{
    Assert(YbPgInheritsCache == NULL);
    HASHCTL ctl = {0};
    ctl.keysize = sizeof(Oid);
    ctl.entrysize = sizeof(YbPgInheritsCacheEntryData);
    ctl.hcxt = CacheMemoryContext;
```

Note that `ctl.hcxt = CacheMemoryContext` ensures the copying into
`CacheMemoryContext`. By moving `MemoryContextSwitchTo(CacheMemoryContext)`
to the minimum scope where it is truly required to ensure the right memory context,
those temporary memory will be allocated from transaction memory that will be
auto-freed after the transaction (created just for doing cache refresh) is over.

With these two leaks fixed, I am able to add a new unit test that checks that
the cache memory size stay the same after cache refresh.
Jira: DB-11181

Test Plan: ./yb_build.sh release --cxx-test pg_libpq-test --gtest_filter PgLibPqTest.CatalogCacheMemoryLeak -n 20

Reviewers: fizaa, kfranz

Reviewed By: fizaa

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D35017
myang2021 added a commit that referenced this issue May 16, 2024
…ttrdef local cache

Summary:
The memory leak was identified via code inspection during debugging the issue.

```
static void
YbInitUpdateRelationCacheState(YbUpdateRelationCacheState *state)
{
    YbLoadTupleCache(&state->pg_attrdef_cache, AttrDefaultRelationId,
                     &YbExtractAttrDefTupleCacheKey, "pg_attrdef local cache");
    YbLoadTupleCache(&state->pg_constraint_cache, ConstraintRelationId,
                     &YbExtractConstraintTupleCacheKey,
                     "pg_constraint local cache");
}
```

and

```
static void
YbLoadTupleCache(YbTupleCache *cache, Oid relid,
                 YbTupleCacheKeyExtractor key_extractor, const char *cache_name)
{
    Assert(!(cache->rel || cache->data));
    cache->rel = heap_open(relid, AccessShareLock);
    HASHCTL ctl = {0};
    ctl.keysize = sizeof(Oid);
    ctl.entrysize = sizeof(YbTupleCacheEntry);
    cache->data = hash_create(cache_name, 32, &ctl, HASH_ELEM | HASH_BLOBS);
```

We have called `hash_create` in `YbLoadTupleCache`, but in the corresponding
`YbCleanupTupleCache` I do not see a call to `hash_destroy`:

```
static void
YbCleanupTupleCache(YbTupleCache *cache)
{
    if (!cache->rel)
        return;

    heap_close(cache->rel, AccessShareLock);
}
```

This diff fixes the memory leak by calling `hash_destroy` in
`YbCleanupTupleCache`.

Jira: DB-11181

Original commit: 18754b1 / D34845

Test Plan:
Manual test.
(1) ./bin/yb-ctl create --rf 1 --tserver_flags ysql_catalog_preload_additional_tables=true
(2) before the fix

```
yugabyte=# select used_bytes,total_bytes from pg_get_backend_memory_contexts() where name like '%local cache';
 used_bytes | total_bytes
------------+-------------
       6616 |        8192
       5576 |        8192
(2 rows)

```
(3) after the fix

```
yugabyte=# select used_bytes,total_bytes from pg_get_backend_memory_contexts() where name like '%local cache';
 used_bytes | total_bytes
------------+-------------
(0 rows)

```

Reviewers: kfranz

Reviewed By: kfranz

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35137
myang2021 added a commit that referenced this issue May 17, 2024
Summary:
In a newly conducted upgrade stress testing, we found that when upgrading
2.18.4.0 to 2.20.3.0, we can successfully complete the upgrade of a 12-DB,
50 connection per DB, 7-node cluster configuration. But when upgrading 2.18.4.0
to 2024.1-b123, the upgrade failed. We noticed that the PG memory spike in case
of 2024.1-b123 was higher. After debugging, the upgrade to 2024.1-b123 involved
running more DDLs that cause catalog version to increment. Each time after doing
a catalog cache refresh, the memory size of `CacheMemoryContext` goes up which
suggests a memory leak. An experiment below demonstrated the memory leak
(between each `select`, run `alter user yugabyte superuser` on another session
to trigger catalog cache refresh on the next execution of the `select`
statement)

```
yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
      524288 |     409160
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10254152
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10507912
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10761672
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   11015432
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   11269192
(1 row)
```

Each catalog cache refresh we see `used_bytes` increased by an average of
(11269192 - 10254152) / 4 = 253760

Because of the memory leak, the more catalog-version-bumping-DDLs to execute,
the more memory used by `CacheMemoryContext`. Therefore the upgrade to
2024.1-b123 saw higher PG memory spike.

This diff fixes two identified memory leaks in `CacheMemoryContext` that
account for most of the leaks.

* ybctid memory needs to be released in `CatCacheRemoveCTup`
* yb_table_properties needs to be released in `RelationDestroyRelation`.
Jira: DB-11181

Original commit: 166ef51 / D34969

Test Plan:
Manual test. After the fix

```
yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
      524288 |     409000
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10253160
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10256456
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10259752
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10263048
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10266344
(1 row)

```

Each catalog cache refresh we see `used_bytes` increased by an average of
(10266344 - 10253160) / 4 = 3296

Reviewers: kfranz, fizaa

Reviewed By: fizaa

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35139
myang2021 added a commit that referenced this issue May 20, 2024
…e refresh

Summary:
This diff fixes the remaining memory leaks in `CacheMemoryContext` that were
identified to allow a stable cache memory after each catalog cache refresh.
Two more identified memory leaks:

* `YbCompleteAttrProcessingImpl`: `relation->rd_att->constr` needs to be freed.
* `YbPreloadPgInheritsCache`: tighter scope of `MemoryContextSwitchTo(CacheMemoryContext)`

For the first leak, we do not cleanup all the relations from rel cache on a
catalog cache refresh. Some are considered `YbIsNonAlterableRelation` and will
remain. But the overall processing flow of `YbCompleteAttrProcessingImpl`
processes them anyway and will set their `relation->rd_att->constr` to the newly
allocated `constr`.

For the second leak in `YbPreloadPgInheritsCache`, PG code already appropriately
allocates/copies into hcxt (CacheMemoryContext in this case) whenever required.
If we still allocate any temporary memory into `CacheMemoryContext`, then they
are not freed after being copied into `CacheMemoryContext`. Related code are:

```
static HTAB *YbPgInheritsCache;
```
and

```
void
YbInitPgInheritsCache()
{
    Assert(YbPgInheritsCache == NULL);
    HASHCTL ctl = {0};
    ctl.keysize = sizeof(Oid);
    ctl.entrysize = sizeof(YbPgInheritsCacheEntryData);
    ctl.hcxt = CacheMemoryContext;
```

Note that `ctl.hcxt = CacheMemoryContext` ensures the copying into
`CacheMemoryContext`. By moving `MemoryContextSwitchTo(CacheMemoryContext)`
to the minimum scope where it is truly required to ensure the right memory context,
those temporary memory will be allocated from transaction memory that will be
auto-freed after the transaction (created just for doing cache refresh) is over.

With these two leaks fixed, I am able to add a new unit test that checks that
the cache memory size stay the same after cache refresh.
Jira: DB-11181

Original commit: a9b9a2b / D35017

Test Plan: ./yb_build.sh release --cxx-test pg_libpq-test --gtest_filter PgLibPqTest.CatalogCacheMemoryLeak -n 20

Reviewers: fizaa, kfranz

Reviewed By: kfranz

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35199
svarnau pushed a commit that referenced this issue May 25, 2024
Summary:
The memory leak was identified via code inspection during debugging the issue.

```
static void
YbInitUpdateRelationCacheState(YbUpdateRelationCacheState *state)
{
    YbLoadTupleCache(&state->pg_attrdef_cache, AttrDefaultRelationId,
                     &YbExtractAttrDefTupleCacheKey, "pg_attrdef local cache");
    YbLoadTupleCache(&state->pg_constraint_cache, ConstraintRelationId,
                     &YbExtractConstraintTupleCacheKey,
                     "pg_constraint local cache");
}
```

and

```
static void
YbLoadTupleCache(YbTupleCache *cache, Oid relid,
                 YbTupleCacheKeyExtractor key_extractor, const char *cache_name)
{
    Assert(!(cache->rel || cache->data));
    cache->rel = heap_open(relid, AccessShareLock);
    HASHCTL ctl = {0};
    ctl.keysize = sizeof(Oid);
    ctl.entrysize = sizeof(YbTupleCacheEntry);
    cache->data = hash_create(cache_name, 32, &ctl, HASH_ELEM | HASH_BLOBS);
```

We have called `hash_create` in `YbLoadTupleCache`, but in the corresponding
`YbCleanupTupleCache` I do not see a call to `hash_destroy`:

```
static void
YbCleanupTupleCache(YbTupleCache *cache)
{
    if (!cache->rel)
        return;

    heap_close(cache->rel, AccessShareLock);
}
```

This diff fixes the memory leak by calling `hash_destroy` in
`YbCleanupTupleCache`.

Jira: DB-11181

Test Plan:
Manual test.
(1) ./bin/yb-ctl create --rf 1 --tserver_flags ysql_catalog_preload_additional_tables=true
(2) before the fix

```
yugabyte=# select used_bytes,total_bytes from pg_get_backend_memory_contexts() where name like '%local cache';
 used_bytes | total_bytes
------------+-------------
       6616 |        8192
       5576 |        8192
(2 rows)

```
(3) after the fix

```
yugabyte=# select used_bytes,total_bytes from pg_get_backend_memory_contexts() where name like '%local cache';
 used_bytes | total_bytes
------------+-------------
(0 rows)

```

Reviewers: kfranz

Reviewed By: kfranz

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D34845
svarnau pushed a commit that referenced this issue May 25, 2024
Summary:
In a newly conducted upgrade stress testing, we found that when upgrading
2.18.4.0 to 2.20.3.0, we can successfully complete the upgrade of a 12-DB,
50 connection per DB, 7-node cluster configuration. But when upgrading 2.18.4.0
to 2024.1-b123, the upgrade failed. We noticed that the PG memory spike in case
of 2024.1-b123 was higher. After debugging, the upgrade to 2024.1-b123 involved
running more DDLs that cause catalog version to increment. Each time after doing
a catalog cache refresh, the memory size of `CacheMemoryContext` goes up which
suggests a memory leak. An experiment below demonstrated the memory leak
(between each `select`, run `alter user yugabyte superuser` on another session
to trigger catalog cache refresh on the next execution of the `select`
statement)

```
yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
      524288 |     409160
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10254152
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10507912
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10761672
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   11015432
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   11269192
(1 row)
```

Each catalog cache refresh we see `used_bytes` increased by an average of
(11269192 - 10254152) / 4 = 253760

Because of the memory leak, the more catalog-version-bumping-DDLs to execute,
the more memory used by `CacheMemoryContext`. Therefore the upgrade to
2024.1-b123 saw higher PG memory spike.

This diff fixes two identified memory leaks in `CacheMemoryContext` that
account for most of the leaks.

* ybctid memory needs to be released in `CatCacheRemoveCTup`
* yb_table_properties needs to be released in `RelationDestroyRelation`.
Jira: DB-11181

Test Plan:
Manual test. After the fix

```
yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
      524288 |     409000
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10253160
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10256456
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10259752
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10263048
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10266344
(1 row)

```

Each catalog cache refresh we see `used_bytes` increased by an average of
(10266344 - 10253160) / 4 = 3296

Reviewers: kfranz, fizaa

Reviewed By: fizaa

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D34969
svarnau pushed a commit that referenced this issue May 25, 2024
Summary:
This diff fixes the remaining memory leaks in `CacheMemoryContext` that were
identified to allow a stable cache memory after each catalog cache refresh.
Two more identified memory leaks:

* `YbCompleteAttrProcessingImpl`: `relation->rd_att->constr` needs to be freed.
* `YbPreloadPgInheritsCache`: tighter scope of `MemoryContextSwitchTo(CacheMemoryContext)`

For the first leak, we do not cleanup all the relations from rel cache on a
catalog cache refresh. Some are considered `YbIsNonAlterableRelation` and will
remain. But the overall processing flow of `YbCompleteAttrProcessingImpl`
processes them anyway and will set their `relation->rd_att->constr` to the newly
allocated `constr`.

For the second leak in `YbPreloadPgInheritsCache`, PG code already appropriately
allocates/copies into hcxt (CacheMemoryContext in this case) whenever required.
If we still allocate any temporary memory into `CacheMemoryContext`, then they
are not freed after being copied into `CacheMemoryContext`. Related code are:

```
static HTAB *YbPgInheritsCache;
```
and

```
void
YbInitPgInheritsCache()
{
    Assert(YbPgInheritsCache == NULL);
    HASHCTL ctl = {0};
    ctl.keysize = sizeof(Oid);
    ctl.entrysize = sizeof(YbPgInheritsCacheEntryData);
    ctl.hcxt = CacheMemoryContext;
```

Note that `ctl.hcxt = CacheMemoryContext` ensures the copying into
`CacheMemoryContext`. By moving `MemoryContextSwitchTo(CacheMemoryContext)`
to the minimum scope where it is truly required to ensure the right memory context,
those temporary memory will be allocated from transaction memory that will be
auto-freed after the transaction (created just for doing cache refresh) is over.

With these two leaks fixed, I am able to add a new unit test that checks that
the cache memory size stay the same after cache refresh.
Jira: DB-11181

Test Plan: ./yb_build.sh release --cxx-test pg_libpq-test --gtest_filter PgLibPqTest.CatalogCacheMemoryLeak -n 20

Reviewers: fizaa, kfranz

Reviewed By: fizaa

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D35017
svarnau pushed a commit that referenced this issue May 29, 2024
…ttrdef local cache

Summary:
The memory leak was identified via code inspection during debugging the issue.

```
static void
YbInitUpdateRelationCacheState(YbUpdateRelationCacheState *state)
{
    YbLoadTupleCache(&state->pg_attrdef_cache, AttrDefaultRelationId,
                     &YbExtractAttrDefTupleCacheKey, "pg_attrdef local cache");
    YbLoadTupleCache(&state->pg_constraint_cache, ConstraintRelationId,
                     &YbExtractConstraintTupleCacheKey,
                     "pg_constraint local cache");
}
```

and

```
static void
YbLoadTupleCache(YbTupleCache *cache, Oid relid,
                 YbTupleCacheKeyExtractor key_extractor, const char *cache_name)
{
    Assert(!(cache->rel || cache->data));
    cache->rel = heap_open(relid, AccessShareLock);
    HASHCTL ctl = {0};
    ctl.keysize = sizeof(Oid);
    ctl.entrysize = sizeof(YbTupleCacheEntry);
    cache->data = hash_create(cache_name, 32, &ctl, HASH_ELEM | HASH_BLOBS);
```

We have called `hash_create` in `YbLoadTupleCache`, but in the corresponding
`YbCleanupTupleCache` I do not see a call to `hash_destroy`:

```
static void
YbCleanupTupleCache(YbTupleCache *cache)
{
    if (!cache->rel)
        return;

    heap_close(cache->rel, AccessShareLock);
}
```

This diff fixes the memory leak by calling `hash_destroy` in
`YbCleanupTupleCache`.

Jira: DB-11181

Original commit: 18754b1 / D34845

Test Plan:
Manual test.
(1) ./bin/yb-ctl create --rf 1 --tserver_flags ysql_catalog_preload_additional_tables=true
(2) before the fix

```
yugabyte=# select used_bytes,total_bytes from pg_get_backend_memory_contexts() where name like '%local cache';
 used_bytes | total_bytes
------------+-------------
       6616 |        8192
       5576 |        8192
(2 rows)

```
(3) after the fix

```
yugabyte=# select used_bytes,total_bytes from pg_get_backend_memory_contexts() where name like '%local cache';
 used_bytes | total_bytes
------------+-------------
(0 rows)

```

Reviewers: kfranz

Reviewed By: kfranz

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35137
svarnau pushed a commit that referenced this issue May 29, 2024
Summary:
In a newly conducted upgrade stress testing, we found that when upgrading
2.18.4.0 to 2.20.3.0, we can successfully complete the upgrade of a 12-DB,
50 connection per DB, 7-node cluster configuration. But when upgrading 2.18.4.0
to 2024.1-b123, the upgrade failed. We noticed that the PG memory spike in case
of 2024.1-b123 was higher. After debugging, the upgrade to 2024.1-b123 involved
running more DDLs that cause catalog version to increment. Each time after doing
a catalog cache refresh, the memory size of `CacheMemoryContext` goes up which
suggests a memory leak. An experiment below demonstrated the memory leak
(between each `select`, run `alter user yugabyte superuser` on another session
to trigger catalog cache refresh on the next execution of the `select`
statement)

```
yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
      524288 |     409160
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10254152
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10507912
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10761672
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   11015432
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   11269192
(1 row)
```

Each catalog cache refresh we see `used_bytes` increased by an average of
(11269192 - 10254152) / 4 = 253760

Because of the memory leak, the more catalog-version-bumping-DDLs to execute,
the more memory used by `CacheMemoryContext`. Therefore the upgrade to
2024.1-b123 saw higher PG memory spike.

This diff fixes two identified memory leaks in `CacheMemoryContext` that
account for most of the leaks.

* ybctid memory needs to be released in `CatCacheRemoveCTup`
* yb_table_properties needs to be released in `RelationDestroyRelation`.
Jira: DB-11181

Original commit: 166ef51 / D34969

Test Plan:
Manual test. After the fix

```
yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
      524288 |     409000
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10253160
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10256456
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10259752
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10263048
(1 row)

yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name
= 'CacheMemoryContext';
 total_bytes | used_bytes
-------------+------------
    12879448 |   10266344
(1 row)

```

Each catalog cache refresh we see `used_bytes` increased by an average of
(10266344 - 10253160) / 4 = 3296

Reviewers: kfranz, fizaa

Reviewed By: fizaa

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35139
svarnau pushed a commit that referenced this issue May 29, 2024
…e refresh

Summary:
This diff fixes the remaining memory leaks in `CacheMemoryContext` that were
identified to allow a stable cache memory after each catalog cache refresh.
Two more identified memory leaks:

* `YbCompleteAttrProcessingImpl`: `relation->rd_att->constr` needs to be freed.
* `YbPreloadPgInheritsCache`: tighter scope of `MemoryContextSwitchTo(CacheMemoryContext)`

For the first leak, we do not cleanup all the relations from rel cache on a
catalog cache refresh. Some are considered `YbIsNonAlterableRelation` and will
remain. But the overall processing flow of `YbCompleteAttrProcessingImpl`
processes them anyway and will set their `relation->rd_att->constr` to the newly
allocated `constr`.

For the second leak in `YbPreloadPgInheritsCache`, PG code already appropriately
allocates/copies into hcxt (CacheMemoryContext in this case) whenever required.
If we still allocate any temporary memory into `CacheMemoryContext`, then they
are not freed after being copied into `CacheMemoryContext`. Related code are:

```
static HTAB *YbPgInheritsCache;
```
and

```
void
YbInitPgInheritsCache()
{
    Assert(YbPgInheritsCache == NULL);
    HASHCTL ctl = {0};
    ctl.keysize = sizeof(Oid);
    ctl.entrysize = sizeof(YbPgInheritsCacheEntryData);
    ctl.hcxt = CacheMemoryContext;
```

Note that `ctl.hcxt = CacheMemoryContext` ensures the copying into
`CacheMemoryContext`. By moving `MemoryContextSwitchTo(CacheMemoryContext)`
to the minimum scope where it is truly required to ensure the right memory context,
those temporary memory will be allocated from transaction memory that will be
auto-freed after the transaction (created just for doing cache refresh) is over.

With these two leaks fixed, I am able to add a new unit test that checks that
the cache memory size stay the same after cache refresh.
Jira: DB-11181

Original commit: a9b9a2b / D35017

Test Plan: ./yb_build.sh release --cxx-test pg_libpq-test --gtest_filter PgLibPqTest.CatalogCacheMemoryLeak -n 20

Reviewers: fizaa, kfranz

Reviewed By: kfranz

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35199
@yugabyte-ci yugabyte-ci added priority/medium Medium priority issue and removed priority/high High Priority labels Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024.1_blocker area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

6 participants