-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Upgrade] QLRU 12 DBs Upgrade from 2.18.4 to 2024.1.0.0-b123 causes master leader to go unreachable and throughput never comes back #22262
Labels
2024.1_blocker
area/ysql
Yugabyte SQL (YSQL)
kind/bug
This issue is a bug
priority/medium
Medium priority issue
Comments
shamanthchandra-yb
added
area/docdb
YugabyteDB core features
priority/high
High Priority
status/awaiting-triage
Issue awaiting triage
labels
May 4, 2024
myang2021
added a commit
that referenced
this issue
May 9, 2024
Summary: The memory leak was identified via code inspection during debugging the issue. ``` static void YbInitUpdateRelationCacheState(YbUpdateRelationCacheState *state) { YbLoadTupleCache(&state->pg_attrdef_cache, AttrDefaultRelationId, &YbExtractAttrDefTupleCacheKey, "pg_attrdef local cache"); YbLoadTupleCache(&state->pg_constraint_cache, ConstraintRelationId, &YbExtractConstraintTupleCacheKey, "pg_constraint local cache"); } ``` and ``` static void YbLoadTupleCache(YbTupleCache *cache, Oid relid, YbTupleCacheKeyExtractor key_extractor, const char *cache_name) { Assert(!(cache->rel || cache->data)); cache->rel = heap_open(relid, AccessShareLock); HASHCTL ctl = {0}; ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(YbTupleCacheEntry); cache->data = hash_create(cache_name, 32, &ctl, HASH_ELEM | HASH_BLOBS); ``` We have called `hash_create` in `YbLoadTupleCache`, but in the corresponding `YbCleanupTupleCache` I do not see a call to `hash_destroy`: ``` static void YbCleanupTupleCache(YbTupleCache *cache) { if (!cache->rel) return; heap_close(cache->rel, AccessShareLock); } ``` This diff fixes the memory leak by calling `hash_destroy` in `YbCleanupTupleCache`. Jira: DB-11181 Test Plan: Manual test. (1) ./bin/yb-ctl create --rf 1 --tserver_flags ysql_catalog_preload_additional_tables=true (2) before the fix ``` yugabyte=# select used_bytes,total_bytes from pg_get_backend_memory_contexts() where name like '%local cache'; used_bytes | total_bytes ------------+------------- 6616 | 8192 5576 | 8192 (2 rows) ``` (3) after the fix ``` yugabyte=# select used_bytes,total_bytes from pg_get_backend_memory_contexts() where name like '%local cache'; used_bytes | total_bytes ------------+------------- (0 rows) ``` Reviewers: kfranz Reviewed By: kfranz Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D34845
yugabyte-ci
added
area/ysql
Yugabyte SQL (YSQL)
and removed
area/docdb
YugabyteDB core features
labels
May 9, 2024
myang2021
added a commit
that referenced
this issue
May 13, 2024
Summary: In a newly conducted upgrade stress testing, we found that when upgrading 2.18.4.0 to 2.20.3.0, we can successfully complete the upgrade of a 12-DB, 50 connection per DB, 7-node cluster configuration. But when upgrading 2.18.4.0 to 2024.1-b123, the upgrade failed. We noticed that the PG memory spike in case of 2024.1-b123 was higher. After debugging, the upgrade to 2024.1-b123 involved running more DDLs that cause catalog version to increment. Each time after doing a catalog cache refresh, the memory size of `CacheMemoryContext` goes up which suggests a memory leak. An experiment below demonstrated the memory leak (between each `select`, run `alter user yugabyte superuser` on another session to trigger catalog cache refresh on the next execution of the `select` statement) ``` yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 524288 | 409160 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10254152 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10507912 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10761672 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 11015432 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 11269192 (1 row) ``` Each catalog cache refresh we see `used_bytes` increased by an average of (11269192 - 10254152) / 4 = 253760 Because of the memory leak, the more catalog-version-bumping-DDLs to execute, the more memory used by `CacheMemoryContext`. Therefore the upgrade to 2024.1-b123 saw higher PG memory spike. This diff fixes two identified memory leaks in `CacheMemoryContext` that account for most of the leaks. * ybctid memory needs to be released in `CatCacheRemoveCTup` * yb_table_properties needs to be released in `RelationDestroyRelation`. Jira: DB-11181 Test Plan: Manual test. After the fix ``` yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 524288 | 409000 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10253160 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10256456 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10259752 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10263048 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10266344 (1 row) ``` Each catalog cache refresh we see `used_bytes` increased by an average of (10266344 - 10253160) / 4 = 3296 Reviewers: kfranz, fizaa Reviewed By: fizaa Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D34969
myang2021
added a commit
that referenced
this issue
May 16, 2024
Summary: This diff fixes the remaining memory leaks in `CacheMemoryContext` that were identified to allow a stable cache memory after each catalog cache refresh. Two more identified memory leaks: * `YbCompleteAttrProcessingImpl`: `relation->rd_att->constr` needs to be freed. * `YbPreloadPgInheritsCache`: tighter scope of `MemoryContextSwitchTo(CacheMemoryContext)` For the first leak, we do not cleanup all the relations from rel cache on a catalog cache refresh. Some are considered `YbIsNonAlterableRelation` and will remain. But the overall processing flow of `YbCompleteAttrProcessingImpl` processes them anyway and will set their `relation->rd_att->constr` to the newly allocated `constr`. For the second leak in `YbPreloadPgInheritsCache`, PG code already appropriately allocates/copies into hcxt (CacheMemoryContext in this case) whenever required. If we still allocate any temporary memory into `CacheMemoryContext`, then they are not freed after being copied into `CacheMemoryContext`. Related code are: ``` static HTAB *YbPgInheritsCache; ``` and ``` void YbInitPgInheritsCache() { Assert(YbPgInheritsCache == NULL); HASHCTL ctl = {0}; ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(YbPgInheritsCacheEntryData); ctl.hcxt = CacheMemoryContext; ``` Note that `ctl.hcxt = CacheMemoryContext` ensures the copying into `CacheMemoryContext`. By moving `MemoryContextSwitchTo(CacheMemoryContext)` to the minimum scope where it is truly required to ensure the right memory context, those temporary memory will be allocated from transaction memory that will be auto-freed after the transaction (created just for doing cache refresh) is over. With these two leaks fixed, I am able to add a new unit test that checks that the cache memory size stay the same after cache refresh. Jira: DB-11181 Test Plan: ./yb_build.sh release --cxx-test pg_libpq-test --gtest_filter PgLibPqTest.CatalogCacheMemoryLeak -n 20 Reviewers: fizaa, kfranz Reviewed By: fizaa Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D35017
myang2021
added a commit
that referenced
this issue
May 16, 2024
…ttrdef local cache Summary: The memory leak was identified via code inspection during debugging the issue. ``` static void YbInitUpdateRelationCacheState(YbUpdateRelationCacheState *state) { YbLoadTupleCache(&state->pg_attrdef_cache, AttrDefaultRelationId, &YbExtractAttrDefTupleCacheKey, "pg_attrdef local cache"); YbLoadTupleCache(&state->pg_constraint_cache, ConstraintRelationId, &YbExtractConstraintTupleCacheKey, "pg_constraint local cache"); } ``` and ``` static void YbLoadTupleCache(YbTupleCache *cache, Oid relid, YbTupleCacheKeyExtractor key_extractor, const char *cache_name) { Assert(!(cache->rel || cache->data)); cache->rel = heap_open(relid, AccessShareLock); HASHCTL ctl = {0}; ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(YbTupleCacheEntry); cache->data = hash_create(cache_name, 32, &ctl, HASH_ELEM | HASH_BLOBS); ``` We have called `hash_create` in `YbLoadTupleCache`, but in the corresponding `YbCleanupTupleCache` I do not see a call to `hash_destroy`: ``` static void YbCleanupTupleCache(YbTupleCache *cache) { if (!cache->rel) return; heap_close(cache->rel, AccessShareLock); } ``` This diff fixes the memory leak by calling `hash_destroy` in `YbCleanupTupleCache`. Jira: DB-11181 Original commit: 18754b1 / D34845 Test Plan: Manual test. (1) ./bin/yb-ctl create --rf 1 --tserver_flags ysql_catalog_preload_additional_tables=true (2) before the fix ``` yugabyte=# select used_bytes,total_bytes from pg_get_backend_memory_contexts() where name like '%local cache'; used_bytes | total_bytes ------------+------------- 6616 | 8192 5576 | 8192 (2 rows) ``` (3) after the fix ``` yugabyte=# select used_bytes,total_bytes from pg_get_backend_memory_contexts() where name like '%local cache'; used_bytes | total_bytes ------------+------------- (0 rows) ``` Reviewers: kfranz Reviewed By: kfranz Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35137
myang2021
added a commit
that referenced
this issue
May 17, 2024
Summary: In a newly conducted upgrade stress testing, we found that when upgrading 2.18.4.0 to 2.20.3.0, we can successfully complete the upgrade of a 12-DB, 50 connection per DB, 7-node cluster configuration. But when upgrading 2.18.4.0 to 2024.1-b123, the upgrade failed. We noticed that the PG memory spike in case of 2024.1-b123 was higher. After debugging, the upgrade to 2024.1-b123 involved running more DDLs that cause catalog version to increment. Each time after doing a catalog cache refresh, the memory size of `CacheMemoryContext` goes up which suggests a memory leak. An experiment below demonstrated the memory leak (between each `select`, run `alter user yugabyte superuser` on another session to trigger catalog cache refresh on the next execution of the `select` statement) ``` yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 524288 | 409160 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10254152 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10507912 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10761672 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 11015432 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 11269192 (1 row) ``` Each catalog cache refresh we see `used_bytes` increased by an average of (11269192 - 10254152) / 4 = 253760 Because of the memory leak, the more catalog-version-bumping-DDLs to execute, the more memory used by `CacheMemoryContext`. Therefore the upgrade to 2024.1-b123 saw higher PG memory spike. This diff fixes two identified memory leaks in `CacheMemoryContext` that account for most of the leaks. * ybctid memory needs to be released in `CatCacheRemoveCTup` * yb_table_properties needs to be released in `RelationDestroyRelation`. Jira: DB-11181 Original commit: 166ef51 / D34969 Test Plan: Manual test. After the fix ``` yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 524288 | 409000 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10253160 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10256456 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10259752 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10263048 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10266344 (1 row) ``` Each catalog cache refresh we see `used_bytes` increased by an average of (10266344 - 10253160) / 4 = 3296 Reviewers: kfranz, fizaa Reviewed By: fizaa Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35139
myang2021
added a commit
that referenced
this issue
May 20, 2024
…e refresh Summary: This diff fixes the remaining memory leaks in `CacheMemoryContext` that were identified to allow a stable cache memory after each catalog cache refresh. Two more identified memory leaks: * `YbCompleteAttrProcessingImpl`: `relation->rd_att->constr` needs to be freed. * `YbPreloadPgInheritsCache`: tighter scope of `MemoryContextSwitchTo(CacheMemoryContext)` For the first leak, we do not cleanup all the relations from rel cache on a catalog cache refresh. Some are considered `YbIsNonAlterableRelation` and will remain. But the overall processing flow of `YbCompleteAttrProcessingImpl` processes them anyway and will set their `relation->rd_att->constr` to the newly allocated `constr`. For the second leak in `YbPreloadPgInheritsCache`, PG code already appropriately allocates/copies into hcxt (CacheMemoryContext in this case) whenever required. If we still allocate any temporary memory into `CacheMemoryContext`, then they are not freed after being copied into `CacheMemoryContext`. Related code are: ``` static HTAB *YbPgInheritsCache; ``` and ``` void YbInitPgInheritsCache() { Assert(YbPgInheritsCache == NULL); HASHCTL ctl = {0}; ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(YbPgInheritsCacheEntryData); ctl.hcxt = CacheMemoryContext; ``` Note that `ctl.hcxt = CacheMemoryContext` ensures the copying into `CacheMemoryContext`. By moving `MemoryContextSwitchTo(CacheMemoryContext)` to the minimum scope where it is truly required to ensure the right memory context, those temporary memory will be allocated from transaction memory that will be auto-freed after the transaction (created just for doing cache refresh) is over. With these two leaks fixed, I am able to add a new unit test that checks that the cache memory size stay the same after cache refresh. Jira: DB-11181 Original commit: a9b9a2b / D35017 Test Plan: ./yb_build.sh release --cxx-test pg_libpq-test --gtest_filter PgLibPqTest.CatalogCacheMemoryLeak -n 20 Reviewers: fizaa, kfranz Reviewed By: kfranz Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35199
svarnau
pushed a commit
that referenced
this issue
May 25, 2024
Summary: The memory leak was identified via code inspection during debugging the issue. ``` static void YbInitUpdateRelationCacheState(YbUpdateRelationCacheState *state) { YbLoadTupleCache(&state->pg_attrdef_cache, AttrDefaultRelationId, &YbExtractAttrDefTupleCacheKey, "pg_attrdef local cache"); YbLoadTupleCache(&state->pg_constraint_cache, ConstraintRelationId, &YbExtractConstraintTupleCacheKey, "pg_constraint local cache"); } ``` and ``` static void YbLoadTupleCache(YbTupleCache *cache, Oid relid, YbTupleCacheKeyExtractor key_extractor, const char *cache_name) { Assert(!(cache->rel || cache->data)); cache->rel = heap_open(relid, AccessShareLock); HASHCTL ctl = {0}; ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(YbTupleCacheEntry); cache->data = hash_create(cache_name, 32, &ctl, HASH_ELEM | HASH_BLOBS); ``` We have called `hash_create` in `YbLoadTupleCache`, but in the corresponding `YbCleanupTupleCache` I do not see a call to `hash_destroy`: ``` static void YbCleanupTupleCache(YbTupleCache *cache) { if (!cache->rel) return; heap_close(cache->rel, AccessShareLock); } ``` This diff fixes the memory leak by calling `hash_destroy` in `YbCleanupTupleCache`. Jira: DB-11181 Test Plan: Manual test. (1) ./bin/yb-ctl create --rf 1 --tserver_flags ysql_catalog_preload_additional_tables=true (2) before the fix ``` yugabyte=# select used_bytes,total_bytes from pg_get_backend_memory_contexts() where name like '%local cache'; used_bytes | total_bytes ------------+------------- 6616 | 8192 5576 | 8192 (2 rows) ``` (3) after the fix ``` yugabyte=# select used_bytes,total_bytes from pg_get_backend_memory_contexts() where name like '%local cache'; used_bytes | total_bytes ------------+------------- (0 rows) ``` Reviewers: kfranz Reviewed By: kfranz Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D34845
svarnau
pushed a commit
that referenced
this issue
May 25, 2024
Summary: In a newly conducted upgrade stress testing, we found that when upgrading 2.18.4.0 to 2.20.3.0, we can successfully complete the upgrade of a 12-DB, 50 connection per DB, 7-node cluster configuration. But when upgrading 2.18.4.0 to 2024.1-b123, the upgrade failed. We noticed that the PG memory spike in case of 2024.1-b123 was higher. After debugging, the upgrade to 2024.1-b123 involved running more DDLs that cause catalog version to increment. Each time after doing a catalog cache refresh, the memory size of `CacheMemoryContext` goes up which suggests a memory leak. An experiment below demonstrated the memory leak (between each `select`, run `alter user yugabyte superuser` on another session to trigger catalog cache refresh on the next execution of the `select` statement) ``` yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 524288 | 409160 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10254152 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10507912 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10761672 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 11015432 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 11269192 (1 row) ``` Each catalog cache refresh we see `used_bytes` increased by an average of (11269192 - 10254152) / 4 = 253760 Because of the memory leak, the more catalog-version-bumping-DDLs to execute, the more memory used by `CacheMemoryContext`. Therefore the upgrade to 2024.1-b123 saw higher PG memory spike. This diff fixes two identified memory leaks in `CacheMemoryContext` that account for most of the leaks. * ybctid memory needs to be released in `CatCacheRemoveCTup` * yb_table_properties needs to be released in `RelationDestroyRelation`. Jira: DB-11181 Test Plan: Manual test. After the fix ``` yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 524288 | 409000 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10253160 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10256456 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10259752 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10263048 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10266344 (1 row) ``` Each catalog cache refresh we see `used_bytes` increased by an average of (10266344 - 10253160) / 4 = 3296 Reviewers: kfranz, fizaa Reviewed By: fizaa Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D34969
svarnau
pushed a commit
that referenced
this issue
May 25, 2024
Summary: This diff fixes the remaining memory leaks in `CacheMemoryContext` that were identified to allow a stable cache memory after each catalog cache refresh. Two more identified memory leaks: * `YbCompleteAttrProcessingImpl`: `relation->rd_att->constr` needs to be freed. * `YbPreloadPgInheritsCache`: tighter scope of `MemoryContextSwitchTo(CacheMemoryContext)` For the first leak, we do not cleanup all the relations from rel cache on a catalog cache refresh. Some are considered `YbIsNonAlterableRelation` and will remain. But the overall processing flow of `YbCompleteAttrProcessingImpl` processes them anyway and will set their `relation->rd_att->constr` to the newly allocated `constr`. For the second leak in `YbPreloadPgInheritsCache`, PG code already appropriately allocates/copies into hcxt (CacheMemoryContext in this case) whenever required. If we still allocate any temporary memory into `CacheMemoryContext`, then they are not freed after being copied into `CacheMemoryContext`. Related code are: ``` static HTAB *YbPgInheritsCache; ``` and ``` void YbInitPgInheritsCache() { Assert(YbPgInheritsCache == NULL); HASHCTL ctl = {0}; ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(YbPgInheritsCacheEntryData); ctl.hcxt = CacheMemoryContext; ``` Note that `ctl.hcxt = CacheMemoryContext` ensures the copying into `CacheMemoryContext`. By moving `MemoryContextSwitchTo(CacheMemoryContext)` to the minimum scope where it is truly required to ensure the right memory context, those temporary memory will be allocated from transaction memory that will be auto-freed after the transaction (created just for doing cache refresh) is over. With these two leaks fixed, I am able to add a new unit test that checks that the cache memory size stay the same after cache refresh. Jira: DB-11181 Test Plan: ./yb_build.sh release --cxx-test pg_libpq-test --gtest_filter PgLibPqTest.CatalogCacheMemoryLeak -n 20 Reviewers: fizaa, kfranz Reviewed By: fizaa Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D35017
svarnau
pushed a commit
that referenced
this issue
May 29, 2024
…ttrdef local cache Summary: The memory leak was identified via code inspection during debugging the issue. ``` static void YbInitUpdateRelationCacheState(YbUpdateRelationCacheState *state) { YbLoadTupleCache(&state->pg_attrdef_cache, AttrDefaultRelationId, &YbExtractAttrDefTupleCacheKey, "pg_attrdef local cache"); YbLoadTupleCache(&state->pg_constraint_cache, ConstraintRelationId, &YbExtractConstraintTupleCacheKey, "pg_constraint local cache"); } ``` and ``` static void YbLoadTupleCache(YbTupleCache *cache, Oid relid, YbTupleCacheKeyExtractor key_extractor, const char *cache_name) { Assert(!(cache->rel || cache->data)); cache->rel = heap_open(relid, AccessShareLock); HASHCTL ctl = {0}; ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(YbTupleCacheEntry); cache->data = hash_create(cache_name, 32, &ctl, HASH_ELEM | HASH_BLOBS); ``` We have called `hash_create` in `YbLoadTupleCache`, but in the corresponding `YbCleanupTupleCache` I do not see a call to `hash_destroy`: ``` static void YbCleanupTupleCache(YbTupleCache *cache) { if (!cache->rel) return; heap_close(cache->rel, AccessShareLock); } ``` This diff fixes the memory leak by calling `hash_destroy` in `YbCleanupTupleCache`. Jira: DB-11181 Original commit: 18754b1 / D34845 Test Plan: Manual test. (1) ./bin/yb-ctl create --rf 1 --tserver_flags ysql_catalog_preload_additional_tables=true (2) before the fix ``` yugabyte=# select used_bytes,total_bytes from pg_get_backend_memory_contexts() where name like '%local cache'; used_bytes | total_bytes ------------+------------- 6616 | 8192 5576 | 8192 (2 rows) ``` (3) after the fix ``` yugabyte=# select used_bytes,total_bytes from pg_get_backend_memory_contexts() where name like '%local cache'; used_bytes | total_bytes ------------+------------- (0 rows) ``` Reviewers: kfranz Reviewed By: kfranz Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35137
svarnau
pushed a commit
that referenced
this issue
May 29, 2024
Summary: In a newly conducted upgrade stress testing, we found that when upgrading 2.18.4.0 to 2.20.3.0, we can successfully complete the upgrade of a 12-DB, 50 connection per DB, 7-node cluster configuration. But when upgrading 2.18.4.0 to 2024.1-b123, the upgrade failed. We noticed that the PG memory spike in case of 2024.1-b123 was higher. After debugging, the upgrade to 2024.1-b123 involved running more DDLs that cause catalog version to increment. Each time after doing a catalog cache refresh, the memory size of `CacheMemoryContext` goes up which suggests a memory leak. An experiment below demonstrated the memory leak (between each `select`, run `alter user yugabyte superuser` on another session to trigger catalog cache refresh on the next execution of the `select` statement) ``` yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 524288 | 409160 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10254152 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10507912 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10761672 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 11015432 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 11269192 (1 row) ``` Each catalog cache refresh we see `used_bytes` increased by an average of (11269192 - 10254152) / 4 = 253760 Because of the memory leak, the more catalog-version-bumping-DDLs to execute, the more memory used by `CacheMemoryContext`. Therefore the upgrade to 2024.1-b123 saw higher PG memory spike. This diff fixes two identified memory leaks in `CacheMemoryContext` that account for most of the leaks. * ybctid memory needs to be released in `CatCacheRemoveCTup` * yb_table_properties needs to be released in `RelationDestroyRelation`. Jira: DB-11181 Original commit: 166ef51 / D34969 Test Plan: Manual test. After the fix ``` yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 524288 | 409000 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10253160 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10256456 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10259752 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10263048 (1 row) yugabyte=# select total_bytes, used_bytes from pg_get_backend_memory_contexts() where name = 'CacheMemoryContext'; total_bytes | used_bytes -------------+------------ 12879448 | 10266344 (1 row) ``` Each catalog cache refresh we see `used_bytes` increased by an average of (10266344 - 10253160) / 4 = 3296 Reviewers: kfranz, fizaa Reviewed By: fizaa Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35139
svarnau
pushed a commit
that referenced
this issue
May 29, 2024
…e refresh Summary: This diff fixes the remaining memory leaks in `CacheMemoryContext` that were identified to allow a stable cache memory after each catalog cache refresh. Two more identified memory leaks: * `YbCompleteAttrProcessingImpl`: `relation->rd_att->constr` needs to be freed. * `YbPreloadPgInheritsCache`: tighter scope of `MemoryContextSwitchTo(CacheMemoryContext)` For the first leak, we do not cleanup all the relations from rel cache on a catalog cache refresh. Some are considered `YbIsNonAlterableRelation` and will remain. But the overall processing flow of `YbCompleteAttrProcessingImpl` processes them anyway and will set their `relation->rd_att->constr` to the newly allocated `constr`. For the second leak in `YbPreloadPgInheritsCache`, PG code already appropriately allocates/copies into hcxt (CacheMemoryContext in this case) whenever required. If we still allocate any temporary memory into `CacheMemoryContext`, then they are not freed after being copied into `CacheMemoryContext`. Related code are: ``` static HTAB *YbPgInheritsCache; ``` and ``` void YbInitPgInheritsCache() { Assert(YbPgInheritsCache == NULL); HASHCTL ctl = {0}; ctl.keysize = sizeof(Oid); ctl.entrysize = sizeof(YbPgInheritsCacheEntryData); ctl.hcxt = CacheMemoryContext; ``` Note that `ctl.hcxt = CacheMemoryContext` ensures the copying into `CacheMemoryContext`. By moving `MemoryContextSwitchTo(CacheMemoryContext)` to the minimum scope where it is truly required to ensure the right memory context, those temporary memory will be allocated from transaction memory that will be auto-freed after the transaction (created just for doing cache refresh) is over. With these two leaks fixed, I am able to add a new unit test that checks that the cache memory size stay the same after cache refresh. Jira: DB-11181 Original commit: a9b9a2b / D35017 Test Plan: ./yb_build.sh release --cxx-test pg_libpq-test --gtest_filter PgLibPqTest.CatalogCacheMemoryLeak -n 20 Reviewers: fizaa, kfranz Reviewed By: kfranz Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D35199
yugabyte-ci
added
priority/medium
Medium priority issue
and removed
priority/high
High Priority
labels
Aug 19, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2024.1_blocker
area/ysql
Yugabyte SQL (YSQL)
kind/bug
This issue is a bug
priority/medium
Medium priority issue
Jira Link: DB-11181
Description
Please find detailed conversation in slack thread. Attaching in JIRA.
I upgraded 2 universes from version 2.18.4 to 2024.1. Both times, the master leader became an unreachable node. Even after bringing it back live by stopping and starting from the AWS console, the dropped connections never returned. Interestingly, this didn’t happen with our 18DBs in our before experiments, whereas here in 12 DBs we hit this, where workload is lighter than previous.
The difference was 18 DBs test was upgraded to 2024.1.0.0-b105 and 12 DBs was to 2024.1.0.0-b123. Earlier, the better defaults from Mark existed, which is now under gflag and by default off, from 2024.1.0.0-b116. https://phorge.dev.yugabyte.com/D34565. Looks like the better defaults was masking this issue.
Regarding the upgrade from 2.18.4 -> 2.20.3 with 12 DBs, I didn’t observe the above issue; none of the nodes became unavailable. Like in our previous experiments, I don’t observe a throughput drop in this experiment too.
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: