TServer-level mem-tracker consumption could be much lower than tcmalloc reports. #2566

ttyusupov · 2019-10-10T13:48:25Z

During the scenario explained in #2563 I've faced the following issue.
Last memory consumption metrics for components occupying more than 100MB before read buffer overflow at 16:47:11 UTC were saved into prometheus at 16:47:05 UTC:

3.26GB: mem_tracker{metric_id="yb.tabletserver",metric_type="server"}
  2.83GB: _Call{metric_id="yb.tabletserver",metric_type="server"}
    2.83GB: _Inbound_RPC{metric_id="yb.tabletserver",metric_type="server"}
  232MB: _Read_Buffer{metric_id="yb.tabletserver",metric_type="server"}
    232MB: _Inbound_RPC{metric_id="yb.tabletserver",metric_type="server"}
      232MB: _Reading{metric_id="yb.tabletserver",metric_type="server"}
  186MB: _Tablets{metric_id="yb.tabletserver",metric_type="server"}
    186MB: mem_tracker{metric_type="tablet",table_id="11fd1f27765b46658112ede0cd881ba4",table_name="batch_ts_metrics_raw"}
      186MB:  _RegularDB{metric_type="tablet",table_id="11fd1f27765b46658112ede0cd881ba4",table_name="batch_ts_metrics_raw"}
        186MB: _MemTable{metric_type="tablet",table_id="11fd1f27765b46658112ede0cd881ba4",table_name="batch_ts_metrics_raw"}

We see only 3.26GB peak memory consumption in prometheus before readbuffer re-allocation error, but in error message we see that consumption was ~6GB.

cat yb-tserver.log.INFO.20190918-164602.23652.before.readbuffer.overlow | egrep -i 'memory|consumption'
I0918 16:46:01.930274 23652 mem_tracker.cc:250] MemTracker: hard memory limit is 6.201147 GB
I0918 16:46:01.930284 23652 mem_tracker.cc:252] MemTracker: soft memory limit is 5.270975 GB
I0918 16:46:01.934630 23652 shared_mem.cc:152] Using directory /dev/shm to store shared memory objects
I0918 16:47:06.514689 23678 maintenance_manager.cc:342] we have exceeded our soft memory limit (current capacity is 87.26%).  However, there are no ops currently runnable which would free memory.

87.26% * 6.20GiB=5.41GiB=5.8GB

W0918 16:47:11.144232 23664 binary_call_parser.cc:72] Unable to allocate read buffer because of limit, required: 25101547, blocked by: 0x0000000002162010 -> root, consumption: 6658375803 of 6658431180
W0918 16:47:11.175078 23892 operation_tracker.cc:153] Operation failed, tablet 2ff7aa789ae94044bc7bee791649e4cc operation memory consumption (38987981) has exceeded its limit (1073741824) or the limit of an ancestral tracker

The text was updated successfully, but these errors were encountered:

bmatican · 2019-10-29T05:57:34Z

Adding another instance of this: we recently saw this in a different customer deployment even on the master side. Seemingly benign network blip on the TS side, triggers raft group changes, then corresponding TSHeartbeat requests to the master, master server mem-tracker reported 100MB, total heap was 2GB...

bmatican · 2019-11-04T19:28:57Z

Adapted from @ttyusupov and how I ended up repro-ing this + OOM easily in a 3 node cluster

n1-highcpu-4 (4cpu, <4gb of memory)

--memstore_size_mb=16
--db_block_cache_size_percentage=25
--use_priority_thread_pool_for_compactions=true
--allow_preempting_compactions=false
--rocksdb_max_background_compactions=2
--enable_ondisk_compression=false
--yb_num_shards_per_tserver=8
--memory_limit_hard_bytes=3000000000
--rocksdb_compact_flush_rate_limit_bytes_per_sec=256000000
--rocksdb_level0_slowdown_writes_trigger=8

The flow then is, start a batch workload: java -jar ~/code/yb-sample-apps/target/yb-sample-apps.jar --workload CassandraBatchTimeseries --nodes $IP1:9042,$IP2:9042,$IP3:9042 num_threads_read 1 --num_threads_write 16 n_metrics_count 10000 --max_metrics_count 20000 --num_unique_keys 1000000000 --nouuid --value_size 100

After a couple of minutes of running the workload, stop one TS.
Leaders will be bouncing for a bit.
Once leaders stabilize, they will get timeouts to the failed TS and memory will start growing. This should already be good enough for this issue.

If we then also restart the TS after some 30-60s, it will end up OOMing.

cc @rajukumaryb

bmatican · 2019-11-08T18:37:56Z

Did some more experiments last night. I think much (all?) of the unexpected gap between server and root might be accounted for by internal tcmalloc memory:

Comparing our /mem-trackers and the verbose tcmalloc stats in /memz

server    294.09M (303.01M)
root    465.87M

VS

MALLOC:      332713024 (  317.3 MiB) Bytes in use by application
MALLOC: =    488669184 (  466.0 MiB) Actual memory used (physical + swap)

This lines up pretty well, with likely some allocations by our app not being tracked by the OS.

I think we should add all the free list, etc stats from tcmalloc as a special memtracker under root:

------------------------------------------------
MALLOC:     1315894112 ( 1254.9 MiB) Bytes in use by application
MALLOC: +     96780288 (   92.3 MiB) Bytes in page heap freelist
MALLOC: +    527491976 (  503.1 MiB) Bytes in central cache freelist
MALLOC: +      4984320 (    4.8 MiB) Bytes in transfer cache freelist
MALLOC: +    147798808 (  141.0 MiB) Bytes in thread cache freelists
MALLOC: +     19529728 (   18.6 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   2112479232 ( 2014.6 MiB) Actual memory used (physical + swap)

Basically we should have the first 2 children in our root memtracker UI should be tcmalloc internals and server, with the tcmalloc one then having all the stats that we see at the top of /memz. With that, it should be much easier to validate that that is indeed the main culprit!

Source for tcmalloc: https://github.com/gperftools/gperftools/blob/master/src/tcmalloc.cc#L403

Relevant stats:

  const uint64_t virtual_memory_used = (stats.pageheap.system_bytes
                                        + stats.metadata_bytes);
  const uint64_t physical_memory_used = (virtual_memory_used
                                         - stats.pageheap.unmapped_bytes);
  const uint64_t bytes_in_use_by_app = (physical_memory_used
                                        - stats.metadata_bytes
                                        - stats.pageheap.free_bytes
                                        - stats.central_bytes
                                        - stats.transfer_bytes
                                        - stats.thread_bytes);

cc @ttyusupov @spolitov @rajukumaryb @kmuthukk

Summary: TCMalloc stats are not automatically refreshed. Add poll from root MemTracker. Test Plan: Manual - check if TCMalloc stats in `lynx 127.0.0.1:9000/mem-trackers` change with load Reviewers: sergei Reviewed By: sergei Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D7563

ttyusupov · 2019-12-24T12:04:42Z

Re-checked on the latest master and found that delta between generic_current_allocated_bytes and tserver memtracker is much less - ~5%.

ttyusupov · 2019-12-24T12:06:07Z

Found the commit which changes delta from ~30% to ~5%.
aa35010:

generic_heap_size: 6.49G
generic_current_allocated_bytes: 5.91G
tserver mem_tracker: 5.63G
delta: (5.91-5.63)/5.91=4.7%

Previous commit: 4843cf2:

generic_heap_size: 6.49G
generic_current_allocated_bytes: 6.26G
tserver mem_tracker: 4.5G
delta: (6.26-4.5)/6.26=28%

ttyusupov · 2019-12-24T12:07:11Z

It might be related to space overhead in old std::string implementation for GCC.

ttyusupov added kind/bug This issue is a bug area/docdb YugabyteDB core features labels Oct 10, 2019

bmatican assigned rajukumaryb Nov 4, 2019

This was referenced Nov 6, 2019

Live inbound RPC calls memory consumption could hit hard memory limit #2563

Closed

Bootstrap FATAL: Does not have an entry for the last committed index #2565

Closed

rajukumaryb closed this as completed Nov 20, 2019

rajukumaryb reopened this Nov 20, 2019

ttyusupov mentioned this issue Dec 9, 2019

Single-line std::string::_Rep::_S_create stack traces in heap profiles #3116

Closed

ttyusupov self-assigned this Dec 24, 2019

ttyusupov closed this as completed Dec 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TServer-level mem-tracker consumption could be much lower than tcmalloc reports. #2566

TServer-level mem-tracker consumption could be much lower than tcmalloc reports. #2566

ttyusupov commented Oct 10, 2019

bmatican commented Oct 29, 2019

bmatican commented Nov 4, 2019

bmatican commented Nov 8, 2019

ttyusupov commented Dec 24, 2019

ttyusupov commented Dec 24, 2019

ttyusupov commented Dec 24, 2019

TServer-level mem-tracker consumption could be much lower than tcmalloc reports. #2566

TServer-level mem-tracker consumption could be much lower than tcmalloc reports. #2566

Comments

ttyusupov commented Oct 10, 2019

bmatican commented Oct 29, 2019

bmatican commented Nov 4, 2019

bmatican commented Nov 8, 2019

ttyusupov commented Dec 24, 2019

ttyusupov commented Dec 24, 2019

ttyusupov commented Dec 24, 2019