Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TServer-level mem-tracker consumption could be much lower than tcmalloc reports. #2566

Closed
ttyusupov opened this issue Oct 10, 2019 · 6 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug

Comments

@ttyusupov
Copy link
Contributor

During the scenario explained in #2563 I've faced the following issue.
Last memory consumption metrics for components occupying more than 100MB before read buffer overflow at 16:47:11 UTC were saved into prometheus at 16:47:05 UTC:

3.26GB: mem_tracker{metric_id="yb.tabletserver",metric_type="server"}
  2.83GB: _Call{metric_id="yb.tabletserver",metric_type="server"}
    2.83GB: _Inbound_RPC{metric_id="yb.tabletserver",metric_type="server"}
  232MB: _Read_Buffer{metric_id="yb.tabletserver",metric_type="server"}
    232MB: _Inbound_RPC{metric_id="yb.tabletserver",metric_type="server"}
      232MB: _Reading{metric_id="yb.tabletserver",metric_type="server"}
  186MB: _Tablets{metric_id="yb.tabletserver",metric_type="server"}
    186MB: mem_tracker{metric_type="tablet",table_id="11fd1f27765b46658112ede0cd881ba4",table_name="batch_ts_metrics_raw"}
      186MB:  _RegularDB{metric_type="tablet",table_id="11fd1f27765b46658112ede0cd881ba4",table_name="batch_ts_metrics_raw"}
        186MB: _MemTable{metric_type="tablet",table_id="11fd1f27765b46658112ede0cd881ba4",table_name="batch_ts_metrics_raw"}

We see only 3.26GB peak memory consumption in prometheus before readbuffer re-allocation error, but in error message we see that consumption was ~6GB.

cat yb-tserver.log.INFO.20190918-164602.23652.before.readbuffer.overlow | egrep -i 'memory|consumption'
I0918 16:46:01.930274 23652 mem_tracker.cc:250] MemTracker: hard memory limit is 6.201147 GB
I0918 16:46:01.930284 23652 mem_tracker.cc:252] MemTracker: soft memory limit is 5.270975 GB
I0918 16:46:01.934630 23652 shared_mem.cc:152] Using directory /dev/shm to store shared memory objects
I0918 16:47:06.514689 23678 maintenance_manager.cc:342] we have exceeded our soft memory limit (current capacity is 87.26%).  However, there are no ops currently runnable which would free memory.

87.26% * 6.20GiB=5.41GiB=5.8GB

W0918 16:47:11.144232 23664 binary_call_parser.cc:72] Unable to allocate read buffer because of limit, required: 25101547, blocked by: 0x0000000002162010 -> root, consumption: 6658375803 of 6658431180
W0918 16:47:11.175078 23892 operation_tracker.cc:153] Operation failed, tablet 2ff7aa789ae94044bc7bee791649e4cc operation memory consumption (38987981) has exceeded its limit (1073741824) or the limit of an ancestral tracker
@ttyusupov ttyusupov added kind/bug This issue is a bug area/docdb YugabyteDB core features labels Oct 10, 2019
@bmatican
Copy link
Contributor

Adding another instance of this: we recently saw this in a different customer deployment even on the master side. Seemingly benign network blip on the TS side, triggers raft group changes, then corresponding TSHeartbeat requests to the master, master server mem-tracker reported 100MB, total heap was 2GB...

@bmatican
Copy link
Contributor

bmatican commented Nov 4, 2019

Adapted from @ttyusupov and how I ended up repro-ing this + OOM easily in a 3 node cluster

n1-highcpu-4 (4cpu, <4gb of memory)

--memstore_size_mb=16
--db_block_cache_size_percentage=25
--use_priority_thread_pool_for_compactions=true
--allow_preempting_compactions=false
--rocksdb_max_background_compactions=2
--enable_ondisk_compression=false
--yb_num_shards_per_tserver=8
--memory_limit_hard_bytes=3000000000
--rocksdb_compact_flush_rate_limit_bytes_per_sec=256000000
--rocksdb_level0_slowdown_writes_trigger=8

The flow then is, start a batch workload: java -jar ~/code/yb-sample-apps/target/yb-sample-apps.jar --workload CassandraBatchTimeseries --nodes $IP1:9042,$IP2:9042,$IP3:9042 num_threads_read 1 --num_threads_write 16 n_metrics_count 10000 --max_metrics_count 20000 --num_unique_keys 1000000000 --nouuid --value_size 100

After a couple of minutes of running the workload, stop one TS.
Leaders will be bouncing for a bit.
Once leaders stabilize, they will get timeouts to the failed TS and memory will start growing. This should already be good enough for this issue.

If we then also restart the TS after some 30-60s, it will end up OOMing.

cc @rajukumaryb

@bmatican
Copy link
Contributor

bmatican commented Nov 8, 2019

Did some more experiments last night. I think much (all?) of the unexpected gap between server and root might be accounted for by internal tcmalloc memory:

Comparing our /mem-trackers and the verbose tcmalloc stats in /memz

server    294.09M (303.01M)
root    465.87M

VS

MALLOC:      332713024 (  317.3 MiB) Bytes in use by application
MALLOC: =    488669184 (  466.0 MiB) Actual memory used (physical + swap)

This lines up pretty well, with likely some allocations by our app not being tracked by the OS.

I think we should add all the free list, etc stats from tcmalloc as a special memtracker under root:

------------------------------------------------
MALLOC:     1315894112 ( 1254.9 MiB) Bytes in use by application
MALLOC: +     96780288 (   92.3 MiB) Bytes in page heap freelist
MALLOC: +    527491976 (  503.1 MiB) Bytes in central cache freelist
MALLOC: +      4984320 (    4.8 MiB) Bytes in transfer cache freelist
MALLOC: +    147798808 (  141.0 MiB) Bytes in thread cache freelists
MALLOC: +     19529728 (   18.6 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   2112479232 ( 2014.6 MiB) Actual memory used (physical + swap)

Basically we should have the first 2 children in our root memtracker UI should be tcmalloc internals and server, with the tcmalloc one then having all the stats that we see at the top of /memz. With that, it should be much easier to validate that that is indeed the main culprit!

Source for tcmalloc: https://github.com/gperftools/gperftools/blob/master/src/tcmalloc.cc#L403

Relevant stats:

  const uint64_t virtual_memory_used = (stats.pageheap.system_bytes
                                        + stats.metadata_bytes);
  const uint64_t physical_memory_used = (virtual_memory_used
                                         - stats.pageheap.unmapped_bytes);
  const uint64_t bytes_in_use_by_app = (physical_memory_used
                                        - stats.metadata_bytes
                                        - stats.pageheap.free_bytes
                                        - stats.central_bytes
                                        - stats.transfer_bytes
                                        - stats.thread_bytes);

cc @ttyusupov @spolitov @rajukumaryb @kmuthukk

rajukumaryb added a commit that referenced this issue Nov 20, 2019
Summary: TCMalloc stats are not automatically refreshed. Add poll from root MemTracker.

Test Plan: Manual - check if TCMalloc stats in `lynx 127.0.0.1:9000/mem-trackers` change with load

Reviewers: sergei

Reviewed By: sergei

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D7563
@rajukumaryb rajukumaryb reopened this Nov 20, 2019
@ttyusupov ttyusupov self-assigned this Dec 24, 2019
@ttyusupov
Copy link
Contributor Author

Re-checked on the latest master and found that delta between generic_current_allocated_bytes and tserver memtracker is much less - ~5%.

@ttyusupov
Copy link
Contributor Author

Found the commit which changes delta from ~30% to ~5%.
aa35010:

generic_heap_size: 6.49G
generic_current_allocated_bytes: 5.91G
tserver mem_tracker: 5.63G
delta: (5.91-5.63)/5.91=4.7%

Previous commit: 4843cf2:

generic_heap_size: 6.49G
generic_current_allocated_bytes: 6.26G
tserver mem_tracker: 4.5G
delta: (6.26-4.5)/6.26=28%

@ttyusupov
Copy link
Contributor Author

It might be related to space overhead in old std::string implementation for GCC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug
Projects
None yet
Development

No branches or pull requests

3 participants