Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

集群压测出现很缓慢的内存泄漏 #101

Open
mefly2012 opened this issue Sep 18, 2024 · 4 comments
Open

集群压测出现很缓慢的内存泄漏 #101

mefly2012 opened this issue Sep 18, 2024 · 4 comments

Comments

@mefly2012
Copy link

集群压测出现很缓慢的内存泄漏
集群压测20天后,gc日志显示内存使用率有明显升高,导出的堆直方图显示某些数据对象增长异常,压测停掉后恢复正常。


压测方式:
bifromq 集群3个节点,1.5w 客户端通过 clb 随机连接这3个节点,随眠 5s 发送一次 300 byts的消息,然后其中1000个客户端断开连接,1分钟后,再连接上,如此循环往复。发送 topic 是 test_topic1、test_topic2、test_topic3 , 另外有 60个客户端共享订阅这3个topic
压测期间,没有 warn、error 日志,共享订阅数据也正常。
节选的 gc 日志

[2024-09-02T23:28:32.549+0800][22808][gc] GC(9771) Garbage Collection (Allocation Rate) 1578M(79%)->430M(22%)
[2024-09-02T23:28:56.615+0800][22808][gc] GC(9772) Garbage Collection (Allocation Rate) 1540M(77%)->426M(21%)
[2024-09-02T23:29:26.213+0800][22808][gc] GC(9773) Garbage Collection (Allocation Rate) 1710M(86%)->408M(20%)
[2024-09-02T23:29:57.005+0800][22808][gc] GC(9774) Garbage Collection (Allocation Rate) 1722M(86%)->462M(23%)
[2024-09-02T23:30:17.440+0800][22808][gc] GC(9775) Garbage Collection (Allocation Rate) 1640M(82%)->432M(22%)
[2024-09-02T23:30:45.364+0800][22808][gc] GC(9776) Garbage Collection (Allocation Rate) 1680M(84%)->426M(21%)
[2024-09-02T23:31:06.724+0800][22808][gc] GC(9777) Garbage Collection (Allocation Rate) 1438M(72%)->434M(22%)
[2024-09-02T23:31:37.394+0800][22808][gc] GC(9778) Garbage Collection (Allocation Rate) 1768M(88%)->414M(21%)
[2024-09-02T23:32:06.925+0800][22808][gc] GC(9779) Garbage Collection (Allocation Rate) 1718M(86%)->468M(23%)

[2024-09-03T07:41:17.484+0800][22808][gc] GC(11004) Garbage Collection (Allocation Rate) 1760M(88%)->442M(22%)
[2024-09-03T07:41:47.059+0800][22808][gc] GC(11005) Garbage Collection (Allocation Rate) 1684M(84%)->518M(26%)
[2024-09-03T07:41:53.827+0800][22808][gc] GC(11006) Garbage Collection (Allocation Rate) 854M(43%)->550M(28%)
[2024-09-03T07:42:06.380+0800][22808][gc] GC(11007) Garbage Collection (Allocation Rate) 1206M(60%)->450M(22%)
[2024-09-03T07:42:21.776+0800][22808][gc] GC(11008) Garbage Collection (Allocation Rate) 1112M(56%)->454M(23%)
[2024-09-03T07:42:50.316+0800][22808][gc] GC(11009) Garbage Collection (Allocation Rate) 1676M(84%)->446M(22%)
[2024-09-03T07:43:17.470+0800][22808][gc] GC(11010) Garbage Collection (Allocation Rate) 1664M(83%)->444M(22%)

[2024-09-04T11:28:05.505+0800][22808][gc] GC(15953) Garbage Collection (Allocation Rate) 664M(33%)->624M(31%)
[2024-09-04T11:28:06.608+0800][22808][gc] GC(15954) Garbage Collection (Allocation Rate) 628M(31%)->514M(26%)
[2024-09-04T11:28:21.477+0800][22808][gc] GC(15955) Garbage Collection (Allocation Rate) 1162M(58%)->510M(26%)
[2024-09-04T11:28:47.829+0800][22808][gc] GC(15956) Garbage Collection (Allocation Rate) 1608M(80%)->520M(26%)
[2024-09-04T11:29:06.812+0800][22808][gc] GC(15957) Garbage Collection (Allocation Rate) 1388M(69%)->532M(27%)
[2024-09-04T11:29:32.606+0800][22808][gc] GC(15958) Garbage Collection (Allocation Rate) 1694M(85%)->520M(26%)
[2024-09-04T11:29:57.911+0800][22808][gc] GC(15959) Garbage Collection (Allocation Rate) 1644M(82%)->522M(26%)
[2024-09-04T11:30:10.347+0800][22808][gc] GC(15960) Garbage Collection (Allocation Rate) 1336M(67%)->600M(30%)

[2024-09-17T09:39:40.035+0800][22808][gc] GC(113427) Garbage Collection (Allocation Rate) 972M(49%)->940M(47%)
[2024-09-17T09:39:41.513+0800][22808][gc] GC(113428) Garbage Collection (Allocation Rate) 942M(47%)->798M(40%)
[2024-09-17T09:39:42.934+0800][22808][gc] GC(113429) Garbage Collection (Allocation Rate) 798M(40%)->788M(39%)
[2024-09-17T09:39:51.773+0800][22808][gc] GC(113430) Garbage Collection (Allocation Rate) 1160M(58%)->768M(38%)
[2024-09-17T09:40:06.767+0800][22808][gc] GC(113431) Garbage Collection (Allocation Rate) 1424M(71%)->768M(38%)
[2024-09-17T09:40:21.783+0800][22808][gc] GC(113432) Garbage Collection (Allocation Rate) 1414M(71%)->770M(38%)
[2024-09-17T09:40:36.799+0800][22808][gc] GC(113433) Garbage Collection (Allocation Rate) 1412M(71%)->768M(38%)
[2024-09-17T09:40:47.321+0800][22808][gc] GC(113434) Garbage Collection (Allocation Rate) 1228M(61%)->788M(39%)

[2024-09-18T10:17:21.988+0800][22808][gc] GC(122063) Garbage Collection (Allocation Rate) 980M(49%)->802M(40%)
[2024-09-18T10:17:23.464+0800][22808][gc] GC(122064) Garbage Collection (Allocation Rate) 804M(40%)->798M(40%)
[2024-09-18T10:17:36.865+0800][22808][gc] GC(122065) Garbage Collection (Allocation Rate) 1372M(69%)->792M(40%)
[2024-09-18T10:17:51.822+0800][22808][gc] GC(122066) Garbage Collection (Allocation Rate) 1434M(72%)->792M(40%)
[2024-09-18T10:18:06.857+0800][22808][gc] GC(122067) Garbage Collection (Allocation Rate) 1440M(72%)->792M(40%)
[2024-09-18T10:18:21.883+0800][22808][gc] GC(122068) Garbage Collection (Allocation Rate) 1430M(72%)->794M(40%)
[2024-09-18T10:18:27.338+0800][22808][gc] GC(122069) Garbage Collection (Allocation Rate) 1028M(51%)->806M(40%)
[2024-09-18T10:18:37.049+0800][22808][gc] GC(122070) Garbage Collection (Allocation Rate) 1202M(60%)->790M(40%)

jmap -histo 命令导出的 jmap 直方图对比如下:
2024-09-06的

 num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:       1216624       68130944  java.util.LinkedHashMap$Entry (java.base@17.0.10)
   2:        151789       58781424  [Ljava.lang.Object; (java.base@17.0.10)
   3:        933437       48740648  [B (java.base@17.0.10)
   4:        872736       27927552  java.lang.String (java.base@17.0.10)
   5:        186231       27850976  [Ljava.util.HashMap$Node; (java.base@17.0.10)
   6:        329554       15818592  com.baidu.bifromq.sessiondict.rpc.proto.Session
   7:        187767       15021360  java.util.LinkedHashMap (java.base@17.0.10)
   8:        329553       10545696  io.grpc.internal.RetriableStream$1SendMessageEntry
   9:        171809        9621304  com.baidu.bifromq.type.ClientInfo
  10:        180231        8651088  com.google.protobuf.MapField
  11:        329553        7909272  io.grpc.internal.RetriableStream$1FlushEntry
  12:         30825        5983504  [Ljava.util.concurrent.ConcurrentHashMap$Node; (java.base@17.0.10)
  13:        180231        5767392  com.google.protobuf.MapField$MutabilityAwareMap
  14:        180231        4325544  com.google.protobuf.MapField$ImmutableMessageConverter
  15:        107165        4286600  java.util.concurrent.ConcurrentHashMap$Node (java.base@17.0.10)
  16:         43235        4150560  java.util.concurrent.ConcurrentHashMap (java.base@17.0.10)
  17:        168732        4049568  java.util.LinkedHashMap$LinkedEntrySet (java.base@17.0.10)
  18:         45366        3992208  io.netty.channel.DefaultChannelHandlerContext
  19:         16723        3112744  java.lang.Class (java.base@17.0.10)
  20:        118420        2842080  java.util.concurrent.atomic.AtomicReference (java.base@17.0.10)
  21:          8215        2168760  io.grpc.internal.ManagedChannelImpl$ChannelStreamProvider$1RetryStream
  22:         67624        2163968  java.util.concurrent.ConcurrentLinkedQueue$Node (java.base@17.0.10)
  23:         67048        2145536  io.grpc.PersistentHashArrayMappedTrie$Leaf
  24:         85893        2061432  java.util.concurrent.atomic.AtomicLong (java.base@17.0.10)
  25:          9619        1692944  com.baidu.bifromq.baserpc.ManagedRequestPipeline
  26:          7601        1679856  [I (java.base@17.0.10)
  27:          5022        1526688  com.baidu.bifromq.mqtt.handler.v3.MQTT3TransientSessionHandler
  28:         44011        1408352  java.util.ArrayDeque (java.base@17.0.10)
  29:         10974        1316880  io.grpc.internal.MessageDeframer
  30:         20485        1311040  io.reactivex.rxjava3.operators.SpscLinkedArrayQueue
  31:         23366        1308496  io.reactivex.rxjava3.internal.observers.LambdaObserver
  32:         31966        1278640  java.util.HashMap$Node (java.base@17.0.10)
  33:         38596        1235072  java.util.concurrent.ConcurrentLinkedQueue (java.base@17.0.10)

2024-09-18的

 num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:       1986343      111235208  java.util.LinkedHashMap$Entry (java.base@17.0.10)
   2:       1368530       68770704  [B (java.base@17.0.10)
   3:        151417       63513224  [Ljava.lang.Object; (java.base@17.0.10)
   4:        295559       43598464  [Ljava.util.HashMap$Node; (java.base@17.0.10)
   5:       1307849       41851168  java.lang.String (java.base@17.0.10)
   6:        550335       26416080  com.baidu.bifromq.sessiondict.rpc.proto.Session
   7:        296722       23737760  java.util.LinkedHashMap (java.base@17.0.10)
   8:        550334       17610688  io.grpc.internal.RetriableStream$1SendMessageEntry
   9:        281768       15779008  com.baidu.bifromq.type.ClientInfo
  10:        290193       13929264  com.google.protobuf.MapField
  11:        550334       13208016  io.grpc.internal.RetriableStream$1FlushEntry
  12:        290193        9286176  com.google.protobuf.MapField$MutabilityAwareMap
  13:        290193        6964632  com.google.protobuf.MapField$ImmutableMessageConverter
  14:        278892        6693408  java.util.LinkedHashMap$LinkedEntrySet (java.base@17.0.10)
  15:         29155        5745072  [Ljava.util.concurrent.ConcurrentHashMap$Node; (java.base@17.0.10)
  16:        103608        4144320  java.util.concurrent.ConcurrentHashMap$Node (java.base@17.0.10)
  17:         41231        3958176  java.util.concurrent.ConcurrentHashMap (java.base@17.0.10)
  18:         42350        3726800  io.netty.channel.DefaultChannelHandlerContext
  19:         16799        3127040  java.lang.Class (java.base@17.0.10)
  20:        118110        2834640  java.util.concurrent.atomic.AtomicReference (java.base@17.0.10)
  21:          8217        2169288  io.grpc.internal.ManagedChannelImpl$ChannelStreamProvider$1RetryStream
  22:         67517        2160544  java.util.concurrent.ConcurrentLinkedQueue$Node (java.base@17.0.10)
  23:         67071        2146272  io.grpc.PersistentHashArrayMappedTrie$Leaf
  24:         83556        2005344  java.util.concurrent.atomic.AtomicLong (java.base@17.0.10)
  25:          9619        1692944  com.baidu.bifromq.baserpc.ManagedRequestPipeline
  26:          7611        1680064  [I (java.base@17.0.10)
  27:          4686        1424544  com.baidu.bifromq.mqtt.handler.v3.MQTT3TransientSessionHandler
  28:         43692        1398144  java.util.ArrayDeque (java.base@17.0.10)
  29:         10977        1317240  io.grpc.internal.MessageDeframer
  30:         20487        1311168  io.reactivex.rxjava3.operators.SpscLinkedArrayQueue
  31:         23370        1308720  io.reactivex.rxjava3.internal.observers.LambdaObserver
  32:         31691        1267640  java.util.HashMap$Node (java.base@17.0.10)
  33:         38607        1235424  java.util.concurrent.ConcurrentLinkedQueue (java.base@17.0.10)
  34:         12384        1188864  io.grpc.CallOptions
  35:         11210        1165840  io.netty.handler.codec.http2.WeightedFairQueueByteDistributor$State
  36:          4769        1068256  io.netty.channel.epoll.EpollSocketChannel
  37:         32975        1055200  java.util.concurrent.ConcurrentHashMap$KeySetView (java.base@17.0.10)
  38:         10977        1053792  io.grpc.internal.MessageFramer
  39:          8217        1051776  io.grpc.internal.ClientCallImpl
  40:          1520        1033600  io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue

2024-09-18 压测停止后恢复正常

num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:        130994       48524392  [Ljava.lang.Object; (java.base@17.0.10)
   2:        172121       18939280  [B (java.base@17.0.10)
   3:           525        4444576  [C (java.base@17.0.10)
   4:        122091        3906912  java.lang.String (java.base@17.0.10)
   5:         18769        3490856  java.lang.Class (java.base@17.0.10)
   6:        100860        2420640  java.util.concurrent.atomic.AtomicReference (java.base@17.0.10)
   7:          7102        2056960  [Ljava.util.HashMap$Node; (java.base@17.0.10)
   8:         36292        2032352  java.util.LinkedHashMap$Entry (java.base@17.0.10)
   9:         49112        1964480  java.util.concurrent.ConcurrentHashMap$Node (java.base@17.0.10)
  10:          2954        1942496  [Ljava.util.concurrent.ConcurrentHashMap$Node; (java.base@17.0.10)
  11:         59873        1915936  java.util.concurrent.ConcurrentLinkedQueue$Node (java.base@17.0.10)
  12:          7232        1909248  io.grpc.internal.ManagedChannelImpl$ChannelStreamProvider$1RetryStream
  13:         57881        1852192  io.grpc.PersistentHashArrayMappedTrie$Leaf
  14:          8369        1846720  [I (java.base@17.0.10)
  15:          9612        1691712  com.baidu.bifromq.baserpc.ManagedRequestPipeline
  16:         19495        1247680  io.reactivex.rxjava3.operators.SpscLinkedArrayQueue
  17:         21344        1195264  io.reactivex.rxjava3.internal.observers.LambdaObserver
  18:         48789        1170936  java.util.concurrent.atomic.AtomicLong (java.base@17.0.10)
  19:          9645        1157400  io.grpc.internal.MessageDeframer
  20:         12024        1154304  io.grpc.CallOptions
  21:         27506        1100240  java.util.HashMap$Node (java.base@17.0.10)
  22:         34343        1098976  java.util.ArrayDeque (java.base@17.0.10)
  23:         33882        1084224  java.util.concurrent.ConcurrentLinkedQueue (java.base@17.0.10)
  24:          1520        1033600  io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue
  25:          9879        1027416  io.netty.handler.codec.http2.WeightedFairQueueByteDistributor$State
  26:         10295         988320  java.util.concurrent.ConcurrentHashMap (java.base@17.0.10)
  27:         29276         936832  java.util.ArrayList (java.base@17.0.10)
  28:          9645         925920  io.grpc.internal.MessageFramer
  29:          7232         925696  io.grpc.internal.ClientCallImpl
  30:          7204         864480  com.baidu.bifromq.basekv.server.QueryPipeline
  31:          4821         848496  io.grpc.netty.NettyClientTransport$3
  32:         20302         812080  java.util.concurrent.ConcurrentLinkedDeque$Node (java.base@17.0.10)

BifroMQ

  • Version: [3.2.1]
  • Deployment: [Cluster]

To Reproduce

Expected behavior

Logs

Configurations

OS(please complete the following information):

JVM:

  • Version: [ 17]

Performance Related

  • HOST:
    • Cluster node count: [3]
    • CPU: [2]
    • Memory: [4]
  • Network:
    • Bandwidth: []
    • Latency: []
  • Load:
    • PUB count: []
    • SUB count: []
    • PUB QPS per connection: []
    • Payload size: []

Additional context
Add any other context about the problem here.

@popduke
Copy link
Member

popduke commented Sep 18, 2024

请尝试用3.3.1版本复现

@masterOcean
Copy link

我们停掉压测后分析导出的 dump文件,发现有 6000多个 com.baidu.bifromq.sessiondict.rpc.proto.Session 对象,但这个是 gRpc 调用的参数,按理说应该在 gRpc 结束后会释放掉的呀。然后使用 eclips mat 分析了它的 gc root,怀疑与 gRpc 有关。
dump 文件如下:
链接:https://pan.baidu.com/s/1_A9ux4o4MXOSyzWO-wqMuA?pwd=vuq0
提取码:vuq0
session GC ROOT

@masterOcean
Copy link

masterOcean commented Sep 19, 2024

请尝试用3.3.1版本复现

限于公司项目要求,我们没法随时升级呢,而且这个复现时间还挺长的。

@popduke
Copy link
Member

popduke commented Sep 20, 2024

BifroMQ开源版本目前只对最新的minor版本提供支持,你可以把复现过程剥离,以代码或项目的方式提供出来。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants