Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak with Doobie + Task #934

Open
mdedetrich opened this issue Jul 2, 2019 · 18 comments
Open

Memory leak with Doobie + Task #934

mdedetrich opened this issue Jul 2, 2019 · 18 comments

Comments

@mdedetrich
Copy link

mdedetrich commented Jul 2, 2019

Note that this may be related to #913

I finally managed to diagnose a memory leak that is happening with Doobie + monix.eval.Task/cats.effect.IO. Note that this only occurs with Doobie.

The repo reproducing the leak can be found here https://github.com/mdedetrich/task-doobie-memory-leak

@guymers
Copy link
Contributor

guymers commented Jul 7, 2019

I ran both example projects for half an hour and both of them seem to have slowly increasing old gen. Maybe the reason it is visible in the graph is due to Doobie GCing 5x more often?

Quill

At GC  9 - concurrent mark-sweep generation used 5186K to 6155K
At GC 57 - concurrent mark-sweep generation used 8691K to 8802K

Doobie

At GC   8 - concurrent mark-sweep generation used 18217K to 19217K
At GC  57 - concurrent mark-sweep generation used 20245K to 20245K
At GC 281 - concurrent mark-sweep generation used 20457K to 20686K

Quill
quill_mem
quill_mem_2

Doobie
doobie_mem
doobie_mem_2

@mdedetrich
Copy link
Author

Hmm, I can't really replicate your results, what OS are you running on? I will also try and run the tests for a longer period of time.

@guymers
Copy link
Contributor

guymers commented Jul 8, 2019

I am running Linux 5.1.15 and using GraalVM

$ java -version
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (build 1.8.0_212-20190523183340.buildslave.jdk8u-src-tar--b03)
OpenJDK 64-Bit GraalVM CE 19.1.0 (build 25.212-b03-jvmci-20-b04, mixed mode)

@mdedetrich
Copy link
Author

mdedetrich commented Jul 8, 2019

Hmm I am not using GraalVM, instead I was using OpenJDK 11 (also on linux). When I get home I will run taskQuill for more than a few hours to verify

@guymers
Copy link
Contributor

guymers commented Jul 8, 2019

This is what I get running Java 11 with these java options: -Xmx128m, -XX:+PrintGCDetails, -XX:+HeapDumpBeforeFullGC

$ java -version
openjdk version "11.0.3" 2019-04-16 LTS
OpenJDK Runtime Environment Zulu11.31+11-CA (build 11.0.3+7-LTS)
OpenJDK 64-Bit Server VM Zulu11.31+11-CA (build 11.0.3+7-LTS, mixed mode)

Quill
2019-07-08-194918_1515x569_scrot
2019-07-08-194929_1516x721_scrot

Doobie
2019-07-08-194937_1515x569_scrot
2019-07-08-194947_1514x715_scrot

@mdedetrich
Copy link
Author

mdedetrich commented Jul 9, 2019

Okay so I verified a couple of things. Yes its true that the Quill version has memory increasing slowly over time, but this is just the OldGen and it actually gets cleaned up when you approach the current heap size where as the doobie version will continuously increase over time (while bumping the current heap) which is what we are experiencing in production.

I just modified the repository to increase the number of transactions being made (so its quicker to reproduce) and I also lowered the initial heap size to 64M so we don't have to wait so long. These are the results

For quill, as you can see the memory does increase over time but once you approach the initial heap size the OldGen GC sweep kicks in and releases the memory
quill

On the other hand when we run the doobie version when it approaches the initial heap size it will just continuously increase the heap (which indicates a leak)

doobie

@mdedetrich
Copy link
Author

mdedetrich commented Jul 9, 2019

For further evidence, I also ran the tests setting Xmx of 128mb which is the maximum heap size that the JVM is allowed to specify. When using the doobie version, it eventually runs out of memory and you get errors like the following

[info] 13:12:04.192 [INFO ] ProxyLeakTask - Previously reported leaked connection org.postgresql.jdbc.PgConnection@3e1be64 on thread pool-1-thread-13 was returned to the pool (unleaked)
[info] 13:12:04.764 [INFO ] Main$ - Transaction complete!
[info] 13:12:04.764 [INFO ] Main$ - Transaction complete!
[info] 13:12:04.763 [INFO ] Main$ - Transaction complete!
[info] 13:12:09.575 [WARN ] ProxyLeakTask - Connection leak detection triggered for org.postgresql.jdbc.PgConnection@3e1be64 on thread pool-1-thread-13, stack trace follows
[info] java.lang.Exception: Apparent connection leak detected
[info] 	at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100)
[info] 	at doobie.util.transactor$Transactor$fromDataSource$FromDataSourceUnapplied.$anonfun$apply$10(transactor.scala:258)
[info] 	at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:87)
[info] 	at cats.effect.internals.IORunLoop$.startCancelable(IORunLoop.scala:41)
[info] 	at cats.effect.internals.IOBracket$BracketStart.run(IOBracket.scala:86)
[info] 	at cats.effect.internals.Trampoline.cats$effect$internals$Trampoline$$immediateLoop(Trampoline.scala:70)
[info] 	at cats.effect.internals.Trampoline.startLoop(Trampoline.scala:36)
[info] 	at cats.effect.internals.TrampolineEC$JVMTrampoline.super$startLoop(TrampolineEC.scala:93)
[info] 	at cats.effect.internals.TrampolineEC$JVMTrampoline.$anonfun$startLoop$1(TrampolineEC.scala:93)
[info] 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info] 	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
[info] 	at cats.effect.internals.TrampolineEC$JVMTrampoline.startLoop(TrampolineEC.scala:93)
[info] 	at cats.effect.internals.Trampoline.execute(Trampoline.scala:43)
[info] 	at cats.effect.internals.TrampolineEC.execute(TrampolineEC.scala:44)
[info] 	at cats.effect.internals.IOBracket$BracketStart.apply(IOBracket.scala:72)
[info] 	at cats.effect.internals.IOBracket$BracketStart.apply(IOBracket.scala:52)
[info] 	at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:136)
[info] 	at cats.effect.internals.IORunLoop$RestartCallback.signal(IORunLoop.scala:355)
[info] 	at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:376)
[info] 	at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:316)
[info] 	at cats.effect.internals.IOShift$Tick.run(IOShift.scala:36)
[info] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info] 	at java.lang.Thread.run(Thread.java:748)
[info] 13:12:09.575 [WARN ] ProxyLeakTask - Connection leak detection triggered for org.postgresql.jdbc.PgConnection@42b65b3c on thread pool-1-thread-16, stack trace follows
[info] java.lang.Exception: Apparent connection leak detected
[info] 	at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100)
[info] 	at doobie.util.transactor$Transactor$fromDataSource$FromDataSourceUnapplied.$anonfun$apply$10(transactor.scala:258)
[info] 	at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:87)
[info] 	at cats.effect.internals.IORunLoop$.startCancelable(IORunLoop.scala:41)
[info] 	at cats.effect.internals.IOBracket$BracketStart.run(IOBracket.scala:86)
[info] 	at cats.effect.internals.Trampoline.cats$effect$internals$Trampoline$$immediateLoop(Trampoline.scala:70)
[info] 	at cats.effect.internals.Trampoline.startLoop(Trampoline.scala:36)
[info] 	at cats.effect.internals.TrampolineEC$JVMTrampoline.super$startLoop(TrampolineEC.scala:93)
[info] 	at cats.effect.internals.TrampolineEC$JVMTrampoline.$anonfun$startLoop$1(TrampolineEC.scala:93)
[info] 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info] 	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
[info] 	at cats.effect.internals.TrampolineEC$JVMTrampoline.startLoop(TrampolineEC.scala:93)
[info] 	at cats.effect.internals.Trampoline.execute(Trampoline.scala:43)
[info] 	at cats.effect.internals.TrampolineEC.execute(TrampolineEC.scala:44)
[info] 	at cats.effect.internals.IOBracket$BracketStart.apply(IOBracket.scala:72)
[info] 	at cats.effect.internals.IOBracket$BracketStart.apply(IOBracket.scala:52)
[info] 	at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:136)
[info] 	at cats.effect.internals.IORunLoop$RestartCallback.signal(IORunLoop.scala:355)
[info] 	at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:376)
[info] 	at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:316)
[info] 	at cats.effect.internals.IOShift$Tick.run(IOShift.scala:36)
[info] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info] 	at java.lang.Thread.run(Thread.java:748)
[info] 13:12:09.575 [WARN ] ProxyLeakTask - Connection leak detection triggered for org.postgresql.jdbc.PgConnection@33bec223 on thread pool-1-thread-31, stack trace follows
[info] java.lang.Exception: Apparent connection leak detected
[info] 	at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:100)
[info] 	at doobie.util.transactor$Transactor$fromDataSource$FromDataSourceUnapplied.$anonfun$apply$10(transactor.scala:258)
[info] 	at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:87)
[info] 	at cats.effect.internals.IORunLoop$.startCancelable(IORunLoop.scala:41)
[info] 	at cats.effect.internals.IOBracket$BracketStart.run(IOBracket.scala:86)
[info] 	at cats.effect.internals.Trampoline.cats$effect$internals$Trampoline$$immediateLoop(Trampoline.scala:70)
[info] 	at cats.effect.internals.Trampoline.startLoop(Trampoline.scala:36)
[info] 	at cats.effect.internals.TrampolineEC$JVMTrampoline.super$startLoop(TrampolineEC.scala:93)
[info] 	at cats.effect.internals.TrampolineEC$JVMTrampoline.$anonfun$startLoop$1(TrampolineEC.scala:93)
[info] 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[info] 	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
[info] 	at cats.effect.internals.TrampolineEC$JVMTrampoline.startLoop(TrampolineEC.scala:93)
[info] 	at cats.effect.internals.Trampoline.execute(Trampoline.scala:43)
[info] 	at cats.effect.internals.TrampolineEC.execute(TrampolineEC.scala:44)
[info] 	at cats.effect.internals.IOBracket$BracketStart.apply(IOBracket.scala:72)
[info] 	at cats.effect.internals.IOBracket$BracketStart.apply(IOBracket.scala:52)
[info] 	at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:136)
[info] 	at cats.effect.internals.IORunLoop$RestartCallback.signal(IORunLoop.scala:355)
[info] 	at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:376)
[info] 	at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:316)
[info] 	at cats.effect.internals.IOShift$Tick.run(IOShift.scala:36)
[info] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info] 	at java.lang.Thread.run(Thread.java:748)

The associated graph is here
doobie2

As you can see its allocated memory until it hits the max heap size at which point you start getting memory related errors.

The same with quill

quill2
We do end up getting close to the Xmx but as is expected the JVM then frees a lot of memory.

@guymers
Copy link
Contributor

guymers commented Jul 9, 2019

Okay I get the same results with the changes.

It's because the connect executor has an unbound queue and the number of produced database operations exceeds the number that can be processed. I seem to be able to process ~3,900 operations a second but there are 5*500*3=7500 operations being produced. So you end up with this:

2019-07-09-205713_1527x385_scrot

In production I run with a bounded queue

val poolSize = ???
val queueSize = ???
val executorService = new ThreadPoolExecutor(
  poolSize, poolSize,
  0L, TimeUnit.MILLISECONDS,
  new LinkedBlockingQueue[Runnable](queueSize)
)
val connectEC = ExecutionContext.fromExecutor(executorService)

and handle the java.util.concurrent.RejectedExecutionException

@mdedetrich
Copy link
Author

Okay so question is we are having this leak in production and we have no way near that amount of transactions per second (we typically get 200 transactions per second max, lets say 500 max to be generous). I am going to tweak the connect executor to see if I can reproduce but I think its something more.

Furthermore for whatever reason, Quill doesn't have this problem at all. It can handle this load and it also doesn't slow down

@guymers
Copy link
Contributor

guymers commented Jul 9, 2019

When I run the quill example I get a java.lang.OutOfMemoryError: unable to create new native thread after 30 seconds.

@mdedetrich
Copy link
Author

mdedetrich commented Jul 9, 2019

Hmm I don't get this, I even posted the graph which was running for 20 minutes, let me run once more

@guymers
Copy link
Contributor

guymers commented Jul 9, 2019

Strange. There are over 11,000 monix-io threads listed in VisualVM when it crashes.

@mdedetrich
Copy link
Author

mdedetrich commented Jul 9, 2019

quill3

Latest run, also running the suite on a friends laptop and its working fine.

In any case, this is somewhat concerning because it appears that what you are saying is that Doobie can't process transactions as fast as they are incoming (talking about our production server where we are experiencing the problem, not this reproduction repo which is deliberately running a lot of transactions to speed up the leak). Using a bounded queue isn't really an option because we are then dropping transactions.

@mdedetrich
Copy link
Author

mdedetrich commented Jul 9, 2019

Okay so I ran taskDoobie/run with a bounded queue, i.e.

  val transactEC: ExecutionContext = ExecutionContext.fromExecutor(
    new ThreadPoolExecutor(
      100,
      Integer.MAX_VALUE,
      60,
      TimeUnit.SECONDS,
      new LinkedBlockingQueue[Runnable](1000)
    )
  )

And I am still getting the problem,

doobie3

Also here is a screenshot of it taskQuill/run on my friends laptop

Screenshot from 2019-07-09 14-05-36

Not sure what is happening with Quill, maybe its specific to OpenJDK? In any case there are still problems with Doobie even when using a bounded queue (or should I use more appropriate settings?)

EDIT: Forgot the second parameter, let me re-run

@mdedetrich
Copy link
Author

mdedetrich commented Jul 9, 2019

Okay so I reran with the correct parameters, i.e.

  val transactEC: ExecutionContext = ExecutionContext.fromExecutor(
    new ThreadPoolExecutor(
      100,
      100,
      60,
      TimeUnit.SECONDS,
      new LinkedBlockingQueue[Runnable](1000)
    )
  )

And still the same result

doobie4

@guymers
Copy link
Contributor

guymers commented Jul 10, 2019

Yeah I have no idea. If I run with

val poolSize = 32
val executorService = new ThreadPoolExecutor(
  poolSize, poolSize,
  0L, TimeUnit.MILLISECONDS,
  new LinkedBlockingQueue[Runnable](1000)
)

I get this

2019-07-10-212415_1732x1190_scrot

2019-07-10-212549_1732x1190_scrot

I have removed references to IO if that makes a difference (and commented out the logging)

val xa = HikariTransactor[Task](hikariDatasource, connectEC, Scheduler.io("transact"))

.toTask -> .transact(xa)

@mdedetrich
Copy link
Author

Okay thanks, I will try and replicate this tomorrow. Maybe its the timeout of 60 seconds that is causing the queue to fill.

This still begs the question though, having an unbounded queue to me seems like a bandaid because essentially this is saying that Doobie can't process transactions as fast as its receiving them. This is understandable for the repo which has the isolated leak but for production we honestly do not have that much load and if I understand correctly I definitely do not want to be dropping transactions on a bounded queue because its filling up (and also I am curious as to why Quill is not exhibiting this behavior, at least when you slow down the transaction frequency in the production repo)

@mdedetrich
Copy link
Author

mdedetrich commented Aug 13, 2019

Sorry for taking a while, but I tried replicating your results.

The good: I sometimes can replicate your results and interestingly it doesn't leak here.
The bad: When I can't replicate your results, its still the same original behavior that I got before where there is a leak (that mirrors what happens in prod).
The ugly: Its completely nondeterministic. Even worse, I set up a custom RejectedExecutionHandler to see if I get rejected tasks (which is what I expect when the queue is full) however It appears to never reject tasks, which is probably why the leak is still happening in the cases it does. This means that for whatever reason, Doobie is not rejecting transactions when it should be.

I am attaching screenshots of the past 3 runs which I did
2019-08-13-095620_3840x1080_scrot
2019-08-13-095607_3840x1080_scrot
2019-08-13-093340_3840x1080_scrot

The second run is the one which is ideal however it doesn't happen materialistically. These are the settings I am using for the transact executor

  import java.util.concurrent.RejectedExecutionHandler
  import java.util.concurrent.ThreadPoolExecutor

  class MyRejectedExecutionHandler extends RejectedExecutionHandler {
    override def rejectedExecution(worker: Runnable, executor: ThreadPoolExecutor): Unit = {
      logger.error(s"Worker ${worker.toString} is rejected")
    }
  }

  val connectEC: ExecutionContext =
    ExecutionContext.fromExecutor(Executors.newFixedThreadPool(32))

  val executor = new ThreadPoolExecutor(
    100,
    100,
    60,
    TimeUnit.SECONDS,
    new LinkedBlockingQueue[Runnable](1000),
    new MyRejectedExecutionHandler
  )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants