Optimize `v2/groups` request performance #5973

zen-dog · 2018-02-02T11:08:15Z

Improved v2/groups?embed... requests performance by introducing dedicated thread pools for fetching info instead of using the ForkJoinPool.
Additionally introduces dedicated execution context for GroupManagerModule, StorageModule and SchedulerActions components.

Benchmark results master:

Running 1m test @ http://localhost:8080/v2/groups?embed=...
  10 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.17s   354.93ms   2.00s    70.33%
    Req/Sec     0.12      0.33     1.00     87.76%
  237 requests in 1.00m, 217.12MB read
  Socket errors: connect 0, read 0, write 0, timeout 146
Requests/sec:      3.95
Transfer/sec:      3.62MB

This branch:

Running 1m test @ http://localhost:8080/v2/groups?embed=...
  10 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   249.39ms   62.67ms 580.40ms   65.82%
    Req/Sec     4.01      1.82    10.00     85.60%
  2403 requests in 1.00m, 2.15GB read
Requests/sec:     40.00
Transfer/sec:     36.64MB

Summary: 10x throughput, 5x latency and 5x std. deviation improvement.

JIRA: COPS-2224

and reduced the amount of work done on the futures.

for `StorageModule`, `SchedulerActions` and `GroupManagerModule` components.

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-5973-1.

mesosphere-ci

✗ Build of #5973 failed.

See the logs and test results for details.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-5973-2.

mesosphere-ci

✗ Build of #5973 failed.

See the logs and test results for details.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-5973-3.

mesosphere-ci

✔ Build of #5973 completed successfully.

See details at jenkins-marathon-pipelines-PR-5973-3.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/marathon-1.6.0-pre-299-gd8338be.tgz",
"sha1": "327099a92157ef863287b936441bb18399e5b623"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 5973.
The job then reports back to this PR.

＼\ ٩( ᐛ )و /／

meln1k

LGTM!

to name pooled threads e.g. `app-info-module-thread-3`

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-5973-4.

mesosphere-ci

✔ Build of #5973 completed successfully.

See details at jenkins-marathon-pipelines-PR-5973-4.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/marathon-1.6.0-pre-300-g658e102.tgz",
"sha1": "83869c6e353ee6242927b3ab4d51845de240daee"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 5973.
The job then reports back to this PR.

＼\ ٩( ᐛ )و /／

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-5973-5.

mesosphere-ci

✔ Build of #5973 completed successfully.

See details at jenkins-marathon-pipelines-PR-5973-5.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/marathon-1.6.0-pre-301-gf3b44f4.tgz",
"sha1": "a0c3a032e865bf308ed9b306ec4999a69b051c52"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 5973.
The job then reports back to this PR.

＼\ ٩( ᐛ )و /／

kensipe

LGTM

preference for more readable naming of vars but nothing blocking

kensipe · 2018-02-06T14:25:17Z

src/main/scala/mesosphere/marathon/core/appinfo/AppInfoModule.scala

@@ -21,8 +21,11 @@ class AppInfoModule @Inject() (
    healthCheckManager: HealthCheckManager,
    marathonSchedulerService: MarathonSchedulerService,
    taskFailureRepository: TaskFailureRepository) {
+
+  val ec = NamedExecutionContext.fixedThreadPoolExecutionContext(8, "app-info-module")


ec? are we trying to stay within a linter line length rule :)

could we make this more readable pls?

There are a few akka/scala specific and accepted "short names" e.g. ec for ExecutionContext, mat for Materializer etc. Falls into the same category as loop counter for(i=0,..)

Should this thread pool be ideally related to the number of threads available?

Should we make it configurable? (even if via an undocumented environment variable)?

It seems like we should ideally create the execution context in the module and share it with the pods status controller?

ec is fine for me for the reason @zen-dog pointed out.
I agree with @timcharper that the number of threads could as well be configurable.

kensipe · 2018-02-06T14:26:53Z

src/main/scala/mesosphere/marathon/core/appinfo/impl/DefaultInfoService.scala


-  private[this] val log = LoggerFactory.getLogger(getClass)
+  implicit val ec = NamedExecutionContext.fixedThreadPoolExecutionContext(8, "default-info-service")


timcharper

Nice work!

What do you think about those questions?

timcharper · 2018-02-06T15:57:17Z

src/main/scala/mesosphere/marathon/core/appinfo/AppInfoModule.scala

@@ -21,8 +21,11 @@ class AppInfoModule @Inject() (
    healthCheckManager: HealthCheckManager,
    marathonSchedulerService: MarathonSchedulerService,
    taskFailureRepository: TaskFailureRepository) {
+
+  val ec = NamedExecutionContext.fixedThreadPoolExecutionContext(8, "app-info-module")


Should this thread pool be ideally related to the number of threads available?

Should we make it configurable? (even if via an undocumented environment variable)?

It seems like we should ideally create the execution context in the module and share it with the pods status controller?

meichstedt · 2018-02-06T18:09:21Z

src/main/scala/mesosphere/marathon/core/CoreModuleImpl.scala

@@ -199,10 +200,11 @@ class CoreModuleImpl @Inject() (

  // GROUP MANAGER

+  val groupManagerExecutionContext = NamedExecutionContext.fixedThreadPoolExecutionContext(8, "group-manager-module")


Why 8 threads? Did we measure that 8 is a meaningful number? I'm not trying to overcomplicate things, but it seems to me that making these configurable would allow us to also easily run different tests to verify that the number makes sense.

8 seems like a low enough number that it shouldn't matter. More important is the fact that we separate the thread pools from each other.

meichstedt · 2018-02-06T18:11:15Z

src/main/scala/mesosphere/marathon/core/appinfo/AppInfoModule.scala

@@ -21,8 +21,11 @@ class AppInfoModule @Inject() (
    healthCheckManager: HealthCheckManager,
    marathonSchedulerService: MarathonSchedulerService,
    taskFailureRepository: TaskFailureRepository) {
+
+  val ec = NamedExecutionContext.fixedThreadPoolExecutionContext(8, "app-info-module")


ec is fine for me for the reason @zen-dog pointed out.
I agree with @timcharper that the number of threads could as well be configurable.

meichstedt · 2018-02-06T18:20:08Z

src/main/scala/mesosphere/util/ThreadPoolContext.scala

+    * Returns an execution context backed by a fixed thread pool. All threads in the pool are prefixed with `namePrefix`
+    * e.g. `slow-io-pool-thread-1`.
+    *
+    * @param numThreads number of threads in the pol


nit: s/pol/pool

jeschkies

Easily the greatest change ever. I do have question on the usage of the fixed worker thread pool but it should not block this PR. Thanks a ton!

jeschkies · 2018-02-07T08:40:50Z

src/main/scala/mesosphere/marathon/core/appinfo/impl/DefaultInfoService.scala


-  private[this] val log = LoggerFactory.getLogger(getClass)
+  implicit val ec = NamedExecutionContext.fixedThreadPoolExecutionContext(8, "default-info-service")


I wonder if this really has an impact if app info and the group manager have their own thread pools. I'd like to understand where we benefit from a fixed worker thread pool and where we should be using the fork join thread pool.

I tested the scenario with one thread pool and it adds roughly 20% response time. Not that much but I think dividing them still makes sense - Dīvide et Imperā!

zen-dog · 2018-02-07T12:13:02Z

Should this thread pool be ideally related to the number of threads available?
Should we make it configurable? (even if via an undocumented environment variable)?

I don't think there is much value in setting it to availableProcessors or whatever. We already have chaos creating a thread for each connection, akka dispatcher pools and a bunch of other threads. And when we used to block everywhere our thread count would grow > 1k threads at times.
But I do agree that it makes sense to make it configurable.

It seems like we should ideally create the execution context in the module and share it with the pods status controller?

Shared execution context showed 20% bigger response times (for whatever reason).

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-5973-6.

mesosphere-ci

✔ Build of #5973 completed successfully.

See details at jenkins-marathon-pipelines-PR-5973-6.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/marathon-1.6.0-pre-311-ge500dcc.tgz",
"sha1": "84c29052bc915820edf9ce47b83b901606253078"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 5973.
The job then reports back to this PR.

＼\ ٩( ᐛ )و /／

timcharper · 2018-02-08T03:43:16Z

src/main/scala/mesosphere/marathon/core/appinfo/AppInfoConfig.scala

+
+trait AppInfoConfig extends ScallopConf {
+
+  lazy val appInfoModuleExecutionContextSize = opt[Int](


Let's base the default as Runtime.getRuntime().availableProcessors() * 2? (or... just # of availableProcessors?)

…l size

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-5973-7.

mesosphere-ci

✔ Build of #5973 completed successfully.

See details at jenkins-marathon-pipelines-PR-5973-7.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/marathon-1.6.0-pre-313-ga4ee6d2.tgz",
"sha1": "5127a89c0a07c72c70b6c15fd0192ead1f749abf"

You can run system integration test changes of this PR against Marathon
master by triggering this Jenkins job with the Pull_Request_id 5973.
The job then reports back to this PR.

＼\ ٩( ᐛ )و /／

Backport of the original PR #5973 JIRA: COPS-2224

Summary: This replace the Akka stream with a simple `Future.sequence` and uses the same `AppInfoBaseData` instance for each pod. This we utilize the cached instances and not limit ourselves to eight parallel threads. To not overload Marathon we use the fixed size thread pool from #5973. However, even with #5973 we might allocate a lot of futures and runnables. JIRA issues: MARATHON-8563

Summary: This replace the Akka stream with a simple `Future.sequence` and uses the same `AppInfoBaseData` instance for each pod. This we utilize the cached instances and not limit ourselves to eight parallel threads. To not overload Marathon we use the fixed size thread pool from #5973. However, even with #5973 we might allocate a lot of futures and runnables. For 100 pods with 1 instance each I got the following: ## Master ``` › wrk -c 10 -t 10 -d 2m http://localhost:8080/v2/pods/::status Running 2m test @ http://localhost:8080/v2/pods/::status 10 threads and 10 connections Thread Stats Avg Stdev Max +/- Stdev Latency 136.52ms 44.93ms 859.05ms 93.64% Req/Sec 7.71 2.49 20.00 69.81% 8918 requests in 2.00m, 1.34GB read Requests/sec: 74.26 Transfer/sec: 11.45MB ``` ## This change ``` Running 2m test @ http://localhost:8080/v2/pods/::status 10 threads and 10 connections Thread Stats Avg Stdev Max +/- Stdev Latency 62.27ms 23.24ms 220.39ms 70.17% Req/Sec 15.94 6.09 50.00 69.12% 19345 requests in 2.00m, 2.91GB read Requests/sec: 161.10 Transfer/sec: 24.85MB ``` JIRA issues: MARATHON-8563

Summary: This replace the Akka stream with a simple `Future.sequence` and uses the same `AppInfoBaseData` instance for each pod. This we utilize the cached instances and not limit ourselves to eight parallel threads. To not overload Marathon we use the fixed size thread pool from #5973. However, even with #5973 we might allocate a lot of futures and runnables. For 100 pods with 1 instance each I got the following: ``` › wrk -c 10 -t 10 -d 2m http://localhost:8080/v2/pods/::status Running 2m test @ http://localhost:8080/v2/pods/::status 10 threads and 10 connections Thread Stats Avg Stdev Max +/- Stdev Latency 136.52ms 44.93ms 859.05ms 93.64% Req/Sec 7.71 2.49 20.00 69.81% 8918 requests in 2.00m, 1.34GB read Requests/sec: 74.26 Transfer/sec: 11.45MB ``` ``` Running 2m test @ http://localhost:8080/v2/pods/::status 10 threads and 10 connections Thread Stats Avg Stdev Max +/- Stdev Latency 62.27ms 23.24ms 220.39ms 70.17% Req/Sec 15.94 6.09 50.00 69.12% 19345 requests in 2.00m, 2.91GB read Requests/sec: 161.10 Transfer/sec: 24.85MB ``` JIRA issues: MARATHON-8563

Summary: This replace the Akka stream with a simple `Future.sequence` and uses the same `AppInfoBaseData` instance for each pod. This we utilize the cached instances and not limit ourselves to eight parallel threads. To not overload Marathon we use the fixed size thread pool from #5973. However, even with #5973 we might allocate a lot of futures and runnables. For 100 pods with 1 instance each I got the following: ## Master ``` › wrk -c 10 -t 10 -d 2m http://localhost:8080/v2/pods/::status Running 2m test @ http://localhost:8080/v2/pods/::status 10 threads and 10 connections Thread Stats Avg Stdev Max +/- Stdev Latency 136.52ms 44.93ms 859.05ms 93.64% Req/Sec 7.71 2.49 20.00 69.81% 8918 requests in 2.00m, 1.34GB read Requests/sec: 74.26 Transfer/sec: 11.45MB ``` ## This change ``` Running 2m test @ http://localhost:8080/v2/pods/::status 10 threads and 10 connections Thread Stats Avg Stdev Max +/- Stdev Latency 62.27ms 23.24ms 220.39ms 70.17% Req/Sec 15.94 6.09 50.00 69.12% 19345 requests in 2.00m, 2.91GB read Requests/sec: 161.10 Transfer/sec: 24.85MB ``` JIRA issues: MARATHON-8563

Summary: This replace the Akka stream with a simple `Future.sequence` and uses the same `AppInfoBaseData` instance for each pod. This we utilize the cached instances and not limit ourselves to eight parallel threads. To not overload Marathon we use the fixed size thread pool from #5973. However, even with #5973 we might allocate a lot of futures and runnables. For 100 pods with 1 instance each I got the following: ``` › wrk -c 10 -t 10 -d 2m http://localhost:8080/v2/pods/::status Running 2m test @ http://localhost:8080/v2/pods/::status 10 threads and 10 connections Thread Stats Avg Stdev Max +/- Stdev Latency 136.52ms 44.93ms 859.05ms 93.64% Req/Sec 7.71 2.49 20.00 69.81% 8918 requests in 2.00m, 1.34GB read Requests/sec: 74.26 Transfer/sec: 11.45MB ``` ``` Running 2m test @ http://localhost:8080/v2/pods/::status 10 threads and 10 connections Thread Stats Avg Stdev Max +/- Stdev Latency 62.27ms 23.24ms 220.39ms 70.17% Req/Sec 15.94 6.09 50.00 69.12% 19345 requests in 2.00m, 2.91GB read Requests/sec: 161.10 Transfer/sec: 24.85MB ``` JIRA issues: MARATHON-8563

Aleksey Dukhovniy added 2 commits February 2, 2018 11:05

Added dedicated thread pools for fetching group info

035e3ad

and reduced the amount of work done on the futures.

Additional dedicated execution contexts

93df16c

for `StorageModule`, `SchedulerActions` and `GroupManagerModule` components.

zen-dog requested a review from meln1k February 2, 2018 11:14

mesosphere-ci suggested changes Feb 2, 2018

View reviewed changes

Making scapegoat happy

896aabf

mesosphere-ci suggested changes Feb 2, 2018

View reviewed changes

Added missing execution context in the test

d8338be

zen-dog requested review from timcharper and jeschkies February 2, 2018 15:20

mesosphere-ci suggested changes Feb 2, 2018

View reviewed changes

mesosphere-ci approved these changes Feb 2, 2018

View reviewed changes

meln1k approved these changes Feb 2, 2018

View reviewed changes

Using custom thread factory

658e102

to name pooled threads e.g. `app-info-module-thread-3`

mesosphere-ci suggested changes Feb 2, 2018

View reviewed changes

mesosphere-ci approved these changes Feb 2, 2018

View reviewed changes

After running some tests scaled down the pool to 8 threads.

f3b44f4

mesosphere-ci suggested changes Feb 3, 2018

View reviewed changes

mesosphere-ci approved these changes Feb 3, 2018

View reviewed changes

kensipe approved these changes Feb 6, 2018

View reviewed changes

ichernetsky approved these changes Feb 6, 2018

View reviewed changes

timcharper suggested changes Feb 6, 2018

View reviewed changes

meichstedt reviewed Feb 6, 2018

View reviewed changes

jeschkies approved these changes Feb 7, 2018

View reviewed changes

Made all execution context pool sizes configurable

2737722

mesosphere-ci suggested changes Feb 7, 2018

View reviewed changes

mesosphere-ci approved these changes Feb 7, 2018

View reviewed changes

timcharper reviewed Feb 8, 2018

View reviewed changes

timcharper approved these changes Feb 8, 2018

View reviewed changes

Using Runtime.getRuntime().availableProcessors() as the default poo…

157bf7a

…l size

mesosphere-ci suggested changes Feb 9, 2018

View reviewed changes

mesosphere-ci approved these changes Feb 9, 2018

View reviewed changes

zen-dog merged commit 9c0ad47 into master Feb 9, 2018

zen-dog pushed a commit that referenced this pull request Feb 9, 2018

Optimize v2/groups request performance

f912dba

Backport of the original PR #5973 JIRA: COPS-2224

zen-dog mentioned this pull request Feb 9, 2018

Optimize v2/groups request performance #6001

Merged

zen-dog pushed a commit that referenced this pull request Feb 9, 2018

Optimize v2/groups request performance (#6001)

726f92d

Backport of the original PR #5973 JIRA: COPS-2224

zen-dog deleted the ad/optimize-fetch-groups branch April 7, 2018 15:49

jeschkies mentioned this pull request Feb 8, 2019

Speed up v2/pods/::status #6786

Merged

jeschkies mentioned this pull request Feb 12, 2019

Speed up v2/pods/::status (#6786) #6790

Merged

jeschkies mentioned this pull request Feb 12, 2019

Speed up v2/pods/::status (#6786) #6791

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `v2/groups` request performance #5973

Optimize `v2/groups` request performance #5973

zen-dog commented Feb 2, 2018 •

edited

Loading

mesosphere-ci left a comment

mesosphere-ci left a comment

mesosphere-ci left a comment

mesosphere-ci left a comment

mesosphere-ci left a comment

mesosphere-ci left a comment

meln1k left a comment

mesosphere-ci left a comment

mesosphere-ci left a comment

mesosphere-ci left a comment

mesosphere-ci left a comment

kensipe left a comment

kensipe Feb 6, 2018

zen-dog Feb 6, 2018

timcharper Feb 6, 2018

meichstedt Feb 6, 2018

kensipe Feb 6, 2018

timcharper left a comment

timcharper Feb 6, 2018

meichstedt Feb 6, 2018

zen-dog Feb 7, 2018 •

edited

Loading

meichstedt Feb 6, 2018

meichstedt Feb 6, 2018

jeschkies left a comment

jeschkies Feb 7, 2018

zen-dog Feb 7, 2018

zen-dog commented Feb 7, 2018 •

edited

Loading

mesosphere-ci left a comment

mesosphere-ci left a comment

timcharper Feb 8, 2018

mesosphere-ci left a comment

mesosphere-ci left a comment


		private[this] val log = LoggerFactory.getLogger(getClass)
		implicit val ec = NamedExecutionContext.fixedThreadPoolExecutionContext(8, "default-info-service")

		@@ -199,10 +200,11 @@ class CoreModuleImpl @Inject() (

		// GROUP MANAGER

		val groupManagerExecutionContext = NamedExecutionContext.fixedThreadPoolExecutionContext(8, "group-manager-module")


		trait AppInfoConfig extends ScallopConf {

		lazy val appInfoModuleExecutionContextSize = opt[Int](

Optimize v2/groups request performance #5973

Optimize v2/groups request performance #5973

Conversation

zen-dog commented Feb 2, 2018 • edited Loading

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

meln1k left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

kensipe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timcharper left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zen-dog Feb 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeschkies left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zen-dog commented Feb 7, 2018 • edited Loading

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

Optimize `v2/groups` request performance #5973

Optimize `v2/groups` request performance #5973

zen-dog commented Feb 2, 2018 •

edited

Loading

zen-dog Feb 7, 2018 •

edited

Loading

zen-dog commented Feb 7, 2018 •

edited

Loading