Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(router): support for isolation modes using limiters #3379

Merged
merged 3 commits into from
Jun 8, 2023

Conversation

atzoum
Copy link
Contributor

@atzoum atzoum commented May 23, 2023

Description

Introducing isolation strategies in router, following the same pattern that we are already using in processor & batchrouter.

Router Isolation drawio

Router is one of a kind, since it now uses not 1 but 2 different types of workers:

  1. partition worker: multiple partition workers are being created by the worker pool according to the configured isolation strategy.
  2. (internal) worker: each partition worker has a pool of (internal) workers which ultimately process the jobs and deliver them to their destination. By default, 64 internal workers are spawn for every single partition worker.

Isolation modes

  1. none - a single partition worker picks up and processes jobs for all workspaces & destinations.
  2. workspace - using a separate partition worker per active workspace (default mode in multi-tenant deployments).
  3. destination - using a separate partition worker per active destination (default mode in non-multi-tenant deployments).

Limiters

  1. pickup - for picking up jobs from jobsdb and assigning them to workers (limit: 50 goroutines/destType)
  2. transform - for controlling concurrency while transforming jobs (limit: 100 goroutines/destType)
  3. batch - for controlling concurrency while batching jobs (limit: 100 goroutines/destType)
  4. process - for controlling concurrency while processing jobs within workers (limit: 100 goroutines/destType)

Dynamic limiter priorities

Router uses dynamic priorities for executing tasks inside its limiters. Each partition keeps track of it's throughput and error rate for every limiter-enabled operation, by using partition.Stats. These statistics are used to calculate a performance score between 0-100, which in turn gets mapped to a limiter priority between 1-4. This priority is especially important when there is congestion in the limiter (goroutines requesting more slots than the limiter's total cap). The limiter itself has a mechanism to avoid starvation of lower-priority jobs, by periodically incrementing their priority (by default every 1 sec, see WithLimiterDynamicPeriod).

Future cleanup work

The existing fair pickup algorithm has been retained and will be used only when JobsDB.fairPickup is enabled and isolation mode is set to none. This is to be able to revert in case an unexpected issue happens during the rollout. However, the final plan is to remove MultiTenantJobsDB - unionQuery.go and MultiTenantI - tenantstats.go in the next release, after router isolation has been released, tested and proven with production workloads.

Performance

It comes as no surprise to say that performance is again on par (and even better) with the previous solution up to 200 partitions. After that threshold, throughput starts deteriorating.

The relevant benchmark can be found inside router_isolation_test.go and a visualisation of one execution of this benchmark is available here.

Other notable changes

  • Reorganised package's code in separate, logically-grouped files, trying to make it easier to read and follow.
  • Removed global variables and Init function from the package.
  • AdaptivePayloadLimiter.enabled is now true by default.
  • Setting a soft memory limit (aka GOMEMLIMIT) to 80% of the container's available memory.
  • Using bytesize, queue, mem, profiler and sync packages from rudder-go-kit.
  • Capturing pipeline delay statistics through pipeline_delay_min & pipeline_delay_max.

Notion Ticket

Link

Security

  • The code changed/added as part of this pull request won't create any security issues with how the software is being used.

@codecov
Copy link

codecov bot commented May 23, 2023

Codecov Report

Patch coverage: 81.59% and project coverage change: -0.03 ⚠️

Comparison is base (9e32802) 68.47% compared to head (8c0f9d7) 68.44%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3379      +/-   ##
==========================================
- Coverage   68.47%   68.44%   -0.03%     
==========================================
  Files         330      327       -3     
  Lines       53042    52801     -241     
==========================================
- Hits        36319    36140     -179     
+ Misses      14358    14312      -46     
+ Partials     2365     2349      -16     
Impacted Files Coverage Δ
app/apphandlers/processorAppHandler.go 9.20% <0.00%> (-0.04%) ⬇️
jobsdb/unionQuery.go 84.94% <ø> (-3.09%) ⬇️
processor/stash/stash.go 52.88% <ø> (ø)
...hema-forwarder/internal/forwarder/baseforwarder.go 81.57% <ø> (ø)
...hema-forwarder/internal/forwarder/jobsforwarder.go 67.32% <ø> (ø)
services/alert/pagerduty.go 0.00% <0.00%> (ø)
services/alert/victorops.go 0.00% <0.00%> (ø)
services/alerta/client.go 82.97% <ø> (ø)
services/multitenant/tenantstats.go 92.83% <ø> (ø)
warehouse/archive/archiver.go 65.86% <0.00%> (ø)
... and 38 more

... and 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@atzoum atzoum changed the title [WIP] feat(router): isolation modes with limiters [WIP] feat(router): support for isolation modes using limiters May 24, 2023
@atzoum atzoum force-pushed the feat.routerIsolation branch 11 times, most recently from 0bc1a2c to b552588 Compare May 26, 2023 08:05
}

// TODO: delete this once we remove the old fair pickup algorithm and move MultiTenantLegacy#GetAllJobs inside JobsDB
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[note] added some todos for cleaning up in the next release

Comment on lines +24 to +28
if memStat, err := mem.Get(); err == nil {
memoryLimit := int64(80 * memStat.Total / 100)
logger.NewLogger().Infow("Setting memory limit to", "limit", memoryLimit)
debug.SetMemoryLimit(memoryLimit)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[note] setting a soft memory limit (aka GOMEMLIMIT) to 80% of the container's available memory.

Comment on lines +102 to +107
go func() {
wstart := time.Now()
w.Stop()
wg.Done()
wp.logger.Debugf("worker %s stopped in %s", w.partition, time.Since(wstart))
}()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[note] stopping workers in parallel

@@ -1289,7 +1268,7 @@ func Unique(stringSlice []string) []string {
}

func UseFairPickup() bool {
return config.GetBool("JobsDB.fairPickup", false) || config.GetBool("EnableMultitenancy", false)
return config.GetBool("JobsDB.fairPickup", false) && config.GetString("Router.isolationMode", "default") == "none"
Copy link
Contributor Author

@atzoum atzoum May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[note] fair pickup will be used only if it is enabled and isolation mode is none

status *jobsdb.JobStatusT
}

type reloadableConfig struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[note] since we will be starting new routers after switching from degraded to normal mode, moving all reloadable configuration to a separate struct, so that there is minimal leakage!
And promise to prepare a solution for non-leaking reloadable configuration in rudder-go-kit in the future!

@atzoum atzoum force-pushed the feat.routerIsolation branch 3 times, most recently from b6f698d to 5131cd0 Compare May 29, 2023 07:40
@atzoum atzoum changed the title [WIP] feat(router): support for isolation modes using limiters feat(router): support for isolation modes using limiters May 29, 2023
@atzoum atzoum requested a review from cisse21 May 29, 2023 15:13
@atzoum atzoum force-pushed the feat.routerIsolation branch 13 times, most recently from db7358c to 9a87e48 Compare June 7, 2023 05:59
@@ -22,7 +22,7 @@ import (
"github.com/ory/dockertest/v3"
"github.com/stretchr/testify/require"

kitHelper "github.com/rudderlabs/rudder-go-kit/testhelper"
kit_helper "github.com/rudderlabs/rudder-go-kit/testhelper"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we adhering to the outcome of the poll :P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enjoy!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We voted for the same thing but lost :'(

@@ -289,7 +249,7 @@ func (w *worker) WorkerProcess() {
}

case <-timeout:
timeout = time.After(jobsBatchTimeout)
timeout = time.After(w.rt.reloadableConfig.jobsBatchTimeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do have this both here and at line 60?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

further refactoring is required indeed, but will refrain from doing it now as changes are already enough :)

router/worker.go Outdated Show resolved Hide resolved
router/batchrouter/handle_observability.go Outdated Show resolved Hide resolved
@atzoum atzoum merged commit fbe109f into master Jun 8, 2023
@atzoum atzoum deleted the feat.routerIsolation branch June 8, 2023 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants