improve processor stability #473

frairon · 2025-02-25T06:52:30Z

What this PR tries to improve

Stability of processors in the face of restarting/unstable kafka processors.

Background

We're facing the issue that our kafka-cluster restarts or rebalances from time to time, which makes all processors restart. Since the processors will rebalance, this PR uses reconnecting views to be used for the join/lookup tables.

frairon · 2025-02-25T06:56:14Z

partition_processor.go

@@ -257,7 +257,7 @@ func (pp *PartitionProcessor) Start(setupCtx, ctx context.Context) error {
 		join := join
 		pp.runnerGroup.Go(func() error {
 			defer pp.state.SetState(PPStateStopping)
-			return join.CatchupForever(runnerCtx, false)
+			return join.CatchupForever(runnerCtx, true)


this makes the join-table trying to reconnect while the processor is running (together with the other table some lines below)

frairon · 2025-02-25T06:57:11Z

partition_processor.go

 	if pp.cancelRunnerGroup != nil {
 		pp.cancelRunnerGroup()
 	}

 	// wait for the runner to be done
 	runningErrs := multierror.Append(pp.runnerGroup.Wait().ErrorOrNil())

+	close(pp.input)


channels are now closed after the runner-group is done --> visitors are attaching to the runner-group for this.

frairon · 2025-02-25T06:59:43Z

partition_processor.go

@@ -637,15 +637,6 @@ func (pp *PartitionProcessor) VisitValues(ctx context.Context, name string, meta

 	var wg sync.WaitGroup

-	// drains the channel and drops out when closed.


there was actually no point to distinguish between draining until close or draining until it's empty, because this function is writing to the channel.
In case two visitors are started at the same time and one of them panics or is stopped, it'll drain the other's messages too - but that is an issue that existed before, so we'll ignore it here :)

frairon · 2025-02-25T07:00:00Z

processor.go

@@ -421,11 +421,11 @@ func (g *Processor) handleSessionErrors(ctx, sessionCtx context.Context, session
 			)

 			if errors.As(err, &errProc) {
-				g.log.Debugf("error processing message (non-transient), shutting down processor: %v", err)
+				g.log.Printf("error processing message (non-transient), shutting down processor: %v", err)


let's have those important errors not as debug.

frairon · 2025-02-25T07:00:57Z

topic_manager.go

@@ -89,7 +89,7 @@ func checkBroker(broker Broker, config *sarama.Config) error {
 	}

 	err := broker.Open(config)
-	if err != nil {
+	if err != nil && !errors.Is(err, sarama.ErrAlreadyConnected) {


accordin to docs, Open might return this if it's already connected and it's not an error.

frairon · 2025-02-25T07:00:58Z

systemtest/proc_disconnect_test.go

@@ -18,6 +18,7 @@ func TestProcessorShutdown_KafkaDisconnect(t *testing.T) {
 	brokers := initSystemTest(t)
 	var (
 		topic = goka.Stream(fmt.Sprintf("goka_systemtest_proc_shutdown_disconnect-%d", time.Now().Unix()))
+		join  = goka.Stream(fmt.Sprintf("goka_systemtest_proc_shutdown_disconnect-%d-join", time.Now().Unix()))


adding some join tables to the tests so we can test the reconnecting joins change from above.

frairon commented Feb 25, 2025

View reviewed changes

improve processor stability

17448ff

frairon force-pushed the processor-stability branch from 2de9c0e to 17448ff Compare February 25, 2025 08:00

mmreza79 approved these changes Feb 25, 2025

View reviewed changes

norbertklawikowski approved these changes Feb 25, 2025

View reviewed changes

frairon merged commit 3a26dac into master Feb 26, 2025
5 checks passed

frairon deleted the processor-stability branch February 26, 2025 05:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve processor stability #473

improve processor stability #473

frairon commented Feb 25, 2025 •

edited

Loading

frairon Feb 25, 2025

frairon Feb 25, 2025

frairon Feb 25, 2025

frairon Feb 25, 2025

frairon Feb 25, 2025

frairon Feb 25, 2025

		@@ -637,15 +637,6 @@ func (pp *PartitionProcessor) VisitValues(ctx context.Context, name string, meta

		var wg sync.WaitGroup

		// drains the channel and drops out when closed.

improve processor stability #473

improve processor stability #473

Conversation

frairon commented Feb 25, 2025 • edited Loading

What this PR tries to improve

Background

frairon Feb 25, 2025

Choose a reason for hiding this comment

frairon Feb 25, 2025

Choose a reason for hiding this comment

frairon Feb 25, 2025

Choose a reason for hiding this comment

frairon Feb 25, 2025

Choose a reason for hiding this comment

frairon Feb 25, 2025

Choose a reason for hiding this comment

frairon Feb 25, 2025

Choose a reason for hiding this comment

frairon commented Feb 25, 2025 •

edited

Loading