Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(dataflow): handle pipeline errors and clear kafka streams state #5358

Merged
merged 4 commits into from
Feb 26, 2024

Conversation

lc525
Copy link
Member

@lc525 lc525 commented Feb 23, 2024

What this PR does / why we need it:
Because we're now continuously retrying the subscription to the scheduler when the subscription terminates, we need to also clean the kafka streams state of failed pipelines when the subscription is terminated with an error.

This commit also introduces a new way of handling errors and the changes in pipeline status. It moves us towards being able to inspect the status of the underlying kafka streams from outside the Pipeline object. The end-goal (not archieved here) is to react to pipelines reaching error states during their operation, rather than just when they are created for the first time (what's currently implemented). However, this PR does get us to a point where the pipeline creation status, including possible errors/exceptions are reported to the scheduler and can show up either in the seldon CLI or in k8s.

Instead of propagating exceptions, the idea is to return and handle common errors in a more similar way to golang.

Which issue(s) this PR fixes:

  • Fixes #INFRA-755 (internal): Exceptions in pipeline creation/deletion are unrecoverable because of uncleaned Kafka Streams state
  • Progress on #INFRA-617 (internal): Investigate high priority unhandled errors
  • Progress on #INFRA-648 (internal): Handle dataflow [...] exceptions and present meaningful error messages

** PR Merge Sequencing GROUP 663 **
This is part of a sequence of PRs building on each-other, but split in order to simplify reviewing.
This PR has sequence no: G663/1

Next to review in sequence: lc525#47 : fix(dataflow): wait for kafka topic creation

Fixes INFRA-755 (internal): Exceptions in pipeline creation/deletion are
unrecoverable because of uncleaned Kafka Streams state

Because we're now continuously retrying the subscription to the scheduler
when the subscription terminates, we need to also clean the kafka streams
state of failed pipelines when the subscription is terminated with an error.

This commit also introduces a new way of handling errors and the changes
in pipeline status. It moves us towards being able to inspect the status of the
status of the underlying kafka streams from outside the Pipeline object.
The end-goal (not archieved here) is to react to pipelines reaching error
states during their operation, rather than just when they are created for the
first time (what's currently implemented).

Instead of propagating exceptions, the idea is to return and handle common
errors in a more similar way to golang.
@lc525 lc525 requested a review from sakoush as a code owner February 23, 2024 12:04
@lc525 lc525 added the v2 label Feb 23, 2024
@lc525 lc525 marked this pull request as draft February 23, 2024 14:42
@lc525
Copy link
Member Author

lc525 commented Feb 23, 2024

Moved to draft as I've realised a couple more improvements can/should be made before merging

- propose a new way of reporting error states and returning them from function
calls
- deal with edge-cases and document them in code
Revamps error handling for pipelines so that:

1. Scheduler subscription is not ended when kafka stream errors/exceptions
appear
2. Any error messages are propagated towards the scheduler so that it can show
the reason for a broad array of cases where the kafka streams app for a given
pipeline gets stopped.
3. Lay groundwork so that errors are propagated to the scheduler not only
at pipeline creation time but on transitioning to an error state.
@lc525 lc525 marked this pull request as ready for review February 25, 2024 21:59
@lc525 lc525 changed the title fix(dataflow): clear kafka stream state for failed pipelines fix(dataflow): handle pipeline errors and clear kafka streams state Feb 25, 2024
Copy link
Member

@sakoush sakoush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change! I have added comments / questions to the extent I can read kotlin.

@lc525 lc525 merged commit 91e36f2 into SeldonIO:v2 Feb 26, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants