Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discovery: Implement MVP of an outbox pattern for the event bus #251

Open
2 of 3 tasks
robrap opened this issue Jul 21, 2023 · 10 comments
Open
2 of 3 tasks

Discovery: Implement MVP of an outbox pattern for the event bus #251

robrap opened this issue Jul 21, 2023 · 10 comments
Labels
event-bus Work related to the Event Bus.

Comments

@robrap
Copy link
Contributor

robrap commented Jul 21, 2023

A/Cs:

At this time, there is not a great story for recovery from event producing issues if the event bus (Kafka, Redis, etc.) were to temporarily go down.

A common solution for this is an Outbox Pattern, where event data is first sent to the database, and a separate process sends events from the outbox to the event broker, maintaining order. This ticket is for implementing an MVP of this pattern.

Notes/Questions:

  • Another ticket, Ability to resend events to Kafka that errored while producing edx/edx-arch-experiments#354, to document/implement a less ideal work around, but this fix is likely to be much more dependent on environment.
  • To maintain order, the simplest implementation is to have a singleton process that reads from the outbox and produces the events.
    • What can this handle for typical load?
    • What can this handle for a large incident?
    • When would we need to invest in being able to have multiple processors (e.g. that handle different topics)?
    • Would we need a capability to move or keep certain topics with large load and less essential events off the outbox?
  • When do we delete events from the outbox?
  • Where is the code for this going to live? Do we want this to be in a library?
  • Note: there are many articles we can review to learn from others.
@robrap
Copy link
Contributor Author

robrap commented Jul 27, 2023

We are waiting on @davidjoy (or someone) to do some outreach to help us determine priority. The thought was to discuss with Colin about a potential subscriptions event, because it is financial. We can also reach out to owners of existing events: https://2u-internal.atlassian.net/wiki/spaces/AT/pages/174555142/How+to+Use+the+Event+Bus+edX.org+2+of+2#Current-event-bus-usage.

@robrap
Copy link
Contributor Author

robrap commented Jul 27, 2023

[inform] I discussed with Kelly, and they are probably fine with the more temporary solution of edx/edx-arch-experiments#354 for a while. However, she also recommended reaching out to Colin about the Program Credential events.

@davidjoy
Copy link

davidjoy commented Aug 1, 2023

So after talking to @colinbrash, we don't feel there's an immediate need for this to be done. If we have the ability to manually recover via the more temporary solution in edx/edx-arch-experiments#354 then that's sufficient for now.

The risks associated with the program credential event not sending are relatively small, in that it's only being used to automatically shut off a subscription and/or send a reminder email.

That said, Colin indicates that he sees the Outbox pattern as an important part of giving users of the event bus piece of mind, and that not having it gives him pause as his team continues work on commerce-coordinator. For them to consider adopting the event bus in that work, they'd like to have the resiliency that the outbox would provide.

@davidjoy davidjoy removed their assignment Aug 1, 2023
@davidjoy davidjoy removed the status in Arch-BOM Aug 1, 2023
@robrap
Copy link
Contributor Author

robrap commented Aug 1, 2023

That said, Colin indicates that he sees the Outbox pattern as an important part of giving users of the event bus piece of mind, and that not having it gives him pause... [emphasis added]

Is this for any events, or specifically commerce-related events? Either way, this seems like something we need to get ahead of if we want people to use it when they need it. Maybe it isn't the highest priority, but it still feels like high priority work if we are concerned with event bus adoption.

@davidjoy
Copy link

davidjoy commented Aug 2, 2023

Commerce-related events that commerce coordinator might produce or rely on.

@robrap
Copy link
Contributor Author

robrap commented Aug 10, 2023

  1. This ticket was prioritized as Future, but I'd like us to revisit.
  2. There was a Prod incident caused by an arch-bom related change.
    a. We will separately look to see how this type of event might be avoided.
    b. At the same time, this is a reminder that we may cause event producing errors, and not just Kafka going down.
  3. As part of resolving, all existing events have backup plans. If events did not, this would have been a more major issue.

[proposal] We implement sooner rather than later to avoid an RCA that we know if avoidable.
[alternative proposal] [not recommended] we communicate that the event bus is risky for any event that doesn't have its own backup plan, until we implement this ticket, and have teams communicate with us when and if they wish to use the event bus and do not want to take on this risk. This type of blocker may just deter people from using the event bus.

@robrap robrap moved this to Prioritized in Arch-BOM Aug 14, 2023
@robrap
Copy link
Contributor Author

robrap commented Aug 21, 2023

We should definitely check in with Confluent around this for solutions. I also wonder if there is any way we could use a Connector for the processing of the outbox itself.

@dianakhuang dianakhuang changed the title Implement MVP of an outbox pattern for the event bus Discovery: Implement MVP of an outbox pattern for the event bus Aug 29, 2023
@dianakhuang dianakhuang moved this from Prioritized to Groomed in Arch-BOM Aug 29, 2023
@robrap
Copy link
Contributor Author

robrap commented Sep 13, 2023

The ticket openedx/event-bus-kafka#178 already existed, and is possibly redundant to this one. Whoever takes on this ticket might review that ticket, see if there is anything important to move over, and close it as necessary.

@jmbowman
Copy link

jmbowman commented Jan 3, 2024

There's a thread in the Platform Engineering Slack workspace that seems relevant: https://platformengin-b0m7058.slack.com/archives/C038M6NBFLY/p1704230480747129 . Initial post:

Hello, I have a Platform team who has created a Kafka producer/consumer service following the Transactional Outbox Pattern. The outbox tables live with in the databases owned by the service teams. I had hoped / expected that the service teams would own this table and outbox relay since it is a per service database setup in databases they own. We created documentation expecting this. I am getting pushback because some Kafka expertise is required for maintenance. We are open to supporting them with knowledge/consulting but they are pushing back for my Platform team to own them. I am reluctant because I am not sure how scalable it is for my team to own X outbox tables that live in databases owned by the service teams. Anyone else ever hit this problem or similar issue?

First reply:

This happens all the time. Unfortunately, it sounds like expectations weren't established before building this service for the dev team.
The good news is you're not stuck with everything...yet. I would use this as an opportunity to clarify and tease apart the responsibilities between your platform and dev team.
Here are some thoughts that may help your conversations (these are based on a platform engineering mindset of "you build it, you run it"):

  • You mentioned contention around who owns the table. Dev teams usually own all data that they produce. So, the data in the table would be owned by the dev teams. If the table schema needs changed, I think this would fall under maintenance of the reusable service (see next note)
  • Platform teams usually maintain reusable libraries/frameworks/services. This means feature updates, bug fixes, security updates.
  • Some engineering orgs have shared services that are a single instance of production resources while some will require each dev team to stand up an instance of a reusable service. It's unclear where your org stands on this, but I believe it's within the platform team's responsibility to (at minimum) figure out how to perform upgrades on this service
  • Deciding who should perform these upgrades/updates requires a little more knowledge and context. I would do some critical thinking and work with the dev team to figure out a strategy that works for both.

@timmc-edx
Copy link
Contributor

The ADR has been accepted in Provisional status: https://docs.openedx.org/projects/openedx-events/en/latest/decisions/0015-outbox-pattern-and-production-modes.html -- I won't be ticketing up that work at the moment, though.

@timmc-edx timmc-edx moved this from In Progress to Prioritized in Arch-BOM Feb 20, 2024
@timmc-edx timmc-edx removed their assignment Feb 20, 2024
@jristau1984 jristau1984 moved this from Prioritized to Backlog in Arch-BOM Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
event-bus Work related to the Event Bus.
Projects
None yet
Development

No branches or pull requests

4 participants