Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ELT performance by running multiple tap stream processes in parallel ("Melturbo") #2677

Closed
MeltyBot opened this issue Apr 26, 2021 · 4 comments

Comments

@MeltyBot
Copy link
Contributor

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/2727

Originally created by @kgpayne on 2021-04-26 19:13:52


For integrations with large numbers of streams (e.g. monolithic application databases) it is often advantageous to break the integration up into multiple Meltano pipelines. Today, this is supported (using tap inheritance and selection criteria) but very manual. In an ideal world, Meltano would be able to use a combination of context collected in discovery, context from previous pipeline runs (e.g. records transferred) and hints/overrides provided by end-users in config to optimise the 'sharding' of pipelines. Like the turbocharger on a Diesel engine, this is in aid of eke out every bit of available performance, and may also serve as a way of parallelising taps and targets that do not implement multithreading.

Discovery Context:

  • Availability of Primary Keys (large FULL_TABLE replications in their own pipelines)
  • Data Volume (count of records)
  • Data Velocity (records per hour, if a created_at field is available)

Run-history Context:

  • Avg. time taken for Stream sync
  • Records per Stream sync (could be many more than records per hour above for e.g. weekly schedules)

Hint/Override Context:

  • Composable Table/Stream groups (we have several pipelines that collect subsets of tables that correspond to a 'domain' in our monolith)
  • Custom grouping rules (e.g. all FULL_TABLE replication Streams under 10k rows in one pipeline)

Initially I would expect to have to run a melatno optimise extractor <extractor-name> command. In future, it would be amazing to delegate optimisation to Meltano in some sort of 'auto-pilot' mode, where optimisation happens unsupervised as run history is accumulated and in response to up-stream changes (changes in data volumes, velocities and as new tables are added).

@MeltyBot
Copy link
Contributor Author

@aaronsteers aaronsteers moved this to Up Next in Office Hours Aug 17, 2022
@aaronsteers aaronsteers moved this from Up Next to Discussed in Office Hours Aug 17, 2022
@tayloramurphy tayloramurphy changed the title Improve ELT performance by running multiple tap processes in parallel ("Melturbo") Improve ELT performance by running multiple tap stream processes in parallel ("Melturbo") Feb 24, 2023
@stale
Copy link

stale bot commented Jun 25, 2023

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

@stale stale bot added the stale label Jun 25, 2023
@tayloramurphy
Copy link
Collaborator

Still relevant

@stale stale bot removed the stale label Jun 26, 2023
Copy link

stale bot commented Dec 25, 2024

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.

@stale stale bot added the stale label Dec 25, 2024
@stale stale bot closed this as completed Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

2 participants