Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved MV and index fill #586

Open
purplefox opened this issue Sep 17, 2022 · 1 comment
Open

Improved MV and index fill #586

purplefox opened this issue Sep 17, 2022 · 1 comment

Comments

@purplefox
Copy link
Collaborator

purplefox commented Sep 17, 2022

The current MV/index fill occurs as part of DDL execution. It can take considerable time, and if raft leadership changes during this period it will be cancelled. The fill logic is also very complex.

Here is how we can improve it:

  • Remove MV/index fill from DDL execution. DDL execution will initiate it but DDL execution will complete before fill is complete
  • To fill an MV, we create it and attach it to its feeders so it receives new rows
  • We also create a special source called a "fill_source" - this can be created through the normal ddl execution mechanism so is persistent.
  • A fill_source, instead of obtaining rows from Kafka messages, obtains them by scanning a set of shards in the database (we can use iterator for this).
  • When the fill_source is created it is initiated with a sequence start and sequence end for each shard which is obtained by taking a snapshot of the feeder table and looking at the oldest and newest rows.
  • The fill source will load rows in chunks and send them to the MV handleRows() method. As rows are processed it will store last processed offset, so it can start off from there after restart or failure.
  • The MV will receive rows from both newly ingested data and data from the fill.
  • Each row will need a sequence number of timestamp so the MV can reject older rows when it has seen a newer row for the same key.
  • To handle deletes, if the MV receives a delete it will need to store it in a deleted_rows sub table. When receiving a row from fill these can be ignored if there are exists a row in deleted for the same key and later timestamp/sequence.
  • The deleted keys can be cached in memory (up to lru size) to avoid too many gets.
  • The deleted keys can be deleted when the fill is complete
  • When the fill_source has completed up until it's end sequence then it will drop itself.
@purplefox
Copy link
Collaborator Author

Update: better method.

We introduce versioning on the data key, then we can create a true snapshot of the data.
Then, when creating an MV we insert a new executor type "FillExecutor" in front of the MV.
The Fill Executor maintains an iterator on the feeding source. The iterator at any one time has a version associated to it, and only iterates over keys which have the highest version <= to that version. Once it has iterated over all keys for a particular version then it moves to the next version and repeats. This is repeated until there are no more rows. At this point the fill is complete and the FillExecutor switches over to live records.
The Fill Executor will store its offset persistently in each batch that is processed so that on failure it will carry on where it left off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant