Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online DDL via VReplication #7419

Merged
merged 74 commits into from
Mar 1, 2021

Conversation

shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Jan 31, 2021

Ready for review

This PR implements a "forward" alter table via VReplication.

General design described in #6926 (comment), which this PR implements.

Notable bullets about the changes in this PR:

  • The main logic of this PR is in go/vt/vttablet/onlineddl/vrepl.go and go/vt/vttablet/onlineddl/executor.go

  • go/vt/vttablet/onlineddl/executor.go can now invoke, cancel and terminate VReplication based migrations. Some of this management is symmetric to how we invoke, cancel and terminate gh-ost or pt-osc, but there are also significant differences:

    • while gh-ost and pt-osc can only work within the context of this tablet (invoked as an OS Exec), VReplication migrations can execute by one tablet and continued on another, even in face of failover.
    • gh-ost and pt-osc have self-logic to run the migration. VReplication needs to be told what exactly it should run. The tablet knows about a alter table statement, and needs to create a matching rule/filter query for VReplication.
    • gh-ost and pt-osc have self-logic to cut-over. Vreplication does not. Executor identifies the occasion for cut-over and implements the cut-over logic, which entails stopping writes to original tables, waiting for pos, stopping VReplication, replacing tables.
  • Imported a bunch of code from https://github.com/github/gh-ost. These are found under go/vt/vttablet/onlineddl/vrepl. The kind of functionality in those files:

    • Parsing an ALTER TABLE statement
    • Structures to define columns
    • Dealing with datatypes
      We don't strictly need all the above as-is. We can replace parsing with vitess's parsing. We don't need to track datatypes because vreplication does. But I copied&pasted those files because they are mature and stable and get the work done.
      I intend to refactor them later on and remove redundant/unrelated code, or replace with existing Vitess functionality. But I'd like to do so on a separate PR so as to focus on the main overall functionality here.
  • Split onlineddl/endtoend tests into endtoend_ghost and endtoend_vrepl. They are mostly similar but with some changes (like how do you throttle a migration). The tests are OK-ish but need to be more elaborate. I again wish to follow up in a future PR as I am yet to design a satisfactory endtoend test for schema migrations and failure scenarios.

  • Due to some earlier attempt to use wrangler inside tabletserver, I created a new interface called vexec.Executor, and refactored ddlExecutor to be vexec.Executor instead of onlineddl.Executor. I ended up not using wrangler inside tabletserver but I liked the refactor and kept it.


initial PR comment when this was Work In Progress:

Description

Implementing Online DDL via VReplication.
There's a bunch of code to wrap the VReplication mechanism. I import a lot of logic from gh-ost (analyzing tables, analyzing primary key, sanity checks). Some of this should be replaced with vitess's sqlparser, but for now it's faster to go with gh-ost imported code as I know it is stable.

This is work in progrss. the idea is that we run VReplication in a materialize-like fashion, where both source and target are the shard's PRIMARY.

Vreplication is kicked by the online DDL executor, and not by vtctl/wrangler. this is very different from the normal VReplication flow, and reasoning is that we want the tablets to validate and to schedule the migration at its own good time. So we're using the existing Online DDL Executor logic to start/stop migrations. Vreplication is different in that it can survive restarts or failovers. This is something the executor will take into account.

VReplication only works through PRIMARY KEYs. In that aspect, gh-ost is more flexible as it can iterate any UNIQUE KEY. Another thing for me to check is what are valid PRIMARY KEY changes: is it OK to add a column to a PRIMARY KEY? Remove a column?

Another difference to normal VReplication flow is how we do the cut-over. We will automate the cut-over, and will then rename the tables. No change to routing rules. Later, we'd need to see if we can also reverse VReplication for renamed tables. I'm not sure how simple/hard that would be.

anyway, this is really work in progress and no guarantees. Will comment as progress is made.

Related Issue(s)

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build/CI
  • VTAdmin

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…parser

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…t on this branch's radar

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…ap query

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…writes to table, waiting for updated pos, renaming tables, releasing table, releasing locks

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great :-)

Left some comments/queries: in general the functionality looks good.

"strategy": sqltypes.StringBindVariable(string(schema.DDLStrategyPTOSC)),
// isVReplMigrationReadyToCutOver sees if the vreplication migration has completed the row copy
// and is up to date with the binlogs.
func (e *Executor) isVReplMigrationReadyToCutOver(ctx context.Context, s *VReplStream) (isReady bool, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For low write qps it is possible that there are no binlog events for a table for a long time. If I read this logic correctly, in such cases the Online DDL workflow may not cut over automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Isn't there a heartbeat mechanism for vreplication though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time_updated is updated by the heartbeat. However transaction_timestamp is only when a new binlog event is seen, not for a heartbeat. This is quite rare since it means there is no writes on the source at all. But it has been reported by some users. Also if a database is taken "offline" for some reason (in the sense that no app is writing to it) and the user starts a workflow this can happen.

Practically we may not need to address this now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed evaluation of transaction-timestamp. I want to be on the safer side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeUpdated will always be current because of the heartbeat so the moment the copy state is empty it will say that the migrate workflow is ready for cutover. Am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something?

Ermm... Not sure why you're asking? Am I missing something? 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate a bit, though I'm still unsure what you meant:

timeUpdated will always be current

Unless there's some lag, or vreplication is stopped

so the moment the copy state is empty

We also validate that pos in non-empty. Were you suggesting the migration might cut-over before starting copying rows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having discussed this: meanwhile checking transaction_timestamp is the safer approach. time_updated gets update regardless of incoming events.
So the risk of transaction_timestamp is in completely stale servers. However, it's enough that some table is updated and the problem goes away. But if the server is completely stale, cut-over will not happen.

To solve that we can either:

  • force injection of some dummy change, or
  • compare GTID pos (likely not a good idea, because _vt tables are getting changed, affecting GTID, but scrubbed in the streamer)
  • modify streamer to include more information that we can use on target side.

// isVReplMigrationReadyToCutOver sees if the vreplication migration has completed the row copy
// and is up to date with the binlogs.
func (e *Executor) isVReplMigrationReadyToCutOver(ctx context.Context, s *VReplStream) (isReady bool, err error) {
// Check all the cases where migration is still running:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an online ddl completes, vttablets and vtgates will need to reload their schema as per the current mechanisms, before they see the changes made by the ddl correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I issue a ReloadSchema before running the migration. Should I run ReloadSchema immediately after completing the migration, or should I actually do that during the cut-over (when tables have been switched, and writes are still blocked)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing it when writes are blocked might be better since binlog events after the write will then see the new schema. Otherwise we may have a race: new events for the altered table may be processed before the new schema is reloaded.

But the ReloadSchema is just for this tablet, right?

Copy link
Contributor Author

@shlomi-noach shlomi-noach Feb 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/vitessio/vitess/pull/7419/files#diff-059c9f46e8d270d9c5514ef2b08679035eb0daaa8d95074e34ef43a81d50dc37R637-R639

		if err := tmClient.ReloadSchema(ctx, tablet.Tablet, ""); err != nil {
			return err
		}

correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just clarifying, that this goes through gRPC; so I want to make sure this isn't an expensive event, because meanwhile we're blocking (rather -- rejecting) writes to the table, so it's a critical point in time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define expensive...: ReloadSchema can take a while since it parses the entire db schema. It has been speeded up recently but for a large number of tables it will probably take a non-trivial amount of time. I would think, still in the order of milliseconds, but not entirely sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-trivial amount of time.

milliseconds is just fine. Would it take more than 1 second (I'd define "expensive" at 1 second) ? e.g. on a schema with 10,000 tables? Is it possible to ReloadSchema for just one table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(having discussed this in sync with @rohit-nayak-ps) reloading schema can take a substantial time if many thousands of tables are involved. For now, we issue an async ReloadSchema (with goroutine). In the future, we will look into reloading a single table.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…tion-online-ddl

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

I have merged #7492 here. This adds a new endtoend/online_ddl_vrepl_stress test, which tests vreplication/online DDL under concurrent load, where a mockup "app" opens multiple connections, runs some randomly spread queries, and keeps track of changes. A migration takes place during that workload. At th eend of the migration, the app looks at migrated table and expects some metrics in the table to match its own expected metrics. See more in #7492 (comment)

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

7e2a074 introduces dummy statement injection for when we run vreplication based online DDL, and the server is utterly stale. In this situation, transaction_timestamp is never updated, and we may never know whether we should cut-over. See #7419 (comment)

The solution is hacky: if we notice transaction_timestamp is too old (over 1 minute by default), then we set a timer to inject a futuristic dummy statement, in the form of DROP VIEW IF EXISTS ... (name of view which surely does not exist). We inject several of these statements at a schedule when we expect the next online DDL cycle to run, a point where we check for running vreplication migrations. We expect those injections to be intercepted in the binary log (assuming no lag) an thus pave the way for a cut-over.

@shlomi-noach
Copy link
Contributor Author

@rohit-nayak-ps actually, re: discussion about cut-over and stale server -- the tests disagree. The tests are happy to cut-over when the copy is complete, even when there's no traffic on the table. Did some local tests, inserted two rows on a table, then slept, then migrated -- cut-over took place.
Let me take a closer look on what happened.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

I notice that it is impossible to ADD UNIQUE KEY if there's a data conflict. This needs to be supported. Gonna look into it on a different PR, will possibly import a few tests from gh-ost.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Contributor Author

@sougou confirmed changes to _vt tables are intercepted on the target, and transaction_timestamp gets updated. There is indeed an infinite loop -- but it's rate limited: only once event per second is registered in transaction_timestamp.

I've reverted 7e2a074 ; there is no need to inject dummy transaction. The existing mechanism works even on stale servers.

@shlomi-noach
Copy link
Contributor Author

Looking at ADD UNIQUE KEY, the changes required to support it are small, but also begin the rolling of the snowball that is reworking the copier+player flow. I will address the idea in an issue and a followup PR.

This PR, meanwhile, is good to review, notwithstanding the fact that vrepl online DDL does not support ADD UNIQUE KEY and a bit of other UNIQUE KEY.PRIMARY KEY changes. We will work on those in iterations.

@shlomi-noach
Copy link
Contributor Author

I should clarify. Adding a UNIQUE KEY is possible, when unique constraints are met. There is a use case for adding a UNIQUE KEY when the constraint is not met; I'll discuss this elsewhere.

Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants