From e72c1503c684307e2cfc77f146cc9840380c8d3e Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Thu, 29 Sep 2022 10:35:10 +0200 Subject: [PATCH 1/6] docs/design: Initial commit of REORGANIZE PARTITION design doc --- .../design/2022-09-29-reorganize-partition.md | 132 ++++++++++++++++++ 1 file changed, 132 insertions(+) create mode 100644 docs/design/2022-09-29-reorganize-partition.md diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md new file mode 100644 index 0000000000000..1c66bc657a8f9 --- /dev/null +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -0,0 +1,132 @@ +# TiDB Design Documents + +- Author(s): [Mattias Jonsson](http://github.com/mjonss) +- Discussion PR: https://github.com/pingcap/tidb/pull/XXX +- Tracking Issue: https://github.com/pingcap/tidb/issues/15000 + +## Table of Contents + +* [Introduction](#introduction) +* [Motivation or Background](#motivation-or-background) +* [Detailed Design](#detailed-design) +* [Test Design](#test-design) + * [Functional Tests](#functional-tests) + * [Scenario Tests](#scenario-tests) + * [Compatibility Tests](#compatibility-tests) + * [Benchmark Tests](#benchmark-tests) +* [Impacts & Risks](#impacts--risks) +* [Investigation & Alternatives](#investigation--alternatives) +* [Unresolved Questions](#unresolved-questions) + +## Introduction + +Support ALTER TABLE t REORGANIZE PARTITION p1,p2 INTO (partition pNew1 values...) + +## Motivation or Background + +TiDB is currently lacking the support of changing the partitions of a partitioned table, it only supports adding and dropping LIST/RANGE partitions. +Supporting REORGANIZE PARTITIONs will allow RANGE partitioned tables to have a MAXVALUE partition to catch all values and split it into new ranges. Similar with LIST partitions where one can split or merge different partitions. + +When this is implemented, it will also allow transforming a non-partitioned table into a partitioned table as well as remove partitioning and make a partitioned table a normal non-partitioned table, which is different ALTER statements but can use the same implementation as REORGANIZE PARTITION + +The operation should be online, and must handle multiple partitions as well as large data sets. + +Possible usage scenarios: +- Full table copy + - merging all partitions to a single table (ALTER TABLE t REMOVE PARTITIONING) + - splitting data from many to many partitions, like change the number of hash partitions + - splitting a table to many partitions (ALTER TABLE t PARTITION BY ...) +- Partial table copy (not full table/all partitions) + - split one or more partitions + - merge two or more partitions + +These different use cases can have different optimizations, but the generic form must still be solved: +- N partitions, where each partition has M indexes + +Should we implement both the ingest (lightning way) as well as the merge-txn (row-by-row internal insert batches)? + +## Detailed Design + +There are two parts of the design: +- Schema change states throughout the operation +- Reorganization implementation + +Where the schema change states will clarify which different steps that will be done in which schema state transisions. + +### Schema change states for REORGANIZE PARTITION + +Since this operation will: +- create new partitions +- drop existing partitions +- copy data from dropped partitions to new partitions +it will use all these schema change stages: + // StateNone means this schema element is absent and can't be used. + StateNone SchemaState = iota + // StateDeleteOnly means we can only delete items for this schema element. + StateDeleteOnly + // StateWriteOnly means we can use any write operation on this schema element, + // but outer can't read the changed data. + StateWriteOnly + // StateWriteReorganization means we are re-organizing whole data after write only state. + StateWriteReorganization + // StateDeleteReorganization means we are re-organizing whole data after delete only state. + StateDeleteReorganization + // StatePublic means this schema element is ok for all write and read operations. + StatePublic + // StateReplicaOnly means we're waiting tiflash replica to be finished. + StateReplicaOnly + // StateGlobalTxnOnly means we can only use global txn for operator on this schema element + StateGlobalTxnOnly + + +Explaie the design in enough detail that: it is reasonably clear how the feature would be implemented, corner cases are dissected by example, how the feature is used, etc. + +It's better to describe the pseudo-code of the key algorithm, API interfaces, the UML graph, what components are needed to be changed in this section. + +Compatibility is important, please also take into consideration, a checklist: +- Compatibility with other features, like partition table, security&privilege, collation&charset, clustered index, async commit, etc. +- Compatibility with other internal components, like parser, DDL, planner, statistics, executor, etc. +- Compatibility with other external components, like PD, TiKV, TiFlash, BR, TiCDC, Dumpling, TiUP, K8s, etc. +- Upgrade compatibility +- Downgrade compatibility + +## Test Design + +A brief description of how the implementation will be tested. Both the integration test and the unit test should be considered. + +### Functional Tests + +It's used to ensure the basic feature function works as expected. Both the integration test and the unit test should be considered. + +### Scenario Tests + +It's used to ensure this feature works as expected in some common scenarios. + +### Compatibility Tests + +A checklist to test compatibility: +- Compatibility with other features, like partition table, security & privilege, charset & collation, clustered index, async commit, etc. +- Compatibility with other internal components, like parser, DDL, planner, statistics, executor, etc. +- Compatibility with other external components, like PD, TiKV, TiFlash, BR, TiCDC, Dumpling, TiUP, K8s, etc. +- Upgrade compatibility +- Downgrade compatibility + +### Benchmark Tests + +The following two parts need to be measured: +- The performance of this feature under different parameters +- The performance influence on the online workload + +## Impacts & Risks + +Describe the potential impacts & risks of the design on overall performance, security, k8s, and other aspects. List all the risks or unknowns by far. + +Please describe impacts and risks in two sections: Impacts could be positive or negative, and intentional. Risks are usually negative, unintentional, and may or may not happen. E.g., for performance, we might expect a new feature to improve latency by 10% (expected impact), there is a risk that latency in scenarios X and Y could degrade by 50%. + +## Investigation & Alternatives + +How do other systems solve this issue? What other designs have been considered and what is the rationale for not choosing them? + +## Unresolved Questions + +What parts of the design are still to be determined? From 090eab3cf2fdf380c4eb9ed3c1f0abd3156ce104 Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Fri, 7 Oct 2022 13:54:25 +0200 Subject: [PATCH 2/6] Update 2022-09-29-reorganize-partition.md --- .../design/2022-09-29-reorganize-partition.md | 75 ++++++++++++++----- 1 file changed, 56 insertions(+), 19 deletions(-) diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md index 1c66bc657a8f9..69492e51cb880 100644 --- a/docs/design/2022-09-29-reorganize-partition.md +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -1,7 +1,7 @@ # TiDB Design Documents - Author(s): [Mattias Jonsson](http://github.com/mjonss) -- Discussion PR: https://github.com/pingcap/tidb/pull/XXX +- Discussion PR: https://github.com/pingcap/tidb/pull/38314 - Tracking Issue: https://github.com/pingcap/tidb/issues/15000 ## Table of Contents @@ -43,7 +43,8 @@ Possible usage scenarios: These different use cases can have different optimizations, but the generic form must still be solved: - N partitions, where each partition has M indexes -Should we implement both the ingest (lightning way) as well as the merge-txn (row-by-row internal insert batches)? +First implementation should be based on the merge-txn (row-by-row read, update record, write) transactional batches. +Later we can implement the ingest (lightning way) optimization, since it is not yet completely done. ## Detailed Design @@ -59,51 +60,72 @@ Since this operation will: - create new partitions - drop existing partitions - copy data from dropped partitions to new partitions +- it will use all these schema change stages: + // StateNone means this schema element is absent and can't be used. StateNone SchemaState = iota + - Check if the table structure after the ALTER is valid + - Generate physical table ids to each new partition + - Update the meta data with the new partitions and which partitions to be dropped (so that new transaction can double write) + - TODO: Should we also set placement rules? (Lable Rules) + - Set the state to StateDeleteOnly + // StateDeleteOnly means we can only delete items for this schema element. StateDeleteOnly + - Set the state to StateWriteOnly + // StateWriteOnly means we can use any write operation on this schema element, // but outer can't read the changed data. StateWriteOnly + - Set the state to StateWriteReorganization + // StateWriteReorganization means we are re-organizing whole data after write only state. StateWriteReorganization - // StateDeleteReorganization means we are re-organizing whole data after delete only state. - StateDeleteReorganization - // StatePublic means this schema element is ok for all write and read operations. - StatePublic + - Copy the data from the partitions to be dropped and insert it into the new partitions [1] + - Write the index data for the new partitions [2] + - Replace the old partitions with the new partitions in the metadata + - Add a job for dropping the old partitions data (which will also remove its placement rules etc.) + // StateReplicaOnly means we're waiting tiflash replica to be finished. StateReplicaOnly - // StateGlobalTxnOnly means we can only use global txn for operator on this schema element - StateGlobalTxnOnly + - TODO: Do we need this stage? How to handle TiFlash for REORGANIZE PARTITIONS? +- [1] will read all rows, update their physical table id (partition id) and write them back. +- [2] can be done together with [1] to avoid duplicating the work for both reading and getting the new partition id -Explaie the design in enough detail that: it is reasonably clear how the feature would be implemented, corner cases are dissected by example, how the feature is used, etc. -It's better to describe the pseudo-code of the key algorithm, API interfaces, the UML graph, what components are needed to be changed in this section. +During the reorganization happens in the background the normal write path needs to check if there are any new partitions in the metadata and also check if the updated/deleted/inserted row would match a new partition, and if so, also do the same operation in the new partition, just like during adding index or modify column operations currently does. (To be implemented in `(*partitionedTable) AddRecord/UpdateRecord/RemoveRecord`) -Compatibility is important, please also take into consideration, a checklist: -- Compatibility with other features, like partition table, security&privilege, collation&charset, clustered index, async commit, etc. -- Compatibility with other internal components, like parser, DDL, planner, statistics, executor, etc. -- Compatibility with other external components, like PD, TiKV, TiFlash, BR, TiCDC, Dumpling, TiUP, K8s, etc. -- Upgrade compatibility -- Downgrade compatibility +Note that parser support already exists. +There should be no issues with upgrading, and downgrade will not be supported during the DDL. + +TODO: +- How does DDL affect statistics? Look into how to add statistics for the new partitions (and that the statistics for the old partitions are removed when they are dropped) +- How to handle TiFlash? ## Test Design -A brief description of how the implementation will be tested. Both the integration test and the unit test should be considered. +Re-use tests from other DDLs like Modify column, but adjust them for Reorganize partition. ### Functional Tests +TODO + It's used to ensure the basic feature function works as expected. Both the integration test and the unit test should be considered. ### Scenario Tests +Note that this DDL will be online, while MySQL is blocking on MDL. + +TODO + It's used to ensure this feature works as expected in some common scenarios. ### Compatibility Tests +TODO + A checklist to test compatibility: - Compatibility with other features, like partition table, security & privilege, charset & collation, clustered index, async commit, etc. - Compatibility with other internal components, like parser, DDL, planner, statistics, executor, etc. @@ -113,15 +135,29 @@ A checklist to test compatibility: ### Benchmark Tests +Correctness and functionality is higher priority than performance, also better to not influence performance of normal load during the DDL. + +TODO + The following two parts need to be measured: - The performance of this feature under different parameters - The performance influence on the online workload ## Impacts & Risks -Describe the potential impacts & risks of the design on overall performance, security, k8s, and other aspects. List all the risks or unknowns by far. +Impacts: +- better usability of partitioned tables +- online alter in TiDB, where MySQL is blocking +- all affected data needs to be read (CPU/IO/Network load on TiDB/PD/TiKV) +- all data needs to be writted (duplicated, both row-data and indexes), including transaction logs (more disk space on TiKV, CPU/IO/Network load on TiDB/PD/TiKV and TiFlash if configured on the table). -Please describe impacts and risks in two sections: Impacts could be positive or negative, and intentional. Risks are usually negative, unintentional, and may or may not happen. E.g., for performance, we might expect a new feature to improve latency by 10% (expected impact), there is a risk that latency in scenarios X and Y could degrade by 50%. +Risks: +- introduction of bugs + - in the DDL code + - in the write path (double writing the changes for transactions running during the DDL) +- out of disk space +- out of memory +- general resource usage, resulting in lower performance of the cluster ## Investigation & Alternatives @@ -130,3 +166,4 @@ How do other systems solve this issue? What other designs have been considered a ## Unresolved Questions What parts of the design are still to be determined? +- TiFlash handling From 6be116305ad6c1b6e2f1b006ddb09cf944865bdf Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Mon, 10 Oct 2022 10:14:06 +0200 Subject: [PATCH 3/6] Update docs/design/2022-09-29-reorganize-partition.md Co-authored-by: Benjamin2037 --- docs/design/2022-09-29-reorganize-partition.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md index 69492e51cb880..baf15b0f897d9 100644 --- a/docs/design/2022-09-29-reorganize-partition.md +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -44,7 +44,7 @@ These different use cases can have different optimizations, but the generic form - N partitions, where each partition has M indexes First implementation should be based on the merge-txn (row-by-row read, update record, write) transactional batches. -Later we can implement the ingest (lightning way) optimization, since it is not yet completely done. +Later we can implement the ingest (lightning way) optimization, since DDL module are on the way of evolution to do reorg tasks more efficiency. ## Detailed Design From 10dcf46299064d3ab163e8be4d2453fbde255a4e Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Mon, 10 Oct 2022 10:14:20 +0200 Subject: [PATCH 4/6] Update docs/design/2022-09-29-reorganize-partition.md Co-authored-by: Benjamin2037 --- docs/design/2022-09-29-reorganize-partition.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md index baf15b0f897d9..9b7414506b9c7 100644 --- a/docs/design/2022-09-29-reorganize-partition.md +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -52,7 +52,7 @@ There are two parts of the design: - Schema change states throughout the operation - Reorganization implementation -Where the schema change states will clarify which different steps that will be done in which schema state transisions. +Where the schema change states will clarify which different steps that will be done in which schema state transitions. ### Schema change states for REORGANIZE PARTITION From 7e140c756d7fb4982bdbc084a6003ab0f0e047d6 Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Mon, 10 Oct 2022 10:14:27 +0200 Subject: [PATCH 5/6] Update docs/design/2022-09-29-reorganize-partition.md Co-authored-by: Benjamin2037 --- docs/design/2022-09-29-reorganize-partition.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md index 9b7414506b9c7..86be04f5ea389 100644 --- a/docs/design/2022-09-29-reorganize-partition.md +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -60,7 +60,6 @@ Since this operation will: - create new partitions - drop existing partitions - copy data from dropped partitions to new partitions -- it will use all these schema change stages: // StateNone means this schema element is absent and can't be used. From 71a23d954ca2c36eadaf7a1e264a158a699fcb34 Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Mon, 10 Oct 2022 10:14:53 +0200 Subject: [PATCH 6/6] Update docs/design/2022-09-29-reorganize-partition.md Co-authored-by: Benjamin2037 --- docs/design/2022-09-29-reorganize-partition.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md index 86be04f5ea389..70b6e605f3c91 100644 --- a/docs/design/2022-09-29-reorganize-partition.md +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -70,7 +70,7 @@ it will use all these schema change stages: - TODO: Should we also set placement rules? (Lable Rules) - Set the state to StateDeleteOnly - // StateDeleteOnly means we can only delete items for this schema element. + // StateDeleteOnly means we can only delete items for this schema element (the new partition). StateDeleteOnly - Set the state to StateWriteOnly