Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs/Balancer.md #611

Merged
merged 5 commits into from
Apr 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions docs/Balancer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
## Data Balancing

We considered different options for moving data. The most popular way is to make a copy via logical replication, then delete half of the data on one node and delete another half of the data on the other node. We decided that logical replication does not work well enough yet. Instead, the coordinator makes ε-split - cut off a small part of the data. Since it is small, it all works very quickly.

![ε-split](e-split.png "ε-split")



## Algorithm

This is a brief summary of what stages balancing consists of:
Denchick marked this conversation as resolved.
Show resolved Hide resolved

1. **Collecting statistics**: The load balancer collects statistics on the workload on the shards using [pg_comment_stats](#pg_comment_stats) to measure CPU and disk usage.
2. **Finding the most heavily loaded shard**: Based on the collected statistics, the load balancer identifies the shard with the highest workload.
3. **Selecting the most significant load criterion**: Among all the workload criteria, the one with the greatest impact on the overall workload is chosen.
4. **Checking out the need for data migration**: The workload on the key range is compared to a threshold value. If it exceeds the threshold, it's time to migrate the data.
5. **Finding the key range with the heaviest load**: On the identified shard, the key range with the highest workload is determined.
6. **Choosing a destination**: It is decided which shard and key range the data will be migrated to.
7. **Data movement**: A data movement operation is initiated, which may involve splitting the data into smaller chunks, if necessary, and transferring them to the destination shard. For more details see [data movement internals](#Data movement internals)
8. **Synchronization**: The changes are synchronized with the etcd cluster to ensure data consistency.

## pg_comment_stats

We fork pg_stat_statements and modified it a little bit. The original version of the extension records stats for each SQL statement, while [pg_comment_stats](https://github.com/munakoiso/pg_comment_stats) keeps track of queries that have specific keys mentioned in the statement comments.

```
> /* a: 1 c: hmm*/ select 1;
> select comment_keys, query_count, user_time from pgcs_get_stats() limit 1;
-[ RECORD 1 ]+----------------------
comment_keys | {"a": "1"}
query_count | 1
user_time | 6.000000000000363e-06
```

## Data movement internals

Balancer is a separate binary that executes the algorithm. It executes the algorithm exactly once without cyclic repetition, and its running time is on the order of seconds. If the queue is not empty, the balancer performs a task from the queue. A data transport task is actually a group of tasks that can have many actions, and between all actions, the task state is synchronized with etcd. After completion, the task is removed from the task group.

For clarity, here is how it is defined [in the code](https://github.com/pg-sharding/spqr/blob/master/pkg/models/tasks/tasks.go ):

```
type Task struct {
ShardFromId string
ShardToId string
KrIdFrom string
KrIdTo string
Bound []byte
KrIdTemp string
State TaskState // Planned, Split, Moved
}

type TaskGroup struct {
Tasks []*Task
JoinType JoinType // JoinNone, JoinLeft, JoinRight
}
```
16 changes: 0 additions & 16 deletions docs/Coordinator.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,3 @@ Then, run `SHOW routers;`. Coordinator will reply with list of knows router in c

It is possible to run coordinator as a separate entity or with router using `with_coordinator` flag

## Data Balancing

We considered different options for moving data. The most popular way is to make a copy via logical replication, then delete half of the data on one node and delete another half of the data on the other node. We decided that logical replication does not work well enough yet. Instead, the coordinator makes ε-split - cut off a small part of the data. Since it is small, it all works very quickly.

![ε-split](e-split.png "ε-split")

The load from the moving key ranges measured with [pg_comment_stats](https://github.com/munakoiso/pg_comment_stats) extension.

```
> /* a: 1 c: hmm*/ select 1;
> select comment_keys, query_count, user_time from pgcs_get_stats() limit 1;
-[ RECORD 1 ]+----------------------
comment_keys | {"a": "1"}
query_count | 1
user_time | 6.000000000000363e-06
```
Binary file modified docs/e-split.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/spqr-logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading