Add (almost complete) EPaxos implementation & fixes (#24)

* minor updates to plotting scripts * add distributed machines helper scripts * minor updates to plotting scripts * make toml dependency optional for workflow * staging progress on physical net expers * add tidb workload profile results * staging progress on changed bench parameters * update gitignore * minor updates to tla+ formatting * minor updates to plotting scripts * minor updates to plotting scripts * minor updates to plotting scripts * minor updates to plotting scripts * add open_tcp_ports.sh script * add auto iperf & ping script * staging progress on physical net expers * staging progress on physical net expers * staging progress on physical net expers * finished physical network exper =) * finished physical network exper =) * update motivation profiling plots * add banal formal dump script * minor updates to plotting scripts * add WAN delays modeling calculation * minor updates to WAN perf modeling script * reorganize scripts into paper-specific folders * reorganize scripts into paper-specific folders * mv artifact memos to separate file * adding SMR-style TLA+ spec * adding command line TLC helper scripts * optimizing SMR-style TLA+ spec & linearizability constraint * completing SMR-style TLA+ spec * completing SMR-style TLA+ spec * complete SMR-style MultiPaxos TLA+ spec * add README to MultiPaxos spec folder * add SMR-style MultiPaxos spec blog * minor updates to MultiPaxos spec README * update SMR-style MultiPaxos spec and TLC script * add node failure injection to MultiPaxos spec * update Crossword TLA+ spec * update Crossword TLA+ spec * update Crossword TLA+ spec .gitignore * extending SMR-style MultiPaxos spec * extending SMR-style MultiPaxos spec * finish extended SMR-style MultiPaxos spec * staging progress on Bodega TLA+ spec * staging progress on Bodega TLA+ spec * finish Bodega TLA+ spec * finish Bodega TLA+ spec * finish Bodega TLA+ spec * finish Bodega TLA+ spec * test change * scanning through the codebase * scanning through the codebase * add back committed condition check in Bodega spec * cleaning up bench client * add comments on port usage mess * remove useless perf sim parameters * minor updates to README * add server module sync APIs * fix wal_offset interference bug * refactor MultiPaxos code to fix Prepare phase bug * comment out the snapshotting test * fixing the Prepare phase for RSPaxos * fixing the Prepare phase for RSPaxos * fixed the Prepare phase for RSPaxos * fixed the Prepare phase for Crossword * fix formatting issues * updating CloudLab experiments support * updating CloudLab experiments support * updating CloudLab experiments support * updating CloudLab experiments support * updating CloudLab experiments support * updating CloudLab experiments support * updating CloudLab experiments support * updating CloudLab experiments support * updating CloudLab experiments support * add remote_hosts.toml grouping * restructured scripts utils library * making better benchmarking scripts * making better benchmarking scripts * making better benchmarking scripts * making better benchmarking scripts * making better benchmarking scripts * making better benchmarking scripts * making better benchmarking scripts * finish follower read stalenss exper * enable keyspace partitioning run mode * enable keyspace partitioning for ChainPaxos * enable keyspace partitioning for ChainPaxos * enable keyspace partitioning for ChainPaxos * enable keyspace partitioning for ChainPaxos * minor updates to artifact readme * working on improved YCSB exper * working on improved YCSB exper * finished improved YCSB trace exper * finished improved YCSB trace exper * update Crossword publish crop scripts * wrapping up the improved evaluation * update results archiving script * add wan quorums model plotting for Bodega * use k reqs unit for ycsb plot * staging progress on chain replication impl * staging progress on chain rep impl * finished chain replication implementation * finished chain replication implementation * finished chain replication implementation * fixing CI workflow issues * fixing CI workflow issues * minor update to README * finishing final plotting updates for Crossword * finishing final plotting updates for Crossword * minor updates to bench script comments * use clone_from() according to clippy * update Bodega TLA+ model * minor updates to cargo metadata * modifications for code quality improvement * modifications for code quality improvement * use static OnceLock for logging identifier * let formatter reorder imports automatically * minor updates to README * minor fix for logging & script ports * minor updates to README * minor updates to README * adding public repo publish script * adding public repo publish script * adding public repo publish script * adding public repo publish script * adding public repo publish script * adding public repo publish script * bump rust dependencies versions * replace messagepack with bincode * remove unnecessary bind ports requirements * bump tokio version and minor plot changes * update setup scripts * update setup scripts for cockroach * update archive_results.sh script * working on extra expers for reviews * working on gossiping batching exper * finish bw utilization exper * finish bw utilization exper * add cockroach automation scripts * enabling TPC-C workload bench for cockroach * working on cockroach tpcc experiment * minor updates to comments * minor updates to mirror script * working on cockroach tpcc experiment * working on cockroach tpcc experiment * correct cockroach tpcc payload profiling * correct cockroach tpcc payload profiling * add protocol env var to cockroach script * improve plot dimensions and spacing * minor modifications to rscoding util * minor updates to mirror script * add RS coding timing logging switch * minor updates to cockroach script * minor updates to cockroach script * adding settings to cockroach scripts * minor updates to cockroach script * adding settings to cockroach scripts * add cockroach auto benchmark script * finalize Crossword benchmark scripts * cleaning up helper scripts * cleaning up helper scripts * cleaning up helper scripts * adding zookeeper scripts * enable zookeeper cluster helper scripts * remove unnecessary home dir copy in geni scripts * remove unnecessary home dir copy in geni scripts * minor updates to toml parsing helper * minor updates to toml parsing helper * finishing up cocoroachdb experiment * adding zookeeper client support * added zookeeper clients script * update and finish zookeeper clients * adding etcd cluster scripts * adding etcd cluster scripts * adding etcd cluster scripts * added etcd clients script * update bodega tla+ gitignore * staging progress on leasing module * staging progress on leasing module * staging progress on leasing module * finish leasing module implementation * staging progress on MultiPaxos leader leases * staging progress on MultiPaxos leader leases * better Summerset module tasks code structure * better heartbeats management & fix Raft metadata serde * staging progress on MultiPaxos leader leases * staging progress on MultiPaxos leader leases * staging progress on MultiPaxos leader leases * proper mutual_leases test & improve timed unit tests * minor updates to leaseman module printing * typo fixes across the codebase * typo fixes across the codebase * fix github workflow dependencies issues * staging progress on near quorum reads * finish implementation of near quorum reads * minor updates to near quorum reads logging * bench client sequential keys & preloading feature * staging progress on loc_wr_grid bench script * staging progress on loc_wr_grid bench script * staging progress on microbenchmark scripts * add QuorumLeases and Bodega protocols skeletons * staging progress on QuorumLeases implementation * add config changes support to client-facing external APIs * staging progress on QuorumLeases implementation * staging progress on QuorumLeases implementation * staging progress on QuorumLeases implementation * staging progress on QuorumLeases implementation * staging progress on QuorumLeases implementation * staging progress on QuorumLeases implementation * staging progress on QuorumLeases implementation * fix excessive qleasing actions bug for QuorumLeases * fix blocking by higher_number at ensure_qlease_cleared * fix missing NoGrants handling on leader itself * minor fixes to typos * add urgent commit notifications option to QuorumLeases * clean up skeleton code for Bodega * clean up skeleton code for EPaxos * add peer_accept_bars guard to leader leases solutions * minor modifications to comments & structure * staging progress on EPaxos implementation * staging progress on EPaxos implementation * staging progress on EPaxos implementation * staging progress on EPaxos implementation * staging progress on EPaxos implementation * staging progress on EPaxos implementation * staging progress on EPaxos implementation * staging progress on EPaxos implementation * finish EPaxos implementation & various fixes * fix heartbeater module missing cancellation bug * fix EPaxos normal operation bugs * fix EPaxos explicit prepare bugs * minor updates to README * minor updates to formatting * minor updates to README & formatting * trimming for public repo --------- Co-authored-by: Guanzhou Hu <josehu@Joses-MacBook-Pro.local>
josehu07 · Nov 10, 2024 · 7e59a61 · 7e59a61
1 parent eaa8738
commit 7e59a61
Show file tree

Hide file tree

Showing 102 changed files with 10,252 additions and 839 deletions.
diff --git a/Cargo.toml b/Cargo.toml
@@ -37,10 +37,12 @@ log = { workspace = true }
 env_logger = { workspace = true }
 async-trait = "0.1"
 fixedbitset = { version = "0.5", features = ["serde"] }
+atomic_refcell = "0.1"
 flashmap = "0.1"
 futures = "0.3"
 bincode = "1.3"
 reed-solomon-erasure = { version = "6.0" }
+petgraph = "0.6"
 get-size = { version = "0.1", features = ["derive"] }
 linreg = "0.2"
 statistical = "1.0"

diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@
 [![Proc tests status](https://github.com/josehu07/summerset/actions/workflows/tests_proc.yml/badge.svg)](https://github.com/josehu07/summerset/actions?query=josehu07%3Atests_proc)
 [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
 
-Summerset is a distributed, replicated, protocol-generic key-value store supporting a wide range of state machine replication (SMR) protocols for research purposes. More protocols are actively being added.
+Summerset is a distributed, replicated, protocol-generic key-value store supporting a wide range of state machine replication (SMR) protocols for research purposes.
 
 <p align="center">
   <img width="360" src="./README.png">
@@ -20,24 +20,30 @@ Summerset is a distributed, replicated, protocol-generic key-value store support
 | `RepNothing` | Simplest protocol w/o any replication |
 | `SimplePush` | Pushing to peers w/o consistency guarantees |
 | `ChainRep` | Bare implementation of Chain Replication ([paper](https://www.cs.cornell.edu/home/rvr/papers/OSDI04.pdf)) |
-| `MultiPaxos` | Classic MultiPaxos protocol ([paper](https://www.microsoft.com/en-us/research/uploads/prod/2016/12/paxos-simple-Copy.pdf)) |
-| `RSPaxos` | MultiPaxos w/ Reed-Solomon erasure code sharding ([paper](https://madsys.cs.tsinghua.edu.cn/publications/HPDC2014-mu.pdf)) |
+| `MultiPaxos` | Classic MultiPaxos ([paper](https://www.microsoft.com/en-us/research/uploads/prod/2016/12/paxos-simple-Copy.pdf)) w/ modern features |
+| `EPaxos` | Leaderless-style Egalitarian Paxos ([paper](https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf)) |
 | `Raft` | Raft with explicit log and strong leadership ([paper](https://raft.github.io/raft.pdf)) |
-| `CRaft` | Raft w/ erasure code sharding and fallback support ([paper](https://www.usenix.org/system/files/fast20-wang_zizhong.pdf)) |
+| `RSPaxos` | MultiPaxos w/ RS erasure code sharding ([paper](https://madsys.cs.tsinghua.edu.cn/publications/HPDC2014-mu.pdf)) |
+| `CRaft` | Raft w/ erasure code sharding and fallback ([paper](https://www.usenix.org/system/files/fast20-wang_zizhong.pdf)) |
+| `QuorumLeases` | Local reads at leaseholders when quiescent ([paper](https://www.cs.cmu.edu/~imoraru/papers/qrl.pdf)) |
 
 Formal TLA+ specification of some protocols are provided in `tla+/`.
 
+More exciting protocols are actively being added!
+
 </details>
 
 <details>
 <summary>Why is Summerset different from other codebases...</summary>
 
-- **Async Rust**: Summerset is written in Rust and demonstrates canonical usage of async programming structures backed by the [`tokio`](https://tokio.rs/) framework;
-- **Event-based**: Summerset adopts a channel-oriented, event-based system architecture; each replication protocol is basically just a set of event handlers plus a `tokio::select!` loop;
-- **Modularized**: Common components of a distributed KV store, e.g. network transport and durable logger, are cleanly separated from each other and connected through channels.
-- **Protocol-generic**: With the above two points combined, Summerset is able to support a set of different replication protocols in one codebase, with common functionalities abstracted out.
+- **Async Rust**: Summerset is written in Rust and demonstrates canonical usage of async programming structures backed by the [`tokio`](https://tokio.rs/) framework.
+- **Channel/event-based**: Summerset adopts a channel-oriented, event-based system architecture; each replication protocol is basically just a set of event handlers plus a `tokio::select!` loop. The entire codebase contains 0 explicit usage of `Mutex`.
+- **Modularized**: Common components of a distributed KV store, e.g. network transport and durable logger, as well as protocol-specific parallelism tasks, are cleanly separated from each other and connected through channels. This extends Go's philosophy of doing "synchronization by (low-cost) communication (of ownership transfers)".
+- **Protocol-generic**: With the above points combined, Summerset is able (and strives) to support a set of different replication protocols in one codebase, with common functionalities abstracted out, leaving each protocol's implementation concise and to-the-point.
+
+These design choices make protocol implementation in Summerset rather straight-forward and understandable, without making a sacrifice on performance.
 
-These design choices make protocol implementation in Summerset straight-forward and understandable, without any sacrifice on performance. Comments / issues / PRs are always welcome!
+Comments / issues / PRs are always welcome!
 
 </details>
 

diff --git a/scripts/distr_clients.py b/scripts/distr_clients.py
@@ -27,12 +27,24 @@
         "put_ratio",
         "ycsb_trace",
         "length_s",
+        "use_random_keys",
+        "skip_preloading",
         "norm_stdev_ratio",
         "unif_interval_ms",
         "unif_upper_bound",
     ],
-    "tester": ["test_name", "keep_going", "logger_on"],
-    "mess": ["pause", "resume"],
+    "tester": [
+        "test_name",
+        "keep_going",
+        "logger_on",
+    ],
+    "mess": [
+        "pause",
+        "resume",
+        "grantor",
+        "grantee",
+        "leader",
+    ],
 }
 
 
@@ -88,7 +100,9 @@ def glue_params_str(cli_args, params_list):
     return "+".join(params_strs)
 
 
-def compose_client_cmd(protocol, manager, config, utility, timeout_ms, params, release):
+def compose_client_cmd(
+    protocol, manager, config, utility, timeout_ms, params, release, near_id=None
+):
     cmd = [f"./target/{'release' if release else 'debug'}/summerset_client"]
     cmd += [
         "-p",
@@ -98,7 +112,19 @@ def compose_client_cmd(protocol, manager, config, utility, timeout_ms, params, r
         "--timeout-ms",
         str(timeout_ms),
     ]
+
     if config is not None and len(config) > 0:
+        # if dist_machs is set, near_id will be the node ID that's considered
+        # the closest to this client
+        # NOTE: dumb overwriting near_server_id field here for simplicity
+        if near_id is not None and "near_server_id" in config:
+            bi = config.index("near_server_id=") + 15
+            epi = config[bi:].find("+")
+            esi = config[bi:].find(" ")
+            ei = epi if esi == -1 else esi if epi == -1 else min(epi, esi)
+            ei = len(config) if ei == -1 else ei
+            assert config[bi:ei] == "0"
+            config = config[:bi] + str(near_id) + config[ei:]
         cmd += ["--config", config]
 
     cmd += ["-u", utility]
@@ -115,6 +141,7 @@ def compose_client_cmd(protocol, manager, config, utility, timeout_ms, params, r
 def run_clients(
     remotes,
     ipaddrs,
+    hosts,
     me,
     man,
     cd_dir,
@@ -137,10 +164,9 @@ def run_clients(
 
     # if dist_machs set, put clients round-robinly across this many machines
     # starting from me
-    hosts = list(remotes.keys())
-    hosts = hosts[hosts.index(me) :] + hosts[: hosts.index(me)]
+    td_hosts = hosts[hosts.index(me) :] + hosts[: hosts.index(me)]
     if dist_machs > 0:
-        hosts = hosts[:dist_machs]
+        td_hosts = td_hosts[:dist_machs]
 
     client_procs = []
     for i in range(num_clients):
@@ -153,6 +179,7 @@ def run_clients(
             timeout_ms,
             params,
             release,
+            near_id=hosts.index(me if dist_machs <= 1 else td_hosts[i % len(td_hosts)]),
         )
 
         proc = None
@@ -161,8 +188,8 @@ def run_clients(
                 i, cmd, capture_stdout=capture_stdout, cores_per_proc=pin_cores
             )
         else:
-            host = hosts[i % len(hosts)]
-            local_i = i // len(hosts)
+            host = td_hosts[i % len(td_hosts)]
+            local_i = i // len(td_hosts)
             if host == me:
                 # run my responsible clients locally
                 proc = run_process_pinned(
@@ -263,6 +290,12 @@ def run_clients(
         action="store_true",
         help="if set, expect there'll be a service halt",
     )
+    parser_bench.add_argument(
+        "--use_random_keys", action="store_true", help="if set, generate random keys"
+    )
+    parser_bench.add_argument(
+        "--skip_preloading", action="store_true", help="if set, skip preloading phase"
+    )
     parser_bench.add_argument(
         "--norm_stdev_ratio", type=float, help="normal dist stdev ratio"
     )
@@ -303,11 +336,26 @@ def run_clients(
     parser_mess.add_argument(
         "--resume", type=str, help="comma-separated list of servers to resume"
     )
+    parser_mess.add_argument(
+        "--grantor",
+        type=str,
+        help="comma-separated list of servers as configured grantors",
+    )
+    parser_mess.add_argument(
+        "--grantee",
+        type=str,
+        help="comma-separated list of servers as configured grantees",
+    )
+    parser_mess.add_argument(
+        "--leader",
+        type=str,
+        help="string form of configured leader ID (or empty string)",
+    )
 
     args = parser.parse_args()
 
     # parse hosts config file
-    base, repo, _, remotes, _, ipaddrs = utils.config.parse_toml_file(
+    base, repo, hosts, remotes, _, ipaddrs = utils.config.parse_toml_file(
         TOML_FILENAME, args.group
     )
     cd_dir = f"{base}/{repo}"
@@ -359,6 +407,7 @@ def run_clients(
     client_procs = run_clients(
         remotes,
         ipaddrs,
+        hosts,
         args.me,
         args.man,
         cd_dir,

diff --git a/scripts/distr_cluster.py b/scripts/distr_cluster.py
@@ -1,5 +1,6 @@
 import sys
 import os
+import time
 import signal
 import argparse
 
@@ -28,9 +29,9 @@
 
 
 class ProtoFeats:
-    def __init__(self, may_snapshot, has_heartbeats, extra_defaults):
+    def __init__(self, may_snapshot, has_hb_leader, extra_defaults):
         self.may_snapshot = may_snapshot
-        self.has_heartbeats = has_heartbeats
+        self.has_hb_leader = has_hb_leader
         self.extra_defaults = extra_defaults
 
 
@@ -39,9 +40,11 @@ def __init__(self, may_snapshot, has_heartbeats, extra_defaults):
     "SimplePush": ProtoFeats(False, False, None),
     "ChainRep": ProtoFeats(False, False, None),
     "MultiPaxos": ProtoFeats(True, True, None),
-    "Raft": ProtoFeats(True, True, None),
+    "EPaxos": ProtoFeats(True, False, lambda n, _: f"optimized_quorum=true"),
     "RSPaxos": ProtoFeats(True, True, lambda n, _: f"fault_tolerance={(n//2)//2}"),
+    "Raft": ProtoFeats(True, True, None),
     "CRaft": ProtoFeats(True, True, lambda n, _: f"fault_tolerance={(n//2)//2}"),
+    "QuorumLeases": ProtoFeats(True, True, lambda n, _: f"sim_read_lease=false"),
 }
 
 
@@ -120,7 +123,7 @@ def config_dict_to_str(d):
     if config is not None and len(config) > 0:
         config_dict.update(config_str_to_dict(config))
 
-    if PROTOCOL_FEATURES[protocol].has_heartbeats and hb_timer_off:
+    if PROTOCOL_FEATURES[protocol].has_hb_leader and hb_timer_off:
         config_dict["disable_hb_timer"] = "true"
 
     return config_dict_to_str(config_dict)
@@ -208,6 +211,7 @@ def launch_servers(
     file_midfix,
     fresh_files,
     pin_cores,
+    launch_wait,
 ):
     if num_replicas != len(remotes):
         raise ValueError(f"invalid num_replicas: {num_replicas}")
@@ -258,6 +262,9 @@ def launch_servers(
             )
         server_procs.append(proc)
 
+        if launch_wait:
+            time.sleep(1)  # NOTE: helps enforce server ID assignment
+
     return server_procs
 
 
@@ -311,6 +318,9 @@ def launch_servers(
     parser.add_argument(
         "--pin_cores", type=int, default=0, help="if > 0, set CPU cores affinity"
     )
+    parser.add_argument(
+        "--launch_wait", action="store_true", help="if set, wait 1s between launches"
+    )
     parser.add_argument(
         "--skip_build", action="store_true", help="if set, skip cargo build"
     )
@@ -414,6 +424,7 @@ def launch_servers(
         file_midfix,
         not args.keep_files,
         args.pin_cores,
+        args.launch_wait,
     )
 
     # register termination signals handler

diff --git a/scripts/local_clients.py b/scripts/local_clients.py
@@ -26,12 +26,24 @@
         "put_ratio",
         "ycsb_trace",
         "length_s",
+        "use_random_keys",
+        "skip_preloading",
         "norm_stdev_ratio",
         "unif_interval_ms",
         "unif_upper_bound",
     ],
-    "tester": ["test_name", "keep_going", "logger_on"],
-    "mess": ["pause", "resume"],
+    "tester": [
+        "test_name",
+        "keep_going",
+        "logger_on",
+    ],
+    "mess": [
+        "pause",
+        "resume",
+        "grantor",
+        "grantee",
+        "leader",
+    ],
 }
 
 
@@ -197,6 +209,12 @@ def run_clients(
         action="store_true",
         help="if set, expect there'll be a service halt",
     )
+    parser_bench.add_argument(
+        "--use_random_keys", action="store_true", help="if set, generate random keys"
+    )
+    parser_bench.add_argument(
+        "--skip_preloading", action="store_true", help="if set, skip preloading phase"
+    )
     parser_bench.add_argument(
         "--norm_stdev_ratio", type=float, help="normal dist stdev ratio"
     )
@@ -237,6 +255,21 @@ def run_clients(
     parser_mess.add_argument(
         "--resume", type=str, help="comma-separated list of servers to resume"
     )
+    parser_mess.add_argument(
+        "--grantor",
+        type=str,
+        help="comma-separated list of servers as configured grantors",
+    )
+    parser_mess.add_argument(
+        "--grantee",
+        type=str,
+        help="comma-separated list of servers as configured grantees",
+    )
+    parser_mess.add_argument(
+        "--leader",
+        type=str,
+        help="string form of configured leader ID (or empty string)",
+    )
 
     args = parser.parse_args()
 

diff --git a/scripts/local_cluster.py b/scripts/local_cluster.py
@@ -27,9 +27,9 @@
 
 
 class ProtoFeats:
-    def __init__(self, may_snapshot, has_heartbeats, extra_defaults):
+    def __init__(self, may_snapshot, has_hb_leader, extra_defaults):
         self.may_snapshot = may_snapshot
-        self.has_heartbeats = has_heartbeats
+        self.has_hb_leader = has_hb_leader
         self.extra_defaults = extra_defaults
 
 
@@ -38,9 +38,11 @@ def __init__(self, may_snapshot, has_heartbeats, extra_defaults):
     "SimplePush": ProtoFeats(False, False, None),
     "ChainRep": ProtoFeats(False, False, None),
     "MultiPaxos": ProtoFeats(True, True, None),
-    "Raft": ProtoFeats(True, True, None),
+    "EPaxos": ProtoFeats(True, False, lambda n, _: f"optimized_quorum=true"),
     "RSPaxos": ProtoFeats(True, True, lambda n, _: f"fault_tolerance={(n//2)//2}"),
+    "Raft": ProtoFeats(True, True, None),
     "CRaft": ProtoFeats(True, True, lambda n, _: f"fault_tolerance={(n//2)//2}"),
+    "QuorumLeases": ProtoFeats(True, True, lambda n, _: f"sim_read_lease=false"),
 }
 
 
@@ -105,7 +107,7 @@ def config_dict_to_str(d):
     if config is not None and len(config) > 0:
         config_dict.update(config_str_to_dict(config))
 
-    if PROTOCOL_FEATURES[protocol].has_heartbeats and hb_timer_off:
+    if PROTOCOL_FEATURES[protocol].has_hb_leader and hb_timer_off:
         config_dict["disable_hb_timer"] = "true"
 
     return config_dict_to_str(config_dict)