Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
109801: sql: implement an interlock to modify unsafe settings r=dt a=knz

Fixes #109810.
Epic: CRDB-28893

As discussed [here](https://docs.google.com/document/d/11mWsfORExZxKqyMJfa6vg7LUzLEYhJ295NkaP1-bvL4/edit?disco=AAAA3lp44WY).

Prior to this patch, it was possible to easily automate `SET CLUSTER
SETTING` for unsafe cluster settings. This is undesirable; we want to
strongly incentivize a human operator paying attention to changes to
these settings.

This patch implements an *interlock*: a mechanism through which the
operator needs to perform two concurrent, related actions for the
change to take effect.

This works as follows:

1. the operator attempts to change a cluster setting from a SQL shell,
   for example:

   ```sql
   SET CLUSTER SETTING kv.raft_log.synchronization.unsafe.disabled = true;
   ```

2. the server fails the execution, with an error:

   ```
   ERROR: changing cluster setting
   "kv.raft_log.synchronization.unsafe.disabled" may cause cluster
   instability or data corruption. To confirm the change, run the
   following command before trying again:

   SET unsafe_setting_interlock_key = 'B7TxIA==';

   ```

3. the operator can then perform the recommended action, then
   try SET CLUSTER SETTING again. Because the key is properly
   set, the SET CLUSTER SETTING statement succeeds.

Also, `RESET` statements (or `SET CLUSTER SETTING ... = DEFAULT`) are
not subject to the interlock, as we assume that the default value is
safe for use.

111336: roachprod-microbench: update error tolerance r=renatolabs,srosenberg a=herkolategan

Previously the `lenient` flag that allowed errors during microbenchmarks to be tolerated would also result in the exit status being 0 even if errors occurred.

The error tolerance should only allow the run to continue, if errors are encountered, but still report the failures by signalling an exit code 1 so that failures can be tracked and reported on.

Release Note: None
Epic: None

111639: kvserver: skip `TestStoreRangeMergeRaftSnapshot` under metamorphic tests r=erikgrinaker a=erikgrinaker

Touches #111624.
Epic: none
Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Herko Lategan <herko@cockroachlabs.com>
Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
  • Loading branch information
4 people committed Oct 3, 2023
4 parents 412a6c3 + d9a02c9 + 36eafc3 + 74e8f0b commit 20d0094
Show file tree
Hide file tree
Showing 22 changed files with 298 additions and 25 deletions.
1 change: 1 addition & 0 deletions pkg/bench/tpcc/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ go_test(
embed = [":tpcc"],
deps = [
"//pkg/base",
"//pkg/kv/kvserver/logstore",
"//pkg/security/securityassets",
"//pkg/security/securitytest",
"//pkg/server",
Expand Down
16 changes: 8 additions & 8 deletions pkg/bench/tpcc/subprocess_commands_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,9 @@ import (
"time"

"github.com/cockroachdb/cockroach/pkg/base"
"github.com/cockroachdb/cockroach/pkg/kv/kvserver/logstore"
"github.com/cockroachdb/cockroach/pkg/testutils/serverutils"
"github.com/cockroachdb/cockroach/pkg/testutils/sqlutils"
"github.com/cockroachdb/cockroach/pkg/testutils/testcluster"
"github.com/cockroachdb/cockroach/pkg/util/envutil"
"github.com/cockroachdb/cockroach/pkg/util/log"
"github.com/cockroachdb/cockroach/pkg/workload"
Expand Down Expand Up @@ -61,17 +62,16 @@ var (
require.True(t, ok)

defer log.Scope(t).Close(t)
tc := testcluster.StartTestCluster(t, 1, base.TestClusterArgs{
ServerArgs: base.TestServerArgs{
StoreSpecs: []base.StoreSpec{{Path: storeDir}},
},
srv, db, _ := serverutils.StartServer(t, base.TestServerArgs{
StoreSpecs: []base.StoreSpec{{Path: storeDir}},
})
defer tc.Stopper().Stop(ctx)
defer srv.Stopper().Stop(ctx)

// Make the generation faster.
logstore.DisableSyncRaftLog.Override(context.Background(), &srv.SystemLayer().ClusterSettings().SV, true)

db := tc.ServerConn(0)
tdb := sqlutils.MakeSQLRunner(db)
tdb.Exec(t, "CREATE DATABASE "+databaseName)
tdb.Exec(t, "SET CLUSTER SETTING kv.raft_log.synchronization.disabled = true")
tdb.Exec(t, "USE "+databaseName)
tpcc, err := workload.Get("tpcc")
require.NoError(t, err)
Expand Down
18 changes: 18 additions & 0 deletions pkg/bench/tpcc/tpcc_bench_options_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ import (
"testing"

"github.com/cockroachdb/cockroach/pkg/base"
"github.com/cockroachdb/cockroach/pkg/testutils/serverutils"
)

type option interface {
Expand Down Expand Up @@ -45,6 +46,7 @@ func (o options) apply(cfg *benchmarkConfig) {
type benchmarkConfig struct {
workloadFlags []string
argsGenerator serverArgs
setupServer []func(b testing.TB, s serverutils.TestServerInterface)
setupStmts []string
}

Expand Down Expand Up @@ -80,4 +82,20 @@ func setupStmt(stmt string) option {
return setupStmtOption(stmt)
}

var _ = setupStmt // silence unused linter

func (s setupStmtOption) String() string { return string(s) }

func setupServer(fn func(tb testing.TB, s serverutils.TestServerInterface)) option {
return setupServerOption{fn}
}

type setupServerOption struct {
fn func(tb testing.TB, s serverutils.TestServerInterface)
}

func (s setupServerOption) apply(cfg *benchmarkConfig) {
cfg.setupServer = append(cfg.setupServer, s.fn)
}

func (s setupServerOption) String() string { return "setup server" }
9 changes: 7 additions & 2 deletions pkg/bench/tpcc/tpcc_bench_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import (
"testing"

"github.com/cockroachdb/cockroach/pkg/base"
"github.com/cockroachdb/cockroach/pkg/kv/kvserver/logstore"
"github.com/cockroachdb/cockroach/pkg/testutils"
"github.com/cockroachdb/cockroach/pkg/testutils/serverutils"
"github.com/cockroachdb/cockroach/pkg/testutils/sqlutils"
Expand Down Expand Up @@ -66,8 +67,9 @@ func BenchmarkTPCC(b *testing.B) {
StoreSpecs: []base.StoreSpec{{Path: td}},
}, cleanup
}),
setupStmt(`
SET CLUSTER SETTING kv.raft_log.synchronization.disabled = true`),
setupServer(func(tb testing.TB, s serverutils.TestServerInterface) {
logstore.DisableSyncRaftLog.Override(context.Background(), &s.SystemLayer().ClusterSettings().SV, true)
}),
}

for _, opts := range []options{
Expand Down Expand Up @@ -142,6 +144,9 @@ func (bm *benchmark) startCockroach(b testing.TB) {
s.Stopper().Stop(context.Background())
})

for _, fn := range bm.setupServer {
fn(b, s)
}
for _, stmt := range bm.setupStmts {
sqlutils.MakeSQLRunner(db).Exec(b, stmt)
}
Expand Down
10 changes: 6 additions & 4 deletions pkg/cmd/allocsim/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -538,13 +538,15 @@ func main() {
os.Exit(exitStatus)
}()

for i := 0; i < c.Cfg.NumNodes; i++ {
cfg := c.Cfg.PerNodeCfg[i]
cfg.ExtraEnv = append(cfg.ExtraEnv, "COCKROACH_DISABLE_RAFT_LOG_SYNCHRONIZATION_UNSAFE=true")
c.Cfg.PerNodeCfg[i] = cfg
}

c.Start(context.Background())
defer c.Close()
c.UpdateZoneConfig(1, 1<<20)
_, err := c.Nodes[0].DB().Exec("SET CLUSTER SETTING kv.raft_log.synchronization.disabled = true")
if err != nil {
log.Fatalf(context.Background(), "%v", err)
}
if len(config.Localities) != 0 {
a.runWithConfig(config)
} else {
Expand Down
6 changes: 1 addition & 5 deletions pkg/cmd/roachprod-microbench/executor.go
Original file line number Diff line number Diff line change
Expand Up @@ -439,11 +439,7 @@ func (e *executor) executeBenchmarks() error {

e.log.Printf("Completed benchmarks, results located at %s", e.outputDir)
if errorCount != 0 {
if e.lenient {
e.log.Printf("Ignoring errors in benchmark results (lenient flag was set)")
} else {
return errors.Newf("Found %d errors during remote execution", errorCount)
}
return errors.Newf("Found %d errors during remote execution", errorCount)
}
return nil
}
Expand Down
2 changes: 1 addition & 1 deletion pkg/cmd/roachprod-microbench/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ func makeRunCommand() *cobra.Command {
cmd.Flags().StringSliceVar(&config.excludeList, "exclude", []string{}, "comma-separated regex of packages and benchmarks to exclude e.g. 'pkg/util/.*:BenchmarkIntPool,pkg/sql:.*'")
cmd.Flags().IntVar(&config.iterations, "iterations", config.iterations, "number of iterations to run each benchmark")
cmd.Flags().BoolVar(&config.copyBinaries, "copy", config.copyBinaries, "copy and extract test binaries and libraries to the target cluster")
cmd.Flags().BoolVar(&config.lenient, "lenient", config.lenient, "tolerate errors in the benchmark results")
cmd.Flags().BoolVar(&config.lenient, "lenient", config.lenient, "tolerate errors while running benchmarks")
cmd.Flags().BoolVar(&config.affinity, "affinity", config.affinity, "run benchmarks with iterations and binaries having affinity to the same node, only applies when more than one archive is specified")
cmd.Flags().BoolVar(&config.quiet, "quiet", config.quiet, "suppress roachprod progress output")

Expand Down
2 changes: 2 additions & 0 deletions pkg/kv/kvserver/client_merge_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -3718,6 +3718,8 @@ func TestStoreRangeMergeRaftSnapshot(t *testing.T) {
defer leaktest.AfterTest(t)()
defer log.Scope(t).Close(t)

skip.UnderMetamorphicWithIssue(t, 111624)

// We will be testing the SSTs written on store2's engine.
var receivingEng, sendingEng storage.Engine
// All of these variables will be populated later, after starting the cluster..
Expand Down
8 changes: 5 additions & 3 deletions pkg/kv/kvserver/logstore/logstore.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,15 +38,17 @@ import (
"go.etcd.io/raft/v3/raftpb"
)

var disableSyncRaftLog = settings.RegisterBoolSetting(
// DisableSyncRaftLog disables raft log synchronization and can cause data loss.
var DisableSyncRaftLog = settings.RegisterBoolSetting(
settings.SystemOnly,
"kv.raft_log.disable_synchronization_unsafe",
"disables synchronization of Raft log writes to persistent storage. "+
"Setting to true risks data loss or data corruption on process or OS crashes. "+
"This not only disables fsync, but also disables flushing writes to the OS buffer. "+
"The setting is meant for internal testing only and SHOULD NOT be used in production.",
envutil.EnvOrDefaultBool("COCKROACH_DISABLE_RAFT_LOG_SYNCHRONIZATION_UNSAFE", false),
settings.WithName("kv.raft_log.synchronization.disabled"),
settings.WithName("kv.raft_log.synchronization.unsafe.disabled"),
settings.WithUnsafe,
)

var enableNonBlockingRaftLogSync = settings.RegisterBoolSetting(
Expand Down Expand Up @@ -232,7 +234,7 @@ func (s *LogStore) storeEntriesAndCommitBatch(
stats.PebbleBegin = timeutil.Now()
stats.PebbleBytes = int64(batch.Len())
wantsSync := len(m.Responses) > 0
willSync := wantsSync && !disableSyncRaftLog.Get(&s.Settings.SV)
willSync := wantsSync && !DisableSyncRaftLog.Get(&s.Settings.SV)
// Use the non-blocking log sync path if we are performing a log sync ...
nonBlockingSync := willSync &&
// and the cluster setting is enabled ...
Expand Down
1 change: 1 addition & 0 deletions pkg/server/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ go_library(
"//pkg/kv/kvserver/kvstorage",
"//pkg/kv/kvserver/liveness",
"//pkg/kv/kvserver/liveness/livenesspb",
"//pkg/kv/kvserver/logstore",
"//pkg/kv/kvserver/loqrecovery",
"//pkg/kv/kvserver/loqrecovery/loqrecoverypb",
"//pkg/kv/kvserver/protectedts",
Expand Down
5 changes: 4 additions & 1 deletion pkg/server/import_ts.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/keys"
"github.com/cockroachdb/cockroach/pkg/kv"
"github.com/cockroachdb/cockroach/pkg/kv/kvpb"
"github.com/cockroachdb/cockroach/pkg/kv/kvserver/logstore"
"github.com/cockroachdb/cockroach/pkg/roachpb"
"github.com/cockroachdb/cockroach/pkg/server/status/statuspb"
"github.com/cockroachdb/cockroach/pkg/sql/sessiondata"
Expand Down Expand Up @@ -73,9 +74,11 @@ func maybeImportTS(ctx context.Context, s *topLevelServer) (returnErr error) {
// is not in use in the data set to import.
s.node.suppressNodeStatus.Set(true)

// Disable raft log synchronization to make the server generally faster.
logstore.DisableSyncRaftLog.Override(ctx, &s.cfg.Settings.SV, true)

// Disable writing of new timeseries, as well as roll-ups and deletion.
for _, stmt := range []string{
"SET CLUSTER SETTING kv.raft_log.synchronization.disabled = 'true';",
"SET CLUSTER SETTING timeseries.storage.enabled = 'false';",
"SET CLUSTER SETTING timeseries.storage.resolution_10s.ttl = '99999h';",
"SET CLUSTER SETTING timeseries.storage.resolution_30m.ttl = '99999h';",
Expand Down
12 changes: 12 additions & 0 deletions pkg/settings/common.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ type common struct {
name SettingName
description string
visibility Visibility
unsafe bool
slot slotIdx
nonReportable bool
retired bool
Expand Down Expand Up @@ -119,6 +120,17 @@ func (c *common) setName(name SettingName) {
c.name = name
}

// setUnsafe is used to override the unsafe status of the setting.
// Refer to the WithUnsafe option for details.
func (c *common) setUnsafe() {
c.unsafe = true
}

// IsUnsafe indicates whether the setting is unsafe.
func (c *common) IsUnsafe() bool {
return c.unsafe
}

// SetOnChange installs a callback to be called when a setting's value changes.
// `fn` should avoid doing long-running or blocking work as it is called on the
// goroutine which handles all settings updates.
Expand Down
28 changes: 28 additions & 0 deletions pkg/settings/lint/lint_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,13 @@ func TestLintClusterSettingNames(t *testing.T) {
}
}

if strings.Contains(settingName, "unsafe") && !setting.IsUnsafe() {
return errors.Errorf("%s: setting name contains \"unsafe\" but is not marked unsafe (hint: use option settings.WithUnsafe)", settingName)
}
if setting.IsUnsafe() && !strings.Contains(settingName, "unsafe") {
return errors.Errorf("%s: setting marked as unsafe but its name does not contain \"unsafe\"", settingName)
}

return nil
}()
if nameErr != nil {
Expand Down Expand Up @@ -221,3 +228,24 @@ func TestLintClusterSettingDescriptions(t *testing.T) {
}
}
}

func TestLintClusterVisibility(t *testing.T) {
defer leaktest.AfterTest(t)()

skip.UnderRace(t, "lint only test")
skip.UnderDeadlock(t, "lint only test")
skip.UnderStress(t, "lint only test")

keys := settings.Keys(true /* forSystemTenant */)
for _, settingKey := range keys {
setting, ok := settings.LookupForLocalAccessByKey(settingKey, true /* forSystemTenant */)
if !ok {
t.Errorf("registry bug: setting with key %q not found", settingKey)
continue
}

if setting.IsUnsafe() && setting.Visibility() == settings.Public {
t.Errorf("%s: unsafe settings must not be public", setting.Name())
}
}
}
5 changes: 5 additions & 0 deletions pkg/settings/masked.go
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,11 @@ func (s *maskedSetting) Class() Class {
return s.setting.Class()
}

// IsUnsafe returns whether the underlying setting is unsafe.
func (s *maskedSetting) IsUnsafe() bool {
return s.setting.IsUnsafe()
}

// TestingIsReportable is used in testing for reportability.
func TestingIsReportable(s Setting) bool {
if _, ok := s.(*maskedSetting); ok {
Expand Down
6 changes: 6 additions & 0 deletions pkg/settings/options.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,12 @@ func WithVisibility(v Visibility) SettingOption {
}}
}

// WithUnsafe indicates that the setting is unsafe.
// Unsafe settings need an interlock to be updated. They also cannot be public.
var WithUnsafe SettingOption = SettingOption{commonOpt: func(c *common) {
c.setUnsafe()
}}

// WithPublic sets public visibility.
var WithPublic SettingOption = WithVisibility(Public)

Expand Down
4 changes: 4 additions & 0 deletions pkg/settings/setting.go
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,10 @@ type Setting interface {
// settings are still accessible to users, but they don't get listed out when
// retrieving all settings.
Visibility() Visibility

// IsUnsafe returns whether the setting is unsafe, and thus requires
// a special interlock to set.
IsUnsafe() bool
}

// NonMaskedSetting is the exported interface of non-masked settings. A
Expand Down
4 changes: 4 additions & 0 deletions pkg/sql/exec_util.go
Original file line number Diff line number Diff line change
Expand Up @@ -3649,6 +3649,10 @@ func (m *sessionDataMutator) SetSharedLockingForSerializable(val bool) {
m.data.SharedLockingForSerializable = val
}

func (m *sessionDataMutator) SetUnsafeSettingInterlockKey(val string) {
m.data.UnsafeSettingInterlockKey = val
}

// Utility functions related to scrubbing sensitive information on SQL Stats.

// quantizeCounts ensures that the Count field in the
Expand Down
4 changes: 4 additions & 0 deletions pkg/sql/sessiondatapb/local_only_session_data.proto
Original file line number Diff line number Diff line change
Expand Up @@ -449,6 +449,10 @@ message LocalOnlySessionData {
// forms of DDL inside explicit txns).
bool strict_ddl_atomicity = 111 [(gogoproto.customname) = "StrictDDLAtomicity"];

// UnsafeSettingInterlockKey needs to be set to a special string
// before SET CLUSTER SETTING is allowed on an unsafe setting.
string unsafe_setting_interlock_key = 113;

///////////////////////////////////////////////////////////////////////////
// WARNING: consider whether a session parameter you're adding needs to //
// be propagated to the remote nodes. If so, that parameter should live //
Expand Down
Loading

0 comments on commit 20d0094

Please sign in to comment.