Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: effectively eliminate "intent missing and record aborted" ambiguous errors #52566

Closed
nvanbenschoten opened this issue Aug 10, 2020 · 0 comments · Fixed by #53156
Closed
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.

Comments

@nvanbenschoten
Copy link
Member

Address the following TODO:

// TODO(nvanbenschoten): QueryTxn will materialize an ABORTED transaction
// record if one does not already exist. If we are certain that no actor
// will ever persist an ABORTED transaction record after a COMMIT record is
// GCed and we returned whether the record was synthesized in the QueryTxn
// response then we could use the existence of an ABORTED transaction record
// to further isolates the ambiguity caused by the loss of information
// during intent resolution. If this error becomes a problem, we can explore
// this option.

While here, we'll also want to make sure that only a transaction's own coordinator will ever GC its record below the TxnCleanupThreshold (1 hour). This seems to already be the case everywhere except for in the rangefeedTxnPusher, which both pushes transactions and GCs their record. We should stop doing the second part.

@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. A-kv-transactions Relating to MVCC and the transactional model. labels Aug 10, 2020
andreimatei added a commit to andreimatei/cockroach that referenced this issue Aug 20, 2020
Before this patch, the rangefeed txn pushed was GC'ing the records of
the transactions that it found to be aborted or committed when it
PUSH_TOUCHED's them. Deleting the txn record is not a nice thing to do
if the txn's coordinator might still be running, and so this patch
terminates the practice. If the coordinator is still running and it had
STAGED the txn record, a PUSH_TOUCH might cause a transition to ABORTED
or COMMITTED. If such a transition is followed by the GC of the txn
record, this causes an ambigous commit error for the coordinator, which
can't tell whether the transaction committed (implicitly) or was rolled
back.

As the code stands, this patch doesn't actually help eliminate the
ambiguity in the ABORTED case (it does in the COMMIT case though),
because the coordinator doesn't distinguish between an ABORTED record
and a missing record; it treats both the same, as ambiguous. But we're
going to improve that, and the coordinator is gonna consider an existing
record as non-ambiguous. The GC'ed case is going to remain the ambiguous
one, and the idea is to avoid that as much as possible by only doing GC
from the gc-queue (after an hour) and from the coordinator.

Touches cockroachdb#52566

Release note (bug fix): Some rare AmbiguousCommitErrors happening when
CDC was used were eliminated.
andreimatei added a commit to andreimatei/cockroach that referenced this issue Aug 20, 2020
Before this patch, the rangefeed txn pushed was GC'ing the records of
the transactions that it found to be aborted or committed when it
PUSH_TOUCHED's them. Deleting the txn record is not a nice thing to do
if the txn's coordinator might still be running, and so this patch
terminates the practice. If the coordinator is still running and it had
STAGED the txn record, a PUSH_TOUCH might cause a transition to ABORTED
or COMMITTED. If such a transition is followed by the GC of the txn
record, this causes an ambigous commit error for the coordinator, which
can't tell whether the transaction committed (implicitly) or was rolled
back.

As the code stands, this patch doesn't actually help eliminate the
ambiguity in the ABORTED case (it does in the COMMIT case though),
because the coordinator doesn't distinguish between an ABORTED record
and a missing record; it treats both the same, as ambiguous. But we're
going to improve that, and the coordinator is gonna consider an existing
record as non-ambiguous. The GC'ed case is going to remain the ambiguous
one, and the idea is to avoid that as much as possible by only doing GC
from the gc-queue (after an hour) and from the coordinator.

Touches cockroachdb#52566

Release note (bug fix): Some rare AmbiguousCommitErrors happening when
CDC was used were eliminated.
andreimatei added a commit to andreimatei/cockroach that referenced this issue Aug 20, 2020
This patch eliminates an AmbiguousResultError case. When a committer
finds an intent to be missing after having STAGED the txn, it tries to
figure out if the transaction has been committed or aborted behind its
back. It could be either: another actor might have performed recovery
and found the txn to be implicitly committed, or the recovery might
succeed in preventing an intent from being written and thus the
transaction will be aborted.
The commiter thus reads the txn record and tries to figure out what the
deal is. The case in which a COMMITed txn record is found is
unambiguous. The case when the txn record is not found is ambiguous: it
could have been GCed, and the status is lost. This case stays ambiguous.
The case where the transaction record is found to be ABORTED used to be
treated as ambiguous, but this patch makes it unambiguous. It used to be
ambiguous for a bad reason: the QueryTxn request used to read the txn
record conflated an ABORTED record with a missing record that cannot be
created anymore because of the tscache protections.

This patch adds more output to QueryTxn, specifying if the record has
been found or if it has been synthesized. This allows the DistSender to
eliminate the ambiguity when the record is found.

This patch relies on the fact that nobody writes an ABORTED record (or,
any record) after the record is GC'ed.

Fixes cockroachdb#52566
Fixes cockroachdb#52186

One of the issues above is a TPCC failure with an ambiguous error. I'm
not sure that this was the ambiguity in question, but it seems likely.

Release note (general change): Some cases where AmbiguousResultErrors
were produced under high load now return non-ambiguous, retriable
errors.
andreimatei added a commit to andreimatei/cockroach that referenced this issue Aug 20, 2020
Before this patch, the rangefeed txn pushed was GC'ing the records of
the transactions that it found to be aborted or committed when it
PUSH_TOUCHED's them. Deleting the txn record is not a nice thing to do
if the txn's coordinator might still be running, and so this patch
terminates the practice. If the coordinator is still running and it had
STAGED the txn record, a PUSH_TOUCH might cause a transition to ABORTED
or COMMITTED. If such a transition is followed by the GC of the txn
record, this causes an ambigous commit error for the coordinator, which
can't tell whether the transaction committed (implicitly) or was rolled
back.

As the code stands, this patch doesn't actually help eliminate the
ambiguity in the ABORTED case (it does in the COMMIT case though),
because the coordinator doesn't distinguish between an ABORTED record
and a missing record; it treats both the same, as ambiguous. But we're
going to improve that, and the coordinator is gonna consider an existing
record as non-ambiguous. The GC'ed case is going to remain the ambiguous
one, and the idea is to avoid that as much as possible by only doing GC
from the gc-queue (after an hour) and from the coordinator.

Touches cockroachdb#52566

Release note (bug fix): Some rare AmbiguousCommitErrors happening when
CDC was used were eliminated.
craig bot pushed a commit that referenced this issue Aug 20, 2020
52938: sql: implement idle_in_transaction_session_timeout r=RichardJCai a=RichardJCai

Release note (sql change): Implement IdleInTransactionSessionTimeout
variable to allow terminating sessions that are idle in a transaction
past the provides threshold.

Set the variable by using SET idle_in_transaction_session_timeout = 'time'.

Sessions that are idle in OPEN, ABORTED and DONE(COMMITWAIT) transaction states
will be terminated if the user idles longer than the threshold time.

Fixes #5924

53041: cli/debug: add SQL-decoding to printed keys r=dt a=dt

This adds a `--decode-as-table` flag to `debug keys --values`.

When passed a base64 encoded Descriptor, which can be obtained via
`select encode(descriptor, 'base64') from system.descriptor where id = x`,
it will attempt to decode KVs using a SQL row-fetcher backed by that
descriptor and print them as col1=val1, col2=val2,...

Release note: none.

53060: backupccl: refactor backup destination resolution r=dt a=pbardea

During backup planning we need to resolve the backup destination. This
means that we need to:
1. Determine if we're going to resolve to a backup collection or not
2. Determine if we should auto-append
3. Read any encryption options we may need if we determine that we're
donig an incremental backup
4. Find the previous backups in the chain if we're doing an incremental
backup.

Release note: None

53066: opt: add rule to replace ST_Distance with ST_DWithin r=DrewKimball a=DrewKimball

`ST_DWithin` is equivalent to the expression `ST_Distance <= x`.
Similar reformulations can be made for different comparison operators
(<, >, >=). 

This PR consists of two commits. The first commit adds an internal builtin
function `ST_DWithinExclusive`, which is equivalent to `ST_Distance < x`
but is able to exit early.

The second commit adds two rules that replace expressions of the form 
`ST_Distance <= x` with either `ST_DWithin` or `ST_DWithinExclusive` 
depending on the comparison operator used. This replacement is desirable 
because the `ST_DWithin` function can exit early, while `ST_Distance` cannot
do the same. `ST_DWithin` can also be used to generate an inverted index scan.

Fixes #52028

Release note: None

53085: colexec: fix dynamic batch behavior in hash joiner with low mem limit r=yuzefovich a=yuzefovich

The disk-spilling logic in the hash joiner relies on the assumption that
a memory limit error can only occur during the building of the hash
table. However, this assumption was recently broken in the case of the
low `workmem` limit setting by the dynamic batch size work - it now
became possible for an error to occur during output population. As
a result, we might have already consumed some batches from the left
input and have emitted partial results before we fall back to the
external hash joiner, and this would result in some tuples not being
emitted at all (if we have a pending batch from the left).

This problem is now fixed by restoring the invariant with the
introduction of an unlimited allocator into the in-memory hash joiner.
The unlimited allocator is used only to populate the output (meaning
that we have already successfully built the hash table from the right
side) at the phase of the hash join algorithm when we consume the left
side one batch at a time and populate the output also one batch at
a time, so such choice is acceptable. This additionally allows to
properly account for the memory used by the output batch.

Note that the only other disk-spilling operator (different sorters that
fallback to the external sort) weren't impacted by this because they are
able to correctly export the buffered tuples regardless of when the
memory limit error occurs.

Fixes: #52859.

Release note: None

53146: rangefeed: don't delete txn records from the rngfeed pusher r=andreimatei a=andreimatei

Before this patch, the rangefeed txn pushed was GC'ing the records of
the transactions that it found to be aborted or committed when it
PUSH_TOUCHED's them. Deleting the txn record is not a nice thing to do
if the txn's coordinator might still be running, and so this patch
terminates the practice. If the coordinator is still running and it had
STAGED the txn record, a PUSH_TOUCH might cause a transition to ABORTED
or COMMITTED. If such a transition is followed by the GC of the txn
record, this causes an ambigous commit error for the coordinator, which
can't tell whether the transaction committed (implicitly) or was rolled
back.

As the code stands, this patch doesn't actually help eliminate the
ambiguity in the ABORTED case (it does in the COMMIT case though),
because the coordinator doesn't distinguish between an ABORTED record
and a missing record; it treats both the same, as ambiguous. But we're
going to improve that, and the coordinator is gonna consider an existing
record as non-ambiguous. The GC'ed case is going to remain the ambiguous
one, and the idea is to avoid that as much as possible by only doing GC
from the gc-queue (after an hour) and from the coordinator.

Touches #52566

Release note (bug fix): Some rare AmbiguousCommitErrors happening when
CDC was used were eliminated.

Co-authored-by: richardjcai <caioftherichard@gmail.com>
Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Co-authored-by: Paul Bardea <pbardea@gmail.com>
Co-authored-by: Drew Kimball <drewk@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Aug 21, 2020
52704: sql: add support for transaction level stats collection  r=solongordon a=arulajmani

This patch adds the framework to start collecting transaction level
statistics and return them on the `statements` end point. Previously,
the response on the end point was being faked. This patch replaces the
fake response by an actual response with some fields correctly
populated. Other fields will be populated in upcoming patches, and
currently contain garbage data.

The correctly populated fields on `CollectedTransactionStatistics` are:
1. StatementIDs
2. Count
3. ServiceLat
4. App
5. RetryLat
6. MaxRetries
7. CommitLat


Release note: None

53156: kvserver,kvclient: disambiguate aborted txn case r=andreimatei a=andreimatei

This patch eliminates an AmbiguousResultError case. When a committer
finds an intent to be missing after having STAGED the txn, it tries to
figure out if the transaction has been committed or aborted behind its
back. It could be either: another actor might have performed recovery
and found the txn to be implicitly committed, or the recovery might
succeed in preventing an intent from being written and thus the
transaction will be aborted.
The commiter thus reads the txn record and tries to figure out what the
deal is. The case in which a COMMITed txn record is found is
unambiguous. The case when the txn record is not found is ambiguous: it
could have been GCed, and the status is lost. This case stays ambiguous.
The case where the transaction record is found to be ABORTED used to be
treated as ambiguous, but this patch makes it unambiguous. It used to be
ambiguous for a bad reason: the QueryTxn request used to read the txn
record conflated an ABORTED record with a missing record that cannot be
created anymore because of the tscache protections.

This patch adds more output to QueryTxn, specifying if the record has
been found or if it has been synthesized. This allows the DistSender to
eliminate the ambiguity when the record is found.

This patch relies on the fact that nobody writes an ABORTED record (or,
any record) after the record is GC'ed.

Fixes #52566
Fixes #52186

One of the issues above is a TPCC failure with an ambiguous error. I'm
not sure that this was the ambiguity in question, but it seems likely.

Release note (general change): Some cases where AmbiguousResultErrors
were produced under high load now return non-ambiguous, retriable
errors.

53157: build: add libgeos libs to docker image for publish release r=pbardea a=jlinder

Before: The script was written to be backportable (without adding the
libgeos libs to the docker image).

Why: Geospatial arrived in master (20.2) and the libgeos libs need to be
included with the docker image.

Now: The libs are included in the docker image generated in the
teamcity-publish-release.sh script.

Release note: None

Co-authored-by: Solon Gordon <solon@cockroachlabs.com>
Co-authored-by: arulajmani <arulajmani@gmail.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: James H. Linder <jamesl@cockroachlabs.com>
@craig craig bot closed this as completed in 3dcb6f1 Aug 21, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 31, 2020
Fixes cockroachdb#53189.
Fixes cockroachdb#53282.
Fixes cockroachdb#53285.
Fixes cockroachdb#53469.

3dcb6f1 improved the detection of missing intents during parallel commit
attempts to distinguish between certain classes of ambiguous errors and
transaction aborted errors. This was a nice improvement, as it
dramatically reduced the number of situations where we returned
ambiguous errors during normal operation (see cockroachdb#52566).

However, in introducing a new location where transaction retry errors
could be generated, it accidentally violated the invariant that all
transaction retry errors have transaction protos attached to them. This
was causing panics in TPC-C roachtests. This commit fixes this issue by
properly attaching transaction protos to these new errors, along with
any others returned from `detectIntentMissingDueToIntentResolution`.

Release justification: bug fix
craig bot pushed a commit that referenced this issue Sep 1, 2020
53589: jobs: improve job adoption r=ajwerner a=ajwerner

#### jobs: don't hold mutex during adoption, launch in parallel

#### jobs: break up new stages of job lifecycle movement

In the PR which adopted the sqlliveness sessions, we shoved all of the stages
of adopting jobs into the same stage and we invoked that stage on each adoption
interval and on each sent to the adoption channel.

These stages are:

 * Cancel jobs
 * Serve pause and cancel requests
 * Delete claims due to dead sessions
 * Claim jobs
 * Process claimed jobs

This is problematic for tests which send on the adoption channel at a high
rate. One important thing to note is that all jobs which are sent on the
adoption channel are already claimed.

After this PR we move the first three steps above into the cancellation
loop we were already running. We also increase the default interval for
that loop as it was exceedingly frequent at 1s for no obvious reason.

Much of the testing changes are due to this cancelation loop duration
change. The tests in this package now run 3x faster (10s vs 30s).

Then, upon sends on the adoption channel, we just process claimed jobs.
When the adoption interval rolls around, then we attempt to both claim
and process jobs.

Release justification: bug fixes and low-risk updates to new functionality
Release note: None

53697: kv: attach txn to error from detectIntentMissingDueToIntentResolution r=nvanbenschoten a=nvanbenschoten

Fixes #53189.
Fixes #53282.
Fixes #53285.
Fixes #53469.

3dcb6f1 improved the detection of missing intents during parallel commit attempts to distinguish between certain classes of ambiguous errors and transaction aborted errors. This was a nice improvement, as it dramatically reduced the number of situations where we returned ambiguous errors during normal operation (see #52566).

However, in introducing a new location where transaction retry errors could be generated, it accidentally violated the invariant that all transaction retry errors have transaction protos attached to them. This was causing panics in TPC-C roachtests. This commit fixes this issue by properly attaching transaction protos to these new errors, along with any others returned from `detectIntentMissingDueToIntentResolution`.

Release justification: bug fix

53708: builtins: implement ST_MemCollect and ST_MemUnion r=otan a=erikgrinaker

These are implemented as aliases for ST_Collect and ST_Union. In PostGIS
these are memory-optimized versions, but for now it should be sufficient
to simply make them functional.

Release justification: low risk, high benefit changes to existing functionality

Release note (sql change): Implement the geometry aggregate builtins
`ST_MemCollect` and `ST_MemUnion`.

Closes #48984.
Closes #48986.

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Erik Grinaker <erik@grinaker.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants