Fix deadlock when write queue full #1569

replay · 2019-12-10T17:06:24Z

there is a deadlock because when the cassandra store consumes its write
queue it needs to call a callback which acquires a lock on the
aggemetrics which generated this chunk. when this aggmetric pushes new
chunk write requests into the queue it is holding that lock. this means
if the queue is full we can end up in a situation where the aggmetric
is blocked on pushing into the write queue while the write queue
consumer is blocked on calling the callback which is trying to acquire
that same aggmetric lock.
by switching the aggmetric properties which the callback is trying to
update to atomics we should be able to avoid such a deadlock.

the property aggmetric.lastSaveFinish is never used, so there is no
point maintaining its value. this removes the property and all places
which maintain it.

Related #726
Fixes https://github.com/grafana/metrictank-ops/issues/536

there is a deadlock because when the cassandra store consumes its write queue it needs to call a callback which acquires a lock on the aggemetrics which generated this chunk. when this aggmetric pushes new chunk write requests into the queue it is holding that lock. this means if the queue is full we can end up in a situation where the aggmetric is blocked on pushing into the write queue while the write queue consumer is blocked on calling the callback which is trying to acquire that same aggmetric lock. by switching the aggmetric properties which the callback is trying to update to atomics we should be able to avoid such a deadlock.

this property is never used, so there is no point maintaining its value. this removes the property and all places which maintain it.

Dieterbe · 2019-12-11T12:13:09Z

mdata/aggmetric.go

-			a.lastSaveFinish = ts
+		lastSaveStart := atomic.LoadUint32(&a.lastSaveStart)
+		if ts > lastSaveStart {
+			atomic.StoreUint32(&a.lastSaveStart, ts)


race condition!
I think you need something like

newVal := ts prev := atomic.SwapUint32(&a.lastSaveStart, newVal) for prev > newVal { newVal = prev prev = atomic.SwapUint32(&a.lastSaveStart, newVal) }

(i took this from idx/memory.bumpLastUpdate())

good catch, i'll update that

Dieterbe · 2019-12-11T12:13:43Z

the property aggmetric.lastSaveFinish is never used, so there is no
point maintaining its value. this removes the property and all places
which maintain it.

remarkable find. this field was introduced in january 2017 (c596193) and we have basically never used it!

mdata/aggmetric.go

Dieterbe · 2019-12-11T12:30:07Z

mdata/aggmetric.go

@@ -410,7 +408,7 @@ func (a *AggMetric) persist(pos int) {
 	}

 	// Every chunk with a T0 <= this chunks' T0 is now either saved, or in the writeQueue.
-	a.lastSaveStart = chunk.Series.T0
+	atomic.StoreUint32(&a.lastSaveStart, chunk.Series.T0)


similar to my comment in SyncChunkSaveState, we need to account for others setting higher values using a swap loop.

Dieterbe · 2019-12-11T12:31:23Z

mdata/aggmetric.go

@@ -492,8 +490,7 @@ func (a *AggMetric) add(ts uint32, val float64) {
 		log.Debugf("AM: %s Add(): created first chunk with first point: %v", a.key, a.chunks[0])
 		a.lastWrite = uint32(time.Now().Unix())
 		if a.dropFirstChunk {
-			a.lastSaveStart = t0
-			a.lastSaveFinish = t0
+			atomic.StoreUint32(&a.lastSaveStart, t0)
 		}


same here. need swap loop to account for chunksaves coming in concurrently.

Dieterbe · 2019-12-11T12:33:34Z

my main concern here is the mixing of mutexes with atomics, this may not be valid with the Go memory model, which is very loosely defined when it comes to atomics.
I believe the Go may do certain optimizations under the hood that our assumptions are not compatible with. This needs more research.

edit:
see https://groups.google.com/forum/#!msg/golang-dev/vVkH_9fl1D8/azJa10lkAwAJ and golang/go#5045 (comment)

So I think because you changed all access to a.lastSaveStart to be atomic, we should be good!

When we ran into this in the past, we mixed atomic and non-atomic access, which was the more problematic case. (#945)

replay · 2019-12-11T16:56:50Z

my main concern here is the mixing of mutexes with atomics, this may not be valid with the Go memory model, which is very loosely defined when it comes to atomics.
I believe the Go may do certain optimizations under the hood that our assumptions are not compatible with. This needs more research.

The mixing of mutexes and atomics is definitely not nice, it introduces the risk that somebody who makes further changes in the future would just read one of those properties without using atomics, not being aware that when reading atomics must be used. But similar like in the case of the .LastUpdate property in the index, i think there can be cases where it makes sense.
I'm not sure what you mean with I believe the Go may do certain optimizations under the hood that our assumptions are not compatible with. This needs more research., what's the question that needs to be answered?

Dieterbe · 2019-12-11T16:59:29Z

There is no more question to be answered. I linked in my comment to the discussions that lead me to believe we're fine.

replay · 2019-12-11T17:00:10Z

oh sorry, didn't reload this page in a while and didn't see that you updated the comment

replay · 2019-12-11T18:13:42Z

I moved the swap loop implementations into a helper function, as this is now a repeating pattern. Unfortunately they won't get inlined due to the for, but i guess that's still fine and the additional overhead for the function call won't lead to a relevant slow down when ingesting.

Dieterbe · 2019-12-12T17:45:42Z

fix #726

replay added 2 commits December 10, 2019 13:44

remove aggmetric.lastSaveFinish

2192b5b

this property is never used, so there is no point maintaining its value. this removes the property and all places which maintain it.

replay requested a review from Dieterbe December 10, 2019 17:06

Dieterbe reviewed Dec 11, 2019

View reviewed changes

mdata/aggmetric.go Show resolved Hide resolved

Dieterbe reviewed Dec 11, 2019

View reviewed changes

introduces helpers to atomically increase values

e061f7f

replay changed the title ~~[WIP] Fix deadlock when write queue full~~ Fix deadlock when write queue full Dec 11, 2019

re-introduce useful comments & cleanup

a124ed9

Dieterbe approved these changes Dec 12, 2019

View reviewed changes

Dieterbe assigned replay Dec 12, 2019

Dieterbe added this to the sprint-5 milestone Dec 12, 2019

Dieterbe merged commit 3320cb3 into master Dec 12, 2019

Dieterbe deleted the fix_deadlock_when_write_queue_full branch December 12, 2019 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock when write queue full #1569

Fix deadlock when write queue full #1569

replay commented Dec 10, 2019 •

edited

Loading

Dieterbe Dec 11, 2019

replay Dec 11, 2019

Dieterbe commented Dec 11, 2019

Dieterbe Dec 11, 2019

Dieterbe Dec 11, 2019

Dieterbe commented Dec 11, 2019 •

edited

Loading

replay commented Dec 11, 2019

Dieterbe commented Dec 11, 2019

replay commented Dec 11, 2019

replay commented Dec 11, 2019 •

edited

Loading

Dieterbe commented Dec 12, 2019

Fix deadlock when write queue full #1569

Fix deadlock when write queue full #1569

Conversation

replay commented Dec 10, 2019 • edited Loading

Dieterbe Dec 11, 2019

Choose a reason for hiding this comment

replay Dec 11, 2019

Choose a reason for hiding this comment

Dieterbe commented Dec 11, 2019

Dieterbe Dec 11, 2019

Choose a reason for hiding this comment

Dieterbe Dec 11, 2019

Choose a reason for hiding this comment

Dieterbe commented Dec 11, 2019 • edited Loading

replay commented Dec 11, 2019

Dieterbe commented Dec 11, 2019

replay commented Dec 11, 2019

replay commented Dec 11, 2019 • edited Loading

Dieterbe commented Dec 12, 2019

replay commented Dec 10, 2019 •

edited

Loading

Dieterbe commented Dec 11, 2019 •

edited

Loading

replay commented Dec 11, 2019 •

edited

Loading