-
Notifications
You must be signed in to change notification settings - Fork 105
Conversation
there is a deadlock because when the cassandra store consumes its write queue it needs to call a callback which acquires a lock on the aggemetrics which generated this chunk. when this aggmetric pushes new chunk write requests into the queue it is holding that lock. this means if the queue is full we can end up in a situation where the aggmetric is blocked on pushing into the write queue while the write queue consumer is blocked on calling the callback which is trying to acquire that same aggmetric lock. by switching the aggmetric properties which the callback is trying to update to atomics we should be able to avoid such a deadlock.
this property is never used, so there is no point maintaining its value. this removes the property and all places which maintain it.
mdata/aggmetric.go
Outdated
a.lastSaveFinish = ts | ||
lastSaveStart := atomic.LoadUint32(&a.lastSaveStart) | ||
if ts > lastSaveStart { | ||
atomic.StoreUint32(&a.lastSaveStart, ts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
race condition!
I think you need something like
newVal := ts
prev := atomic.SwapUint32(&a.lastSaveStart, newVal)
for prev > newVal {
newVal = prev
prev = atomic.SwapUint32(&a.lastSaveStart, newVal)
}
(i took this from idx/memory.bumpLastUpdate())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, i'll update that
remarkable find. this field was introduced in january 2017 (c596193) and we have basically never used it! |
mdata/aggmetric.go
Outdated
@@ -410,7 +408,7 @@ func (a *AggMetric) persist(pos int) { | |||
} | |||
|
|||
// Every chunk with a T0 <= this chunks' T0 is now either saved, or in the writeQueue. | |||
a.lastSaveStart = chunk.Series.T0 | |||
atomic.StoreUint32(&a.lastSaveStart, chunk.Series.T0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to my comment in SyncChunkSaveState, we need to account for others setting higher values using a swap loop.
@@ -492,8 +490,7 @@ func (a *AggMetric) add(ts uint32, val float64) { | |||
log.Debugf("AM: %s Add(): created first chunk with first point: %v", a.key, a.chunks[0]) | |||
a.lastWrite = uint32(time.Now().Unix()) | |||
if a.dropFirstChunk { | |||
a.lastSaveStart = t0 | |||
a.lastSaveFinish = t0 | |||
atomic.StoreUint32(&a.lastSaveStart, t0) | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. need swap loop to account for chunksaves coming in concurrently.
my main concern here is the mixing of mutexes with atomics, this may not be valid with the Go memory model, which is very loosely defined when it comes to atomics. edit: So I think because you changed all access to a.lastSaveStart to be atomic, we should be good! When we ran into this in the past, we mixed atomic and non-atomic access, which was the more problematic case. (#945) |
The mixing of mutexes and atomics is definitely not nice, it introduces the risk that somebody who makes further changes in the future would just read one of those properties without using atomics, not being aware that when reading atomics must be used. But similar like in the case of the |
There is no more question to be answered. I linked in my comment to the discussions that lead me to believe we're fine. |
oh sorry, didn't reload this page in a while and didn't see that you updated the comment |
I moved the swap loop implementations into a helper function, as this is now a repeating pattern. Unfortunately they won't get inlined due to the |
fix #726 |
there is a deadlock because when the cassandra store consumes its write
queue it needs to call a callback which acquires a lock on the
aggemetrics which generated this chunk. when this aggmetric pushes new
chunk write requests into the queue it is holding that lock. this means
if the queue is full we can end up in a situation where the aggmetric
is blocked on pushing into the write queue while the write queue
consumer is blocked on calling the callback which is trying to acquire
that same aggmetric lock.
by switching the aggmetric properties which the callback is trying to
update to atomics we should be able to avoid such a deadlock.
the property
aggmetric.lastSaveFinish
is never used, so there is nopoint maintaining its value. this removes the property and all places
which maintain it.
Related #726
Fixes https://github.com/grafana/metrictank-ops/issues/536