-
Notifications
You must be signed in to change notification settings - Fork 624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
executeBatch can cause a dead lock to a nil deref causing c.prepMu not to be unlocked. #179
executeBatch can cause a dead lock to a nil deref causing c.prepMu not to be unlocked. #179
Conversation
… then subsequent deadlock can occur
This is a an interesting corner case - do you know if there is a tractable way of reproducing the issue? |
Because golang map iterator is randomized the code path becomes pretty nondeterministic. To reproduce this you can (while being faithful to production conditions) Due to the random iteration it may not always happen but essentially what you want is while one routine is in the middle of executing a prepare request for |
The change looks good to me - I might have a look to see if your instructions can reproduce the symptom. |
So I had a go at trying to replicate your steps. Here is the injected sleep:
And here is the test routine:
I've run this against the current master (i.e. still containing the bug) but unfortunately I haven't been able to reproduce the error - any idea why? |
You actually have to muck around a bit more with the internals. For example you need to clear
Now even then using your function the panic may still not occur because the map iterator may always select the right prepared statement sidestepping the error. Still we can cause the error by removing the
Now since every
Should panic with
on the first try. |
If I comment out the break statement on line 514, I do indeed get the same panic on line 511. |
That said, I've been running the test with break commented back in, and it exceeds the maximum timeout of the testing framework (600 secs). |
I'm not 100% sure that changing the program flow (by commenting out the break statement) is a good way to verify this bug. I'm happier to accept the fact that we need to inject a pause into the actual production code to reproduce the symptom, since adverse scheduling is an issue that you're much more likely to see under high load. So whilst the change you are suggesting looks pretty good, it would make me hesitate less to merge the patch if either:
Sorry that I appear to be blocking this change, but my primary focus is maintaining the stability of the library (even though it could well be more stable with this change, as it seems to be helping you in a real world deployment). |
The problem isn't the presence of the break, the problem is the iteration order. I'm having a hard time reproducing the error because I can't influence the iteration order of a map. Consider we have this map (iterated in this order): and lets say we are looking for Now if we were to iterate in this order (maps aren't guaranteed to be any order): and we are still looking for This function will also crash (doesn't require the removal of the break) but depends on func TestNilFlight(t *testing.T) {
session := createSession(t)
defer session.Close()
for i := 0; i < 10000; i++ {
var w sync.WaitGroup
w.Add(2)
go func() {
query := session.Query("SELECT host_id from system.local")
if err := session.Pool.Pick(nil).executeQuery(query).Close(); err != nil {
w.Done()
t.Fatalf("Failed to execute query: %v", err)
}
w.Done()
}()
stmt, conn := injectInvalidPreparedStatement(t, session, "test_reprepare_statement")
go func() {
time.Sleep(1 * time.Second)
batch := session.NewBatch(UnloggedBatch)
batch.Query(stmt, "bar")
if err := conn.executeBatch(batch); err != nil {
w.Done()
t.Fatalf("Failed to execute query for reprepare statement: %v", err)
}
w.Done()
}()
w.Wait()
conn.executeQuery(session.Query("DROP TABLE test_reprepare_statement"))
}
} |
Secondly, in the batch goroutine, if you |
I think gocql is probably better off with this change, so what I propose is that we merge this if nobody is against the idea. So if either @phillipCouto, @tux21b or @Zariel don't raise any objections within a reasonable timeframe, I'll merge this. |
Ok so I just reviewed it and I think this can fly without a test case. Reason is flight being nil or flight.info being nil shouldn't trigger a panic and gocql should continue to operate and treat it as the flight information as missing and worst case reprepare the statement. This LGTM. Really sorry for the long delay on reviewing this. |
executeBatch can cause a dead lock to a nil deref causing c.prepMu not to be unlocked.
Currently in a high load situation
flight
orflight.info
can be nil when execute batch is called (the prepareStatement call can take a long time or even timeout, during this time flight.info can be nil).As executebatch searches for the prepared statement to reset, it can come across such a statement and panic. This panic causes
c.prepMu
not to be unlocked causing a dead lock.