-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible latency regression in latest release #72
Comments
At a high load, every 30 seconds, when there is synchronization with the MongoDB cluster, many connections start to work longer than a second. |
Hi @sshipkov This change was added by PR #52 to help mitigate a different issue - can you confirm it's definitely this change that causes your issue? As part of the release process we tested this change against a sharded, replicated MongoDB cluster but we didn't see anything unusual. What exactly is happening? Can you reproduce the issue with debug logging enabled and post the relevant output? Dom |
I'm sorry, I will not include debugging |
Hi @sshipkov Thanks for taking the time to put together some annotated graphs, however they're not as helpful as you might think - they show something was working and then it isn't. Anything between the application code through to OS, CPU load, network card, drivers, switches, network utilisation, etc could be causing what I assume is a throughput dropout (not latency drop as your graphs show, which would be a good thing) - what has led you to believe #52 is the cause? I'd suggest running an version of mgo without the socket change (instructions below) on N-1 machines, and leaving a single machine as a canary deployment to help isolate the problem to the changes introduced in #52. Avoid making any other changes, as it will make the results harder to reason about. Can you provide any details? Some helpful information to reproduce the issue, such as:
On a side note, during our last testing run we had ~35,000op/s (mixed workload) and our 99.9 percentile was 19.1ms - we have not experienced any problems. Your first graph shows a 99.9% latency varying between 100ms and 1 second - if you're within a datacenter (not over a WAN link) this is a hugely worrying sign - it's likely there are problems elsewhere in your infrastructure. If the latency graph appeared in our environment, the right hand side would be considered normal, and the left would be the worry. I'm happy to revert the change if it is causing an issue, but not without proof it is the cause. The above questions will help reproduce the issue in our testing environment and diagnose further. Thanks, To build your application without #52:
Then build your application as normal. If you vendor mgo (you should) then check your dependency management tool's documentation. |
Many measures have been taken before changing mgo. Unfortunately, I do not have time to thoroughly investigate mgo. |
Hi @sshipkov Sorry for the delay, but it's the new year holidays! ;) I might have misunderstood your graphs above - is the left of Either way, I replicated your setup as close as I reasonably could, and I can see no latency regression from #52 - only a reduction in the latency variance (a good thing). Note the 99th percentile ( Values in milliseconds:
While the change in average latency is statistically significant, it's tiny ( The throughput however has changed in an interesting way: Values are reads per second:
Because of what I imagine is due to a reduced lock duration, the average throughput has increased by ~2213 reads per second. It wasn't till after I ran the tests that I saw your I run a lot of benchmarks, but I'm not a statistician - so take it all the analysis above with a pinch of salt! If you would like to look at the latency measurements yourself, there's grouped data and a histogram (on the second sheet) here. |
mgo/socket.go
Line 552 in 5be15cc
socket.Unlock()
must be before return, aftersocket.conn.Write()
The text was updated successfully, but these errors were encountered: