Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible latency regression in latest release #72

Closed
ghost opened this issue Dec 29, 2017 · 8 comments
Closed

Possible latency regression in latest release #72

ghost opened this issue Dec 29, 2017 · 8 comments
Labels

Comments

@ghost
Copy link

ghost commented Dec 29, 2017

mgo/socket.go

Line 552 in 5be15cc

socket.Unlock()

socket.Unlock() must be before return, after socket.conn.Write()

@ghost
Copy link
Author

ghost commented Dec 29, 2017

At a high load, every 30 seconds, when there is synchronization with the MongoDB cluster, many connections start to work longer than a second.
FailFast partially solves this problem.

@domodwyer
Copy link

Hi @sshipkov

This change was added by PR #52 to help mitigate a different issue - can you confirm it's definitely this change that causes your issue? As part of the release process we tested this change against a sharded, replicated MongoDB cluster but we didn't see anything unusual.

What exactly is happening? Can you reproduce the issue with debug logging enabled and post the relevant output?

Dom

@domodwyer domodwyer assigned ghost Dec 29, 2017
@domodwyer domodwyer changed the title Serious mistake Socket locking during cluster synchronisation Dec 29, 2017
@ghost
Copy link
Author

ghost commented Dec 29, 2017

mongo
999 percentile, ~ 3500rps read requests

@ghost
Copy link
Author

ghost commented Dec 29, 2017

network
smoother monogdb servers tx

@ghost
Copy link
Author

ghost commented Dec 29, 2017

I'm sorry, I will not include debugging

@domodwyer
Copy link

Hi @sshipkov

Thanks for taking the time to put together some annotated graphs, however they're not as helpful as you might think - they show something was working and then it isn't.

Anything between the application code through to OS, CPU load, network card, drivers, switches, network utilisation, etc could be causing what I assume is a throughput dropout (not latency drop as your graphs show, which would be a good thing) - what has led you to believe #52 is the cause?

I'd suggest running an version of mgo without the socket change (instructions below) on N-1 machines, and leaving a single machine as a canary deployment to help isolate the problem to the changes introduced in #52. Avoid making any other changes, as it will make the results harder to reason about.

Can you provide any details? Some helpful information to reproduce the issue, such as:

  • MongoDB version, storage engine, number of shards/replicas, failover frequency, etc
  • Query workload (read only, 10%r/90%w, from cache or "random" reads, etc)
  • Approximate dataset size, document size/complexity
  • Network (typical latency, loss percentage, max throughput, background throughput, WAN/LAN, etc)
  • Environment (OS, VM/physical machines, number of nodes, NUMA, transparent huge pages, etc)
  • Go version, mgo usage patterns (I assume you do Copy() and Close() rather than using a single session
  • Debug log

On a side note, during our last testing run we had ~35,000op/s (mixed workload) and our 99.9 percentile was 19.1ms - we have not experienced any problems. Your first graph shows a 99.9% latency varying between 100ms and 1 second - if you're within a datacenter (not over a WAN link) this is a hugely worrying sign - it's likely there are problems elsewhere in your infrastructure. If the latency graph appeared in our environment, the right hand side would be considered normal, and the left would be the worry.

I'm happy to revert the change if it is causing an issue, but not without proof it is the cause. The above questions will help reproduce the issue in our testing environment and diagnose further.

Thanks,
Dom


To build your application without #52:

cd $GOPATH/src/github.com/globalsign/mgo
git pull
git revert 5be15cc6a44937b6937a9487e1633f8a5fb25e4a

Then build your application as normal. If you vendor mgo (you should) then check your dependency management tool's documentation.

@ghost
Copy link
Author

ghost commented Dec 30, 2017

  • MongoDB 3.4.9 Community, WT, 1 replica
  • 95% read, from cache, random reads
  • 3mb, 88b, 32000 documents
  • 0, 0, ~10gb, lan (own data center, high-performance network)
  • MongoDB is located on 2x physical machines (10 cores, 380 gb of RAM), debian, lxc, cpu load <2%, la < 1
  • golang 1.9.2, gomaxproc 16, mgo Copy/Close (poolLimit 8), apps in kubernetes cluster, 6 pods on different physical nodes (pod cpu load <0.2m)

Many measures have been taken before changing mgo.
Changes in the mgo were made at night, the influence of other factors is minimal.
Now all requests are executed faster than 0ms.

Unfortunately, I do not have time to thoroughly investigate mgo.
I will later try to test the mgo update on other services, if the result is the same, we will just use 10gen/mgo fork ;)

@domodwyer
Copy link

Hi @sshipkov

Sorry for the delay, but it's the new year holidays! ;)

I might have misunderstood your graphs above - is the left of move socket.Unlock() with #52 and the right without #52? Or the other way around?

Either way, I replicated your setup as close as I reasonably could, and I can see no latency regression from #52 - only a reduction in the latency variance (a good thing). Note the 99th percentile (p99) is the same for both unlike the graphs above:

Values in milliseconds:

x => r2017.10.11-select-zipfian-latency.log 
y => r2017.11.06-select-zipfian-latency.log 

          n min   max median   average    stddev  p99
x 200000009 0.5 240.2    0.5 0.5159448 0.1934402 1.05
y 200000009 0.5 240.2    0.5 0.5163596 0.1721280 1.05

.3                 0.4                 0.5                0.6                   
+|------------------+-------------------+------------------+-------------------|
                                        +                                       
1                                       |                                       
                                        +                                       
                                        +                                       
2                                       |                                       
                                        +                                       
Legend: 1=data$x, 2=data$y

At 95% probablitiy:
===> average is statistically significant     (p=0.000000, diff ~0.000415)
===> variance is statistically significant    (p=0.000000)

While the change in average latency is statistically significant, it's tiny (0.0004148ms) so is ignored. From the above I see no changes in average latency between c4a7121 and 5be15cc. The variance (spread of measurements) is reduced - this is a good thing, and means we have less requests taking a relatively long time.

The throughput however has changed in an interesting way:

Values are reads per second:

x => r2017.10.11-select-zipfian-throughput.log 
y => r2017.11.06-select-zipfian-throughput.log 

     n   min   max median  average   stddev      p99
x 2865 56970 72599  67608 67584.57 2382.520 72044.52
y 2865 57506 74826  69783 69797.62 2435.384 74466.88

                      65000                    70000                    75000   
 |----------------------+------------------------+------------------------+----|
                             +-------+-------+                                  
1    ------------------------|       |       |-----------------                 
                             +-------+-------+                                  
                                        +-------+-------+                       
2                   --------------------|       |       |-----------------      
                                        +-------+-------+                       
Legend: 1=data$x, 2=data$y

At 95% probablitiy:
===> average is statistically significant     (p=0.000000, diff ~2213.048866)
===> variance not statistically significant   (p=0.240275)

Because of what I imagine is due to a reduced lock duration, the average throughput has increased by ~2213 reads per second.

It wasn't till after I ran the tests that I saw your poolLimit 8 comment - that is a really small number of connections to allow and to me is a really bad idea - are you sure this is the right thing to do for your workload? Once those 8 connections are busy, mgo blocks waiting for a connection to become free - this could easily be the source of your high 99th percentile measurements. It is not a good idea to use SetPoolLimit() as a concurrency throttle - see this post by @niemeyer (the original mgo author) for a clear explanation why.

I run a lot of benchmarks, but I'm not a statistician - so take it all the analysis above with a pinch of salt! If you would like to look at the latency measurements yourself, there's grouped data and a histogram (on the second sheet) here.

@domodwyer domodwyer changed the title Socket locking during cluster synchronisation Possible latency regression in latest release Jan 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant