Possible latency regression in latest release #72

ghost · 2017-12-29T01:14:00Z

Line 552 in 5be15cc

socket.Unlock()

socket.Unlock() must be before return, after socket.conn.Write()

The text was updated successfully, but these errors were encountered:

ghost · 2017-12-29T13:03:37Z

At a high load, every 30 seconds, when there is synchronization with the MongoDB cluster, many connections start to work longer than a second.
FailFast partially solves this problem.

domodwyer · 2017-12-29T13:16:08Z

Hi @sshipkov

This change was added by PR #52 to help mitigate a different issue - can you confirm it's definitely this change that causes your issue? As part of the release process we tested this change against a sharded, replicated MongoDB cluster but we didn't see anything unusual.

What exactly is happening? Can you reproduce the issue with debug logging enabled and post the relevant output?

Dom

ghost · 2017-12-29T22:34:58Z

999 percentile, ~ 3500rps read requests

ghost · 2017-12-29T22:49:35Z

smoother monogdb servers tx

ghost · 2017-12-29T22:51:35Z

I'm sorry, I will not include debugging

domodwyer · 2017-12-30T00:22:24Z

Hi @sshipkov

Thanks for taking the time to put together some annotated graphs, however they're not as helpful as you might think - they show something was working and then it isn't.

Anything between the application code through to OS, CPU load, network card, drivers, switches, network utilisation, etc could be causing what I assume is a throughput dropout (not latency drop as your graphs show, which would be a good thing) - what has led you to believe #52 is the cause?

I'd suggest running an version of mgo without the socket change (instructions below) on N-1 machines, and leaving a single machine as a canary deployment to help isolate the problem to the changes introduced in #52. Avoid making any other changes, as it will make the results harder to reason about.

Can you provide any details? Some helpful information to reproduce the issue, such as:

MongoDB version, storage engine, number of shards/replicas, failover frequency, etc
Query workload (read only, 10%r/90%w, from cache or "random" reads, etc)
Approximate dataset size, document size/complexity
Network (typical latency, loss percentage, max throughput, background throughput, WAN/LAN, etc)
Environment (OS, VM/physical machines, number of nodes, NUMA, transparent huge pages, etc)
Go version, mgo usage patterns (I assume you do Copy() and Close() rather than using a single session
Debug log

On a side note, during our last testing run we had ~35,000op/s (mixed workload) and our 99.9 percentile was 19.1ms - we have not experienced any problems. Your first graph shows a 99.9% latency varying between 100ms and 1 second - if you're within a datacenter (not over a WAN link) this is a hugely worrying sign - it's likely there are problems elsewhere in your infrastructure. If the latency graph appeared in our environment, the right hand side would be considered normal, and the left would be the worry.

I'm happy to revert the change if it is causing an issue, but not without proof it is the cause. The above questions will help reproduce the issue in our testing environment and diagnose further.

Thanks,
Dom

To build your application without #52:

cd $GOPATH/src/github.com/globalsign/mgo
git pull
git revert 5be15cc6a44937b6937a9487e1633f8a5fb25e4a

Then build your application as normal. If you vendor mgo (you should) then check your dependency management tool's documentation.

ghost · 2017-12-30T19:25:10Z

MongoDB 3.4.9 Community, WT, 1 replica
95% read, from cache, random reads
3mb, 88b, 32000 documents
0, 0, ~10gb, lan (own data center, high-performance network)
MongoDB is located on 2x physical machines (10 cores, 380 gb of RAM), debian, lxc, cpu load <2%, la < 1
golang 1.9.2, gomaxproc 16, mgo Copy/Close (poolLimit 8), apps in kubernetes cluster, 6 pods on different physical nodes (pod cpu load <0.2m)

Many measures have been taken before changing mgo.
Changes in the mgo were made at night, the influence of other factors is minimal.
Now all requests are executed faster than 0ms.

Unfortunately, I do not have time to thoroughly investigate mgo.
I will later try to test the mgo update on other services, if the result is the same, we will just use 10gen/mgo fork ;)

domodwyer · 2018-01-03T00:21:15Z

Hi @sshipkov

Sorry for the delay, but it's the new year holidays! ;)

I might have misunderstood your graphs above - is the left of move socket.Unlock() with #52 and the right without #52? Or the other way around?

Either way, I replicated your setup as close as I reasonably could, and I can see no latency regression from #52 - only a reduction in the latency variance (a good thing). Note the 99th percentile (p99) is the same for both unlike the graphs above:

Values in milliseconds:

x => r2017.10.11-select-zipfian-latency.log 
y => r2017.11.06-select-zipfian-latency.log 

          n min   max median   average    stddev  p99
x 200000009 0.5 240.2    0.5 0.5159448 0.1934402 1.05
y 200000009 0.5 240.2    0.5 0.5163596 0.1721280 1.05

.3                 0.4                 0.5                0.6                   
+|------------------+-------------------+------------------+-------------------|
                                        +                                       
1                                       |                                       
                                        +                                       
                                        +                                       
2                                       |                                       
                                        +                                       
Legend: 1=data$x, 2=data$y

At 95% probablitiy:
===> average is statistically significant     (p=0.000000, diff ~0.000415)
===> variance is statistically significant    (p=0.000000)

While the change in average latency is statistically significant, it's tiny (0.0004148ms) so is ignored. From the above I see no changes in average latency between c4a7121 and 5be15cc. The variance (spread of measurements) is reduced - this is a good thing, and means we have less requests taking a relatively long time.

The throughput however has changed in an interesting way:

Values are reads per second:

x => r2017.10.11-select-zipfian-throughput.log 
y => r2017.11.06-select-zipfian-throughput.log 

     n   min   max median  average   stddev      p99
x 2865 56970 72599  67608 67584.57 2382.520 72044.52
y 2865 57506 74826  69783 69797.62 2435.384 74466.88

                      65000                    70000                    75000   
 |----------------------+------------------------+------------------------+----|
                             +-------+-------+                                  
1    ------------------------|       |       |-----------------                 
                             +-------+-------+                                  
                                        +-------+-------+                       
2                   --------------------|       |       |-----------------      
                                        +-------+-------+                       
Legend: 1=data$x, 2=data$y

At 95% probablitiy:
===> average is statistically significant     (p=0.000000, diff ~2213.048866)
===> variance not statistically significant   (p=0.240275)

Because of what I imagine is due to a reduced lock duration, the average throughput has increased by ~2213 reads per second.

It wasn't till after I ran the tests that I saw your poolLimit 8 comment - that is a really small number of connections to allow and to me is a really bad idea - are you sure this is the right thing to do for your workload? Once those 8 connections are busy, mgo blocks waiting for a connection to become free - this could easily be the source of your high 99th percentile measurements. It is not a good idea to use SetPoolLimit() as a concurrency throttle - see this post by @niemeyer (the original mgo author) for a clear explanation why.

I run a lot of benchmarks, but I'm not a statistician - so take it all the analysis above with a pinch of salt! If you would like to look at the latency measurements yourself, there's grouped data and a histogram (on the second sheet) here.

domodwyer assigned ghost Dec 29, 2017

domodwyer added the needs info label Dec 29, 2017

domodwyer changed the title ~~Serious mistake~~ Socket locking during cluster synchronisation Dec 29, 2017

domodwyer closed this as completed Jan 3, 2018

domodwyer added invalid and removed needs info labels Jan 3, 2018

domodwyer changed the title ~~Socket locking during cluster synchronisation~~ Possible latency regression in latest release Jan 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible latency regression in latest release #72

Possible latency regression in latest release #72

ghost commented Dec 29, 2017

ghost commented Dec 29, 2017

domodwyer commented Dec 29, 2017

ghost commented Dec 29, 2017 •

edited by ghost

Loading

ghost commented Dec 29, 2017 •

edited by ghost

Loading

ghost commented Dec 29, 2017

domodwyer commented Dec 30, 2017

ghost commented Dec 30, 2017 •

edited by ghost

Loading

domodwyer commented Jan 3, 2018

Possible latency regression in latest release #72

Possible latency regression in latest release #72

Comments

ghost commented Dec 29, 2017

ghost commented Dec 29, 2017

domodwyer commented Dec 29, 2017

ghost commented Dec 29, 2017 • edited by ghost Loading

ghost commented Dec 29, 2017 • edited by ghost Loading

ghost commented Dec 29, 2017

domodwyer commented Dec 30, 2017

ghost commented Dec 30, 2017 • edited by ghost Loading

domodwyer commented Jan 3, 2018

ghost commented Dec 29, 2017 •

edited by ghost

Loading

ghost commented Dec 29, 2017 •

edited by ghost

Loading

ghost commented Dec 30, 2017 •

edited by ghost

Loading