-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
excessive 2-message latency #45
Comments
I straced both sides and the output is not exactly shocking: p1: with call_fetch: write(4, "\2\ncall_fetch\24\2O\7\23\0\0\0\23\0\2\1+\24\2OO", 29) = 29 with call then fetch: write(4, "'\24\2O\7\24\0\0\0\23\0\2\1+\24\2OO", 18) = 18 p2: with call_fetch: select(15, [0 11 12 13 14], NULL, NULL, {10, 0}) = 1 (in [0], left {8, 622846}) with call then fetch: select(15, [0 11 12 13 14], NULL, NULL, {10, 0}) = 1 (in [0], left {8, 853315}) I can get rid of the 0-timeout selects by avoiding select entirely when the workqueue is non-empty, but this makes no difference. |
On Mac, I see reasonable results. Perhaps something to do with some buffer setting in the linux kernel - one of those /proc entries? julia> tic();println(remote_call_fetch(2,+,1,1));toc() julia> tic();println(fetch(remote_call(2,+,1,1)));toc() Darwin The-Surfing-Burrito.local 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386 |
Tried on 64-bit Linux (Opteron - not that it should matter), and see the same 60x slowdown. Linux neumann.cs.ucsb.edu 2.6.18-8.1.6.el5 #1 SMP Thu Jun 14 17:29:04 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux |
I'm not sure any of those are the issue. I think this is odder. Maybe one of these Linux-specific TCP bugs? Unfortunately, I could only find the Google cached version. |
Jeff, I assume you're holding the TCP connection open, right? If so, can you try doing the latter in a loop? If only the first one is slow, then I'm more inclined to finger slow start as the culprit. |
When I loop it N times it always seems to take exactly N times longer:
|
And looping the first one is proportionally faster? Can you confirm that the same TCP connection gets used throughout? |
Looping the first one 100x seems to take only 50x longer. Interesting. It's almost exactly like somebody inserted a sleep call in the second case. |
Yeah, it should show up in the strace output since doing anything with opening or closing sockets requires a system call. |
This problem is officially shitty. |
I have a new idea for fixing this: implement timer events, and use it to combine messages sent within, say, 2ms of each other into a single write. Then at least on the sending side both cases will have the same strace signature. |
Ugh. This is such a hack. Not a complaint about the approach really, but it's just so fucked up that this has to be down. Why is the Linux TCP implementation this buggy? |
I have always had problems with latency on linux, for as far as I can remember. For the last 10 years, one has always had to look for some kernel patches, and /proc entries, and nothing really addresses it. -viral On Jun 20, 2011, at 9:42 AM, StefanKarpinski wrote:
|
It might be a good thing to do anyway; delaying and aggregating messages is pretty common. |
I think this should be moved to 2.0. |
OK, get excited. I just tried this and it actually worked:
|
Turns out we can do even better by doing message aggregation only on the requesting side:
|
Which commit fixed this? |
Not committed yet. |
delaying and aggregating messages to reduce the number of writes. probably fixing issue #110 the problem was doing I/O during GC in the RemoteRef finalizer. obviously this writes data at a random time and really screws things up. this is an epic commit.
This performance problem is back. The fix needs to be restored within the new I/O system. |
I think we had dropped the distinction between send_msg and send_msg_now when IOBuffer didn't exist. This can probably be restored fairly easily now. |
See also f9637a2. Threads may be necessary to make this work properly. |
I used wireshark to capture network packets. For
the capture showed: For
the capture showed: Is it possible it is not a network issue? In the second case the response to the first packet is arriving after 0.039 seconds, and the request-response for the second packet hardly takes any time. Does it have anything to do with creating a RemoteRef in the second case (which is not required in with remotecall_fetch) NOTE: The term |
This is a really weird bug. Notice the timings below:
The and
No such issues here. |
I was able to fix this before by aggregating messages into a single write
|
I don't notice any of these on the mac. |
This is a bit of a shot in the dark, but is it at all possible that this is related to #2816? I believe the discovery there was an unnecessary buffer copy in a very low level function right above the tcp stack. |
Update sampling.jl
Without this change, the compiler fails to notice that `env_threads isa Int` in the fall-through case, leading to a union-split with a branch that is in fact unreachable: ``` 43 ┄ %109 = φ (#41 => %105, #42 => %108)::Union{Nothing, Int64} │ %110 = (%109 isa Int64)::Bool └─── goto #45 if not %110 ... 45 ─ %126 = π (%109, Nothing) │ Base.convert(Int64, %126)::Union{} └─── unreachable ``` Co-authored-by: Jeff Bezanson <jeff.bezanson@gmail.com>
Without this change, the compiler fails to notice that `env_threads isa Int` in the fall-through case, leading to a union-split with a branch that is in fact unreachable: ``` 43 ┄ %109 = φ (#41 => %105, #42 => %108)::Union{Nothing, Int64} │ %110 = (%109 isa Int64)::Bool └─── goto #45 if not %110 ... 45 ─ %126 = π (%109, Nothing) │ Base.convert(Int64, %126)::Union{} └─── unreachable ``` After this change, the union-split is eliminated. Co-authored-by: Jeff Bezanson <jeff.bezanson@gmail.com>
Without this change, the compiler fails to notice that `env_threads isa Int` in the fall-through case, leading to a union-split with a branch that is in fact unreachable: ``` 43 ┄ %109 = φ (JuliaLang#41 => %105, JuliaLang#42 => %108)::Union{Nothing, Int64} │ %110 = (%109 isa Int64)::Bool └─── goto JuliaLang#45 if not %110 ... 45 ─ %126 = π (%109, Nothing) │ Base.convert(Int64, %126)::Union{} └─── unreachable ``` After this change, the union-split is eliminated. Co-authored-by: Jeff Bezanson <jeff.bezanson@gmail.com>
This says it all:
This is all between two processes on the same machine. A single message roundtrip (request, response) is fast. But if we do the same thing in two messages (send call, send fetch, get response) it takes 60x longer. This is crazy. And there is still only one round trip, just a single extra one-way message from p1 to p2.
We already tried setting TCP_NODELAY and it did not make any difference. So far I'm stumped.
The text was updated successfully, but these errors were encountered: