-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assertion failed: !more (fq.cpp:99) #939
Comments
Can you make a minimal test case for this? On Wed, Mar 19, 2014 at 9:41 PM, TacoScott notifications@github.com wrote:
|
I will try to come up with something reproducible, but I'm not sure it's curve related. I'll tweak the ironhouse example to provide a bad cert and let it sit for a couple days and report back. |
Thanks! On Thu, Mar 20, 2014 at 3:33 PM, TacoScott notifications@github.com wrote:
|
Test case I'm working on: It's a simple client/server where the server sits and listens for connections from a client with a key the server doesn't like, and the client just spawns 400 dealer sockets and sends ONE multipart message that should fail auth. It will then hammer the server with repeated attempts. My working theory is that after N number of fail events or N number of seconds it libzmq runs out of something and dies. This assertion only occurs in the test case scenario but it does not yet occur with the code in the above repo. It might take some time to appear. I'll report back with how long it takes to generate this issue with this test case. In my own experience, the SERVER (REP) is what generates this assertion, never the client (DEALER). |
confirmed! [DEBUG] [2014-03-20 23:11:41,660] [base.py:242] [_authenticate_curve] DENIED (CURVE) domain=* client_key=xEe)EKOjA3{O6Frd6XSK<[uYuO#gO[q7i2?q):K} real 409m46.792s |
I have so more info that may be useful. It appears that the error occurs during zap authentication. It appears to occur with the zmq.PAIR message pattern when receiving multipart messages from a zmq.PAIR socket. You can see more info/testing I've done over at the pyzmq issue I opened about this: What are some example conditions that may cause that assertion so I can do some more digging on my own. Thanks! |
most recent test failure after: |
we switched pyzmq to not use PAIR sockets and it still failed: [DEBUG] [2014-03-25 04:01:05,799] [base.py:253] [_send_zap_reply] ZAP reply code=400 text=Unknown key real 410m1.487s |
Can you apply the followinf patch https://gist.github.com/hurtonm/9784872 ad report back. Thanks. |
I'll see what I can do, stay tuned. Thanks! |
patch applied, libzmq.so recompiled, testing now. I'll report back. why do you think this is a possible fix if I may ask? |
we're about about 20hours in of testing, the assertion has not shown back up. 14.5 hours was the previous record. Off to a good start. |
about 24 hours in, still solid. I think you may have found it. |
still no assertion after 30 some plus hours. Looking very promising. |
still no issues. I think it's safe to say this resolved the issue. I'll let my test case run for a couple more days just for the fun of it, but it seems reasonable your patch fixed issue #939 |
Thanks for testing this. Really appreciated. |
if there can be any doubt, my test case is still running strong. No assertions. |
Will someone make a pull request, and I'll backport this to 4.0 stable. On Wed, Apr 2, 2014 at 5:03 PM, TacoScott notifications@github.com wrote:
|
Sure, will do. |
This bug is caused by fq which can terminate pipe before sending all frames of a message first. So sometimes two messages were mixed and this confused ZAP handler. This patch just modifies one pipe parameter so that socket consumes all messages before dropping the pipe.
Has some bits commented out due to zeromq#939, now work. Note: there is an issue in libzmq when binding/unbinding or connecting/disconnecting (I did not investigate deeper) the same socket several times. Even closing the socket with zero linger, zmq_ctx_term will block. The workaround in this test case is to close the sockets for each test step.
OK, backported to 4.0.x. |
This bug is caused by fq which can terminate pipe before sending all frames of a message first. So sometimes two messages were mixed and this confused ZAP handler. This patch just modifies one pipe parameter so that socket consumes all messages before dropping the pipe.
Using the bundled/static compiled version of pyzmq with python Python 2.7.3 and zeromq 4.0.4 and libsodium 0.4.5.
This may very well be related to #882 and or #883. It seems that when a zmq node, or multiple nodes start hammering with DENIED auth requests, or get hammered with DENIED auth requests, over time I speculate that we run out of "handles" or some such and zeromq "crashes".
[DEBUG] [2014-03-18 23:38:38,482] [base.py:242] [_authenticate_curve] DENIED (CURVE) domain=* client_key=KEYSTRINGHEREOBSCURED
[DEBUG] [2014-03-18 23:38:38,482] [base.py:253] [_send_zap_reply] ZAP reply code=400 text=Unknown key
[DEBUG] [2014-03-18 23:38:38,729] [base.py:134] [handle_zap_message] version: 1.0, sequence: 1, domain: , address: IPHIDDEN, identity: , mechanism: CURVE
[DEBUG] [2014-03-18 23:38:38,729] [base.py:242] [_authenticate_curve] DENIED (CURVE) domain=* client_key=KEYSTRINGHEREOBSCURED
[DEBUG] [2014-03-18 23:38:38,729] [base.py:253] [_send_zap_reply] ZAP reply code=400 text=Unknown key
FAST FORWARD IN TIME with the above error message occuring 1000's of times
Assertion failed: !more (fq.cpp:99)
Those error messages are expected as I am confirming that zap doesn't allow bad keys. It doesn't, so that part is good and expected. This error message may have nothing to do with curve at all, may have just been incidental.
Looks like the error comes from the zmq::fq_t::recvpipe function, and looks to have something to do with multi-part messages best I can tell. Every single message I send from my client thread is a multi-part message with a blank "first" part as it's using the zmq.DEALER message pattern. I use a method of determining if the connection succeed based on if I get a reply. If I don't get a reply in N seconds I close the socket with no linger, do some other python cleanup. I then wait N seconds and restart the process of creating a new socket, starting a connection, and sending the first "hello" message.
The text was updated successfully, but these errors were encountered: