Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failed: !more (fq.cpp:99) #939

Closed
TheTacoScott opened this issue Mar 19, 2014 · 20 comments
Closed

Assertion failed: !more (fq.cpp:99) #939

TheTacoScott opened this issue Mar 19, 2014 · 20 comments
Assignees

Comments

@TheTacoScott
Copy link

Using the bundled/static compiled version of pyzmq with python Python 2.7.3 and zeromq 4.0.4 and libsodium 0.4.5.

This may very well be related to #882 and or #883. It seems that when a zmq node, or multiple nodes start hammering with DENIED auth requests, or get hammered with DENIED auth requests, over time I speculate that we run out of "handles" or some such and zeromq "crashes".

[DEBUG] [2014-03-18 23:38:38,482] [base.py:242] [_authenticate_curve] DENIED (CURVE) domain=* client_key=KEYSTRINGHEREOBSCURED
[DEBUG] [2014-03-18 23:38:38,482] [base.py:253] [_send_zap_reply] ZAP reply code=400 text=Unknown key
[DEBUG] [2014-03-18 23:38:38,729] [base.py:134] [handle_zap_message] version: 1.0, sequence: 1, domain: , address: IPHIDDEN, identity: , mechanism: CURVE
[DEBUG] [2014-03-18 23:38:38,729] [base.py:242] [_authenticate_curve] DENIED (CURVE) domain=* client_key=KEYSTRINGHEREOBSCURED
[DEBUG] [2014-03-18 23:38:38,729] [base.py:253] [_send_zap_reply] ZAP reply code=400 text=Unknown key

FAST FORWARD IN TIME with the above error message occuring 1000's of times

Assertion failed: !more (fq.cpp:99)

Those error messages are expected as I am confirming that zap doesn't allow bad keys. It doesn't, so that part is good and expected. This error message may have nothing to do with curve at all, may have just been incidental.

Looks like the error comes from the zmq::fq_t::recvpipe function, and looks to have something to do with multi-part messages best I can tell. Every single message I send from my client thread is a multi-part message with a blank "first" part as it's using the zmq.DEALER message pattern. I use a method of determining if the connection succeed based on if I get a reply. If I don't get a reply in N seconds I close the socket with no linger, do some other python cleanup. I then wait N seconds and restart the process of creating a new socket, starting a connection, and sending the first "hello" message.

@hintjens
Copy link
Member

Can you make a minimal test case for this?

On Wed, Mar 19, 2014 at 9:41 PM, TacoScott notifications@github.com wrote:

Using the bundled/static compiled version of pyzmq with python Python
2.7.3 and zeromq 4.0.4 and libsodium 0.4.5.

After many curve "error" messages with debug logging turned on.
This may very well be related to #882https://github.com/zeromq/libzmq/issues/882and or
#883 #883. It seems that when a
zmq node, or multiple nodes start hammering with DENIED auth requests, over
time I speculate that we run out of "handles" or some such.

[DEBUG] [2014-03-18 23:38:38,482] [base.py:242] [_authenticate_curve]
DENIED (CURVE) domain=* client_key=KEYSTRINGHEREOBSCURED
[DEBUG] [2014-03-18 23:38:38,482] [base.py:253] [_send_zap_reply] ZAP
reply code=400 text=Unknown key
[DEBUG] [2014-03-18 23:38:38,729] [base.py:134] [handle_zap_message]
version: 1.0, sequence: 1, domain: , address: IPHIDDEN, identity: ,
mechanism: CURVE
[DEBUG] [2014-03-18 23:38:38,729] [base.py:242] [_authenticate_curve]
DENIED (CURVE) domain=* client_key=KEYSTRINGHEREOBSCURED
[DEBUG] [2014-03-18 23:38:38,729] [base.py:253] [_send_zap_reply] ZAP
reply code=400 text=Unknown key

FAST FORWARD IN TIME with the above error message occuring 1000's of times

Assertion failed: !more (fq.cpp:99)

Those error messages are expected as I am confirming that zap doesn't
allow bad keys. It doesn't, so that part is good and expected. This error
message may have nothing to do with curve at all, may have just been
incidental.

Looks like the error comes from the zmq::fq_t::recvpipe function, and
looks to have something to do with multi-part messages best I can tell.
Every single message I send from my client thread is a multi-part message
with a blank "first" part as it's using the zmq.DEALER message pattern. I
use a method of determining if the connection succeed based on if I get a
reply. If I don't get a reply in N seconds I close the socket with no
linger, do some other python cleanup. I then wait N seconds and restart the
process.

Reply to this email directly or view it on GitHubhttps://github.com//issues/939
.

@TheTacoScott
Copy link
Author

I will try to come up with something reproducible, but I'm not sure it's curve related. I'll tweak the ironhouse example to provide a bad cert and let it sit for a couple days and report back.

@hintjens
Copy link
Member

Thanks!

On Thu, Mar 20, 2014 at 3:33 PM, TacoScott notifications@github.com wrote:

I will try to come up with something reproducible, but I'm not sure it's
curve related. I'll tweak the ironhouse example to provide a bad cert and
let it sit for a couple days and report back.

Reply to this email directly or view it on GitHubhttps://github.com//issues/939#issuecomment-38166589
.

@TheTacoScott
Copy link
Author

Test case I'm working on:
https://github.com/withorwithoutgod/libzmq-issue-939

It's a simple client/server where the server sits and listens for connections from a client with a key the server doesn't like, and the client just spawns 400 dealer sockets and sends ONE multipart message that should fail auth. It will then hammer the server with repeated attempts. My working theory is that after N number of fail events or N number of seconds it libzmq runs out of something and dies. This assertion only occurs in the test case scenario but it does not yet occur with the code in the above repo. It might take some time to appear.

I'll report back with how long it takes to generate this issue with this test case.

In my own experience, the SERVER (REP) is what generates this assertion, never the client (DEALER).

@TheTacoScott
Copy link
Author

confirmed!
the client works just fine, server side my test case generates the error:

[DEBUG] [2014-03-20 23:11:41,660] [base.py:242] [_authenticate_curve] DENIED (CURVE) domain=* client_key=xEe)EKOjA3{O6Frd6XSK<[uYuO#gO[q7i2?q):K}
[DEBUG] [2014-03-20 23:11:41,660] [base.py:253] [_send_zap_reply] ZAP reply code=400 text=Unknown key
[DEBUG] [2014-03-20 23:11:41,685] [base.py:134] [handle_zap_message] version: 1.0, sequence: 1, domain: , address: 127.0.0.1, identity: , mechanism: CURVE
[DEBUG] [2014-03-20 23:11:41,685] [base.py:242] [_authenticate_curve] DENIED (CURVE) domain=* client_key=xEe)EKOjA3{O6Frd6XSK<[uYuO#gO[q7i2?q):K}
[DEBUG] [2014-03-20 23:11:41,685] [base.py:253] [_send_zap_reply] ZAP reply code=400 text=Unknown key
[DEBUG] [2014-03-20 23:11:41,697] [base.py:134] [handle_zap_message] version: 1.0, sequence: 1, domain: , address: 127.0.0.1, identity: , mechanism: CURVE
[DEBUG] [2014-03-20 23:11:41,697] [base.py:242] [_authenticate_curve] DENIED (CURVE) domain=* client_key=xEe)EKOjA3{O6Frd6XSK<[uYuO#gO[q7i2?q):K}
[DEBUG] [2014-03-20 23:11:41,697] [base.py:253] [_send_zap_reply] ZAP reply code=400 text=Unknown key
[DEBUG] [2014-03-20 23:11:41,729] [base.py:134] [handle_zap_message] version: 1.0, sequence: 1, domain: , address: 127.0.0.1, identity: , mechanism: CURVE
[DEBUG] [2014-03-20 23:11:41,729] [base.py:242] [_authenticate_curve] DENIED (CURVE) domain=* client_key=xEe)EKOjA3{O6Frd6XSK<[uYuO#gO[q7i2?q):K}
[DEBUG] [2014-03-20 23:11:41,729] [base.py:253] [_send_zap_reply] ZAP reply code=400 text=Unknown key
Assertion failed: !more (bundled/zeromq/src/fq.cpp:99)
Aborted

real 409m46.792s
user 159m27.414s
sys 1m11.748s

@TheTacoScott
Copy link
Author

I have so more info that may be useful.

It appears that the error occurs during zap authentication. It appears to occur with the zmq.PAIR message pattern when receiving multipart messages from a zmq.PAIR socket.

You can see more info/testing I've done over at the pyzmq issue I opened about this:
zeromq/pyzmq#504

What are some example conditions that may cause that assertion so I can do some more digging on my own. Thanks!

@TheTacoScott
Copy link
Author

most recent test failure after:
real 867m4.538s

@TheTacoScott
Copy link
Author

we switched pyzmq to not use PAIR sockets and it still failed:

[DEBUG] [2014-03-25 04:01:05,799] [base.py:253] [_send_zap_reply] ZAP reply code=400 text=Unknown key
Assertion failed: !more (bundled/zeromq/src/fq.cpp:99)
Aborted

real 410m1.487s
user 158m43.999s
sys 1m8.508s

@hurtonm hurtonm self-assigned this Mar 25, 2014
@hurtonm
Copy link
Member

hurtonm commented Mar 26, 2014

Can you apply the followinf patch https://gist.github.com/hurtonm/9784872 ad report back. Thanks.

@hurtonm hurtonm closed this as completed Mar 26, 2014
@hurtonm hurtonm reopened this Mar 26, 2014
@TheTacoScott
Copy link
Author

I'll see what I can do, stay tuned. Thanks!

@TheTacoScott
Copy link
Author

patch applied, libzmq.so recompiled, testing now. I'll report back.

why do you think this is a possible fix if I may ask?

@TheTacoScott
Copy link
Author

we're about about 20hours in of testing, the assertion has not shown back up. 14.5 hours was the previous record. Off to a good start.

@TheTacoScott
Copy link
Author

about 24 hours in, still solid. I think you may have found it.

@TheTacoScott
Copy link
Author

still no assertion after 30 some plus hours. Looking very promising.

@TheTacoScott
Copy link
Author

still no issues. I think it's safe to say this resolved the issue. I'll let my test case run for a couple more days just for the fun of it, but it seems reasonable your patch fixed issue #939

@hurtonm
Copy link
Member

hurtonm commented Mar 29, 2014

Thanks for testing this. Really appreciated.

@TheTacoScott
Copy link
Author

if there can be any doubt, my test case is still running strong. No assertions.

@hintjens
Copy link
Member

hintjens commented Apr 2, 2014

Will someone make a pull request, and I'll backport this to 4.0 stable.

On Wed, Apr 2, 2014 at 5:03 PM, TacoScott notifications@github.com wrote:

if there can be any doubt, my test case is still running strong. No
assertions.

Reply to this email directly or view it on GitHubhttps://github.com//issues/939#issuecomment-39341202
.

@hurtonm
Copy link
Member

hurtonm commented Apr 2, 2014

Sure, will do.

hurtonm added a commit to hurtonm/libzmq that referenced this issue Apr 3, 2014
This bug is caused by fq which can terminate pipe before sending all
frames of a message first. So sometimes two messages were mixed and this
confused ZAP handler.

This patch just modifies one pipe parameter so that socket consumes all
messages before dropping the pipe.
hintjens added a commit that referenced this issue Apr 3, 2014
hintjens added a commit to hintjens/libzmq that referenced this issue Apr 29, 2014
Has some bits commented out due to zeromq#939, now work.

Note: there is an issue in libzmq when binding/unbinding or
connecting/disconnecting (I did not investigate deeper) the
same socket several times. Even closing the socket with zero
linger, zmq_ctx_term will block. The workaround in this test
case is to close the sockets for each test step.
@hintjens
Copy link
Member

OK, backported to 4.0.x.

bluca pushed a commit that referenced this issue Oct 31, 2023
This bug is caused by fq which can terminate pipe before sending all
frames of a message first. So sometimes two messages were mixed and this
confused ZAP handler.

This patch just modifies one pipe parameter so that socket consumes all
messages before dropping the pipe.
bluca pushed a commit that referenced this issue Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants