Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV on program exit when RD_KAFKA_RESP_ERR__REVOKE_PARTITIONS #265

Closed
5 of 7 tasks
ghost opened this issue Oct 16, 2017 · 13 comments
Closed
5 of 7 tasks

SIGSEGV on program exit when RD_KAFKA_RESP_ERR__REVOKE_PARTITIONS #265

ghost opened this issue Oct 16, 2017 · 13 comments
Labels

Comments

@ghost
Copy link

ghost commented Oct 16, 2017

Description

I have an app that throws an exception following a consumer.poll before commit. In turn my python program ends with SIGSEGV and a core dump. Here's the output from gdb:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  CallState_get (h=h@entry=0x7f0578bd3978) at confluent_kafka/src/confluent_kafka.c:1390

warning: Source file is more recent than executable.
1390		PyEval_RestoreThread(cs->thread_state);
(gdb) bt
#0  CallState_get (h=h@entry=0x7f0578bd3978) at confluent_kafka/src/confluent_kafka.c:1390
#1  0x00007f0579f88241 in Consumer_rebalance_cb (rk=0x27880d0, err=RD_KAFKA_RESP_ERR__REVOKE_PARTITIONS, c_parts=0x7f0544002730, opaque=0x7f0578bd3978) at confluent_kafka/src/Consumer.c:656
#2  0x00007f0579cb8774 in rd_kafka_poll_cb (rk=0x27880d0, rkq=0x260b650, rko=0x7f0544002780, cb_type=2184411728, cb_type@entry=1140860800, opaque=opaque@entry=0x0) at rdkafka.c:2530
#3  0x00007f0579cb97f8 in rd_kafka_consumer_close (rk=0x27880d0) at rdkafka.c:2088
#4  0x00007f0579cb98f4 in rd_kafka_destroy_app (blocking=<optimized out>, rk=<optimized out>) at rdkafka.c:669
#5  rd_kafka_destroy (rk=0x27880d0) at rdkafka.c:712
#6  0x00007f0579f8919a in Consumer_dealloc (self=0x7f0578bd3978) at confluent_kafka/src/Consumer.c:65
#7  0x00007f0581ca2b47 in dict_dealloc (mp=0x7f0578bf3120) at Objects/dictobject.c:2014
#8  0x00007f0581cc9701 in subtype_dealloc (self=0x7f0578bf11d0) at Objects/typeobject.c:1207
#9  0x00007f0581c8cfc2 in frame_tp_clear (f=0x2779a08) at Objects/frameobject.c:509
#10 0x00007f0581d9f912 in delete_garbage (old=<optimized out>, collectable=<optimized out>) at Modules/gcmodule.c:867
#11 collect (generation=generation@entry=2, n_collected=n_collected@entry=0x7ffc85866530, n_uncollectable=n_uncollectable@entry=0x7ffc85866538, nofail=nofail@entry=0) at Modules/gcmodule.c:1019
#12 0x00007f0581da0559 in collect_with_callback (generation=2) at Modules/gcmodule.c:1128
#13 PyGC_Collect () at Modules/gcmodule.c:1592
#14 0x00007f0581d7d343 in Py_FinalizeEx () at Python/pylifecycle.c:603
#15 0x00007f0581d9e33a in Py_Main (argc=argc@entry=3, argv=argv@entry=0x21b1010) at Modules/main.c:829
#16 0x0000000000400ae1 in main (argc=3, argv=<optimized out>) at ./Programs/python.c:69

How to reproduce

https://gist.github.com/sbv-siteimprove/90965a8b16e9f542f3fe3aa0f01355e0

Checklist

Please provide the following information:

  • confluent-kafka-python and librdkafka version (confluent_kafka.version() and confluent_kafka.libversion()):
>>> import confluent_kafka
>>> confluent_kafka.version()
('0.11.0', 720896)
>>> confluent_kafka.libversion()
('0.11.0', 721151)
  • Apache Kafka broker version:
    wurstmeister/kafka:0.11.0.1 docker image
  • Client configuration: {...}
{                               
            'bootstrap.servers': brokers,                                         
            'group.id': 'somethinghere'                    
             'api.version.request': True,                                          
            'socket.keepalive.enable': True,                                      
             'metadata.request.timeout.ms': 3000,                                  
             'topic.metadata.refresh.interval.ms': 3000,                           
             'socket.timeout.ms': 3000,                                            
             'enable.auto.commit': False,                                          
             'auto.commit.interval.ms': 0,                                         
             'enable.auto.offset.store': True,                                     
             'default.topic.config': {                                             
                 'auto.offset.reset': 'earliest'                                   
             },                                                                    
             'message.max.bytes': _MAX_MESSAGE_BYTES
 }
  • Operating system:
    debian jessie

  • Provide client logs (with 'debug': '..' as necessary)
    I'm not sure how to

  • Provide broker log excerpts
    [2017-10-16 14:44:13,870] INFO [GroupCoordinator 1001]: Member rdkafka-bcb360c4-e259-4b8b-9329-1d589e626d7a in group somethinghere has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
    [2017-10-16 14:44:13,870] INFO [GroupCoordinator 1001]: Preparing to rebalance group somethinghere with old generation 1 (__consumer_offsets-42) (kafka.coordinator.group.GroupCoordinator)
    [2017-10-16 14:44:13,871] INFO [GroupCoordinator 1001]: Group somethinghere with generation 2 is now empty (__consumer_offsets-42) (kafka.coordinator.group.GroupCoordinator)

  • Critical issue

@edenhill
Copy link
Contributor

👍 for gdb backtrace!

It looks like the garbage collector kicks in, triggering Consumer deallocation and an implicit consumer close(), that in turn triggers the on_revoke callback, which expects to have been triggered from a call to poll() or close() that sets up the thread state, but no such luck.

This is a bug in the client which we'll fix, but in the meantime I suggest that you explicitly close() the consumer before letting the GC delete it.

@edenhill edenhill added the bug label Oct 16, 2017
@ghost
Copy link
Author

ghost commented Oct 16, 2017

I just tried making a simple reproduction and you are right. If I make sure to have try: except and close the consumer before the finalizer gets the chance all is well. Here's the repro https://gist.github.com/sbv-siteimprove/90965a8b16e9f542f3fe3aa0f01355e0

@edenhill
Copy link
Contributor

I can't reproduce this on Python 2.7.14:

Traceback (most recent call last):
  File "examples/repro_265.py", line 17, in _poll
    yield got
  File "examples/repro_265.py", line 46, in <module>
    raise Exception('this is my exception')
Exception: this is my exception
Traceback (most recent call last):
  File "examples/repro_265.py", line 46, in <module>
    raise Exception('this is my exception')
Exception: this is my exception

What Python version are you on?

@ghost
Copy link
Author

ghost commented Jan 16, 2018

3.6 :-)

@edenhill
Copy link
Contributor

Thanks, still no reproduction on Python 3.6 with librdkafka master.

@ghost
Copy link
Author

ghost commented Jan 16, 2018

We are running with git clone --branch v0.11.0 https://github.com/edenhill/librdkafka.git

@edenhill
Copy link
Contributor

Nope, still nothing.

@edenhill
Copy link
Contributor

Are you running latest confluent-kafka-python from master?

@ghost
Copy link
Author

ghost commented Jan 16, 2018

confluent_kafka.version()
('0.11.0', 720896)

@ghost
Copy link
Author

ghost commented Jan 16, 2018

Except for the typo (comit vs commit in consumer.py) the repro in the docker-compose still works:

consumer_1   | INFO:consumer:Got 15b109e1-3cc0-443f-91e0-8f54b23fb45a
consumer_1   | ERROR:consumer:Exception during yield
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/src/app/consumer.py", line 14, in _poll
consumer_1   |     yield got
consumer_1   |   File "/usr/src/app/consumer.py", line 41, in <module>
consumer_1   |     raise Exception('this is my exception')
consumer_1   | Exception: this is my exception
consumer_1   | Traceback (most recent call last):
consumer_1   |   File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
consumer_1   |     "__main__", mod_spec)
consumer_1   |   File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
consumer_1   |     exec(code, run_globals)
consumer_1   |   File "/usr/src/app/consumer.py", line 41, in <module>
consumer_1   |     raise Exception('this is my exception')
consumer_1   | Exception: this is my exception
90965a8b16e9f542f3fe3aa0f01355e0_consumer_1 exited with code 139
kafka_1      | [2018-01-16 11:47:26,303] INFO [GroupCoordinator 1001]: Member rdkafka-d23f4b18-808f-4a46-b4fd-e4dbb511bbea in group mygroup has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
kafka_1      | [2018-01-16 11:47:26,304] INFO [GroupCoordinator 1001]: Preparing to rebalance group mygroup with old generation 1 (__consumer_offsets-39) (kafka.coordinator.group.GroupCoordinator)
kafka_1      | [2018-01-16 11:47:26,305] INFO [GroupCoordinator 1001]: Group mygroup with generation 2 is now empty (__consumer_offsets-39) (kafka.coordinator.group.GroupCoordinator)
producer_1   | DEBUG:producer:Producing c5d28c03-a6cc-4096-9093-ad1aae7cadc9

@edenhill
Copy link
Contributor

It doesn't seem to SIGSEGV/core though, right?

@ghost
Copy link
Author

ghost commented Jan 16, 2018

I'm not sure about the translation of the return codes via docker/docker-compose with regards to signals etc. but a core file appears which is what i use for the gdb tracing
(As far as I can read it's 128 + fatal signal => 11)

@edenhill
Copy link
Contributor

Is this reproducible with librdkafka v0.11.4 and confluent-kafka-python v0.11.4? (use latest binary wheel)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant