-
-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple cluster resilience test exhibits (re)connection failure and CRASH_REPORT #390
Comments
Thank you very much for your efforts describing this issue the way you did! Very much appreciated! A first quick response: Assuming The explanation for this crash log is the following: Todo for us:
Does this answer the most urgent questions for you? |
Happy to be of assistance and thank you for the fast response.
In the test scenario I linked to, Hypothetically, we have a 5 node cluster, 10 subscribers to a single shared topic (load-balanced) maintaining a throughput of around 40K TPS (which is pure speculation on my part) If a node goes down, I don't want to suffer 20% loss of TP during the healing period, which could be many mins for EC2, and up to 1 min for ECS. It's very important to me that message loss is minimised; message order isn't as critical. Given -- My other concerns relates to how retained messages are stored assuming QoS=1:
At the moment VerneMQ looks like a promising solution for a variety of use cases I'm looking at and thank you for your great work in developing it to its present level of refinement. |
Let's consider the situations where message loss could occur in general:
To answer your questions: Replication of retained messages is shielded/protected by an in-memory write cache. The cache gets flushed every second to the underlying distributed storage which replicates the retained messages to every other node (n-replicas = n-nodes), similar to the subscriptions, in an eventual consistent manner. This write cache is required to load protect the distributed storage as the retain feature could be very chatty. Due to this trade-off, remote clients might have an inconsistent view wrt. to retain messages (and also subscriptions). The Data loss in the event of sudden termination of AWS container/instance is indeed a problem as with most other stateful system. As the current message store isn't replicated, we strongly advise to go through a controlled shutdown procedure that uses a proper cluster-leave which ensures that the queues/sessions are properly migrated to the other cluster nodes. Otherwise message loss is guaranteed. Subscriptions and retained messages are replicated to all other nodes, so you might only lose subscriptions/retained messages that haven't been committed to the distributed storage. In general start tuning the obvious system limits (e.g. max file descriptors) and TCP buffer sizes. Some of the recommended settings are part of our documentation https://vernemq.com/docs/misc/. Hope this helps. |
Closing in favour of #413 |
Environment
1.0.1
debian/jessie
Firstly, I'm new to MQTT and VerneMQ and have little Erlang experience.
Whilst investigating VerneMQ's characteristics in the event of node loss and partitioning, I've encountered Erlang
CRASH REPORT
occurrences in the debug logs.This occurs when a subscriber reconnects following a node outage (see detailed steps).
I'd like to understand whether this is expected behaviour or a bug.
Do the occurrences of such reports compromise the integrity and service-level of a cluster in any way?
The text was updated successfully, but these errors were encountered: