Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple cluster resilience test exhibits (re)connection failure and CRASH_REPORT #390

Closed
codeasone opened this issue May 19, 2017 · 4 comments

Comments

@codeasone
Copy link

Environment

  • VerneMQ Version: 1.0.1
  • OS: debian/jessie

Firstly, I'm new to MQTT and VerneMQ and have little Erlang experience.

Whilst investigating VerneMQ's characteristics in the event of node loss and partitioning, I've encountered Erlang CRASH REPORT occurrences in the debug logs.

CRASH REPORT Process <0.1065.0> with 0 neighbours crashed with reason: no case clause matching {{case_clause,{badrpc,nodedown}},[{vmq_reg,'-register_subscriber/4-fun-0-',2,[{file,"/opt/vernemq/distdir/1.0.1/_build/default/lib/vmq_server/src/vmq_reg.erl"},{line,182}]},{vmq_reg,block_until_migrated,4,[{file,"/opt/vernemq/distdir/1.0.1/_build/default/lib/vmq_server/src/vmq_reg.erl"},{line,249}]},{vmq_reg,register_subscriber,4,[{file,"/opt/vernemq/distdir/1.0.1/_build/default/lib/vmq_server/src/vmq_reg.erl"},{line,194}]},{vmq_reg_sync_action,'-init/1-fun-0-',1,[{file,"/opt/vernemq/distdi..."},...]}]} in vmq_mqtt_fsm:check_user/2 line 553

This occurs when a subscriber reconnects following a node outage (see detailed steps).

I'd like to understand whether this is expected behaviour or a bug.

Do the occurrences of such reports compromise the integrity and service-level of a cluster in any way?

@dergraf
Copy link
Contributor

dergraf commented May 19, 2017

Thank you very much for your efforts describing this issue the way you did! Very much appreciated!

A first quick response:

Assuming allow_register_during_netsplit = off those crash(es) you're seeing don't compromise the integrity. However, instead of crashing we should gracefully terminate the client by returning the proper CONNACK response.

The explanation for this crash log is the following:
At the moment you stop a node or simulate a netsplit the cluster is in a state we call the Window of Uncertainty. During this time frame the cluster state is unhealthy whereas the remaining nodes believe the cluster is in a healthy state. After some time interval (at most 10secs, but typically a lot earlier) several keepalive checks kick in and eventually all cluster nodes are aware that the cluster is unhealthy. I think your test case has exactly hit that Window of Uncertainty.

Todo for us:

Does this answer the most urgent questions for you?

@codeasone
Copy link
Author

Happy to be of assistance and thank you for the fast response.

Assuming allow_register_during_netsplit = off those crash(es) you're seeing don't compromise the integrity

In the test scenario I linked to, allow_register_during_netsplit = on because I'm trying to maintain availability and high-throughput. Here's what I mean by that (and I hope it's sensible)

Hypothetically, we have a 5 node cluster, 10 subscribers to a single shared topic (load-balanced) maintaining a throughput of around 40K TPS (which is pure speculation on my part)

If a node goes down, I don't want to suffer 20% loss of TP during the healing period, which could be many mins for EC2, and up to 1 min for ECS.

It's very important to me that message loss is minimised; message order isn't as critical.

Given allow_register_during_netsplit = on, is message loss going to be an issue in this setting?

--

My other concerns relates to how retained messages are stored assuming QoS=1:

  • Can you confirm that multiple replicas of a retained message exist across different cluster nodes before a PUBACK is sent to a publisher? If so, is the number of replicas configurable?
  • Assuming write caching is in effect between LevelDB and disk, should I be concerned about data loss in the event of sudden termination of an AWS container/instance?
  • Do you have any general sysctl setting tips/advice? I'll be running VerneMQ on c4.2xlarge instances under Amazon Linux.

At the moment VerneMQ looks like a promising solution for a variety of use cases I'm looking at and thank you for your great work in developing it to its present level of refinement.

@dergraf
Copy link
Contributor

dergraf commented May 20, 2017

Let's consider the situations where message loss could occur in general:

  1. An undetected (window of uncertainty) netsplit or node crash delays the replication of a potential new subscription. Due to the eventual consistent nature of our subscriber storage VerneMQ runs with an out-of-sync routing table. During this time messages are not be routed to the 'new' subscriber, hence loss of messages.
  2. Messages that target a node which isn't reachable due to a netsplit or a node crash are temporarily buffered. The size of this buffer is controlled via outgoing_clustering_buffer_size. Once this buffer is full, messages are dropped (hence message loss). The metric cluster_bytes_dropped covers the nr. of bytes that were dropped due to a full buffer.
  3. Messages using QoS 1 & 2 are stored to disk as soon as they are routed to the proper queue. This enables that you can restart a node without losing messages. If the queue has an online session attached (online queue) a replica of the message is always kept in ram. For offline queues (queue without attached session) no memory replica exist, and the messages are read from disk upon reconnect. Obviously if the underlying storage crashes you lose messages.
  4. Queues have a configured upper bound of allowed messages they can store while online or offline. This mechanism enables to apply load shedding and protect the system. If you reach those limits VerneMQ drops the messages.

To answer your questions:

Replication of retained messages is shielded/protected by an in-memory write cache. The cache gets flushed every second to the underlying distributed storage which replicates the retained messages to every other node (n-replicas = n-nodes), similar to the subscriptions, in an eventual consistent manner. This write cache is required to load protect the distributed storage as the retain feature could be very chatty. Due to this trade-off, remote clients might have an inconsistent view wrt. to retain messages (and also subscriptions). The PUBACK is sent out if the messages is written to the write cache.

Data loss in the event of sudden termination of AWS container/instance is indeed a problem as with most other stateful system. As the current message store isn't replicated, we strongly advise to go through a controlled shutdown procedure that uses a proper cluster-leave which ensures that the queues/sessions are properly migrated to the other cluster nodes. Otherwise message loss is guaranteed. Subscriptions and retained messages are replicated to all other nodes, so you might only lose subscriptions/retained messages that haven't been committed to the distributed storage.

In general start tuning the obvious system limits (e.g. max file descriptors) and TCP buffer sizes. Some of the recommended settings are part of our documentation https://vernemq.com/docs/misc/.

Hope this helps.

@dergraf
Copy link
Contributor

dergraf commented Jun 15, 2017

Closing in favour of #413

@dergraf dergraf closed this as completed Jun 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants