-
Notifications
You must be signed in to change notification settings - Fork 24
gms 3.3 spec
Eucalyptus uses [1] as a virtually synchronous[2] Group Membership System (GMS) [3] to bootstrap and handle host membership in the system.
Initially, isolated hosts are running a eucalyptus-cloud
process and need to become a part of the distributed system. This act of joining the group then triggers a reconfiguration of the system, as a whole, based on what services are registered on the newly joining host (e.g., enable the Walrus service; sync the DB; etc.). Conversely, when the process is stopped/killed or the host shuts down a complementary leave of the group occurs for the host.
Lastly, various faults can impact a host's ability to be useful for processing requests. Host-level failure detection is driven by jgroups. Faults can cause a running eucalyptus-cloud
process to become unreachable from other hosts in the system (a partition), unable to receive messages (e.g., null route, NIC failure, MTU problem), or cause hosting node to fail. For each of these the system must react to remove that host from the system, stop depending on the service which had been running there, and potentially promote services running elsewhere as replacements.
More background on group membership systems is available in "A History of the Virtual Synchrony Replication Model"[4].
In Eucalyptus we depend upon the jgroups GMS to do a number of important things:
When starting the eucalyptus-cloud
process we don't tell it anything about the rest of the system. Localized configuration of distributed state is intentionally avoided. Instead, the process uses IGMPv4 multicast to join a multicast group specific to that cloud installation. The effect is to store the state of the hosts in the system "in the network" (as opposed to on any particular host).
As part of the GMS information we carry three bits along with each host in the system:
- Has the host started up all the way yet? That is, is the web services stack up, etc.; can it respond to web-services requests.
- Does the host run a database?
- If it does run a database, is it synchronized?
When a host joins or leaves a Eucalyptus system some kind of action results. Based on the above host state information the system will reconfigure itself appropriately:
- Host w/o database joins the group: it will setup database connections to all the hosts which do run a database.
- Host w database joins the group: there are two cases.
- There is no other host with a database: the newly joined host will start up normally.
- There is another host with a database: all hosts will block database operations, the newly joined host will synchronize from the other host with the database, it will mark its state as synchronized when that completes, then all hosts will setup connections to the new database.
- Host w/o a database leaves the group: if the host was running an ENABLED service the system will try to perform a failover if a spare is available.
- Host w/ a database leaves the group: everyone tears down their database connections to that host.
It is the nature of networks that message delivery is not reliable. Devices fail, misconfiguration occurs, cables get pulled. When the network stops delivering messages which are going between two hosts w/in the Eucalyptus system a partition has occurred. The GMS will detect the faulty host and remove the host from group. This brings the system into a state with two independently operating regimes. Subsequent repairs to the network will then restore communication between the affected hosts resulting in a merge. At this point, the GMS will provide each of the hosts with a view of the system before the merge having at least two subgroups. Each host determines whether or not the merge would result in a potential inconsistency (e.g., when two CLCs are merged back together one of them must stop accepting writes).
Eucalyptus uses jgroups [1] which is very configurable and in particular affects two characteristics of the system.
- Host bootstrap is Discovery vs. Configured: When
- Failure detection possible with Stateless vs. Stateful protocols:
- Multicast/UDP: Discovered & Stateless
- Unicast/UDP: Configured & Stateless
- Unicast/TCP: Configured & Stateful
- Gossip/TCP: Partially Configured & Stateful
A configuration option is to use a gossip server to add new group members. The gossip server maintains a list of the current members. New members contact the gossip server, which then distributes updated membership information to the new member as well as the rest of the group.
Gossip servers are somewhat more complex to manage but work in cases where cluster membership changes regularly and multicast is not available. The additional complexity comes from needing to configure the system, but has the attribute of only needing to have the possible coordinator hosts (CLCs) configured.
That said, gossip server failures can cause problems with membership management.
Additionally, having multiple GossipRouters would be implied. This has the characteristic of allowing for "split-brain" scenarios where hosts are able to reach different CLCs (e.g., during a network partition).
- Misconfigurations which make a host unreachable.
- Misconfigurations which make a host reachable but it cannot respond.
- Misconfigurations which make a host unreachable but it can send.
- Multicast throttling or blocking.
- MTU misconfigurations.
- Null routing/black hole.
- Stateful connection time outs.
- Configured: Inconsistent host addresses on different nodes will lead to unpredictable system behavior. Identifying this as the root cause is difficult. Verifying this isn't possible when it is misconfigured. Changing the system requires they be changed consistently as new hosts are added or old hosts are removed across all hosts in the system.
- Stateful: If a fault causes the process to be unreachable but the TCP sockets are not closed reacting involves waiting for the socket's keepalive timeout and this might take 2 hrs.
The impact of the above problems on system behaviour varies across configuration options.
- Unicast and Gossip configurations: Asymmetric misconfigurations impact membership.
- Only Multicast configurations: Throttling/blocking impacts membership.
- Only TCP configurations: Stateful connection time outs, Null routing/black holes, and MTU misconfigurations impact failure detection and membership.
- a b http://www.jgroups.org
- ^ http://en.wikipedia.org/wiki/Virtual_synchrony
- ^ http://www.springer.com/computer/communication+networks/book/978-0-387-21509-9
- ^ History of GMS
tag:rls-3.3 tag:gms
- Contact Info
- email: architecture@eucalyptus.com
- IRC: #eucalyptus-devel (freenode)
- Eucalyptus Links