Skip to content

Persistence

Kévin LOVATO edited this page May 26, 2014 · 26 revisions

Being a peer to peer bus, Zebus cannot rely on a central broker to deliver the messages to a peer that was down when it comes back up. We worked around this problem by creating a Persistence Service, a peer that stores the transmitted messages to replay them to a peer when it comes back up.

Normal behavior

During normal operations, a peer transmits a message to a destination peer directly, but it also sends a copy of that message to the persistence peer.

persistence_normal_behavior_first_step

When a message is processed by the destination peer, it sends a message to the Persistence Service to acknowledge the fact that it was processed.

persistence_normal_behavior_first_step

This means that if a message is not processed, for example when a peer is down, it will be stored in the persistence for the time being.

Peer restart

When a Peer restarts, it needs to process the messages that were sent to it during its downtime. Those messages are stored in the Persistence Service storage. persistence_restart_replay_phase

Replay phase

Upon restart, the Peer will connect to the Persistence and ask for the messages sent to its PeerId during its downtime. The Persistence will then send all the missed messages to the starting Peer. This means that migrating a Peer from one machine to the other is seamless as long as you use the same PeerId.

persistence_restart_replay_phase

Safety phase

Once the Replay is over, a service should be able to switch to its normal way of functioning right away. But since some network links could be slower than others, a Peer A sending a message to the starting Peer B could send the message to the Persistence instead of Peer B because it is not aware that is is up. If in the meantime Peer B switched to Normal mode, the message won't be received.

This is why we have a temporary phase during which we process messages normally AND through the Persistence, after deduplicating them on reception.

persistence_restart_replay_phase

Normal phase

After an arbitrary 30 seconds of Safety phase, the link to the Persistence is stopped and the Peer is functioning normally.

persistence_restart_replay_phase

Clone this wiki locally