-
Notifications
You must be signed in to change notification settings - Fork 1
Architecture
It would be a good idea to read the Megastore paper before trying to comprehend this design. In particular, the key abstractions like "entity groups," the roles of the various servers, and the purpose of Paxos and write-ahead logging are important.
You should understand the Megalon data model before reading this page.
I know that some of this code is not great. It will gradually become more fabulous with time. The author P.G. Wodehouse once described his writing process, where he would attach his pages to the wall starting at floor level, and gradually move them up as he improved them, until the finished pages were at eye level.
Most of this code is at about waist level.
Megalon is a clone of Megastore.
Megalon is a transaction management layer than runs on top of HBase. This is exactly analogous to the way that Google's Megastore runs on top of BigTable. Megalon controls access to the underlying HBase instances in a way that provides all the fancy transactional properties that we want.
A single Megalon database spans several data centers (or maybe just 1). In Megastore terminology, each data center is called a "replica." Each data center runs at least one Megalon replication server, at least one Megalon coordinator server, and possibly some Megalon clients. The replication server(s) and coordinator(s) can run on the same server or on different servers.
Each replica contains an HBase cluster. These HBase clusters don't communicate with each other at all. Megalon is the only thing communicating between replicas. Since Megalon commits transactions in the same order at each replica, the HBase databases at each replica will be exact copies of each other, even though HBase's built-in replication functionality isn't used.
The database is divided into "entity groups." A single Megalon transaction can read and write to only one entity group. The client application that runs on top of Megalon must be aware of these entity groups, and must make sure that each entity group doesn't have excessive throughput. An application might store all the data for a single user in an entity group, for example.
Why does Megalon use HBase? Besides the simple fact that HBase is very similar to BigTable, it offers certain key features that allow transactional behavior to be built easily:
- Multi-version concurrency control (aka MVCC): after values are overwritten or deleted, they can still be accessed. Megalon makes use of this to allow one transaction to commit while other transactions are still reading the old values. We don't have to build gnarly concurrency control mechanisms to prevent readers and writers from accessing the same data.
- checkAndPut: clients can conditionally write data depending on the value of existing data in the database. Megalon uses this for operations like "update the proposed value in the write-ahead log only if no one else has changed it since my last read."
A single Megalon database is composed of several servers running in Java virtual machines. They can run in the same JVM or in different JVMs. When two servers communicate, they communicate by by normal Java function calls if they're in the same JVM, or by a custom RPC protocol if they're in different JVMs.
There are two main servers, the "replication server" and the "coordinator." In Megalon these servers perform exactly the same roles as in Megastore. See Figure 5 for a diagram of their interactions.
The Replication Server is used by Megalon clients. When a client does a write operation or a catchup operation, it will communicate with replication servers in the remote replicas. Clients never communicate directly with HBase clusters in remote replicas; all remote data access goes through a replication server for that replica. Clients never use the replication server in their local replica; they just communicate directly with the HBase cluster. There may be multiple replication servers in a single replica. The replication servers keep no internal state, since the only state they need is stored in the HBase instance for their replica.
The Coordinator server is used by clients within its replica. The coordinator exists solely to answer one question: "does this replica have the latest written data for entity group X?" If so, the client can read data from entity group X from the local replica, without communicating with remote replicas. If the coordinator replies that entity group X may not be up to date, the client must begin a catchup operation for that entity group. See the page read and write paths for more information about what happens on the read path and the write path.
Still to be written: a new server, which I'll probably call a "proposer," that will accept read and write requests from client applications and execute them. In the current design of Megalon, the client code is simply a Java library that runs inside the client JVM. A Megalon client is quite complex since it must act as a Paxos proposer and must maintain state about the cluster configuration. A better design would be to have a server with a simple read/write/commit API that does all the gnarly Paxos code on behalf of the client. This would also allow clients to be written in a language other than Java without reimplementing the Megalon client code. This would be a minor change to the Megastore design.
The internal structure of Megalon servers is a bit unusual. The servers use a generic non-blocking server framework written especially for this project contained in the org.megalon.multistageserver package. See MultiStageServer for details.
One of the main design issues for Megalon is concurrency control: each replica acts as a Paxos proposer and accepter. A key issue is coordinating the actions of all the replication servers and clients within a single replica so that they act as a single Paxos proposer/accepter. For example, the correctness of Paxos depends on a single proposer never proposing an N value that is less than or equal to an N value that it has proposed frequently. See ReadWritePaths for details on how this is accomplished.
Megalon uses a custom RPC protocol for communication between servers. The design is actually very simple: an RPC request is just an Avro-serialized payload prefixed with message type, serial number and length. The RPC response is also an Avro-serialized payload, prefixed with a message type, the same serial number as the request, and the message length. AvroRpcDecode.java and RPCUtil.java are good places to start digging.