Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix some typos #8

Merged
merged 1 commit into from
Aug 15, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Prior to Palo, different tools were deployed to solve diverse requirements in ma

However, when a use case requires the simultaneous availability of capabilities that cannot all be provided by a single tool, users were forced to build hybrid architectures that stitch multiple tools together. Users often choose to ingest and update data in one storage system, but later reorganize this data to optimize for an analytical reporting use-case served from another. Our users had been successfully deploying and maintaining these hybrid architectures, but we believe that they shouldn’t need to accept their inherent complexity. A storage system built to provide great performance across a broad range of workloads provides a more elegant solution to the problems that hybrid architectures aim to solve. Palo is the solution. Palo is designed to be a simple and single tightly coupled system, not depending on other systems. Palo provides high concurrent low latency point query performance, but also provides high throughput queries of ad-hoc analysis. Palo provides bulk-batch data loading, but also provides near real-time mini-batch data loading. Palo also provides high availability, reliability, fault tolerance, and scalability.

Generally speaking, Palo is the technology combination of Google Mesa and Cloudera Impala. Mesa is a highly scalable analytic data storage system that stores critical measurement data related to Google's Internet advertising business. Mesa is designed to satisfy complex and challenging set of users’ and systems’ requirements, including near real-time data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. At present, by virtue of its superior performance and rich functionality, Impala has been comparable to many commercial MPP database query engine. Mesa can satisfy the needs of many of our storage requirements, however Mesa itself does not provide a SQL query engine; Impala is a very good MPP SQL query engine, but the lack of a perfect distributed storage engine. So in the end we chose the combination of these two technologies.
Generally speaking, Palo is the technology combination of Google Mesa and Cloudera Impala. Mesa is a highly scalable analytic data storage system that stores critical measurement data related to Google's Internet advertising business. Mesa is designed to satisfy complex and challenging set of users’ and systems’ requirements, including near real-time data ingestion and query ability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. At present, by virtue of its superior performance and rich functionality, Impala has been comparable to many commercial MPP database query engine. Mesa can satisfy the needs of many of our storage requirements, however Mesa itself does not provide a SQL query engine; Impala is a very good MPP SQL query engine, but the lack of a perfect distributed storage engine. So in the end we chose the combination of these two technologies.

Learning from Mesa’s data model, we developed a distributed storage engine. Unlike Mesa, this storage engine does not rely on any distributed file system. Then we deeply integrate this storage engine with Impala query engine. Query compiling, query execution coordination and catalog management of storage engine are integrated to be frontend daemon; query execution and data storage are integrated to be backend daemon. With this integration, we implemented a single, full-featured, high performance state the art of MPP database, as well as maintaining the simplicity.

Expand Down Expand Up @@ -40,7 +40,7 @@ Traditional data warehouse always uses a RDBMS database to store their catalog m

![](./docs/resources/log_replication.jpg)

In-memory catalog storage has three functional modules: real-time memory data structures, memory checkpoints on local disk and an operation relay log. When modifing catalog, the mutation operation is written into the log file firstly. Then, the mutation operation is applied into the memory data structures. Periodically, a thread does the checkpoint that dumps memory data structure image into local disk. Checkpoint mechanism enables the fast startup of frontend and reduces the disk storage occupancy. Actually, in-memory catalog also simplifies the implementation of multiple frontends.
In-memory catalog storage has three functional modules: real-time memory data structures, memory checkpoints on local disk and an operation relay log. When modifying catalog, the mutation operation is written into the log file firstly. Then, the mutation operation is applied into the memory data structures. Periodically, a thread does the checkpoint that dumps memory data structure image into local disk. Checkpoint mechanism enables the fast startup of frontend and reduces the disk storage occupancy. Actually, in-memory catalog also simplifies the implementation of multiple frontends.

#### 3.2 Multiple Frontends

Expand Down Expand Up @@ -157,17 +157,17 @@ Palo applies updates in batches. Three types of data loading are supported: Hado

All the loading work is handled asynchronously. When load request is submitted, a label needs to be provided. By using the load label, users can submit show load request to get the loading status or submit cancel load request to cancel the loading. If the status of loading task is successful or in progress, its load label is not allowed to reuse again. The label of failed task is allowed to be reused.

#### 4.3 Resouce Isolation
#### 4.3 Resource Isolation

1. Multi-tenancy Isolation:Multiple virtual cluster can be created in one pysical Palo cluster. Every backend node can deploy multiple backend processes. Every backend process only belongs to one virtual cluster. Virtual cluster is one tenancy.

2. User Isolation: There are many users in one virtual cluster. You can allocate the resouce among different users and ensure that all users’ tasks are executed under limited resource quota.
2. User Isolation: There are many users in one virtual cluster. You can allocate the resource among different users and ensure that all users’ tasks are executed under limited resource quota.

3. Priority Isolation: There are three priorities isolation group for one user. User could control resource allocated to different tasks submitted by themselves, for example user's query task and loading tasks require different resource quota.

#### 4.4 Multi-Medium Storage

Most machines in modern datacenter are equipped with both SSDs and HDDs. SSD has good random read capability that is the ideal medium for query that needs a large number of random read operations. However, SSD’s capacity is small and is very expensive, we could not deploy it at a large scale. HDD is cheap and has huge capacity that is suitable to store large scale data but with high read latency. In OLAP scenario, we find user usually submit a lot of queries to query the latest data (hot data) and expect low latency. User occasionally executes query on historical data (cold data). This kind of query usually needs to scan large scale of data and is high latency. Multi-Medium Storage allows users to manage the storage medium of the data to meet different query scenarios and reduce the latency. For example, user could put latest data on SSD and historical data which is not used frequently on HDD, user will get low latency when quering latest data while get high latency when query historical data which is normal because it needs scan large scale data.
Most machines in modern datacenter are equipped with both SSDs and HDDs. SSD has good random read capability that is the ideal medium for query that needs a large number of random read operations. However, SSD’s capacity is small and is very expensive, we could not deploy it at a large scale. HDD is cheap and has huge capacity that is suitable to store large scale data but with high read latency. In OLAP scenario, we find user usually submit a lot of queries to query the latest data (hot data) and expect low latency. User occasionally executes query on historical data (cold data). This kind of query usually needs to scan large scale of data and is high latency. Multi-Medium Storage allows users to manage the storage medium of the data to meet different query scenarios and reduce the latency. For example, user could put latest data on SSD and historical data which is not used frequently on HDD, user will get low latency when querying latest data while get high latency when query historical data which is normal because it needs scan large scale data.

In the following figure, user alters partition 'p201601' storage_medium to SSD and storage_cooldown_time to '2016-07-01 00:00:00'. The setting means data in this partition will be put on SSD and it will start to migrate to HDD after the time of storage_cooldown_time.

Expand All @@ -188,7 +188,7 @@ The result of benchmark shows 2x~4x speedup in our typical queries.

## 5. Backup and Recovery

Data backup function is provided to enhance data security. The minimum granularity of backup and recovery is partition. Users can develop plug-ins to backup data to any specified remote storage. The backup data can always be recovered to Palo at all time, to achieve the data rollback purpose.
Data backup function is provided to enhance data security. The minimum granularity of backup and recovery is partition. Users can develop plugins to backup data to any specified remote storage. The backup data can always be recovered to Palo at all time, to achieve the data rollback purpose.

Currently we only support full data backup data rather than incremental backups for the following reasons:

Expand Down