Skip to content

CarbonData is a new hadoop native file format for faster interactive query

License

Notifications You must be signed in to change notification settings

sujith71955/carbondata-1

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CarbonData

CarbonData is a fully indexed columnar and hadoop native datastore designed for fast analytics and multi-dimension query on big data.In customer benchmarks, CarbonData has been shown to manage Petabyte of data running on extraordinarily low-cost hardware and answers queries around 10 times faster than the current open source solutions (column-oriented SQL on Hadoop data-stores).

Why CarbonData

The CaronData file format provides a highly efficient way to store structured data,it was designed to overcome limitations of the other Hadoop file formats.

  • CarbonData stores data along with index,the index is not stored separately but the carbondata itself is the index.Indexing helps to accelerate query performance and reduces the I/O scans to the minimum and reduces the CPU required for computation on the data.
  • CarbonData is designed to support very efficient compression and global encoding schemes,CarbonData can support actionable compression across data files for query engines to perform all processing on compressed/encoded data without having to convert the data. This improves the processing speed, especially for analytical queries. The data can be converted back to the user readable format just before returning the results to the user.
  • CarbonData is designed with various types of usecases in mind and provides flexible storage options. Data can be stored in completely columnar or row based formats or even in a mix of columnar and row based hybrid format.
  • CarbonData is designed to integrate into big data ecosystem, leverages Apache Hadoop,Apache Spark etc. for distributed query processing.

Features

  • Fast analytic queries in seconds using built-in index, optimized for interactive OLAP-style query, high through put scan query, low latency point query.
  • Fast data loading speed and support incremental load in period of minutes. 5 mins near realtime loading should be supported
  • Support concurrent query.
  • Support time based data retention.
  • Supports SQL based query interface.

CarbonData File Structure and Format

The online document at CarbonData File Structure and Format

Building CarbonData

Prerequisites for building CarbonData:

  • Unix-like environment (Linux, Mac OS X)
  • git
  • Apache Maven (we recommend version 3.0.4)
  • Java 7 or 8
  • Scala 2.10

I. Clone and build CarbonData

$ git clone https://github.com/HuaweiBigData/carbondata.git

II. Go to the root of the source tree

$ cd carbondata

III. Build the project Build without testing:

$ mvn -DskipTests clean install 

Or, build with testing:

$ mvn clean install

Developing CarbonData

The CarbonData committers use IntelliJ IDEA and Eclipse IDE to develop.

IntelliJ IDEA

  • Download IntelliJ at https://www.jetbrains.com/idea/ and install the Scala plug-in for IntelliJ at http://plugins.jetbrains.com/plugin/?id=1347
  • Go to "File -> Import Project", locate the CarbonData source directory, and select "Maven Project".
  • In the Import Wizard, select "Import Maven projects automatically" and leave other settings at their default.
  • Leave other settings at their default and you should be able to start your development.
  • When you run the scala test, sometimes you will get out of memory exception. You can increase your VM memory usage by the following setting, for example:
-XX:MaxPermSize=512m -Xmx3072m

You can also make those setting to be the default by setting to the "Defaults -> ScalaTest".

Eclipse

  • Download the Scala IDE (preferred) or install the scala plugin to Eclipse.
  • Import the CarbonData Maven projects ("File" -> "Import" -> "Maven" -> "Existing Maven Projects" -> locate the CarbonData source directory).

Fork and Contribute

This is an open source project for everyone, and we are always open to people who want to use this system or contribute to it. This guide document introduce how to contribute to CarbonData.

About

CarbonData project original contributed from the Huawei, in progress of donating this open source project to Apache Software Foundation for leveraging big data ecosystem.

About

CarbonData is a new hadoop native file format for faster interactive query

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 88.4%
  • Scala 11.3%
  • Thrift 0.3%