Object storage is the recommended storage format in cloud as it can support storing large data files. S3 APIs are widely used for accessing object stores. This can be used to store or retrieve data on Amazon cloud, Huawei Cloud(OBS) or on any other object stores conforming to S3 API. Storing data in cloud is advantageous as there are no restrictions on the size of data and the data can be accessed from anywhere at any time. Carbondata can support any Object Storage that conforms to Amazon S3 API. Carbondata relies on Hadoop provided S3 filesystem APIs to access Object stores.
To store carbondata files onto Object Store, spark.sql.warehouse.dir
property will have
to be configured with Object Store path in spark-default.conf.
For example:
spark.sql.warehouse.dir=s3a://mybucket/carbonstore
If the existing store location cannot be changed or only specific tables need to be stored
onto cloud object store, it can be done so by specifying the location
option in the create
table DDL command.
For example:
CREATE TABLE IF NOT EXISTS db1.table1(col1 string, col2 int) STORED AS carbondata LOCATION 's3a://mybucket/carbonstore'
For more details on create table, Refer DDL of CarbonData
Authentication properties will have to be configured to store the carbondata files on to S3 location.
Authentication properties can be set in any of the following ways:
-
Set authentication properties in core-site.xml, refer hadoop authentication document
-
Set authentication properties in spark-defaults.conf.
Example
spark.hadoop.fs.s3a.secret.key=123
spark.hadoop.fs.s3a.access.key=456
- Pass authentication properties with spark-submit as configuration.
Example:
./bin/spark-submit \
--master yarn \
--conf spark.hadoop.fs.s3a.secret.key=123 \
--conf spark.hadoop.fs.s3a.access.key=456 \
--class=xxx
- Set authentication properties to hadoop configuration object in sparkContext.
Example:
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "123")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.access.key","456")
- Object Storage like S3 does not support file leasing mechanism(supported by HDFS) that is required to take locks which ensure consistency between concurrent operations therefore, it is recommended to set the configurable lock path property(carbon.lock.path) to a HDFS directory.
- Concurrent data manipulation operations are not supported. Object stores follow eventual consistency semantics, i.e., any put request might take some time to reflect when trying to list. This behaviour causes the data read is always not consistent or not the latest.