Register Dataset

Front-end developers need to tell Cloudberry which dataset to query and how the dataset looks like so that it can utilize the Cloudberry optimization techniques.

To do this, send the DDL (Data Definition Language) JSON file to Cloudberry /admin/register path by using POST HTTP method. The following page introduces how to write a DDL JSON file and how to send it to Cloudberry. We still use the Twitter data example for illustration, the schema is defined in Prepare Dataset.

DDL JSON

To declare a dataset schema to Cloudberry, write a JSON file including the following components:

dataset : the dataset (table) name in your database.
schema : the schema definition.
- typeName : (optional) type name for the dataset. (Only useful for AsterixDB)
- dimension : the columns to do group by on. They are usually the x-axis in a visualization figure.
- measurement : the columns to apply the aggregation functions on, such as count(), sum(), average(), min(), max(). They can also be used to filter the data but they should not be used as group by keys.
- primaryKey : the primary key column name.
- timeField : the time column name. Used for query slicing.

The following JSON request can be used to register the Twitter dataset inside AsterixDB to the middleware.

{
  "dataset":"twitter.ds_tweet",
  "schema":{
    "typeName":"twitter.typeTweet",
    "dimension":[
      {"name":"create_at","isOptional":false,"datatype":"Time"},
      {"name":"id","isOptional":false,"datatype":"Number"},
      {"name":"coordinate","isOptional":false,"datatype":"Point"},
      {"name":"lang","isOptional":false,"datatype":"String"},
      {"name":"is_retweet","isOptional":false,"datatype":"Boolean"},
      {"name":"hashtags","isOptional":true,"datatype":"Bag","innerType":"String"},
      {"name":"user_mentions","isOptional":true,"datatype":"Bag","innerType":"Number"},
      {"name":"user.id","isOptional":false,"datatype":"Number"},
      {"name":"geo_tag.stateID","isOptional":false,"datatype":"Number"},
      {"name":"geo_tag.countyID","isOptional":false,"datatype":"Number"},
      {"name":"geo_tag.cityID","isOptional":false,"datatype":"Number"},
      {"name":"geo","isOptional":false,"datatype":"Hierarchy","innerType":"Number",
        "levels":[
          {"level":"state","field":"geo_tag.stateID"},
          {"level":"county","field":"geo_tag.countyID"},
          {"level":"city","field":"geo_tag.cityID"}]}
    ],
    "measurement":[
      {"name":"text","isOptional":false,"datatype":"Text"},
      {"name":"in_reply_to_status","isOptional":false,"datatype":"Number"},
      {"name":"in_reply_to_user","isOptional":false,"datatype":"Number"},
      {"name":"favorite_count","isOptional":false,"datatype":"Number"},
      {"name":"retweet_count","isOptional":false,"datatype":"Number"},
      {"name":"user.status_count","isOptional":false,"datatype":"Number"}
    ],
    "primaryKey":["id"],
    "timeField":"create_at"
  }
}

Note:

Columns that are not interesting to visualization are not required to appear in the schema declaration.
isOptional: columns that can be missed in semi-structured databases or nullable in traditional relational databases.
datatype: data type of the declared column, choices introduced as following.

Data Types

Cloudberry supports the following data types:

Boolean : boolean in databases.
Number : a superset including int8, int32, int64, float, double in databases.
Point : geo-location point composed of two Numbers, e.g. Point(80.00, -10.0).
Time : datetime in databases.
String : string in databases. It is usually used for dimension columns to do filtering and "group by".
Text : text in databases or string for databases who do not support text. It is only applicable to measurement columns to do filtering by a full-text search. Usually, it implies there is an inverted-index built on the field.
Bag : set in databases (mainly AsterixDB, traditional relational databases usually do not support set).
Hierarchy : A synthetic field that defines hierarchical relationships between the existing columns.

Cloudberry supports the following pre-defined functions for different data types:

Pre-defined Functions

Datatype	Filter	Groupby	Aggregation
Boolean	isTrue, isFalse	self	distinct-count
Number	<, >, ==, in, inRange	bin(scale)	count, sum, min, max, avg
Point	inRange	cell(scale)	count
Time	<, >, ==, inRange	interval(x hour)	count
String	contains, matchs, ~=	self	distinct-count, topK
Text	contains		distinct-count, topK (on word-token result)
Bag	contains		distinct-count, topK (on internal data)
Hierarchy		rollup

`Register` End Point

The front-end application can send the ddl JSON file to Cloudberry /admin/register path by using POST HTTP method. E.g., we can register the previous ddl using the following command line:

curl -X POST -H "Content-Type: application/json" -d @JSON_FILE_NAME http://localhost:9000/admin/register

You can access the following url to check all datasets' schema that successfully registered in Cloudberry.

http://localhost:9000/

Now you have the dataset registered to Cloudberry, you can move on to Query Cloudberry.

Quick Start
Documentation
Advanced topics
- Database Adapters
- Enable Sidebar Live Tweets
- Realtime Tweets' Ingestion
How to Contribute
Research

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Register Dataset

DDL JSON

Data Types

Pre-defined Functions

`Register` End Point

Clone this wiki locally

Register Dataset

DDL JSON

Data Types

Pre-defined Functions

Register End Point

Clone this wiki locally

`Register` End Point