-
Notifications
You must be signed in to change notification settings - Fork 82
Register Dataset
Front-end developers need to tell Cloudberry which dataset to query and how the dataset looks like so that it can utilize the Cloudberry optimization techniques.
To do this, send the DDL (Data Definition Language) JSON file to Cloudberry /admin/register
path by using POST
HTTP method.
The following page introduces how to write a DDL JSON file and how to send it to Cloudberry. We still use the Twitter data example for illustration, the schema is defined in Prepare Dataset.
To declare a dataset schema to Cloudberry, write a JSON file including the following components:
- dataset : the dataset (table) name in your database.
-
schema : the schema definition.
- typeName : (optional) type name for the dataset. (Only useful for AsterixDB)
-
dimension : the columns to do
group by
on. They are usually the x-axis in a visualization figure. -
measurement : the columns to apply the aggregation functions on, such as
count()
,sum()
,average()
,min()
,max()
. They can also be used to filter the data but they should not be used asgroup by
keys. - primaryKey : the primary key column name.
- timeField : the time column name. Used for query slicing.
The following JSON request can be used to register the Twitter dataset inside AsterixDB to the middleware.
{
"dataset":"twitter.ds_tweet",
"schema":{
"typeName":"twitter.typeTweet",
"dimension":[
{"name":"create_at","isOptional":false,"datatype":"Time"},
{"name":"id","isOptional":false,"datatype":"Number"},
{"name":"coordinate","isOptional":false,"datatype":"Point"},
{"name":"lang","isOptional":false,"datatype":"String"},
{"name":"is_retweet","isOptional":false,"datatype":"Boolean"},
{"name":"hashtags","isOptional":true,"datatype":"Bag","innerType":"String"},
{"name":"user_mentions","isOptional":true,"datatype":"Bag","innerType":"Number"},
{"name":"user.id","isOptional":false,"datatype":"Number"},
{"name":"geo_tag.stateID","isOptional":false,"datatype":"Number"},
{"name":"geo_tag.countyID","isOptional":false,"datatype":"Number"},
{"name":"geo_tag.cityID","isOptional":false,"datatype":"Number"},
{"name":"geo","isOptional":false,"datatype":"Hierarchy","innerType":"Number",
"levels":[
{"level":"state","field":"geo_tag.stateID"},
{"level":"county","field":"geo_tag.countyID"},
{"level":"city","field":"geo_tag.cityID"}]}
],
"measurement":[
{"name":"text","isOptional":false,"datatype":"Text"},
{"name":"in_reply_to_status","isOptional":false,"datatype":"Number"},
{"name":"in_reply_to_user","isOptional":false,"datatype":"Number"},
{"name":"favorite_count","isOptional":false,"datatype":"Number"},
{"name":"retweet_count","isOptional":false,"datatype":"Number"},
{"name":"user.status_count","isOptional":false,"datatype":"Number"}
],
"primaryKey":["id"],
"timeField":"create_at"
}
}
Note:
- Columns that are not interesting to visualization are not required to appear in the schema declaration.
-
isOptional
: columns that can be missed in semi-structured databases or nullable in traditional relational databases. -
datatype
: data type of the declared column, choices introduced as following.
Cloudberry supports the following data types:
-
Boolean :
boolean
in databases. -
Number : a superset including
int8
,int32
,int64
,float
,double
in databases. -
Point : geo-location point composed of two
Number
s, e.g. Point(80.00, -10.0). -
Time :
datetime
in databases. -
String :
string
in databases. It is usually used for dimension columns to do filtering and "group by". -
Text :
text
in databases orstring
for databases who do not supporttext
. It is only applicable tomeasurement
columns to do filtering by a full-text search. Usually, it implies there is an inverted-index built on the field. -
Bag :
set
in databases (mainly AsterixDB, traditional relational databases usually do not supportset
). - Hierarchy : A synthetic field that defines hierarchical relationships between the existing columns.
Cloudberry supports the following pre-defined functions for different data types:
Datatype | Filter | Groupby | Aggregation |
---|---|---|---|
Boolean | isTrue, isFalse | self | distinct-count |
Number | <, >, ==, in, inRange | bin(scale) | count, sum, min, max, avg |
Point | inRange | cell(scale) | count |
Time | <, >, ==, inRange | interval(x hour) | count |
String | contains, matchs, ~= | self | distinct-count, topK |
Text | contains | distinct-count, topK (on word-token result) | |
Bag | contains | distinct-count, topK (on internal data) | |
Hierarchy | rollup |
The front-end application can send the ddl JSON file to Cloudberry /admin/register
path by using POST
HTTP method.
E.g., we can register the previous ddl using the following command line:
curl -X POST -H "Content-Type: application/json" -d @JSON_FILE_NAME http://localhost:9000/admin/register
You can access the following url to check all datasets' schema that successfully registered in Cloudberry.
http://localhost:9000/
Now you have the dataset registered to Cloudberry, you can move on to Query Cloudberry.