-
Notifications
You must be signed in to change notification settings - Fork 82
Query Cloudberry
After Register Dataset to Cloudberry, the front-end application or the user can send queries to Cloudberry through either of the following two end-points.
Example URL
http://localhost:9000/berry
Example Query
Suppose you create a file query.json
with the following content.
{
"dataset": "twitter.ds_tweet",
"select" : {
"order" : ["-create_at"],
"limit": 1,
"offset": 0,
"field": ["*"]
}
}
This is a simple query to select only one record from twitter.ds_tweet
dataset.
Example Command
curl -X POST -H "Content-Type: application/json" -d @query.json http://localhost:9000/berry
Example Response
[[
{
"place.bounding_box":[
[-71.625705, 42.310982],
[-71.475702, 42.380659]
],
"favorite_count":0,
"geo_tag.countyID":25017,
"user.id":111986159,
"geo_tag.cityID":2538715,
"is_retweet":false,
"text":"Brewing with the back to be or tomorrow at the best to offer they are @ Deep County, CA [Portland Familes, San Maraporonic Makers -",
"retweet_count":0,
"in_reply_to_user":-1,
"id":1129340818383343622,
"coordinate":[-71.50138889, 42.36916667],
"in_reply_to_status":-1,
"geo_tag.stateID":25,
"create_at":"2019-05-17T11:00:07.000Z",
"lang":"en",
"user.profile_image_url":"http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png",
"user.name":"Hazel Dashawn"
}
]]
To minimize the overhead of communication, Cloudberry supports the Websocket
connection when front-end application needs to fluently send queries to and receive responses from Cloudberry, or needs responses pushed by Cloudberry automatically (e.g. when query slicing is applied).
Example URL
ws://localhost:9000/ws
Example code
Open a Websocket
connection to Cloudberry in Javascript as following:
var ws = new WebSocket("ws://localhost:9000/ws"");
Query format of Websocket
end-point is the same as HTTP POST
end-point, and is described as following.
A query is composed of the following parameters:
- Dataset : the dataset to query on.
-
Unnest : to flatten a record based on the nested
Bag
attribute to generate multiple records. (Only useful for semi-structured databases, e.g. AsterixDB.) - Filter : a set of selection predicates.
-
Group :
-
by : to specify the
group by
fields. -
aggregate: to specify the aggregation functions to apply, including
count
,sum
,min
,max
andavg
.
-
by : to specify the
-
Select:
- field : a set of fields that should be returned.
-
order : to specify the
order by
fields. - limit : to specify the number of records that will be returned at most.
-
offset : to specify the pagination combined with
limit
value.
Here are some examples as following.
- Get 100 latest tweets (time and id) that mention "hurricane".
{
"dataset": "twitter.ds_tweet",
"filter": [{
"field": "text",
"relation": "contains",
"values": [ "hurricane"]
}],
"select" : {
"order" : [ "-create_at"],
"limit": 100,
"offset" : 0,
"field": ["create_at", "id"]
}
}
Expected results are as following:
[[
{"create_at":"2019-05-16T17:00:17.000Z","id":1129069066373799936},
{"create_at":"2019-05-14T21:34:07.000Z","id":1128413203938127873},
{"create_at":"2019-05-14T16:12:34.000Z","id":1128332285806354438},
...
]]
- Get the per-state and per-day count of tweets that contain "hurricane" and "florence" in 2018.
{
"dataset": "twitter.ds_tweet",
"filter": [
{
"field": "create_at",
"relation": "inRange",
"values": [ "2018-01-01T00:00:00.000Z", "2018-12-31T00:00:00.000Z"]
},
{
"field": "text",
"relation": "contains",
"values": [ "hurricane", "florence" ]
}
],
"group": {
"by": [
{
"field": "geo_tag.stateID",
"as": "state"
},
{
"field": "create_at",
"apply": {
"name": "interval",
"args": {
"unit": "day"
}
},
"as": "day"
}
],
"aggregate": [
{
"field": "*",
"apply": {
"name": "count"
},
"as": "count"
}
]
}
}
Expected results are as following:
[[
{"state":6,"day":"2018-04-05T00:00:00.000Z","count":1},
{"state":6,"day":"2018-09-11T00:00:00.000Z","count":1},
{"state":8,"day":"2018-03-28T00:00:00.000Z","count":1},
{"state":12,"day":"2018-03-01T00:00:00.000Z","count":1}
...
]]
- Get top-10 related hashtags for tweets that mention "hurricane".
{
"dataset": "twitter.ds_tweet",
"filter": [
{
"field": "text",
"relation": "contains",
"values": [ "hurricane"]
}
],
"unnest" : [{ "hashtags": "tag"}],
"group": {
"by": [
{ "field": "tag" }
],
"aggregate": [
{
"field" : "*",
"apply" : {
"name": "count"
},
"as" : "count"
}
]
},
"select" : {
"order" : [ "-count"],
"limit": 10,
"offset" : 0
}
}
Expected results are as following:
[[
{"tag":"Hurricane","count":102},
{"tag":"Florida","count":92},
{"tag":"hurricane","count":92},
...
]]
Cloudberry supports automatic query-slicing on the timeField
. The front-end can specify a response time limit for each
"small query" to get the results progressively.
{
...
"option":{
"sliceMillis": 2000
}
}
For example, the following query asks the top-10 hashtags with an option to accept an updated results every 200ms.
{
"dataset": "twitter.ds_tweet",
"filter": [{
"field": "text",
"relation": "contains",
"values": ["zika"]
}],
"unnest": [{
"hashtags": "tag"
}],
"group": {
"by": [{
"field": "tag"
}],
"aggregate": [{
"field": "*",
"apply": {
"name": "count"
},
"as": "count"
}]
},
"select": {
"order": ["-count"],
"limit": 10,
"offset": 0
},
"option": {
"sliceMillis": 200
}
}
There will be a stream of results return from Cloudberry as following:
[[{"tag":"Zika","count":3},{"tag":"ColdWater","count":1},{"tag":"Croatia","count":1}, ... ]]
[[{"tag":"Zika","count":4},{"tag":"Croatia","count":1},{"tag":"OperativoNU","count":1}, ... ]]
[[{"tag":"trndnl","count":6},{"tag":"Zika","count":4},{"tag":"ProjectHomeLouisDay","count":1}, ... ]]
...
Sometimes the front-end wants to slice multiple queries simultaneously so that it can show multiple consistent results.
In this case, it can wrap the queries inside the batch
field and specify only one option
field.
{
"batch" : [
{ request1 },
{ request2 }
],
"option" : {
"sliceMillis": 2000
}
}
E.g., the following query shows an batch
example that asks the by-state count and the top-10 hashtags and these two
queries should be sliced synchronized.
{
"batch": [{
"dataset": "twitter.ds_tweet",
"filter": [{
"field": "create_at",
"relation": "inRange",
"values": ["2016-01-01T00:00:00.000Z", "2016-12-31T00:00:00.000Z"]
}, {
"field": "text",
"relation": "contains",
"values": ["zika", "virus"]
}],
"group": {
"by": [{
"field": "geo_tag.stateID",
"as": "state"
}, {
"field": "create_at",
"apply": {
"name": "interval",
"args": {
"unit": "hour"
}
},
"as": "hour"
}],
"aggregate": [{
"field": "*",
"apply": {
"name": "count"
},
"as": "count"
}]
}
}, {
"dataset": "twitter.ds_tweet",
"filter": [{
"field": "text",
"relation": "contains",
"values": ["zika"]
}],
"unnest": [{
"hashtags": "tag"
}],
"group": {
"by": [{
"field": "tag"
}],
"aggregate": [{
"field": "*",
"apply": {
"name": "count"
},
"as": "count"
}]
},
"select": {
"order": ["-count"],
"limit": 10,
"offset": 0
}
}],
"option": {
"sliceMillis": 200
}
}
The response is as following:
[
[ {"state":6,"hour":"2016-08-05T10:00:00.000Z","count":1}, {"state":12,"hour":"2016-07-26T10:00:00.000Z","count":1}, ...],
[ {"tag":"trndnl","count":6},{"tag":"Zika","count":5},{"tag":"ColdWater","count":1}, ...]
]
[
[ {"state":72,"hour":"2016-05-06T10:00:00.000Z","count":1},{"state":48,"hour":"2016-09-09T10:00:00.000Z","count":2}, ...],
[ {"tag":"trndnl","count":6},{"tag":"Zika","count":6},{"tag":"Croatia","count":1}, ...]
]
...
The front end can optionally add a "transform" operation in JSON request to define the post-processing operations.
For example, front-ends can define a wrap
operation to wrap the whole response in a key-value pair JSON object in which the key
is pre-defined. The following request asks the Cloudberry to wrap the result in the value with the key of sample
:
{
"dataset": "twitter.ds_tweet",
"filter": [{
"field": "text",
"relation": "contains",
"values": [ "zika"]
}],
"select" : {
"order" : [ "-create_at"],
"limit": 100,
"offset" : 0,
"field": ["create_at", "id"]
}
"transform" : {
"warp": {
"key": "sample"
}
}
}
The response is as below:
{
"key":"sample",
"value":[[
{"create_at":"2016-10-04T10:00:17.000Z","id":783351045829357568,"user.id":439304013},
{"create_at":"2016-09-09T10:00:28.000Z","id":774291393749643264,"user.id":2870762297},
{"create_at":"2016-09-09T10:00:08.000Z","id":774291307858722820,"user.id":2870783428},
{"create_at":"2016-09-07T10:00:15.000Z","id":773566563895042049,"user.id":2870762297},
{"create_at":"2016-09-06T10:39:19.000Z","id":773214008237318144,"user.id":3783815248},
{"create_at":"2016-08-31T10:00:24.000Z","id":771029887025090560,"user.id":2866011003},
{"create_at":"2016-08-25T10:00:07.000Z","id":768855489073455104,"user.id":115230811},
{"create_at":"2016-08-19T10:00:36.000Z","id":766681282453594112,"user.id":254059750},
{"create_at":"2016-08-09T10:00:35.000Z","id":763057397464043521,"user.id":383512034},
{"create_at":"2016-08-08T10:00:29.000Z","id":762694985355436032,"user.id":518129224}
]]
}
wrap
transformation is often preferable when the front-end send many different requests in the same WebSocket interface.