Skip to content
QIUSHI BAI edited this page Aug 11, 2019 · 6 revisions

After Register Dataset to Cloudberry, the front-end application or the user can send queries to Cloudberry through either of the following two end-points.

POST /berry

Example URL

http://localhost:9000/berry

Example Query

Suppose you create a file query.json with the following content.

{
  "dataset": "twitter.ds_tweet",
  "select" : {
    "order" : ["-create_at"],
    "limit": 1,
    "offset": 0,
    "field": ["*"]
  }
}

This is a simple query to select only one record from twitter.ds_tweet dataset.

Example Command

curl -X POST -H "Content-Type: application/json" -d @query.json http://localhost:9000/berry

Example Response

[[
  {
      "place.bounding_box":[
         [-71.625705, 42.310982],
         [-71.475702, 42.380659]
      ],
      "favorite_count":0,
      "geo_tag.countyID":25017,
      "user.id":111986159,
      "geo_tag.cityID":2538715,
      "is_retweet":false,
      "text":"Brewing with the back to be or tomorrow at the best to offer they are @ Deep County, CA [Portland Familes, San Maraporonic Makers -",
      "retweet_count":0,
      "in_reply_to_user":-1,
      "id":1129340818383343622,
      "coordinate":[-71.50138889, 42.36916667],
      "in_reply_to_status":-1,
      "geo_tag.stateID":25,
      "create_at":"2019-05-17T11:00:07.000Z",
      "lang":"en",
      "user.profile_image_url":"http://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png",
      "user.name":"Hazel Dashawn"
   }
]]

Websocket /ws

To minimize the overhead of communication, Cloudberry supports the Websocket connection when front-end application needs to fluently send queries to and receive responses from Cloudberry, or needs responses pushed by Cloudberry automatically (e.g. when query slicing is applied).

Example URL

ws://localhost:9000/ws

Example code

Open a Websocket connection to Cloudberry in Javascript as following:

var ws = new WebSocket("ws://localhost:9000/ws"");

Query format of Websocket end-point is the same as HTTP POST end-point, and is described as following.

Query Format

A query is composed of the following parameters:

  • Dataset : the dataset to query on.
  • Unnest : to flatten a record based on the nested Bag attribute to generate multiple records. (Only useful for semi-structured databases, e.g. AsterixDB.)
  • Filter : a set of selection predicates.
  • Group :
    • by : to specify the group by fields.
    • aggregate: to specify the aggregation functions to apply, including count, sum, min, max and avg.
  • Select:
    • field : a set of fields that should be returned.
    • order : to specify the order by fields.
    • limit : to specify the number of records that will be returned at most.
    • offset : to specify the pagination combined with limit value.

Here are some examples as following.

Example 1 - Keyword Search

  • Get 100 latest tweets (time and id) that mention "hurricane".
{
  "dataset": "twitter.ds_tweet",
  "filter": [{
    "field": "text",
    "relation": "contains",
    "values": [ "hurricane"]
  }],
  "select" : {
    "order" : [ "-create_at"],
    "limit": 100,
    "offset" : 0,
    "field": ["create_at", "id"]
  }
}

Expected results are as following:

[[
 {"create_at":"2019-05-16T17:00:17.000Z","id":1129069066373799936},
 {"create_at":"2019-05-14T21:34:07.000Z","id":1128413203938127873},
 {"create_at":"2019-05-14T16:12:34.000Z","id":1128332285806354438},
 ...
]]

Example 2 - Group By + Count

  • Get the per-state and per-day count of tweets that contain "hurricane" and "florence" in 2018.
{
  "dataset": "twitter.ds_tweet",
  "filter": [
    {
      "field": "create_at",
      "relation": "inRange",
      "values": [ "2018-01-01T00:00:00.000Z", "2018-12-31T00:00:00.000Z"]
    },
    {
      "field": "text",
      "relation": "contains",
      "values": [ "hurricane", "florence" ]
    }
  ],
  "group": {
    "by": [
        {
          "field": "geo_tag.stateID",
          "as": "state"
        },
        {
          "field": "create_at",
          "apply": {
            "name": "interval",
            "args": {
              "unit": "day"
            }
          },
          "as": "day"
        }
      ],
    "aggregate": [
      {
        "field": "*",
        "apply": {
          "name": "count"
        },
        "as": "count"
      }
      ]
  }
}

Expected results are as following:

[[
    {"state":6,"day":"2018-04-05T00:00:00.000Z","count":1},
    {"state":6,"day":"2018-09-11T00:00:00.000Z","count":1},
    {"state":8,"day":"2018-03-28T00:00:00.000Z","count":1},
    {"state":12,"day":"2018-03-01T00:00:00.000Z","count":1}
    ...
]]

Example 3 - Unnest Bag

  • Get top-10 related hashtags for tweets that mention "hurricane".
{
  "dataset": "twitter.ds_tweet",
  "filter": [
  {
    "field": "text",
    "relation": "contains",
    "values": [ "hurricane"]
  }
  ],
  "unnest" : [{ "hashtags": "tag"}],
  "group": {
    "by": [
      { "field": "tag" }
    ],
    "aggregate": [
      {
        "field" : "*",
        "apply" : {
          "name": "count"
        },
        "as" : "count"
      }
    ]
  },
  "select" : {
    "order" : [ "-count"],
    "limit": 10,
    "offset" : 0
  }
}

Expected results are as following:

[[
  {"tag":"Hurricane","count":102},
  {"tag":"Florida","count":92},
  {"tag":"hurricane","count":92},
  ...
]]

Advanced Query Options

Cloudberry supports automatic query-slicing on the timeField. The front-end can specify a response time limit for each "small query" to get the results progressively.

{
 ...
 "option":{
   "sliceMillis": 2000  
 }
}

For example, the following query asks the top-10 hashtags with an option to accept an updated results every 200ms.

{
    "dataset": "twitter.ds_tweet",
    "filter": [{
        "field": "text",
        "relation": "contains",
        "values": ["zika"]
    }],
    "unnest": [{
        "hashtags": "tag"
    }],
    "group": {
        "by": [{
            "field": "tag"
        }],
        "aggregate": [{
            "field": "*",
            "apply": {
                "name": "count"
            },
            "as": "count"
        }]
    },
    "select": {
        "order": ["-count"],
        "limit": 10,
        "offset": 0
    },
    "option": {
        "sliceMillis": 200
    }
}

There will be a stream of results return from Cloudberry as following:

[[{"tag":"Zika","count":3},{"tag":"ColdWater","count":1},{"tag":"Croatia","count":1}, ... ]]
[[{"tag":"Zika","count":4},{"tag":"Croatia","count":1},{"tag":"OperativoNU","count":1}, ... ]]
[[{"tag":"trndnl","count":6},{"tag":"Zika","count":4},{"tag":"ProjectHomeLouisDay","count":1}, ... ]]
...

Format of multiple requests

Sometimes the front-end wants to slice multiple queries simultaneously so that it can show multiple consistent results. In this case, it can wrap the queries inside the batch field and specify only one option field.

{
  "batch" : [
    { request1 },
    { request2 }
  ],
  "option" : {
    "sliceMillis": 2000  
  }
}

E.g., the following query shows an batch example that asks the by-state count and the top-10 hashtags and these two queries should be sliced synchronized.

{
    "batch": [{
        "dataset": "twitter.ds_tweet",
        "filter": [{
            "field": "create_at",
            "relation": "inRange",
            "values": ["2016-01-01T00:00:00.000Z", "2016-12-31T00:00:00.000Z"]
        }, {
            "field": "text",
            "relation": "contains",
            "values": ["zika", "virus"]
        }],
        "group": {
            "by": [{
                "field": "geo_tag.stateID",
                "as": "state"
            }, {
                "field": "create_at",
                "apply": {
                    "name": "interval",
                    "args": {
                        "unit": "hour"
                    }
                },
                "as": "hour"
            }],
            "aggregate": [{
                "field": "*",
                "apply": {
                    "name": "count"
                },
                "as": "count"
            }]
        }
    }, {
        "dataset": "twitter.ds_tweet",
        "filter": [{
            "field": "text",
            "relation": "contains",
            "values": ["zika"]
        }],
        "unnest": [{
            "hashtags": "tag"
        }],
        "group": {
            "by": [{
                "field": "tag"
            }],
            "aggregate": [{
                "field": "*",
                "apply": {
                    "name": "count"
                },
                "as": "count"
            }]
        },
        "select": {
            "order": ["-count"],
            "limit": 10,
            "offset": 0
        }
    }],
    "option": {
        "sliceMillis": 200
    }
}

The response is as following:

[
  [ {"state":6,"hour":"2016-08-05T10:00:00.000Z","count":1}, {"state":12,"hour":"2016-07-26T10:00:00.000Z","count":1}, ...],
  [ {"tag":"trndnl","count":6},{"tag":"Zika","count":5},{"tag":"ColdWater","count":1}, ...]
]
[
  [ {"state":72,"hour":"2016-05-06T10:00:00.000Z","count":1},{"state":48,"hour":"2016-09-09T10:00:00.000Z","count":2}, ...],
  [ {"tag":"trndnl","count":6},{"tag":"Zika","count":6},{"tag":"Croatia","count":1}, ...]
]
...

Transform response format

The front end can optionally add a "transform" operation in JSON request to define the post-processing operations. For example, front-ends can define a wrap operation to wrap the whole response in a key-value pair JSON object in which the key is pre-defined. The following request asks the Cloudberry to wrap the result in the value with the key of sample:

{
  "dataset": "twitter.ds_tweet",
  "filter": [{
    "field": "text",
    "relation": "contains",
    "values": [ "zika"]
  }],
  "select" : {
    "order" : [ "-create_at"],
    "limit": 100,
    "offset" : 0,
    "field": ["create_at", "id"]
  }
  "transform" : {
    "warp": {
      "key": "sample"
    }
  }
}

The response is as below:

{
  "key":"sample",
  "value":[[
    {"create_at":"2016-10-04T10:00:17.000Z","id":783351045829357568,"user.id":439304013},
    {"create_at":"2016-09-09T10:00:28.000Z","id":774291393749643264,"user.id":2870762297},
    {"create_at":"2016-09-09T10:00:08.000Z","id":774291307858722820,"user.id":2870783428},
    {"create_at":"2016-09-07T10:00:15.000Z","id":773566563895042049,"user.id":2870762297},
    {"create_at":"2016-09-06T10:39:19.000Z","id":773214008237318144,"user.id":3783815248},
    {"create_at":"2016-08-31T10:00:24.000Z","id":771029887025090560,"user.id":2866011003},
    {"create_at":"2016-08-25T10:00:07.000Z","id":768855489073455104,"user.id":115230811},
    {"create_at":"2016-08-19T10:00:36.000Z","id":766681282453594112,"user.id":254059750},
    {"create_at":"2016-08-09T10:00:35.000Z","id":763057397464043521,"user.id":383512034},
    {"create_at":"2016-08-08T10:00:29.000Z","id":762694985355436032,"user.id":518129224}
  ]]
}

wrap transformation is often preferable when the front-end send many different requests in the same WebSocket interface.