copyright | lastupdated | ||
---|---|---|---|
|
2017-08-18 |
{:shortdesc: .shortdesc} {:new_window: target="_blank"} {:tip: .tip} {:pre: .pre} {:codeblock: .codeblock} {:screen: .screen} {:javascript: .ph data-hd-programlang='javascript'} {:java: .ph data-hd-programlang='java'} {:python: .ph data-hd-programlang='python'} {:swift: .ph data-hd-programlang='swift'}
The {{site.data.keyword.discoveryfull}} service offers powerful content search capabilities through queries. After your content is uploaded and enriched by the {{site.data.keyword.discoveryshort}} service, you can build queries, integrate {{site.data.keyword.discoveryshort}} into your own projects, or create a custom application by using the {{site.data.keyword.watson}} Explorer Application Builder. To get started with queries, see Building queries and delivering content. {: shortdesc}
{: #parameter-descriptions}
Query parameters enable you to search your collection, and customize the output of the data you return.
filter
: A cacheable query that excludes any documents that don't mention the query content. Filter search results are not returned in order of relevance. (These queries are written using the {{site.data.keyword.discoveryshort}} Query Language.)query
: A query search returns all documents in your data set with full enrichments and full text in order of relevance. A query also excludes any documents that don't mention the query content. (These queries are written using the {{site.data.keyword.discoveryshort}} Query Language.)aggregation
: Aggregation queries return a count of documents matching a set of data values; for example, top keywords, overall sentiment of entities, and more. For the full list of aggregation options, see the Aggregations table. (These queries are written using the {{site.data.keyword.discoveryshort}} Query Language.)natural_language_query
: A natural language query enables you to perform queries expressed in natural language, as might be received from an end user in a conversational or free-text interface - for example: "IBM Watson in healthcare". The parameter uses the entire input as the query text. It does not recognize operators. Thenatural_language_query
parameter enables capabilities such as passage search and relevancy training.
If you test the same search term on a small data set, you might find that the filter
and query
parameters return very similar (if not identical) results. However, there is a difference between the two parameters.
- Using a filter parameter alone will return search results in no specific order.
- Using a query parameter alone will return search results in order of relevance.
In large data sets, if you need results returned in order of relevance, you should combine the filter
and query
parameters, because using them together will improve performance. This is because the filter
parameter will run first and cache results, then the query
parameter will rank them. For an example of using filters and queries together, see Building combined queries. Filters can also be used in aggregations.
When you write a query that includes both a filter
, and an aggregation
, query
, or natural_language_query
parameter; the filter
parameters run first, after which any aggregation
, query
, or natural_language_query
parameters run in parallel.
With a simple query, especially on a small data set, filter
and query
often return the exact same (or similar) results. If a filter
and query
call return similar results, and getting a response in order of relevance does not matter, it is better to use filter because filter calls are faster and are cached. Caching means that the next time you make that call, you get a much quicker response, particularly in a big data set.
-
count
: The number of documents that you want returned in the response. -
offset
: The number of query results to skip at the beginning. For example, if the total number of results that are returned is 10, and the offset is 8, it returns the last two results. -
return
: A comma-separated list of the portion of the document hierarchy to return. Any of the document hierarchy are valid values. -
sort
: A comma-separated list of fields in the document to sort on. You can optionally specify a sort direction by prefixing the field with-
for descending order or+
for ascending order. Ascending order is the default sort direction.The
sort
parameter is currently available for use only with the API; it is not available through the tooling. -
passages
: A boolean that specifies whether the service returns a set of the most relevant passages from the documents returned by a query that uses thenatural_language_query
parameter. The passages are generated by sophisticated Watson algorithms to determine the best passages of text from all of the documents returned by the query. The default isfalse
.The
passages
parameter lists matching passages before it lists matching documents, as shown in the following example from a collection about dogs. The query is shown at the top of the example.The
passages
parameter can only be used on private collections. It cannot be used on the {{site.data.keyword.discoverynewsfull}} collection. {: tip}
curl -u "{username}":"{password}" "https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}/query?version=2017-06-25&natural_language_query='collie'&passages=true"
{: pre}
The JSON that is returned will be of the following format:
{
"matching_results":2,
"passages":[
{
"document_id":"ab7be56bcc9476493516b511169739f0",
"passage_score":15.230205287402338,
"passage_text":">\n\n\n<h1>Dogs</h1>\n<p>Dogs are popular pets. Some notable breeds include German shepherds, collies, sharpeis, and beagles. Shepherds and collies bark. Sharpeis sharp. Beagles howl.</p>\n<h2>Collie</h2"
},
{
"document_id":"fbb5dcb4d8a6a29f572ebdeb6fbed20e",
"passage_score":10.153470191601558,
"passage_text":"> \n</tr>\n<tr>\n <td>Mandy</td>\n <td>German shepherd/collie mix</td>\n <td>7 years</td> \n <td>Rat bite</td> \n <td>N/A</td> \n</tr>\n<tr>\n <td>Mindy</td>\n <td>Border Collie/terrier mix</td>\n <td>17"
},
...
{: codeblock}
-
passages.fields
: A comma-separated list of fields in the index that passages will be drawn from. If this parameter not specified then all top level field are included. Note: Thepassages.fields
parameter is supported only on private collections. It is not supported in the {{site.data.keyword.discoverynewsfull}} collection. -
passages.count
: The maximum number of passages to return. The search will return fewer if that is the total number found. The default is10
. The maximum is100
. Note>: Thepassages.count
parameter is supported only on private collections. It is not supported in the {{site.data.keyword.discoverynewsfull}} collection. -
passages.characters
: The approximate number of characters that any one passage should have. The default is400
. The minimum is50
. The maximum is2000
. Note: Thepassages.characters
parameter is supported only on private collections. It is not supported in the {{site.data.keyword.discoverynewsfull}} collection. -
highlight
: A boolean that specifies whether the returned output includes ahighlight
object in which the keys are field names and the values are arrays that contain segments of query-matching text highlighted by the HTML*
tag.The output lists the
highlight
object after theenriched_text
object, as shown in the following example.
curl -u "{username}":"{password}" "https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}/query?version=2017-06-25&natural_language_query=beagle&highlight=true"
{: pre}
The JSON that is returned will be of the following format:
{
"matching_results":4,
"results":[
{
"id":"ce78aba1288151fe5dc391cb480e9f98",
"score":1.212413,
"extracted_metadata":{
"title":"Liam and pets",
"sha1":"9a617d49dde77019dc324666767bf08cfd7f30f8",
"filename":"disco-Liam.html20170718-10-ki3l59.html",
"file_type":"html"
},
...
"highlight":{
"enriched_text.relations.sentence":[
" He would very much like to get a *beagle*.",
" Getting a *beagle* would make Liam happy."
],
"enriched_text.relations.subject.keywords.text":[
"*beagle*"
],
"enriched_text.relations.subject.text":[
"a *beagle*"
],
"text":[
"Liam and pets\n\nLiam\n\nLiam does not currently have a pet of his own. He would very much like to get a *beagle*. Getting a *beagle* would make Liam happy."
],
"enriched_text.relations.object.keywords.text":[
"*beagle*"
],
"enriched_text.keywords.text":[
"*beagle*"
],
"enriched_text.relations.object.text":[
"a *beagle*"
],
"html":[
"</h1>\n<p>Liam does not currently have a pet of his own. He would very much like to get a *beagle*",
". Getting a *beagle* would make Liam happy.\n\n</p></body></html>"
]
}
},
...
{: codeblock}
{: #operators}
Operators are the separators between different parts of a query. These are the available operators:
Operator | Description | Example |
---|---|---|
. |
This delimiter separates the levels of hierarchy in the JSON schema | enriched_text.concepts.text |
: |
This marker specifies a match for the query term. | enriched_text.concepts.text:cloud computing |
:: |
This marker specifies an exact match for the query term. | enriched_text.concepts.text::cloud computing |
:! |
This marker specifies that the results do not contain a match for the query term | enriched_text.concepts.text:!cloud computing |
::! |
This marker specifies that the results do not exactly match the query term | enriched_text.concepts.text::!cloud computing |
\ |
Escape character for queries that require the ability to query terms by using string literals that contain control characters. | enriched_text.concepts.text:\!cloud computing |
"IBM Watson" |
All contents of a phrase query are processed as escaped. So no special characters within a phrase query are parsed. For example, double quotes (" ) inside a phrase query must be escaped (\"). Use phrase queries with full-text, rank-based queries, and not with boolean filter operations. Note: single quotes (' ) are not supported.) |
enriched_text.concepts.text:\"IBM watson\" |
~ + Number |
The number of one character changes that need to be made to one string to make it the same as another string. For example car~1 will match car ,cap ,cat ,can , etc.. |
query-enriched_text.concepts.text:Watson~3 |
(), [] |
Logical groupings can be formed to specify more specific information. | filter-entities:(text:IBM,type:Company) |
| |
Boolean operator for "or". | query-enriched.entities.text:Google|IBM |
, |
Boolean operator for "and". | query-enriched.entities.text:Google,IBM |
<=, >=, >, < |
Creates numerical comparisons of less than or equal to, greater than or equal to, greater than, and less than. | |
^multiplier |
Increases the score value of a search term. | query-enriched_text.concepts.text:IBM^3 |
* |
Matches unknown characters in a search expression. | query-enriched_text.concepts.text:IBM* |
{: #aggregations}
Aggregations queries return a set of data values. These are the available aggregations:
Aggregations | Description | Example |
---|---|---|
term |
Returns the top values (by score and by frequency) for the selected enrichments. All enrichments are valid values. You can optionally use count to specify the number of terms to return. This example returns the full text and enrichments of the top values with the concept enrichment, and specifies to return 10 terms. |
term(enriched_text.concepts.text,count:10) |
filter |
A modifier that will narrow down the document set of the aggregation query it precedes. This example filters down to the set of documents that include the concept Cloud computing. | filter(enriched_text.concepts.text:cloud computing) |
nested |
Applying nested before an aggregation query restricts the aggregation to the area of the results specified. For example: nested(enriched_text.entities) means that only the enriched_text.entities components of any result are used to aggregate against. |
nested(enriched_text.entities) |
histogram |
Creates numeric interval segments to categorize documents. Uses field values to describe the category. Valid field values are what you specify in your metadata, or what is enriched in your content. For example, you could have a field value called "price". Use intervals to define the sections the results are split into. Valid interval values are set to make sense with what you're using as your field value. For example, if your data set includes the price of several items, like: "price": 1.30, "price": 1.99, and "price": 2.99, you might use intervals of 1, so that you see everything grouped between 1 and 2, and 2 and 3. Histograms can process decimal values, but intervals have to be whole, non-negative numbers. Histogram fields also have to be number values, and not strings. For example, "price": 1.30 is a number value that works, and "price": "1.30" is a string, and wouldn't work. The syntax is histogram(<field>,<interval>) , as shown in the following example. |
histogram(product.price,interval:1) |
timeslice |
A specialized histogram that uses dates to create interval segments. Valid date interval values are minute , hour , day , week , month , and year . The syntax is timeslice(<field>,<interval>,<time_zone>) . To use timeslice , the time fields in your documents must be of the date data type and in UNIX time {: new_window} format. Unless both of these requirements are met, the timeslice parameter does not work correctly. You can create a timeslice if your documents contain date fields with values such as 1496228512 . The value must be in a numeric format (for example, float or double ) and not enclosed in quotation marks. The service treats dates in text and dates in ISO 8601 format as data type string , not as data type date . You can detect anomalous points in timeslice aggregations. See Timeslice anomaly detection for additional information. This example returns values for "sales" ("product.sales") at intervals of 2 days in the New York City time zone. |
timeslice(product.sales,2day,America/New York) |
top_hits |
Returns the documents ranked by the score of the query or enrichment. Can be used with any query parameter or aggregation. This example returns the 10 top hits for a term aggregation. | term(enriched_text.concepts.text).top_hits(10) |
max |
Returns the highest value in the specified field across all matching documents. | max(product.price) |
min |
Returns the lowest value in the specified field across all matching documents. | min(product.price) |
average |
Returns the mean of values of the specified field across all matching documents. | average(product.price) |
sum |
Adds together the values of the specified field across all matching documents. | sum(product.price) |
{: #anomaly-detection}
You can optionally apply anomaly detection to the results of a timeslice
aggregation. Anomaly detection is used to locate unusual datapoints within a time series and to flag them for further review. Example uses for anomaly detection include identifying spikes in credit-card usage and searching Watson Discovery News for clusters of articles regarding a particular topic.
To apply anomaly detection, use the following syntax in your aggregation:
timeslice(field:<date>,interval:<interval>,anomaly:true)`
{: codeblock}
If you specify anomaly:true
with the timeslice
aggregation, the output includes the following two additional fields, which are shown in the example.
-
"anomaly": true
to indicate that anomaly detection was performed -
An
anomaly
field in the points that are anomalous in the output's results array. The anomaly field has a value of thefloat
data type indicating the magnitude of the anomalous behavior. The closer the value of the anomaly field is to1
, the more likely the result is anomalous. -
The
key
andkey_as_string
in each of the objects in theresults
array corresponds to a UNIX timestamp in seconds. -
The anomaly score is relative to a query, not across queries.
"type": "timeslice",
"field": "blekko.chrondate",
"interval": "1d",
"anomaly": true,
"results": [
{
"matching_results": 2933,
"anomaly": 1,
"key_as_string": "1496880000",
"key": 1496880000000
},
{
"matching_results": 3435,
"anomaly": 1,
"key_as_string": "1496966400",
"key": 1496966400000
},
{
"matching_results": 3692,
"anomaly": 0.598226,
"key_as_string": "1496016000",
"key": 1496016000000
},
{
"matching_results": 4551,
"anomaly": 0.828498,
"key_as_string": "1495411200",
"key": 1495411200000
},
{
"matching_results": 947,
"key_as_string": "1489968000",
"key": 1489968000000
},
...
]
...
{: codeblock}
- Anomaly detection is currently available only on top-level
timeslice
aggregations. It is not available in lower-level (nested) aggregations. - The maximum number of points that can be processed by anomaly detection in any given
timeslice
aggregation is1500
. - The maximum number of top-level timeslice aggregations that can be processed by anomaly detection is
20
.