Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPL describe command #646

Merged
merged 20 commits into from
Jun 28, 2022
Merged

PPL describe command #646

merged 20 commits into from
Jun 28, 2022

Conversation

seankao-az
Copy link
Collaborator

@seankao-az seankao-az commented Jun 20, 2022

Description

Support describe command in PPL

Issues Resolved

#644

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
      • unit test
      • integration test
      • doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
@seankao-az
Copy link
Collaborator Author

This is an early draft exploring option 4 mentioned in #644

Here is the result of querying the sample flight data.

$ curl -XPOST https://localhost:9200/_plugins/_ppl -u 'admin:admin' -k -H 'Content-Type: application/json' -d '{"query": "describe opensearch_dashboards_sample_data_flights | where TYPE_NAME=\"timestamp\" | fields COLUMN_NAME"}'

{
  "schema": [
    {
      "name": "COLUMN_NAME",
      "type": "string"
    }
  ],
  "datarows": [
    [
      "timestamp"
    ]
  ],
  "total": 1,
  "size": 1
}

Currently the type names returned are wrong (string instead of keyword, integer, ..., and timestamp instead of date)

@seankao-az seankao-az marked this pull request as ready for review June 21, 2022 00:09
@seankao-az seankao-az requested a review from a team as a code owner June 21, 2022 00:09
Signed-off-by: Sean Kao <seankao@amazon.com>
@codecov-commenter
Copy link

codecov-commenter commented Jun 21, 2022

Codecov Report

Merging #646 (7e228ac) into main (d81ef73) will decrease coverage by 31.87%.
The diff coverage is n/a.

@@              Coverage Diff              @@
##               main     #646       +/-   ##
=============================================
- Coverage     94.63%   62.76%   -31.88%     
=============================================
  Files           279       10      -269     
  Lines          7517      658     -6859     
  Branches        556      119      -437     
=============================================
- Hits           7114      413     -6701     
+ Misses          349      192      -157     
+ Partials         54       53        -1     
Flag Coverage Δ
query-workbench 62.76% <ø> (ø)
sql-engine ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...java/org/opensearch/sql/ppl/parser/AstBuilder.java
...opensearch/planner/physical/MLCommonsOperator.java
...rc/main/java/org/opensearch/sql/storage/Table.java
...search/storage/script/filter/lucene/TermQuery.java
...org/opensearch/sql/expression/ParseExpression.java
...nsearch/sql/analysis/WindowExpressionAnalyzer.java
...cript/aggregation/ExpressionAggregationScript.java
...arch/storage/script/filter/lucene/LuceneQuery.java
...arch/sql/planner/physical/collector/Collector.java
...request/system/OpenSearchDescribeIndexRequest.java
... and 259 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d81ef73...7e228ac. Read the comment docs.

Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
@seankao-az seankao-az changed the title PPL describe command [WIP] PPL describe command Jun 21, 2022
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
@seankao-az
Copy link
Collaborator Author

Changed response formatter so that response format for ppl describe matches that of sql describe.
Potentially breaks integration testing and lots of stuff, but I haven't find a way to test it locally (macOS).

Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
@seankao-az
Copy link
Collaborator Author

seankao-az commented Jun 22, 2022

Reverted the protocol breaking change for now.

@seankao-az seankao-az linked an issue Jun 22, 2022 that may be closed by this pull request
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
invalid index name (start with '_')

Signed-off-by: Sean Kao <seankao@amazon.com>
@seankao-az seankao-az changed the title [WIP] PPL describe command PPL describe command Jun 23, 2022
Signed-off-by: Sean Kao <seankao@amazon.com>

@Test
public void testDescribeFieldsCommandShouldPass() {
ParseTree tree = new PPLSyntaxParser().analyzeSyntax("describe t | fields a,b");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case for this example? It seems little odd because if someone wants a,b as result he would specify a,b instead of describing table and filtering again.

Fields command is to project menitoned columns from the result set. A plausible usecase could be describe t | fields 2, 3 which implies give me second and third column names.

Also if someone appends other commands to describe, what is the expected behavior. I am assuming we will be calculating on the result set provided by prior describe command.

Copy link
Collaborator Author

@seankao-az seankao-az Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming we will be calculating on the result set provided by prior describe command.

That's correct. When appending other commands to describe, the behavior is to query the metadata table, instead of the data table itself (as expected for the pipe syntax).

An example of the usage of fields can be seen here

The fields do not refer to the data table's fields, but the metadata table's fields, because the result set of the describe command is a metadata table. Here you can see the full list of such fields.

Copy link
Collaborator Author

@seankao-az seankao-az Jun 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A plausible usecase could be describe t | fields 2, 3 which implies give me second and third column names.

Interestingly, we could support that using the following syntax:

describe t | where ORDINAL_POSITION=2 or ORDINAL_POSITION=3 | fields COLUMN_NAME

However, it doesn't quite work yet at the moment, due to type mismatch snippet 1, snippet 2. Also, most of the metadata are meaningless right now, including the order of the columns.

$ curl .... '{"query": "describe opensearch_dashboards_sample_data_flights | where ORDINAL_POSITION=0"}'

{
  "error": {
    "reason": "Invalid Query",
    "details": "= function expected {[BYTE,BYTE],[SHORT,SHORT],[INTEGER,INTEGER],[LONG,LONG],[FLOAT,FLOAT],[DOUBLE,DOUBLE],[STRING,STRING],[BOOLEAN,BOOLEAN],[TIMESTAMP,TIMESTAMP],[DATE,DATE],[TIME,TIME],[DATETIME,DATETIME],[INTERVAL,INTERVAL],[STRUCT,STRUCT],[ARRAY,ARRAY]}, but get [STRING,INTEGER]",
    "type": "ExpressionEvaluationException"
  },
  "status": 400
}

$ curl .... '{"query": "describe opensearch_dashboards_sample_data_flights | where ORDINAL_POSITION=\"0\""}'

{
  "error": {
    "reason": "Invalid Query",
    "details": "invalid to get integerValue from value of type STRING",
    "type": "ExpressionEvaluationException"
  },
  "status": 400
}

Signed-off-by: Sean Kao <seankao@amazon.com>
@seankao-az
Copy link
Collaborator Author

seankao-az commented Jun 24, 2022

I learned from https://github.com/opensearch-project/sql/blob/main/docs/user/ppl/general/identifiers.rst that we actually have to support multiple indices. Most of them worked fine out of the box before my latest commit

  • wildcard: describe index_prefix_* works
  • comma separated indices enclosed in back-ticks: describe `index1,index2` works
  • However, describe index1,index2 doesn't work

In my last commit, I implemented this by joining the index strings, so that the following explain results matches:

$ curl -XPOST https://localhost:9200/_plugins/_ppl/_explain -u 'admin:admin' -k -H 'Content-Type: application/json' -d '{"query": "describe `opensearch_dashboards_sample_data_flights,opensearch_dashboards_sample_data_logs`"}'
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[TABLE_CAT, TABLE_SCHEM, TABLE_NAME, COLUMN_NAME, DATA_TYPE, TYPE_NAME, COLUMN_SIZE, BUFFER_LENGTH, DECIMAL_DIGITS, NUM_PREC_RADIX, NULLABLE, REMARKS, COLUMN_DEF, SQL_DATA_TYPE, SQL_DATETIME_SUB, CHAR_OCTET_LENGTH, ORDINAL_POSITION, IS_NULLABLE, SCOPE_CATALOG, SCOPE_SCHEMA, SCOPE_TABLE, SOURCE_DATA_TYPE, IS_AUTOINCREMENT, IS_GENERATEDCOLUMN]"
    },
    "children": [
      {
        "name": "OpenSearchSystemIndexScan",
        "description": {
          "request": "OpenSearchDescribeIndexRequest{indexName='opensearch_dashboards_sample_data_flights,opensearch_dashboards_sample_data_logs'}"
        },
        "children": []
      }
    ]
  }
}

$ curl -XPOST https://localhost:9200/_plugins/_ppl/_explain -u 'admin:admin' -k -H 'Content-Type: application/json' -d '{"query": "describe opensearch_dashboards_sample_data_flights,opensearch_dashboards_sample_data_logs"}'
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[TABLE_CAT, TABLE_SCHEM, TABLE_NAME, COLUMN_NAME, DATA_TYPE, TYPE_NAME, COLUMN_SIZE, BUFFER_LENGTH, DECIMAL_DIGITS, NUM_PREC_RADIX, NULLABLE, REMARKS, COLUMN_DEF, SQL_DATA_TYPE, SQL_DATETIME_SUB, CHAR_OCTET_LENGTH, ORDINAL_POSITION, IS_NULLABLE, SCOPE_CATALOG, SCOPE_SCHEMA, SCOPE_TABLE, SOURCE_DATA_TYPE, IS_AUTOINCREMENT, IS_GENERATEDCOLUMN]"
    },
    "children": [
      {
        "name": "OpenSearchSystemIndexScan",
        "description": {
          "request": "OpenSearchDescribeIndexRequest{indexName='opensearch_dashboards_sample_data_flights,opensearch_dashboards_sample_data_logs'}"
        },
        "children": []
      }
    ]
  }
}

I'm not sure whether this is the preferred way to do it.

I tried other ways to do this, but was faced with some issues

$ curl -XPOST https://localhost:9200/_plugins/_ppl -u 'admin:admin' -k -H 'Content-Type: application/json' -d '{"query": "describe opensearch_dashboards_sample_data_flights,opensearch_dashboards_sample_data_logs"}'
{
  "error": {
    "reason": "Error occurred in OpenSearch engine: Invalid index name [_ODFE_SYS_TABLE_MAPPINGS.opensearch_dashboards_sample_data_logs], must not start with '_'.",
    "details": "org.opensearch.indices.InvalidIndexNameException: Invalid index name [_ODFE_SYS_TABLE_MAPPINGS.opensearch_dashboards_sample_data_logs], must not start with '_'.\nFor more details, please send request for Json format to see the raw response from OpenSearch engine.",
    "type": "InvalidIndexNameException"
  },
  "status": 400
}

$ curl -XPOST https://localhost:9200/_plugins/_ppl/_explain -u 'admin:admin' -k -H 'Content-Type: application/json' -d '{"query": "describe opensearch_dashboards_sample_data_flights,opensearch_dashboards_sample_data_logs"}'
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[TABLE_CAT, TABLE_SCHEM, TABLE_NAME, COLUMN_NAME, DATA_TYPE, TYPE_NAME, COLUMN_SIZE, BUFFER_LENGTH, DECIMAL_DIGITS, NUM_PREC_RADIX, NULLABLE, REMARKS, COLUMN_DEF, SQL_DATA_TYPE, SQL_DATETIME_SUB, CHAR_OCTET_LENGTH, ORDINAL_POSITION, IS_NULLABLE, SCOPE_CATALOG, SCOPE_SCHEMA, SCOPE_TABLE, SOURCE_DATA_TYPE, IS_AUTOINCREMENT, IS_GENERATEDCOLUMN]"
    },
    "children": [
      {
        "name": "OpenSearchSystemIndexScan",
        "description": {
          "request": "OpenSearchDescribeIndexRequest{indexName='opensearch_dashboards_sample_data_flights,_ODFE_SYS_TABLE_MAPPINGS.opensearch_dashboards_sample_data_logs'}"
        },
        "children": []
      }
    ]
  }
}

Only the second index name is prefixed by _ODFE_SYS_TABLE_MAPPINGS" and I don't know why

Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
Signed-off-by: Sean Kao <seankao@amazon.com>
@seankao-az seankao-az requested a review from penghuo June 24, 2022 22:47
Copy link
Collaborator

@penghuo penghuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change!

@seankao-az seankao-az merged commit 9e2a9ff into opensearch-project:main Jun 28, 2022
@seankao-az seankao-az deleted the ppl-describe branch June 28, 2022 17:21
seankao-az added a commit that referenced this pull request Jun 28, 2022
Signed-off-by: Sean Kao <seankao@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Support describe command in PPL
6 participants