Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] JSON Search Support #2652

Open
brijos opened this issue May 3, 2024 · 9 comments
Open

[FEATURE] JSON Search Support #2652

brijos opened this issue May 3, 2024 · 9 comments
Labels
enhancement New feature or request PPL Piped processing language

Comments

@brijos
Copy link

brijos commented May 3, 2024

Is your feature request related to a problem?
Community members have asked for easier JSON parsing and analysis capabilities which allow them to not only search JSON logs and extract fields without writing complex parse expressions, but perform computations on JSON array values, such as finding the sum of all values in the array, where the number of elements in the array is not known.

What solution would you like?
Allow users to extract and transform data from JSON-formatted events and fields. Users should be able to extract all values in an array by specifying a wildcard for the individual element position and doing an aggregation operation on them. Users should be able to extract: 1/single or multiple top level fields 2/nested fields 3/keys in arrays and perform operations on the values.

** Examples ***

  • User wants to look at an array of elements, across log entries to see how many times each element has occurred in the array.
  • Match any item in a JSON array, regardless of the position in the array
  • Pull out fields within a nested JSON file

What alternatives have you considered?
No other solutions are available in PPL

Do you have any additional context?
No

@brijos brijos added enhancement New feature or request untriaged labels May 3, 2024
@kedbirhan
Copy link

kedbirhan commented May 8, 2024

SELECT
  json_extract(myblob, '$.name') AS name,
  json_extract(myblob, '$.projects') AS projects
FROM dataset

it would help a lot to support json_extract function to manipulate json string field as shown above.

@dblock
Copy link
Member

dblock commented Jun 24, 2024

Catch All Triage - 1 2 3 4 5 6

@dblock dblock removed the untriaged label Jun 24, 2024
@salyh
Copy link

salyh commented Jul 3, 2024

@anasalkouz @YANG-DB @brijos

Json Functions Proposal

I have created a working prototype for the json_extract function.

This is currently implemented in the sql sub-project to make the json functions available not only as PPL command. In other words: The function can be used (like any other built in function) in sql and ppl.

The proposed (and so far implemented syntax) is:

json_extract(<json>,<path>)

<json> Json as string. From an table cell or as literal (mandatory)

<path> a json path or a json pointer expression (mandatory).

The function returns the result as string (scalar value or full json)

An error is thrown when:

  • Json is invalid
  • Path is not a valid json path or json pointer expression

No error is thrown when:

  • Path can not be located (we return just NULL)

Examples:

#JSON Path
select json_extract('{\"name\":\"saly\"}', '$.name')
#JSON Pointer
select json_extract('{\"name\":\"saly\"}', '/name')

Open questions:

  1. Should it be implemented as a built in sql function?
  2. Is the string to string approach sufficient (json input as string, function output as string)?

Not yet covered

  • Optimizations
  • Integ Tests

@salyh
Copy link

salyh commented Jul 9, 2024

@anasalkouz @YANG-DB @rupal-bq any comments on the proposal so far?

@nateynateynate
Copy link
Member

Can we perhaps specify a document ID to use as the json for the query? I've got a lot of json blobs that I'd love to search through instead of breaking them up before ingest.

@salyh
Copy link

salyh commented Jul 16, 2024

Can we perhaps specify a document ID to use as the json for the query? I've got a lot of json blobs that I'd love to search through instead of breaking them up before ingest.

can you post an example how this can look like?

@kedbirhan
Copy link

kedbirhan commented Jul 18, 2024

  1. Should it be implemented as a built in sql function?
    yes this would be the case as this is true for Athena's json_extract function.
  2. Is the string to string approach sufficient (json input as string, function output as string)?
    i would think this suffices for now
    @salyh also it would be nice if the implementation support targeting json array elements as well as shown in the example below

source input

{
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::example-bucket/*"
    },
    {
      "Effect": "Deny",
      "Action": "s3:DeleteObject",
      "Resource": "arn:aws:s3:::example-bucket/*"
    }
  ]
}

query

SELECT ...
FROM ...
WHERE json_extract(p.policy_std, '$.Statement[*].Effect') = 'Allow'
  AND f.name = 'hellopython';

@salyh
Copy link

salyh commented Jul 23, 2024

Will add this ...

@YANG-DB
Copy link
Member

YANG-DB commented Sep 16, 2024

@YANG-DB YANG-DB added the PPL Piped processing language label Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request PPL Piped processing language
Projects
None yet
Development

No branches or pull requests

6 participants