Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenSearch Bulk API Source #248

Closed
Tracked by #4180
laneholloway opened this issue Sep 6, 2021 · 5 comments
Closed
Tracked by #4180

OpenSearch Bulk API Source #248

laneholloway opened this issue Sep 6, 2021 · 5 comments
Assignees
Labels
backlog enhancement New feature or request plugin - source A plugin to receive data from a service or location. Roadmap:Observability/Log Analytics Project-wide roadmap label
Milestone

Comments

@laneholloway
Copy link
Contributor

laneholloway commented Sep 6, 2021

Summary

This creates a new Data Prepper source which accepts data in the form of the OpenSearch Bulk API.

Configuration

source:
  opensearch_api:
    port: 9200
    path_prefix: opensearch/

Operations

The _bulk API supports:

  • index
  • create
  • update
  • delete

This source can do something similar to what the dynamodb source does. Specifically it should include the opensearch_action metadata.

Sample

POST opensearch/_bulk
{ "index": { "_index": "movies", "_id": "tt1979320" } }
{ "title": "Rush", "year": 2013 }

The above request is the simplest case since it is an index request.

It creates an Event with data such as:

{ "_id": "tt1979320" "title": "Rush", "year": 2013 }

Additionally, the event will need metadata that we can use in the opensearch sink.

opensearch_action: "index"
opensearch_index: "movies"
opensearch_id: "tt1979320"

Query parameters

The _bulk API supports a few query parameters. The source should also support most of these and provide some of them as metadata.

  • pipeline -> Sets metadata: opensearch_pipeline
  • routing -> Sets metadata: opensearch_routing
  • timeout -> Configures an alternate timeout for the request in the source. This probably doesn't need to be provided downstream.

Some other parameters that we may wish to support:

  • refresh
  • require_alias
  • wait_for_active_shards

Finally, we should not support these parameters as they are being deprecated.

  • type

Response

Being able to provide the _bulk API response may be more challenging. There are a few reasons:

  1. Unless end-to-end acknowledgments are enabled, we won't have any knowledge of the writes.
  2. Even when acknowledgments are enabled all the metadata needed in a typical response is still not available.

An initial version could provide responses that either have empty values (where appropriate) or use synthetic values.

@sb2k16
Copy link
Member

sb2k16 commented Apr 19, 2024

I would like to work on this issue. Could you please assign this to me?

@sb2k16
Copy link
Member

sb2k16 commented May 3, 2024

For the first milestone, we are going to support the OpenSearch Bulk API Index action. All other actions like create, update and delete will be available in later milestones.

@jzonthemtn
Copy link
Contributor

Thanks @sb2k16 for picking this up. This is of interest to me working on OpenSearch UBI.

I don't want to restrict where UBI events and queries are indexed because there can be valid reasons for wanting to store those items on a different OpenSearch instance (different meaning different from where the query was done). Allowing the user to specify an OpenSearch API-compatible endpoint to receive that data would allow UBI to store data in any instance of OpenSearch with minimal overhead.

The Bulk API will be helpful because the UBI OpenSearch module can use that endpoint directly to send data to another instance of OpenSearch via Data Prepper. Additionally, using Data Prepper is valuable because of the flexibility it gives the user.

I hope that gives some insight into one use-case for this feature request. If it would be helpful to chat more about it please let me know.

@dlvenable
Copy link
Member

Completed in #5024.

@jzonthemtn
Copy link
Contributor

This is awesome! Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog enhancement New feature or request plugin - source A plugin to receive data from a service or location. Roadmap:Observability/Log Analytics Project-wide roadmap label
Projects
Archived in project
Status: New
Development

No branches or pull requests

4 participants