-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
offline batch ingestion API actions and data ingesters #2844
offline batch ingestion API actions and data ingesters #2844
Conversation
1907a48
to
1163387
Compare
1163387
to
a6b7b4a
Compare
e6d4047
to
b3337d8
Compare
ml-algorithms/src/main/java/org/opensearch/ml/engine/ingest/S3DataIngestion.java
Outdated
Show resolved
Hide resolved
ml-algorithms/src/main/java/org/opensearch/ml/engine/ingest/openAIDataIngestion.java
Outdated
Show resolved
Hide resolved
ml-algorithms/src/main/java/org/opensearch/ml/engine/ingest/openAIDataIngestion.java
Outdated
Show resolved
Hide resolved
ml-algorithms/src/main/java/org/opensearch/ml/engine/ingest/openAIDataIngestion.java
Outdated
Show resolved
Hide resolved
common/src/main/java/org/opensearch/ml/common/transport/batch/MLBatchIngestionAction.java
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/action/batch/TransportBatchIngestionAction.java
Outdated
Show resolved
Hide resolved
ml-algorithms/src/main/java/org/opensearch/ml/engine/ingest/S3DataIngestion.java
Show resolved
Hide resolved
ml-algorithms/src/main/java/org/opensearch/ml/engine/ingest/S3DataIngestion.java
Show resolved
Hide resolved
ml-algorithms/src/main/java/org/opensearch/ml/engine/ingest/S3DataIngestion.java
Show resolved
Hide resolved
ml-algorithms/src/main/java/org/opensearch/ml/engine/ingest/S3DataIngestion.java
Outdated
Show resolved
Hide resolved
ml-algorithms/src/main/java/org/opensearch/ml/engine/ingest/S3DataIngestion.java
Show resolved
Hide resolved
plugin/src/main/java/org/opensearch/ml/action/batch/TransportBatchIngestionAction.java
Show resolved
Hide resolved
Signed-off-by: Xun Zhang <xunzh@amazon.com>
exception = addValidationError("The input for ML batch ingestion cannot be null.", exception); | ||
} | ||
if (mlBatchIngestionInput != null && mlBatchIngestionInput.getCredential() == null) { | ||
exception = addValidationError("The credential for ML batch ingestion cannot be null", exception); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Credentials for ML batch ingestion are missing. Please provide the necessary credentials to continue with the ingestion process.
I had this comment which is resolved but haven't applied. Same goes for other validation Errors.
|
This is a CX interphase suggestion. I will explore the suggested way in field_map in a separate PR since this one is already long. |
* batch ingest API rest and transport actions Signed-off-by: Xun Zhang <xunzh@amazon.com> * add openAI ingester Signed-off-by: Xun Zhang <xunzh@amazon.com> * update batch ingestion field mapping interphase and address comments Signed-off-by: Xun Zhang <xunzh@amazon.com> * support multiple data sources as ingestion inputs Signed-off-by: Xun Zhang <xunzh@amazon.com> * use dedicated thread pool for ingestion Signed-off-by: Xun Zhang <xunzh@amazon.com> --------- Signed-off-by: Xun Zhang <xunzh@amazon.com> (cherry picked from commit 33a7c96)
* batch ingest API rest and transport actions Signed-off-by: Xun Zhang <xunzh@amazon.com> * add openAI ingester Signed-off-by: Xun Zhang <xunzh@amazon.com> * update batch ingestion field mapping interphase and address comments Signed-off-by: Xun Zhang <xunzh@amazon.com> * support multiple data sources as ingestion inputs Signed-off-by: Xun Zhang <xunzh@amazon.com> * use dedicated thread pool for ingestion Signed-off-by: Xun Zhang <xunzh@amazon.com> --------- Signed-off-by: Xun Zhang <xunzh@amazon.com> (cherry picked from commit 33a7c96)
* batch ingest API rest and transport actions Signed-off-by: Xun Zhang <xunzh@amazon.com> * add openAI ingester Signed-off-by: Xun Zhang <xunzh@amazon.com> * update batch ingestion field mapping interphase and address comments Signed-off-by: Xun Zhang <xunzh@amazon.com> * support multiple data sources as ingestion inputs Signed-off-by: Xun Zhang <xunzh@amazon.com> * use dedicated thread pool for ingestion Signed-off-by: Xun Zhang <xunzh@amazon.com> --------- Signed-off-by: Xun Zhang <xunzh@amazon.com> (cherry picked from commit 33a7c96) Co-authored-by: Xun Zhang <xunzh@amazon.com>
* batch ingest API rest and transport actions Signed-off-by: Xun Zhang <xunzh@amazon.com> * add openAI ingester Signed-off-by: Xun Zhang <xunzh@amazon.com> * update batch ingestion field mapping interphase and address comments Signed-off-by: Xun Zhang <xunzh@amazon.com> * support multiple data sources as ingestion inputs Signed-off-by: Xun Zhang <xunzh@amazon.com> * use dedicated thread pool for ingestion Signed-off-by: Xun Zhang <xunzh@amazon.com> --------- Signed-off-by: Xun Zhang <xunzh@amazon.com> (cherry picked from commit 33a7c96) Co-authored-by: Xun Zhang <xunzh@amazon.com>
After some thoughts, I think the index mapping can be updated to reflect the actual field mappings for the target index like what you suggested. However, I think we should not over complicate the problem to consider more than 1 embedding source file because that would cause ingestion confusion. For example, in the case of "product_name": "source[0,2].$.my_product_name" In the case that people put 45% product_name data in file 0, and 55% data in file 2, and 55% product_name_embedding data in file 1 and 45% data in file 3, since the files has to be scanned one by one, we wouldn't know how to match them because essentially some of the data in file 1 needs to be bulk indexed but others to be bulk updated. I think we should keep the concept simple and easy to understand here. In a single request, we accept multiple files, but each of the file contains all the data for a certain fields. Basically we only support vertically sharding but not horizontally sharding. With that said, in the field mapping, each of the field would only come from 1 single file to avoid confusions and disorder. |
Description
Add a new API to offline ingest data in batch mode from different sources (starting with sageMaker). This is to collaborate with the offline batch inference released in 2.16.
Example of batch ingestion request from SageMaker:
Example of batch ingestion request from OpenAI:
Related Issues
#2840
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.