Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial code for plugin based on opensearch-plugin-template-java #3

Merged
merged 32 commits into from
Nov 4, 2022

Conversation

kevinawskendra
Copy link
Contributor

Description

Initial code for plugin based on opensearch-plugin-template-java

Issues Resolved

N/A

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@ps48
Copy link
Member

ps48 commented Jul 11, 2022

@kevinawskendra Thanks for the PR. Below are some recommended changes:

  1. Please use commit signoffs otherwise the DCO workflow would fail.
  2. Before adding the plugin template code. Let's add an RFC design document created for this new plugin. You may checkout an example here.

Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
@macohen
Copy link
Collaborator

macohen commented Oct 7, 2022

Mahita, we can review this request after you resolve the DCO issue in your commits.

Mahita Mahesh and others added 18 commits October 7, 2022 12:28
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
…tep to split document source content into passages

Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
…stener.

Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
…r hits based on response, add plugin security policy.

Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
…t setting.

Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
…ield from the OS query.

Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
…tId.

Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
…ex level setting

Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Comment on lines 30 to 39
/**
* The minimum step size by which to move the sliding window after a split. Step size can be larger
* in practice because of respecting sentence boundaries.
*/
private Integer stepSize;

/**
* The maximum number of passages to be extracted from the input text
*/
private Integer maximumPassages;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these Integer rather than int?

The contract of setSlidingWindow implicitly says that they can't be null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to int

Comment on lines 68 to 75
System.out.println(rerankRequest.getRerankingEndpointId());
System.out.println(rerankRequest.getDocuments());
System.out.println(rerankRequest.getQueryText());
Request<Void> request = new DefaultRequest<>(aws4Signer.getServiceName());
request.setHttpMethod(HttpMethodName.POST);
request.setEndpoint(URI.create(serviceEndpoint));
request.setHeaders(Map.of("Content-Type", "application/x-amz-json-1.0", "X-Amz-Target", "AWSKendraRerankingFrontendService.Rerank"));
System.out.println("objectMapper: " + objectMapper.writeValueAsString(rerankRequest));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to remove these System.out.println before merging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left by accident. Removed now.

request.setEndpoint(URI.create(serviceEndpoint));
request.setHeaders(Map.of("Content-Type", "application/x-amz-json-1.0", "X-Amz-Target", "AWSKendraRerankingFrontendService.Rerank"));
System.out.println("objectMapper: " + objectMapper.writeValueAsString(rerankRequest));
request.setContent(new ByteArrayInputStream(objectMapper.writeValueAsString(rerankRequest).getBytes(StandardCharsets.UTF_8)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be a dedicated Kendra reranking client in the AWS SDK, right?

I'm assuming this code (and the whole dto package) are temporary until you have added the client to the SDK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the service is currently under development so we don't have an external SDK. We will update this code once it is ready.


final SearchHits newHits = doSemanticRerank(queryParserResult, hits, pluginEnabledFetchSource);

LOGGER.info("Extracting remaining response fields");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this section can (and should) be extracted to support any external reranker. Anyone who is trying to manipulate the SearchHits of a SearchResponse would need to jump through these same hoops.

More generally, how hard would it be to split out the generic reranking logic from the Kendra-specific implementation?

I think the generic logic is probably:

  1. Call an abstract shouldRerank method based on the search request, and if true, create a generic response listener with a Reranker.
  2. The generic response listener extracts the SearchHits from the response, passes them to the reranker, gets back new SearchHits and uses them (and the old SearchResponse) to create a new SearchResponse.

IMO, that would also help for breaking up this class (which is a little big right now).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I've refactored in the latest commit.

Comment on lines 306 to 309
if (pluginEnabledFetchSource) {
// User disabled fetching document source, so remove from response
searchHit.sourceRef(null);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to understand that pluginEnabledFetchSource meant "The user did not request the document source, the plugin did."

Maybe name this suppressSourceOnResponse or something and comment when you initialize it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the naming suggestion, I've updated in the latest commit and tried to re-word the comment to make the purpose clearer. Let me know what you think.

public class Document {

private String id;
private String title;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existence of the title field looks like it's just part of the Kendra reranking API, and this plugin always passes null.

The only fields that matter for this plugin are id, tokenizedBodyPassages, and originalScore, right?

(I'm also assuming that these classes will be replaced with a proper SDK client model.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still figuring out the user experience for configuring a mapping between fields in their document and the title and body expected by the ranker, so I will address this in a follow up.

}

public List<String> tokenize(String text) {
String[] tokens = text.split("[\\p{Punct}\\s]+");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Written like this, the tokenizer will compile the regex on every invocation.

You should use define a constant Pattern field, initialized with Pattern.compile to capture the regex, and use the Pattern's split method to split strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching this. Updated.

@macohen
Copy link
Collaborator

macohen commented Oct 10, 2022

When this PR is complete, will we be able to close out #4?

Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
@mahitamahesh
Copy link
Contributor

@macohen Yes, I think so. We are looking to make some updates to the document pre-processing logic in a follow up PR, but that won't contradict anything mentioned in #4.

…nd remove unused KendraClient.

Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Mahita Mahesh added 2 commits October 11, 2022 23:43
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
…ot default to false

Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Comment on lines 26 to 28
private static final ObjectParser<KendraSearchExtBuilder, Void> PARSER;
private static final ParseField RANKER_ENABLED = new ParseField("enabled");
private static final ParseField RANKER_ENABLED = new ParseField(Constants.ENABLED);
private static final ParseField BODY_FIELD = new ParseField(Constants.BODY_FIELD);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was chatting with some folks about the long-term strategy where we'll support multiple rerankers, and trying to make sure that users of this plugin will be able to send the same queries in the future.

The proposed spec is described in #12.

Could we change the ext parameters to something more like:

{
  "ext": {
    "rankers": [
      {
        "name" : "kendra_intelligent_ranking",
        "properties" : {
          "body_field": [
            "field1",
            "field2"
          ]
        }
      }
    ]
   }
 }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to propose a slightly modified structure

{
  "ext": {
    "relevancy_config": {
      "rankers": [
        {
          "name" : "kendra_intelligent_ranking",
          "enabled": true
          "properties" : {
            "body_field": [
              "field1",
              "field2"
            ]
          }
        }
      ]
    }
  }
}
  1. We need an ExtBuilder NAME to identify the ext config specified by this plugin, under which the inputs will be nested (I’m using relevancy_config used as a placeholder). See corresponding code and usage from LTR as an example.
  2. We need to have a flag to enable/disable rankers if we are to support index-level and request-level configuration. For example:
PUT my-index-000001/_settings
{
  "index" : {
    "kendra_intelligent_ranking" : {
        "enabled" : "true",
        "body_field": "text"
    }
  }
}

will enable the Kendra ranker for my-index-000001, but issuing a query without an ext entry like below, will override this and disable the Kendra ranker for this request, when in fact, it should be enabled.

GET my-index-000001/_search
{
  "query": {
    "match": {
      "text": {
        "query": "research"
      }
    }
  }
}

@mahitamahesh
Copy link
Contributor

On another note, we have implemented plugin controls via index settings, whereas #12 makes use of the _meta field under my-index-000001/_mapping

  1. Has anyone built a proof of concept with _meta to test that the integration works for this use case? If we need to make the change from settings to mapping, can we base our implementation off of this PoC?
  2. If we support purely UI-based configuration in the future, we can make the API change from settings to mapping under the hood and not break customer workflows. Else I think this would be a breaking change.

@mahitamahesh
Copy link
Contributor

@msfroh Summarizing our offline discussion here:

I. Transformer Interfaces

The plugin will invoke specific transformers at different points during the query flow (e.g. rewriters have to be applied upfront, whereas rankers will be invoked just before response, once all documents have been fetched). Further, the interface for each transformer type will differ based on the parameters it operates upon (e.g. rewriters operate on the query, rankers operate on SearchHits).

Launch proposal:

  • Support one transformer type: result_transformer (request_transformer, etc. to be added in the future)
  • Have a generic ResultTransformer interface with the methods below, and with a concrete implementation for KendraIntelligentRanking.
    • boolean shouldApplyTransformer (SearchRequest)
    • SearchHits apply (SearchHits)
  • Modify plugin logic to invoke a list of ResultTransformers (we will launch with a list of size 1), each of which operates on the SearchHits from the previous stage
  • We will not support passing context from one transformer to the next. This can be added in the future.

II. Search Configuration (for Index and Request level)

Support index-level configuration via index settings (not _mapping, as proposed in #12
Support request-level configuration via ext field in the request DSL.

Sample configuration based on #12 is at the end of this comment.

Launch proposal:

  • Maintain a list of supported transformer types (currently ["result_transformer"]) to perform validation
  • Have a generic transformer configuration with attributes below. Each implemented transformer will define its properties and logic for parsing + validating them
    • name
    • properties
  • Add unit tests with dummy operations to test that a chain of transformers works

Sample Configuration

"search_configuration": {
  "request_transformers": [
    {
      "name": "querqy",
      "properties": {
        "querqy": {
          "matching_query": {
            "must_match": {
              "query": "rambo"
            },
            "multi_match": {
              "query": "rambo",
              "fields": [
                "field1",
                "field2"
              ]
            }
          },
          "query_fields": [
            "title^3.0",
            "brand^2.1",
            "shortSummary"
          ]
        }
      }
    }
  ],
  "result_transformers": [
    {
      "name": "kendra_intelligent_ranking",
      "properties": {
        "title_fields": [
          "title"
        ],
        "body_fields": [
          "published",
          "description"
        ]
      }
    },
    {
      "name": "xyz_highlighting_service",
      "properties": {
        "prop1": "val1"
      }
    }
  ]
}

@msfroh
Copy link
Collaborator

msfroh commented Oct 18, 2022

Thanks @mahitamahesh ! I really like the API as described above.

I think it satisfies the needs of the Kendra-based reranker calls, but will also make it easy to integrate with other rerankers (and other result transformers) soon.

… bug.

Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Mahita Mahesh added 2 commits October 25, 2022 09:43
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
@mahitamahesh
Copy link
Contributor

Minor tweaks compared to the proposal above:

I. Result Transformer Interface

List<Setting<?>> getTransformerSettings();

boolean shouldTransform(final SearchRequest request, final ResultTransformerConfiguration configuration);

SearchHits transform(final SearchHits hits, final SearchRequest request, final ResultTransformerConfiguration configuration);

II. Configuration

Index settings

  • Index settings does not support lists of objects Error: only value lists are allowed in serialized settings
  • Naming following OpenSearch plugin conventions, hence index.plugin.searchrelevance.*
PUT my-index/_settings

{
  "index": {
    "plugin": {
      "searchrelevance": {
        "result_transformers": {
          "kendra_intelligent_ranking": {
            "order": 2,
            "properties": {
              "title_fields": [
                "title"
              ],
              "body_fields": [
                "published",
                "description"
              ]
            }
          },
          "transformer2": {
            "order": 1,
            "properties": {
              "prop1": "val1"
            }
          }
        }
      }
    }
  }
}

Request-level configuration

  • Moved result_transformer directly under ext
  • We will add a new SearchExtBuilder class for each transformer type that is added
GET my-index/_search
{
  "query": {
    "match": {
      "text": {
        "query": "research"
      }
    }
  },
  "ext": {
    "result_transformer" : {
      "kendra_intelligent_ranking": {
        "order": 2,
        "properties": {
          "title_fields": [
            "title"
          ],
          "body_fields": [
            "published",
            "description"
          ]
        }
      },
      "transformer2": {
        "order": 1,
        "properties": {
          "prop1": "val1"
        }
      }
    }
  }
}

Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
@msfroh
Copy link
Collaborator

msfroh commented Oct 25, 2022

Moved result_transformer directly under ext

What was the motivation for that change? (I don't think it's an irreversible change, but I'm curious about why it was needed.)

Copy link
Collaborator

@msfroh msfroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a few nitpicky comments

private final OpenSearchClient openSearchClient;
private final KendraHttpClient kendraClient;
private final TextTokenizer textTokenizer;
private final CloseableHttpClient httpClient;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is no longer used, and can be removed.

I though that we might also be able to remove the Apache HttpClient/HttpCore dependencies, but they need to be included so the AwsHttpClient still works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Comment on lines +68 to +71
/**
* The session token for connecting to Kendra.
*/
public static final Setting<SecureString> SESSION_TOKEN_SETTING = SecureSetting.secureString("kendra_intelligent_ranking.aws.session_token", null);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it ever make sense to specify a session token in the settings?

AFAIK, session tokens are always short-lived, so it never makes sense to include one in the opensearch.yml.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this here since using session key is completely optional. People could use it for testing if long lived credentials are not set up yet.


permission java.net.SocketPermission "*", "connect,resolve";
permission java.lang.RuntimePermission "getClassLoader";
permission java.util.PropertyPermission "*", "read,write";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this permission be scoped down?

In particular, I think we can limit it to kendra_intelligent_ranking.* properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed entirely

@@ -0,0 +1 @@
# possible location for external configs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we delete this file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

implementation 'commons-logging:commons-logging:1.2'
implementation 'com.amazonaws:aws-java-sdk-sts:1.12.300'
implementation 'com.amazonaws:aws-java-sdk-core:1.12.300'
implementation 'joda-time:joda-time:2.10.12'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we can remove the dependency on joda-time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

dependencies {
implementation 'com.google.guava:guava:21.0'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uses of Guava in SlidingWindowTextSplitter and SearchRelevancePlugin seem trivial enough to implement with built-in Java classes, so I would drop the Guava dependency.

The replacements needed are:

Lists.newArrayList()     ->    new ArrayList<>()
ImmutableList.of(...)    ->    Arrays.asList(...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed Guava. Will likely remove lang3 in a follow up PR too.

…date service name.

Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
@msfroh
Copy link
Collaborator

msfroh commented Oct 27, 2022

@mahitamahesh -- how would you feel about us merging this PR, and doing cleanup in subsequent PRs?

IMO, we're functionally complete and the general code structure is good. We have the cleanup items I listed above (which I'm happy to do) and maybe some minor tweaks here and there, but I think piling more on to this PR will make it harder to follow. What do you think?

@mahitamahesh
Copy link
Contributor

@msfroh I have a commit ready addressing all of the cleanup items above, so I'll squeeze that in here if that's alright, and then we can call this PR done.

…bles, dependencies, permissions

Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
@kevinawskendra
Copy link
Contributor Author

LGTM

@kevinawskendra kevinawskendra reopened this Nov 4, 2022
@msfroh msfroh merged commit 9c3994b into opensearch-project:main Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants