-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial code for plugin based on opensearch-plugin-template-java #3
Conversation
@kevinawskendra Thanks for the PR. Below are some recommended changes:
|
Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Mahita, we can review this request after you resolve the DCO issue in your commits. |
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
…tep to split document source content into passages Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: 🍕Kevin Li <kevlim@amazon.com> Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
…stener. Signed-off-by: 🍕Kevin Li <kevlim@amazon.com> Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
…r hits based on response, add plugin security policy. Signed-off-by: 🍕Kevin Li <kevlim@amazon.com> Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
…t setting. Signed-off-by: 🍕Kevin Li <kevlim@amazon.com> Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: 🍕Kevin Li <kevlim@amazon.com> Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: 🍕Kevin Li <kevlim@amazon.com> Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
…ield from the OS query. Signed-off-by: 🍕Kevin Li <kevlim@amazon.com> Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
…tId. Signed-off-by: 🍕Kevin Li <kevlim@amazon.com> Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: 🍕Kevin Li <kevlim@amazon.com> Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
…ex level setting Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
/** | ||
* The minimum step size by which to move the sliding window after a split. Step size can be larger | ||
* in practice because of respecting sentence boundaries. | ||
*/ | ||
private Integer stepSize; | ||
|
||
/** | ||
* The maximum number of passages to be extracted from the input text | ||
*/ | ||
private Integer maximumPassages; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these Integer
rather than int
?
The contract of setSlidingWindow
implicitly says that they can't be null
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to int
System.out.println(rerankRequest.getRerankingEndpointId()); | ||
System.out.println(rerankRequest.getDocuments()); | ||
System.out.println(rerankRequest.getQueryText()); | ||
Request<Void> request = new DefaultRequest<>(aws4Signer.getServiceName()); | ||
request.setHttpMethod(HttpMethodName.POST); | ||
request.setEndpoint(URI.create(serviceEndpoint)); | ||
request.setHeaders(Map.of("Content-Type", "application/x-amz-json-1.0", "X-Amz-Target", "AWSKendraRerankingFrontendService.Rerank")); | ||
System.out.println("objectMapper: " + objectMapper.writeValueAsString(rerankRequest)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to remove these System.out.println
before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left by accident. Removed now.
request.setEndpoint(URI.create(serviceEndpoint)); | ||
request.setHeaders(Map.of("Content-Type", "application/x-amz-json-1.0", "X-Amz-Target", "AWSKendraRerankingFrontendService.Rerank")); | ||
System.out.println("objectMapper: " + objectMapper.writeValueAsString(rerankRequest)); | ||
request.setContent(new ByteArrayInputStream(objectMapper.writeValueAsString(rerankRequest).getBytes(StandardCharsets.UTF_8))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be a dedicated Kendra reranking client in the AWS SDK, right?
I'm assuming this code (and the whole dto
package) are temporary until you have added the client to the SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the service is currently under development so we don't have an external SDK. We will update this code once it is ready.
|
||
final SearchHits newHits = doSemanticRerank(queryParserResult, hits, pluginEnabledFetchSource); | ||
|
||
LOGGER.info("Extracting remaining response fields"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this section can (and should) be extracted to support any external reranker. Anyone who is trying to manipulate the SearchHits
of a SearchResponse
would need to jump through these same hoops.
More generally, how hard would it be to split out the generic reranking logic from the Kendra-specific implementation?
I think the generic logic is probably:
- Call an abstract
shouldRerank
method based on the search request, and if true, create a generic response listener with aReranker
. - The generic response listener extracts the
SearchHits
from the response, passes them to the reranker, gets back newSearchHits
and uses them (and the oldSearchResponse
) to create a newSearchResponse
.
IMO, that would also help for breaking up this class (which is a little big right now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, I've refactored in the latest commit.
if (pluginEnabledFetchSource) { | ||
// User disabled fetching document source, so remove from response | ||
searchHit.sourceRef(null); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to understand that pluginEnabledFetchSource
meant "The user did not request the document source, the plugin did."
Maybe name this suppressSourceOnResponse
or something and comment when you initialize it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the naming suggestion, I've updated in the latest commit and tried to re-word the comment to make the purpose clearer. Let me know what you think.
public class Document { | ||
|
||
private String id; | ||
private String title; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existence of the title
field looks like it's just part of the Kendra reranking API, and this plugin always passes null
.
The only fields that matter for this plugin are id
, tokenizedBodyPassages
, and originalScore
, right?
(I'm also assuming that these classes will be replaced with a proper SDK client model.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are still figuring out the user experience for configuring a mapping between fields in their document and the title
and body
expected by the ranker, so I will address this in a follow up.
} | ||
|
||
public List<String> tokenize(String text) { | ||
String[] tokens = text.split("[\\p{Punct}\\s]+"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Written like this, the tokenizer will compile the regex on every invocation.
You should use define a constant Pattern
field, initialized with Pattern.compile
to capture the regex, and use the Pattern
's split
method to split strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for catching this. Updated.
When this PR is complete, will we be able to close out #4? |
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
…nd remove unused KendraClient. Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
…ot default to false Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
private static final ObjectParser<KendraSearchExtBuilder, Void> PARSER; | ||
private static final ParseField RANKER_ENABLED = new ParseField("enabled"); | ||
private static final ParseField RANKER_ENABLED = new ParseField(Constants.ENABLED); | ||
private static final ParseField BODY_FIELD = new ParseField(Constants.BODY_FIELD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was chatting with some folks about the long-term strategy where we'll support multiple rerankers, and trying to make sure that users of this plugin will be able to send the same queries in the future.
The proposed spec is described in #12.
Could we change the ext parameters to something more like:
{
"ext": {
"rankers": [
{
"name" : "kendra_intelligent_ranking",
"properties" : {
"body_field": [
"field1",
"field2"
]
}
}
]
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to propose a slightly modified structure
{
"ext": {
"relevancy_config": {
"rankers": [
{
"name" : "kendra_intelligent_ranking",
"enabled": true
"properties" : {
"body_field": [
"field1",
"field2"
]
}
}
]
}
}
}
- We need an ExtBuilder NAME to identify the ext config specified by this plugin, under which the inputs will be nested (I’m using
relevancy_config
used as a placeholder). See corresponding code and usage from LTR as an example. - We need to have a flag to enable/disable rankers if we are to support index-level and request-level configuration. For example:
PUT my-index-000001/_settings
{
"index" : {
"kendra_intelligent_ranking" : {
"enabled" : "true",
"body_field": "text"
}
}
}
will enable the Kendra ranker for my-index-000001
, but issuing a query without an ext
entry like below, will override this and disable the Kendra ranker for this request, when in fact, it should be enabled.
GET my-index-000001/_search
{
"query": {
"match": {
"text": {
"query": "research"
}
}
}
}
On another note, we have implemented plugin controls via index settings, whereas #12 makes use of the
|
@msfroh Summarizing our offline discussion here: I. Transformer InterfacesThe plugin will invoke specific transformers at different points during the query flow (e.g. rewriters have to be applied upfront, whereas rankers will be invoked just before response, once all documents have been fetched). Further, the interface for each transformer type will differ based on the parameters it operates upon (e.g. rewriters operate on the query, rankers operate on Launch proposal:
II. Search Configuration (for Index and Request level)Support index-level configuration via index settings (not Sample configuration based on #12 is at the end of this comment. Launch proposal:
Sample Configuration
|
Thanks @mahitamahesh ! I really like the API as described above. I think it satisfies the needs of the Kendra-based reranker calls, but will also make it easy to integrate with other rerankers (and other result transformers) soon. |
… bug. Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Signed-off-by: 🍕Kevin Li <kevlim@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
Minor tweaks compared to the proposal above: I. Result Transformer Interface
II. ConfigurationIndex settings
Request-level configuration
|
Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
What was the motivation for that change? (I don't think it's an irreversible change, but I'm curious about why it was needed.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a few nitpicky comments
private final OpenSearchClient openSearchClient; | ||
private final KendraHttpClient kendraClient; | ||
private final TextTokenizer textTokenizer; | ||
private final CloseableHttpClient httpClient; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field is no longer used, and can be removed.
I though that we might also be able to remove the Apache HttpClient/HttpCore dependencies, but they need to be included so the AwsHttpClient still works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
/** | ||
* The session token for connecting to Kendra. | ||
*/ | ||
public static final Setting<SecureString> SESSION_TOKEN_SETTING = SecureSetting.secureString("kendra_intelligent_ranking.aws.session_token", null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it ever make sense to specify a session token in the settings?
AFAIK, session tokens are always short-lived, so it never makes sense to include one in the opensearch.yml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving this here since using session key is completely optional. People could use it for testing if long lived credentials are not set up yet.
|
||
permission java.net.SocketPermission "*", "connect,resolve"; | ||
permission java.lang.RuntimePermission "getClassLoader"; | ||
permission java.util.PropertyPermission "*", "read,write"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this permission be scoped down?
In particular, I think we can limit it to kendra_intelligent_ranking.*
properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed entirely
@@ -0,0 +1 @@ | |||
# possible location for external configs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we delete this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
semantic-reranker/build.gradle
Outdated
implementation 'commons-logging:commons-logging:1.2' | ||
implementation 'com.amazonaws:aws-java-sdk-sts:1.12.300' | ||
implementation 'com.amazonaws:aws-java-sdk-core:1.12.300' | ||
implementation 'joda-time:joda-time:2.10.12' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we can remove the dependency on joda-time
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
semantic-reranker/build.gradle
Outdated
} | ||
|
||
dependencies { | ||
implementation 'com.google.guava:guava:21.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uses of Guava in SlidingWindowTextSplitter and SearchRelevancePlugin seem trivial enough to implement with built-in Java classes, so I would drop the Guava dependency.
The replacements needed are:
Lists.newArrayList() -> new ArrayList<>()
ImmutableList.of(...) -> Arrays.asList(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed Guava. Will likely remove lang3 in a follow up PR too.
…date service name. Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
@mahitamahesh -- how would you feel about us merging this PR, and doing cleanup in subsequent PRs? IMO, we're functionally complete and the general code structure is good. We have the cleanup items I listed above (which I'm happy to do) and maybe some minor tweaks here and there, but I think piling more on to this PR will make it harder to follow. What do you think? |
@msfroh I have a commit ready addressing all of the cleanup items above, so I'll squeeze that in here if that's alright, and then we can call this PR done. |
…bles, dependencies, permissions Signed-off-by: Mahita Mahesh <mahitam@amazon.com>
LGTM |
Description
Initial code for plugin based on opensearch-plugin-template-java
Copied files from https://github.com/opensearch-project/opensearch-plugin-template-java (excluding
CONTRIBUTING.md
,LICENSE.txt
,NOTICE.txt
, andREADME.md
) then followed README.Renamed
LICENSE
andNOTICE
toLICENSE.txt
,NOTICE.txt
, respectively, in order for./gradlew check
to succeed.Issues Resolved
N/A
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.