-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Repository Search by Substring #8417
Conversation
♻️ PR Preview 4087e10 has been successfully destroyed since this PR has been closed. 🤖 By surge-preview |
1e982e4
to
1178906
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature worries me a lot.
## Cost & performance
Update: @guy-har points out that this entire calculation is wrong and that I was confugsed, please ignore and/or see below.
Suppose a user has 100 repositories (any less and they probably don't need it). They start typing a repository name on the webUI. Every character they type on the GUI narrows the search, at the cost of a KV listing. A reasonable typist can hit 10 characters per second; didn't we just hit 1000 RCUs/s?
Ideally we would open an immediate tech debt item to improve KV to allow faster scanning: we could add conditions on all supported KVs that would greatly reduce the required traffic.
Also, existing programs that perform list-repositories may now cost much more: every repository listing scans all repositories.
API
AFAICT the code overrides "prefix" to mean "substring". This is both confusing and a breaking change. At the very least, please keep "prefix", and add a new "substring" field.
Also, I believe that this change breaks pagination; please test and fix if needed.
pkg/catalog/catalog.go
Outdated
startPos = afterRepositoryID | ||
} | ||
if startPos != "" { | ||
it.SeekGE(startPos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes scanning very inefficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I fixed it- seek to start position is back in. This was a logical error before we even talk about performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But now we are missing repositories that include the string and are before the prefix (e.g for prefix ple we will never get apple)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should take this one step at a time:
- Separate "prefix" (which needs to be kept unchanged) from a new "substring" (actual name TBD) parameter.
- Implement "substring".
- Now support "after" and "prefix" - their implementations are probably unchanged.
We need to do all 3 in this PR, of course, but I think that I and @guy-har need to lay off supporting "after" until, well, after we agree about "substring".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I implemented the substring part. I thought to just call it search string, since that is what the client is trying to do and maybe we will improve the algorithm at some point- let me know what you think
pkg/catalog/catalog.go
Outdated
// collect limit +1 to return limit and has more | ||
if len(repos) >= limit+1 { | ||
break | ||
if strings.Contains(string(record.RepositoryID), prefix) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change: it changes the meaning of the "prefix" parameter! (It is also very confusing to users, who will probably not expect "prefix" to mean "substring").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I could change the name of the parameter- I don't see any other code that uses it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefix is defined in our API and should still be supported, even if we wanted, we can't just change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's lots of code that can use prefix, we will just never know about it:
- Any third-party code might use it. It could even use our REST API directly, we would have a hard time detecting it.
- The high-level Python SDK method
repositories
supports it, so any code that uses that could rely on prefix.
Also,
- The S3 gateway method ListBuckets should support prefix; it seems that it doesn't, and that's now bug S3 gateway ListBuckets does not support "prefix" argument #8425 :-/
Not changing the behaviour of prefix may complicate this PR, but unfortunately we need to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we have to support prefix and also a search string?
pkg/catalog/catalog.go
Outdated
prefixRepositoryID := graveler.RepositoryID(prefix) | ||
startPos := prefixRepositoryID | ||
if afterRepositoryID > startPos { | ||
startPos = afterRepositoryID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this break pagination?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are correct thanks. Added some failing tests and fixed it
OK OK, let's not get over-excited. I deleted a few lines too much, but now I fixed them, so I think it should be fine. I can performance test it to make sure Oh as far as the API, I guess i could just rename it to substring right? No one else uses prefix, so no need for backwards compatiblity |
pkg/catalog/catalog.go
Outdated
startPos = afterRepositoryID | ||
} | ||
if startPos != "" { | ||
it.SeekGE(startPos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But now we are missing repositories that include the string and are before the prefix (e.g for prefix ple we will never get apple)
pkg/catalog/catalog.go
Outdated
// collect limit +1 to return limit and has more | ||
if len(repos) >= limit+1 { | ||
break | ||
if strings.Contains(string(record.RepositoryID), prefix) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefix is defined in our API and should still be supported, even if we wanted, we can't just change it.
@arielshaqed, can you please explain the calculation that brings us to 1000 RCUs/s? |
Sorry, you're right. I always make the same mistake: "kv".Store.Scan is actually DynamoDB Query, which indeed costs 1 RCU per 4 KiB of data. How large is a repository record?
Recalculating: I can certainly see >160 bytes per repository, which is indeed just under 25 repositories per RCU, so 4-5 RCUs per keystroke. Obviously not ideal but please strike my original numbers (and I'm editing that comment to clarify the error). |
@itaigilo pointed out to me that our frontend uses useDebouncedState with a wait time of 300ms, so even if the user types faster, it will generate at most one request per 300ms I could add some sort of substrings cache or index so that we aren't constantly iterating through the KV entries. Do you think it's neccessary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the effort, this is a tricky full-stack feature.
Blocking for both the param naming,
And for at least considering a different design of the API.
api/swagger.yml
Outdated
@@ -70,6 +70,13 @@ components: | |||
description: delimiter used to group common prefixes by | |||
schema: | |||
type: string | |||
|
|||
PaginationSearchString: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why in some places it's SearchString
in in others it's search
?
Please be consistent, since it's making it much easier to track when dealing with params that are being passed from the FE to the BE services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, well a "search string" is a term: https://www.techtarget.com/whatis/definition/search-string
The typically query strings are shortened to one word. See for example:
927f8aa
to
8fbb77f
Compare
OK I addressed these issues (see the comments) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat, thanks! Please see comments about it being a search string, not a pagination search string. Now we'll see if anyone was depending on the old behaviour.1
Footnotes
-
See XKCD #1172. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice one.
Approving to unblock,
But please fix Ariel's and my style comments before merging (paginationSearch
, setsearch
, and console.log
).
pkg/catalog/catalog.go
Outdated
// In this case, pass the last repository name as 'after' on the next call to ListRepositories | ||
func (c *Catalog) ListRepositories(ctx context.Context, limit int, prefix, after string) ([]*Repository, bool, error) { | ||
func (c *Catalog) ListRepositories(ctx context.Context, limit int, prefix, searchString, after string) ([]*Repository, bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document searchString
Co-authored-by: Ariel Shaqed (Scolnicov) <ariels@treeverse.io>
128270c
to
4087e10
Compare
Closes #7596
Change Description
Background
Community feature request
New Feature
Repo search will seek substrings rather than prefixes, making it more likely to turn up what the user wants
Testing Details
Manual tests
Unit test Added