-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue #1455] Implement logic for query box in the search API #1514
Conversation
@@ -167,7 +167,7 @@ db-migrate-heads: ## Show migrations marked as a head | |||
$(alembic_cmd) heads $(args) | |||
|
|||
db-seed-local: | |||
$(PY_RUN_CMD) db-seed-local | |||
$(PY_RUN_CMD) db-seed-local $(args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(chore): update the docs to include iterations: https://github.com/HHS/simpler-grants-gov/blob/chouinar/1455-query-box/documentation/api/database/database-local-usage.md?plain=1#L28
@@ -932,6 +932,8 @@ components: | |||
properties: | |||
query: | |||
type: string | |||
minLength: 1 | |||
maxLength: 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious / to have context: Was 100
chosen for the limit by any specific criteria? Could we easily adjust it in the future if we discover folx are entering really long term strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem hard to update . If the frontend also has <input min={1} max={100}
- we have to keep them in sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there's no technical/performance/etc limit we need to be mindful of? 100 is an arbitrary choice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to @chouinar comment here it appears arbitrary: https://betagrantsgov.slack.com/archives/C05TSL64VUH/p1711038834496529?thread_ts=1710957436.730929&cid=C05TSL64VUH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limit for your search box should be as long as possible, and no longer. For example, if your search engine only supports 256 characters, then don't allow users to enter in 257 characters. You should not make your text limit any shorter than that supported by the engine, because there will always be a small, but possibly significant, number of users that need to search for really long phrases. You should not make it any longer than supported by the search engine, because it will result in unnecessary error messages or unexpected search results.
— https://ux.stackexchange.com/questions/112927/character-limit-for-the-search-box-best-pratices
Trying to image how this edge case might happen, I can think of ways:
- Someone downloaded an opportunity PDF but forgot the URL; they search for a long string copied from the PDF
- Someone enters a keyword, a title, an ID, a assistance listing number, etc; the whole lot adds up quickly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusting this value takes 2 seconds (change the number in the API schema + a test).
We can make it longer, but of the two scenarios you put there, I think only the first would be valid at the moment. Supplying a keyword AND title AND id will almost certainly never return results as implemented. If you entered HHS health 00.123
that exact order of words would need to appear in at least one field we check against to return results, we aren't tokenizing and searching each individually.
"type": "min_or_max_length", | ||
} | ||
], | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:thumbs-up:
ids=search_scenario_id_fnc, | ||
) | ||
def test_opportunity_query_and_filter_200( | ||
self, client, api_auth_token, search_request, expected_scenarios, setup_scenarios |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the setup_scenarios
fixture being used? I think there's a few tests in this file that it may not be used in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used by all the tests in this class. It only gets called once for the set of tests for the class, but is required by all of them as it is where we setup the data we actually query against.
|
||
self.validate_results(search_request, search_response, expected_scenarios) | ||
|
||
def validate_results(self, search_request, search_response, expected_scenarios): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This is a bit of a larger ask but would it make sense to organize things by class - as they can be run that way via console? - Or trying to keep files under 1000 lines if possible by splitting them up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are organized by class - what specifically would you suggest be changed?
I do think this set of tests should be tidied up a bit, and splitting the different routes into separate files probably would be the right way to handle that.
Unfortunately, search requires a lot of tests to validate as there are effectively an infinite number of scenarios due to how many different possible values / fields we can search by.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and splitting the different routes into separate files probably would be the right way to handle that.
Yea something like this - I see that some of the parameterized tests use both the singular GET opportunity and plural POST opportunities. It's just early in the project and I would try to avoid the test files ending up like our past project with 5k lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a ticket to do that cleanup: #1524
OpportunitySummary.summary_description.ilike(ilike_query), | ||
# assistance listing number matches exactly or program title partial match | ||
OpportunityAssistanceListing.assistance_listing_number == query, | ||
OpportunityAssistanceListing.program_title.ilike(ilike_query), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this works for now - are you planning to run the query text against opensearch in the future or something?
@andycochran or @crystabelrangel - you might take a peek at what we have here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the structure of this file looks modular and flexible with small functions like _add_query_filters
and _add_filters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the search logic is very very simplistic on purpose at the moment. I did spend some time looking into:
- Postgres text search / tsvectors: https://www.postgresql.org/docs/current/textsearch.html
- Postgres text similarity: https://www.postgresql.org/docs/current/pgtrgm.html
The textsearch in postgres seemed to not work much better to be honest, maybe I was doing something wrong but the scores it produced generally just amounted to a 0
or ~1
value indicating it was in the column. It also didn't handle the case where you search HHS health
intending that to be the agency + a random text search term.
Text similarity did seem quite a bit better as it allows for a bit of flexibility, but would need to investigate further.
All-in-all, the more time we spend on this, the longer it takes for us to build out an actual search index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rylew1, not sure what feedback you need from design here. If it's just about what data the terms search against (opportunity_title
, opportunity_number
, agency
, summary_description
, assistance_listing_number
, program_title
seems like more that we even anticipated 👍) I think that's fine. At least for now; we'll learn more as we test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits but looks good!
Summary
Fixes #1455
Time to review: 10 mins
Changes proposed
Implements logic for the query field in the search API
Updated the seed script to allow you to generate more opportunities locally
Context for reviewers
This logic is deliberately very, very simple at the moment. It basically just does a text contains check across several columns. This is the most naive approach, and it is expected that if you tried to search two separate words at the same time, you probably won't get any results.
A few options we may consider later (depending on the level of effort we want to put into a non-search-index approach):
Additional information
The
ids
function I added for the tests was because there are a lot of tests and when test16
fails, its really hard to figure out which one that is. Now it outputs the request+expected results: