Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spillover from search ticket #3106

Closed
6 of 7 tasks
MatMoore opened this issue Jan 26, 2024 · 7 comments
Closed
6 of 7 tasks

Spillover from search ticket #3106

MatMoore opened this issue Jan 26, 2024 · 7 comments
Assignees
Labels

Comments

@MatMoore
Copy link
Contributor

MatMoore commented Jan 26, 2024

Things that would be useful to add to the search method:

  • Return numAssets

  • Investigate why domain filters are returning things that don't have the domain set - can we control this? We can pass an addition filter for hasXXX

  • Add a method to get facets separately to the searchAcrossEntityCall. We need to know the domain URNs in order to pass the filters to the search call. If the URL query parameters contain only the labels then we will need to perform a lookup by label. This also gives us the option to show all possible options even when a filter is applied that excludes some of them.

  • The response object should make it easier to lookup facet labels by value or vice versa

  • There is no search highlighting yet - how do we pass this back? The default app does this client side, using react-highlight, so we would also need to this on the client side.

  • We need to be able to filter by ID. (Fixed by passing urn as the filter name)

  • When searching for a data product by name, all the datasets within it rank higher than the actual data product. We would expect this to be the top ranked result. My best guess is that datasets have more fields, and each match with the search terms increases the score of a result. We could avoid this by separating the data product and dataset queries into two queries, and mixing them on the frontend so that data products go at the top. Alternatively, we could try and customise the scoring to boost results based on _entityType = 'DataProduct'.

@MatMoore MatMoore converted this from a draft issue Jan 26, 2024
@MatMoore MatMoore self-assigned this Jan 26, 2024
@MatMoore MatMoore moved this from Todo to In Progress in Data Catalogue Jan 26, 2024
@MatMoore
Copy link
Contributor Author

Note: we are currently using the deprecated filters parameter rather than the more expressive orFilters parameter

We should fix this

@MatMoore
Copy link
Contributor Author

The domains filter does not return null domains

Example query on demo instance

{
  searchAcrossEntities(
    input: {
      types: [DATASET],
      query: "*",
      start: 0,
      count: 10,
			orFilters: [
        {and: [{field:"domains", values: ["urn:li:domain:7d64d0fa-66c3-445c-83db-3a324723daf8"]}]}
      ]
    }) {
    start
    count
    total
    searchResults {
      entity {
        urn
        ... on Dataset {
          domain {
            associatedUrn
          }
        }
      }
    }
  }
}

Will need to debug this further on our instance

@MatMoore
Copy link
Contributor Author

MatMoore commented Jan 29, 2024

How the backend of search works

What fields are indexed in elasticsearch?

public enum FieldType {
    KEYWORD,
    TEXT,
    TEXT_PARTIAL,
    BROWSE_PATH,
    URN,
    URN_PARTIAL,
    BOOLEAN,
    COUNT,
    DATETIME,
    OBJECT,
    BROWSE_PATH_V2,
    WORD_GRAM,
    DOUBLE
  }

How is the Elasticsearch query built?

  1. The graphql query is handled by the graphql resolver https://github.com/datahub-project/datahub/blob/master/datahub-graphql-core/src/main/java/com/linkedin/datahub/graphql/resolvers/search/SearchAcrossEntitiesResolver.java
  2. The resolver calls an EntityClient
  3. EntityClient calls a EntitySearchService https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/com/linkedin/metadata/client/JavaEntityClient.java#L321
  4. ElasticsearchService calls ESSearchDAO https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/ElasticSearchService.java#L132 https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/query/ESSearchDAO.java
  5. ESSearchDAO uses a SearchRequestHandler to build the query https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/query/request/SearchRequestHandler.java
  6. SearchRequestHandler uses SearchQueryBuilder https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/query/request/SearchQueryBuilder.java#L60
  7. This uses the opensearch library directly (e.g. QueryBuilder, QueryBuilders) as well as https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/com/linkedin/metadata/search/utils/ESUtils.java
  8. For each field in the SearchConfig, SearchQueryBuilder generates Elasticsearch should clauses
  9. SearchQueryBuilder Applies score functions (including any custom score config) https://github.com/datahub-project/datahub/blob/dc16c73937dcb4a287653090faf3c32807257872/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/query/request/SearchQueryBuilder.java#L106
  10. SearchRequestHandler builds the filter query https://github.com/datahub-project/datahub/blob/dc16c73937dcb4a287653090faf3c32807257872/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/query/request/SearchRequestHandler.java#L188
  11. It doesn't seem like there are any restrictions over what fields can be filtered on at this stage - this gets converted straight to an elasticsearch query. So any field that is marked as searchable in some way should be filterable via the API.

Observations

  1. Soft deleted entities are excluded by default. This can be overriden by including filters for the "removed" field
  2. We get different behaviour if we opt into structured queries and prefix our query string with "\\/q "
  3. urn is always available as a filter
  4. Some fields are collections and are pluralised (e.g. domains, customProperties) others are singular
  5. Tags, terms, descriptions are expanded to multiple fields, to take into account editable/non editable versions
  6. I didn't spot any code for deliberately including nulls in filters (see below)
  7. There is some special filter logic for boolean fields
  private static QueryBuilder buildEqualsConditionFromCriterionWithValues(
      @Nonnull final String fieldName,
      @Nonnull final Criterion criterion,
      final boolean isTimeseries) {
    if (BOOLEAN_FIELDS.contains(fieldName) && criterion.getValues().size() == 1) {
      // Handle special-cased Boolean fields.
      // here we special case boolean fields we recognize the names of and hard-cast
      // the first provided value to a boolean to do the comparison.
      // Ideally, we should detect the type of the field from the entity-registry in order
      // to determine how to cast.
      return QueryBuilders.termQuery(fieldName, Boolean.parseBoolean(criterion.getValues().get(0)))
          .queryName(fieldName);
    }
    return QueryBuilders.termsQuery(
            toKeywordField(criterion.getField(), isTimeseries), criterion.getValues())
        .queryName(fieldName);
  }

@MatMoore
Copy link
Contributor Author

MatMoore commented Jan 29, 2024

If we need to tweak the ranking, we can customise the elasticsearch query like this
https://datahubproject.io/docs/how/search/#example-3-exclusion--bury

@MatMoore MatMoore moved this from In Progress to Review in Data Catalogue Jan 30, 2024
@MatMoore
Copy link
Contributor Author

MatMoore commented Jan 30, 2024

Search highlighting in the react app uses react-highlight. The server only indicates whether there is a matched field or not.

    const appConfig = useAppConfig();
    const enableNameHighlight = appConfig.config.visualConfig.searchResult?.enableNameHighlight;
    const matchedFields = useMatchedFieldsByGroup(field);
    const hasMatchedField = !!matchedFields?.length;
    const normalizedSearchQuery = useSearchQuery()?.trim().toLowerCase();
    const normalizedText = text.trim().toLowerCase();
    const hasSubstring = hasMatchedField && !!normalizedSearchQuery && normalizedText.includes(normalizedSearchQuery);
    const pattern = enableFullHighlight ? HIGHLIGHT_ALL_PATTERN : undefined;

    return (
        <>
            {enableNameHighlight && hasMatchedField ? (
                <StyledHighlight search={hasSubstring ? normalizedSearchQuery : pattern}>{text}</StyledHighlight>
            ) : (
                text
            )}
        </>
    );

@MatMoore MatMoore moved this from Review to Done in Data Catalogue Feb 1, 2024
Copy link
Contributor

github-actions bot commented Apr 1, 2024

This issue is being marked as stale because it has been open for 60 days with no activity. Remove stale label or comment to keep the issue open.

@github-actions github-actions bot added the stale label Apr 1, 2024
Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue is being closed because it has been open for a further 7 days with no activity. If this is still a valid issue, please reopen it, Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done ✅
Development

No branches or pull requests

2 participants