Spillover from search ticket #3106

MatMoore · 2024-01-26T16:46:13Z

Things that would be useful to add to the search method:

Return numAssets
Investigate why domain filters are returning things that don't have the domain set - can we control this? We can pass an addition filter for hasXXX
Add a method to get facets separately to the searchAcrossEntityCall. We need to know the domain URNs in order to pass the filters to the search call. If the URL query parameters contain only the labels then we will need to perform a lookup by label. This also gives us the option to show all possible options even when a filter is applied that excludes some of them.
The response object should make it easier to lookup facet labels by value or vice versa
There is no search highlighting yet - how do we pass this back? The default app does this client side, using react-highlight, so we would also need to this on the client side.
We need to be able to filter by ID. (Fixed by passing urn as the filter name)
When searching for a data product by name, all the datasets within it rank higher than the actual data product. We would expect this to be the top ranked result. My best guess is that datasets have more fields, and each match with the search terms increases the score of a result. We could avoid this by separating the data product and dataset queries into two queries, and mixing them on the frontend so that data products go at the top. Alternatively, we could try and customise the scoring to boost results based on _entityType = 'DataProduct'.

The text was updated successfully, but these errors were encountered:

MatMoore · 2024-01-29T09:43:24Z

Note: we are currently using the deprecated filters parameter rather than the more expressive orFilters parameter

We should fix this

MatMoore · 2024-01-29T09:54:03Z

The domains filter does not return null domains

Example query on demo instance

{
  searchAcrossEntities(
    input: {
      types: [DATASET],
      query: "*",
      start: 0,
      count: 10,
			orFilters: [
        {and: [{field:"domains", values: ["urn:li:domain:7d64d0fa-66c3-445c-83db-3a324723daf8"]}]}
      ]
    }) {
    start
    count
    total
    searchResults {
      entity {
        urn
        ... on Dataset {
          domain {
            associatedUrn
          }
        }
      }
    }
  }
}

Will need to debug this further on our instance

MatMoore · 2024-01-29T10:35:50Z

How the backend of search works

What fields are indexed in elasticsearch?

This is handled by https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/com/linkedin/metadata/timeseries/elastic/indexbuilder/MappingsBuilder.java
Metadata is ultimately coming from https://github.com/datahub-project/datahub/blob/master/metadata-models/src/main/resources/entity-registry.yml
These reference aspect schemas (*.pdl files in metadata-models)
When the registry is loaded, a SearchScoreFieldSpecExtractor/ is instantiated
It looks for Searchable annotation in the pdl
This annotation is visible as a tag in the datahub demo instance https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:datahub,Dataset,PROD)/Schema?is_lineage_mode=false&schemaFilter=
There is also a SearchScore annotation that can be used - such fields are not made searchable but are mapped in elasticsearch and can be used to influence the ranking
Fields may have a corresponding hasXXX and/or numXXX field for use in filters
The Searchable annotation may override the fieldName, so the elasticsearch field name does not necessarily match the GraphQL one
An entities fields are flattened by aspect - e.g. name instead of properties.name
Searchable annotations also define a field type - KEYWORD is a straight match, whereas everything else will be analysed/tokenised in some way

public enum FieldType {
    KEYWORD,
    TEXT,
    TEXT_PARTIAL,
    BROWSE_PATH,
    URN,
    URN_PARTIAL,
    BOOLEAN,
    COUNT,
    DATETIME,
    OBJECT,
    BROWSE_PATH_V2,
    WORD_GRAM,
    DOUBLE
  }

How is the Elasticsearch query built?

The graphql query is handled by the graphql resolver https://github.com/datahub-project/datahub/blob/master/datahub-graphql-core/src/main/java/com/linkedin/datahub/graphql/resolvers/search/SearchAcrossEntitiesResolver.java
The resolver calls an EntityClient
EntityClient calls a EntitySearchService https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/com/linkedin/metadata/client/JavaEntityClient.java#L321
ElasticsearchService calls ESSearchDAO https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/ElasticSearchService.java#L132 https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/query/ESSearchDAO.java
ESSearchDAO uses a SearchRequestHandler to build the query https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/query/request/SearchRequestHandler.java
SearchRequestHandler uses SearchQueryBuilder https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/query/request/SearchQueryBuilder.java#L60
This uses the opensearch library directly (e.g. QueryBuilder, QueryBuilders) as well as https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/com/linkedin/metadata/search/utils/ESUtils.java
For each field in the SearchConfig, SearchQueryBuilder generates Elasticsearch should clauses
SearchQueryBuilder Applies score functions (including any custom score config) https://github.com/datahub-project/datahub/blob/dc16c73937dcb4a287653090faf3c32807257872/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/query/request/SearchQueryBuilder.java#L106
SearchRequestHandler builds the filter query https://github.com/datahub-project/datahub/blob/dc16c73937dcb4a287653090faf3c32807257872/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/query/request/SearchRequestHandler.java#L188
It doesn't seem like there are any restrictions over what fields can be filtered on at this stage - this gets converted straight to an elasticsearch query. So any field that is marked as searchable in some way should be filterable via the API.

Observations

Soft deleted entities are excluded by default. This can be overriden by including filters for the "removed" field
We get different behaviour if we opt into structured queries and prefix our query string with "\\/q "
urn is always available as a filter
Some fields are collections and are pluralised (e.g. domains, customProperties) others are singular
Tags, terms, descriptions are expanded to multiple fields, to take into account editable/non editable versions
I didn't spot any code for deliberately including nulls in filters (see below)
There is some special filter logic for boolean fields

  private static QueryBuilder buildEqualsConditionFromCriterionWithValues(
      @Nonnull final String fieldName,
      @Nonnull final Criterion criterion,
      final boolean isTimeseries) {
    if (BOOLEAN_FIELDS.contains(fieldName) && criterion.getValues().size() == 1) {
      // Handle special-cased Boolean fields.
      // here we special case boolean fields we recognize the names of and hard-cast
      // the first provided value to a boolean to do the comparison.
      // Ideally, we should detect the type of the field from the entity-registry in order
      // to determine how to cast.
      return QueryBuilders.termQuery(fieldName, Boolean.parseBoolean(criterion.getValues().get(0)))
          .queryName(fieldName);
    }
    return QueryBuilders.termsQuery(
            toKeywordField(criterion.getField(), isTimeseries), criterion.getValues())
        .queryName(fieldName);
  }

MatMoore · 2024-01-29T11:22:12Z

If we need to tweak the ranking, we can customise the elasticsearch query like this
https://datahubproject.io/docs/how/search/#example-3-exclusion--bury

MatMoore · 2024-01-30T10:45:38Z

Search highlighting in the react app uses react-highlight. The server only indicates whether there is a matched field or not.

    const appConfig = useAppConfig();
    const enableNameHighlight = appConfig.config.visualConfig.searchResult?.enableNameHighlight;
    const matchedFields = useMatchedFieldsByGroup(field);
    const hasMatchedField = !!matchedFields?.length;
    const normalizedSearchQuery = useSearchQuery()?.trim().toLowerCase();
    const normalizedText = text.trim().toLowerCase();
    const hasSubstring = hasMatchedField && !!normalizedSearchQuery && normalizedText.includes(normalizedSearchQuery);
    const pattern = enableFullHighlight ? HIGHLIGHT_ALL_PATTERN : undefined;

    return (
        <>
            {enableNameHighlight && hasMatchedField ? (
                <StyledHighlight search={hasSubstring ? normalizedSearchQuery : pattern}>{text}</StyledHighlight>
            ) : (
                text
            )}
        </>
    );

github-actions · 2024-04-01T01:49:58Z

This issue is being marked as stale because it has been open for 60 days with no activity. Remove stale label or comment to keep the issue open.

github-actions · 2024-04-09T01:47:59Z

This issue is being closed because it has been open for a further 7 days with no activity. If this is still a valid issue, please reopen it, Thank you!

MatMoore added this to Data Catalogue Jan 26, 2024

MatMoore converted this from a draft issue Jan 26, 2024

moj-data-platform-robot added this to Analytical Platform Jan 26, 2024

MatMoore self-assigned this Jan 26, 2024

MatMoore moved this from Todo to In Progress in Data Catalogue Jan 26, 2024

tom-webber removed this from Analytical Platform Jan 29, 2024

MatMoore moved this from In Progress to Review in Data Catalogue Jan 30, 2024

github-actions bot mentioned this issue Feb 1, 2024

Monthly issue metrics report #3148

Closed

MatMoore moved this from Review to Done in Data Catalogue Feb 1, 2024

github-actions bot added the stale label Apr 1, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 9, 2024

jacobwoffenden added the data-platform-labs label Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spillover from search ticket #3106

Spillover from search ticket #3106

MatMoore commented Jan 26, 2024 •

edited

Loading

MatMoore commented Jan 29, 2024

MatMoore commented Jan 29, 2024

MatMoore commented Jan 29, 2024 •

edited

Loading

MatMoore commented Jan 29, 2024 •

edited

Loading

MatMoore commented Jan 30, 2024 •

edited

Loading

github-actions bot commented Apr 1, 2024

github-actions bot commented Apr 9, 2024

Spillover from search ticket #3106

Spillover from search ticket #3106

Comments

MatMoore commented Jan 26, 2024 • edited Loading

MatMoore commented Jan 29, 2024

MatMoore commented Jan 29, 2024

MatMoore commented Jan 29, 2024 • edited Loading

What fields are indexed in elasticsearch?

How is the Elasticsearch query built?

Observations

MatMoore commented Jan 29, 2024 • edited Loading

MatMoore commented Jan 30, 2024 • edited Loading

github-actions bot commented Apr 1, 2024

github-actions bot commented Apr 9, 2024

MatMoore commented Jan 26, 2024 •

edited

Loading

MatMoore commented Jan 29, 2024 •

edited

Loading

MatMoore commented Jan 29, 2024 •

edited

Loading

MatMoore commented Jan 30, 2024 •

edited

Loading