Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(search): supporting chinese glossaryterm full text retrieval(#3914) #3956

Merged
merged 3 commits into from
Feb 25, 2022

Conversation

Huyueeer
Copy link
Contributor

@Huyueeer Huyueeer commented Jan 24, 2022

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)

Change

source of problem: #3914
configure the analyzer to replace word segmentation in other languages supported by main_tokenizer, The tokenizer that has been tested includes smartcn_tokenizer & ik_smart, Elasticsearch analysis plugins.

Copy link
Collaborator

@jjoyce0510 jjoyce0510 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow - awesome PR!

This looks great to me. Want another pair of eyes on it, then we can ship. (cc. @dexter-mh-lee)

Thank you @Huyueeer!

@github-actions
Copy link

github-actions bot commented Feb 1, 2022

Unit Test Results (build & test)

  70 files  ±0    70 suites  ±0   18m 58s ⏱️ -47s
611 tests ±0  552 ✔️ ±0  59 💤 ±0  0 ±0 

Results for commit 9b1f151. ± Comparison against base commit 0fd4cb5.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@shirshanka shirshanka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@shirshanka shirshanka merged commit 3a0fe44 into datahub-project:master Feb 25, 2022
@Huyueeer Huyueeer deleted the chinese_support branch March 4, 2022 02:42
maggiehays pushed a commit to maggiehays/datahub that referenced this pull request Aug 1, 2022
…ahub-project#3914) (datahub-project#3956)

* feat(search): supporting chinese glossaryterm full text retrieval(datahub-project#3914)

* refactor(search): modify mainTokenizer to appropriate position(datahub-project#3914)

Co-authored-by: Shirshanka Das <shirshanka@apache.org>
@xiangqiao123
Copy link

xiangqiao123 commented Mar 14, 2023

@Huyueeer In Chinese, two Chinese characters are very common. Can you help put MIN_LENGTH configurable? It will be very helpful for Chinese word segmentation

@Huyueeer
Copy link
Contributor Author

@Huyueeer In Chinese, two Chinese characters are very common. Can you help put MIN_LENGTH configurable? It will be very helpful for Chinese word segmentation

@xiangqiao123 Sorry, this part should be rebuilt. It seems that you should find the person who implements this part

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants