Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text embeddings master sync #14159

Closed
wants to merge 14 commits into from
Closed

Text embeddings master sync #14159

wants to merge 14 commits into from

Conversation

ptjames
Copy link
Contributor

@ptjames ptjames commented Jul 13, 2022

Resolves brave/brave-browser#23424

Submitter Checklist:

  • I confirm that no security/privacy review is needed, or that I have requested one
  • There is a ticket for my issue
  • Used Github auto-closing keywords in the PR description above
  • Wrote a good PR/commit description
  • Squashed any review feedback or "fixup" commits before merge, so that history is a record of what happened in the repo, not your PR
  • Added appropriate labels (QA/Yes or QA/No; release-notes/include or release-notes/exclude; OS/...) to the associated issue
  • Checked the PR locally: npm run test -- brave_browser_tests, npm run test -- brave_unit_tests, npm run lint, npm run gn_check, npm run tslint
  • Ran git rebase master (if needed)

Reviewer Checklist:

  • A security review is not needed, or a link to one is included in the PR description
  • New files have MPL-2.0 license header
  • Adequate test coverage exists to prevent regressions
  • Major classes, functions and non-trivial code blocks are well-commented
  • Changes in component dependencies are properly reflected in gn
  • Code follows the style guide
  • Test plan is specified in PR before merging

After-merge Checklist:

Test Plan:

@ptjames
Copy link
Contributor Author

ptjames commented Jul 13, 2022

previous PR with partial comments: #13749

@ptjames
Copy link
Contributor Author

ptjames commented Jul 13, 2022

Reason for why new PR needed:

The old text embedding branch stopped building properly for me after a recent npm run sync. I saw in slack #browser-dev-guest that someone else encountered the same error while trying to build an older version of brave. Given this, along with a number of structural changes to brave-core directory organization, I decided it would just be best to re-add my text embedding changes on top of a new branch created from current master.

@ptjames
Copy link
Contributor Author

ptjames commented Jul 13, 2022

Referencing issue: brave/brave-browser#23424

#include <memory>
#include <string>

#include "bat/ads/internal/ml/data/vector_data.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can forward declare VectorData and remove the include, we already import it in the .cc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not working for me; perhaps I'm doing it incorrectly

std::string EmbeddingProcessing::CleanText(const std::string& text, bool is_html) {
std::string cleaned_text = text;
if (is_html) {
cleaned_text = ParseTagAttribute(cleaned_text, "og:title", "content");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could extract "og:title" to a constant like constexpr char kOGTitleTag[] = "og:title" inside an anonymous namespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I want to hold off on doing this. We may want to have some logic around possible valid options.

std::string timestamp_ = "";
std::string locale_ = "en";
int embeddings_dim_ = 0;
std::map<std::string, VectorData> embeddings_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could call this vocabulary or embeddings_vocabulary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The map here does contain both the vocabulary and embeddings, so I wouldn't want it to sound like it only contains the vocabulary tokens


void PurgeStaleTextEmbeddingHTMLEvents(TextEmbeddingHTMLEventCallback callback);

void GetTextEmbeddingEventsFromDatabase();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bringing over comment from other PR: we should return the text embeddings here (then you may need the vector include at the top)

@ptjames ptjames mentioned this pull request Jul 21, 2022
25 tasks
@ptjames ptjames closed this Jul 21, 2022
@github-actions github-actions bot added this to the 1.43.x - Nightly milestone Jul 21, 2022
@ptjames
Copy link
Contributor Author

ptjames commented Jul 21, 2022

New PR to use for review: #14269

@ptjames ptjames deleted the text_embeddings_master_sync branch July 29, 2022 03:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement text embedding processing for ad matching MVP
2 participants