-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text embeddings master sync #14159
Text embeddings master sync #14159
Conversation
…torage unresolved
previous PR with partial comments: #13749 |
Reason for why new PR needed: The old text embedding branch stopped building properly for me after a recent npm run sync. I saw in slack #browser-dev-guest that someone else encountered the same error while trying to build an older version of brave. Given this, along with a number of structural changes to brave-core directory organization, I decided it would just be best to re-add my text embedding changes on top of a new branch created from current master. |
Referencing issue: brave/brave-browser#23424 |
vendor/bat-native-ads/src/bat/ads/internal/base/strings/string_html_parse_util.cc
Show resolved
Hide resolved
vendor/bat-native-ads/src/bat/ads/internal/ml/data/vector_data.h
Outdated
Show resolved
Hide resolved
#include <memory> | ||
#include <string> | ||
|
||
#include "bat/ads/internal/ml/data/vector_data.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can forward declare VectorData
and remove the include, we already import it in the .cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not working for me; perhaps I'm doing it incorrectly
std::string EmbeddingProcessing::CleanText(const std::string& text, bool is_html) { | ||
std::string cleaned_text = text; | ||
if (is_html) { | ||
cleaned_text = ParseTagAttribute(cleaned_text, "og:title", "content"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could extract "og:title"
to a constant like constexpr char kOGTitleTag[] = "og:title"
inside an anonymous namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I want to hold off on doing this. We may want to have some logic around possible valid options.
std::string timestamp_ = ""; | ||
std::string locale_ = "en"; | ||
int embeddings_dim_ = 0; | ||
std::map<std::string, VectorData> embeddings_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could call this vocabulary
or embeddings_vocabulary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The map here does contain both the vocabulary and embeddings, so I wouldn't want it to sound like it only contains the vocabulary tokens
vendor/bat-native-ads/src/bat/ads/internal/ml/pipeline/text_processing/embedding_processing.cc
Outdated
Show resolved
Hide resolved
|
||
void PurgeStaleTextEmbeddingHTMLEvents(TextEmbeddingHTMLEventCallback callback); | ||
|
||
void GetTextEmbeddingEventsFromDatabase(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bringing over comment from other PR: we should return the text embeddings here (then you may need the vector include at the top)
...-ads/src/bat/ads/internal/processors/contextual/text_embedding/text_embedding_html_events.cc
Outdated
Show resolved
Hide resolved
...s/internal/processors/contextual/text_embedding/text_embedding_html_events_database_table.cc
Outdated
Show resolved
Hide resolved
...ve-ads/src/bat/ads/internal/processors/contextual/text_embedding/text_embedding_processor.cc
Outdated
Show resolved
Hide resolved
New PR to use for review: #14269 |
Resolves brave/brave-browser#23424
Submitter Checklist:
QA/Yes
orQA/No
;release-notes/include
orrelease-notes/exclude
;OS/...
) to the associated issuenpm run test -- brave_browser_tests
,npm run test -- brave_unit_tests
,npm run lint
,npm run gn_check
,npm run tslint
git rebase master
(if needed)Reviewer Checklist:
gn
After-merge Checklist:
changes has landed on
Test Plan: