-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regular expression models for literals #139
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks pretty good to me :D Just small comments/nits
source/LiteralModel.h
Outdated
void join_with(const LiteralModel& other); | ||
|
||
/** | ||
* Indicats whether model contains any taints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spelling "Indicates"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐱👓
source/LiteralModel.cpp
Outdated
static constexpr const char* PATTERN = "pattern"; | ||
|
||
static constexpr const char* SOURCES = "sources"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we generally don't define json keys as constants in our code base (reasoning being that it adds another level of indirection for something that is quite localized). We just use the string literal at the call-site directly
source/LiteralModel.cpp
Outdated
LiteralModel LiteralModel::from_json( | ||
const Json::Value& value, | ||
Context& context) { | ||
JsonValidation::validate_object(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think check_unexpected_members already calls validate_object
source/Registry.cpp
Outdated
for (const auto& literal_model : literal_models_) { | ||
if (literal_model.second.matches(literal)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: think we usually do:
for (const auto& [pattern, model] : literal_models_) {
if (pattern.matches(literal)) {
// ...
}
}
// ------------------------------------------------------- | ||
// Issues | ||
// ------------------------------------------------------- | ||
public static void testRegexSource() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we usually have the word "issue" somewhere in the method name if we're expecting an issue there... so maybe testRegexSourceIssue()
or something? (and testRegexSourceNoMatchNoIssue()
when there are no issues expected)
aed2108
to
5019e0a
Compare
I also realised since I updated I manually compiled |
@yuhshin-oss has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
source/ForwardTaintTransfer.cpp
Outdated
@@ -5,6 +5,8 @@ | |||
* LICENSE file in the root directory of this source tree. | |||
*/ | |||
|
|||
#include <regex> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed this before, but let's use <re2/re2.h> instead
source/ForwardTaintTransfer.cpp
Outdated
@@ -5,6 +5,8 @@ | |||
* LICENSE file in the root directory of this source tree. | |||
*/ | |||
|
|||
#include <regex> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed this before, but let's use <re2/re2.h> instead
source/ForwardTaintTransfer.cpp
Outdated
if (model.empty()) | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: style:
if (model.empty()) {
return false;
}
source/LiteralModel.h
Outdated
|
||
#pragma once | ||
|
||
#include <regex> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, use <re2/re2.h> instead.
source/LiteralModel.h
Outdated
Taint sources_; | ||
}; | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
} // namespace marianatrench
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, couple of clangtidy/format issues showing up internally that's too tedious to type all of them here. If you can address the RE2 one, we should be able to auto-format after importing
5019e0a
to
90a463b
Compare
Thanks, all addressed. I wasn't sure how to run the linter myself, otherwise I would have checked for that! |
@yuhshin-oss has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for coming up late on this, I saw a few things after @yuhshin-oss that I think are worth improving. Please bear with me.
source/ForwardTaintTransfer.cpp
Outdated
environment->write( | ||
aliasing.result_memory_locations(), | ||
model.sources(), | ||
UpdateKind::Weak); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use a strong update here. This could matter if this is in a loop.
source/LiteralModel.cpp
Outdated
for (const auto& source : sources) { | ||
mt_assert(source.is_leaf()); | ||
check_taint_config_consistency(pattern, source); | ||
all_sources.join_with(Taint{std::move(source)}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are using std::move
on a const ref, that does nothing. might as well remove that std::move
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I changed this to a const ref when I migrated to re2 and I missed the move.
source/LiteralModel.cpp
Outdated
return all_sources; | ||
} | ||
|
||
LiteralModel::LiteralModel() : regex_(pattern_) { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we create a LiteralModel
with an empty string and a pattern for the empty string?
I would prefer if this was more clear in the internal state, for instance we could use a std::optional<re2::RE2>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now used an optional, and we do not match anything if default.
source/LiteralModel.cpp
Outdated
LiteralModel::LiteralModel() : regex_(pattern_) { } | ||
|
||
LiteralModel::LiteralModel( | ||
std::string&& pattern, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually take by copy in those cases.
See https://www.youtube.com/watch?v=PNRju6_yn3o
Also, as a rule of thumb, let's be consistent with the rest of the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced by a std::string_view
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well technically std::string_view
is worse in the case where we move a std::string
, it will make an extra copy.
but performance is not a problem here unless proven otherwise, so that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I am seeing an extra copy in that case? std::string_view
doesn't have a constructor for std::string &&
, so won't it just do the same as usual and just copy the pointer and length? And the constructor of RE2
will always copy exactly once, irrespective of whether we use const std::string &
or const StringPiece &
/std::string_view
, since StringPiece
has a string_view
constructor.
I thus defaulted to std::string_view
since it doesn't require the caller to use std::move
. Having said that, we invoke this constructor exactly once with an rvalue
anyway, so I just changed it to an std::string
to keep things simple and consistent. 🙂
source/LiteralModel.cpp
Outdated
if (this != &other) { | ||
sources_.join_with(other.sources_); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the patterns are different I think it would make sense to set it to null/empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now checking if they have the same pattern, resetting regex_
otherwise.
source/LiteralModel.h
Outdated
Json::Value to_json(Context& context) const; | ||
|
||
private: | ||
std::string pattern_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to store the pattern since RE2
already does it. We could get it with regex_.pattern()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this was still an artifact from using std::regex
before I migrated to RE2
.
source/Registry.cpp
Outdated
@@ -176,13 +200,25 @@ void Registry::join_with(const FieldModel& field_model) { | |||
} | |||
} | |||
|
|||
void Registry::join_with(const LiteralModel& literal_model) { | |||
auto iterator = literal_models_.find(literal_model.pattern()); | |||
if (iterator != end(literal_models_)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: we usually use literal_models_.end()
. Let's be consistent with the rest of the codebase.
for (std::size_t i = batch_size * batch; i < batch_size * (batch + 1) && | ||
i < models.size() + field_models.size(); | ||
i < models.size() + field_models.size() + literal_models.size(); | ||
i++) { | ||
if (i < models.size()) { | ||
writer->write(models[i].to_json(context_), &batch_stream); | ||
} else { | ||
} else if (i < models.size() + field_models.size()) { | ||
writer->write( | ||
field_models[i - models.size()].to_json(context_), | ||
&batch_stream); | ||
} else { | ||
writer->write( | ||
literal_models[i - models.size() - field_models.size()].to_json(context_), | ||
&batch_stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pre-existing but we should probably rewrite this with a proper iterator. These index computations are error prone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what refactoring you have in mind here. Presently these models are stored in three different vectors and accessed based on the current batch
number. Is there an iterator-based API in sparta that I should use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, I'm not expecting you to implement this here. This could be done in another PR, or we could do this internally.
Anyway, I changed my mind here. Yes we could probably make this work with iterators but the fact that iterators hide the indexing will make this code unreadable.
What might make it easier to read is abstracting away "models + field_models + literal_models". We could have:
class ModelVector {
public:
std::size_t size() const { return _models.size() + _field_models.size() + _literal_models.size(); }
// TODO: find a better name..
Json::Value json_for_index(std::size_t i) {
if (i < _models.size()) {
return _models[i].to_json();
}
i -= _models.size();
if (i < _field_models.size()) {
return _field_models[i].to_json();
}
i -= _field_models.size();
if (i < _literal_models.size()) {
return _literal_models[i].to_json();
}
mt_unreachable("invalid index");
}
private:
std::vector<Model> _models;
std::vector<FieldModel> _field_models;
std::vector<LiteralModel> _literal_models;
}
source/Registry.h
Outdated
@@ -63,6 +65,7 @@ class Registry final { | |||
/* These are thread-safe. */ | |||
Model get(const Method* method) const; | |||
FieldModel get(const Field* field) const; | |||
LiteralModel get(std::string_view literal) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we change the name of this method?
When I read get
I usually expect a O(1)
but here we loop over all literal models.
Also, we might want another API latter that is an actual get
for a given pattern.
I would call it match_literal
or something.
Also nitpick: the comment above says "These are thread-safe" but your implementation of get
is not thread-safe (iterating on a ConcurrentMap is not thread safe). I think it's fine though as long as there are no mutations in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically not my comment, but definitely surprised to learn that concurrent traversal in redex' ConcurrentMap
is not safe. Not what I'm used to from concurrent_unordered_map
. 🙂
Since I renamed the method, I now moved it away from those two get
mehods and gave it a comment of its own.
public class StringBuilder { | ||
|
||
public StringBuilder append(String str) { | ||
return this; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could all changes to library_classes be in a different PR? This causes a lot of changes in integration tests that are not related to the PR, making it hard to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a problem, here is the extracted PR:
#140
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, could you rebase this on top of master now that #140 was merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Match static regex for new source in `analyze_const_string`.
90a463b
to
7ceaec5
Compare
} | ||
|
||
LiteralModel::LiteralModel(const LiteralModel& other) | ||
: regex_(other.regex_ ? std::optional(other.regex_->pattern()) : std::nullopt), sources_(other.sources_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is because re2::RE2
is not copyable? That's annoying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct:
// Not copyable.
// RE2 objects are expensive. You should probably use std::shared_ptr<RE2>
// instead. If you really must copy, RE2(first.pattern(), first.options())
// effectively does so: it produces a second object that mimics the first.
RE2(const RE2&) = delete;
One alternative would indeed be to use a shared_ptr
instead of an optional
, RE2
objects are safe for concurrent use after all. Happy to apply that change if preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine as is :)
source/LiteralModel.cpp
Outdated
std::string LiteralModel::pattern() const { | ||
return regex_ ? regex_->pattern() : "(no pattern)"; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we return a std::optional
instead? Not a fan of implicitly returning "no pattern".
source/LiteralModel.cpp
Outdated
} | ||
|
||
void LiteralModel::join_with(const LiteralModel& other) { | ||
if (this != &other) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: We generally prefer the early-return pattern, instead of nesting if blocks.
if (this != &other) { | |
if (this == &other) { | |
return; | |
} |
source/LiteralModel.cpp
Outdated
Json::Value LiteralModel::to_json(const ExportOriginsMode export_origins_mode) const { | ||
auto value = Json::Value(Json::objectValue); | ||
|
||
value["pattern"] = pattern(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check beforehand if the pattern is empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Omitted now if the pattern is empty.
|
||
Json::Value LiteralModel::to_json(Context& context) const { | ||
auto value = to_json(context.options->export_origins_mode()); | ||
value["position"] = context.positions->unknown()->to_json(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine, let's keep it for consistency.
source/Registry.cpp
Outdated
@@ -176,13 +200,25 @@ void Registry::join_with(const FieldModel& field_model) { | |||
} | |||
} | |||
|
|||
void Registry::join_with(const LiteralModel& literal_model) { | |||
auto iterator = literal_models_.find(literal_model.pattern()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe assert that the pattern is non-empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now using mt_assert
:
const std::optional<std::string> pattern{literal_model.pattern()};
mt_assert(pattern);
Allow configuring regex sources using JSON config.
Add Google API key test for literal model.
Do not log tainting a register if model is empty.
Allow unused `description` property in literal model.
Add literal models doc and examples to `models.md`.
Expose models, field models and literal models in options and shim.
44f5954
to
3939f0d
Compare
@arthaud has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
As discussed in the workplace group I added support for source models for literals matching configurable regular expressions. The
models.md
documentation outlines what that would look like:Example literal models:
Example code:
Reviewer note
In order to test these changes, I added new classes and methods to
source/tests/integration/end-to-end/library_classes
, which causes a significant amount of unrelated changes to other tests'expected_xxx.json
. I tried to constrain those changes to the commitUpdate library_classes
.