Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular expression models for literals #139

Closed
wants to merge 7 commits into from

Conversation

pkesseli
Copy link
Contributor

@pkesseli pkesseli commented Aug 7, 2023

As discussed in the workplace group I added support for source models for literals matching configurable regular expressions. The models.md documentation outlines what that would look like:


Example literal models:

[
  {
    "pattern": "SELECT \\*.*",
    "description": "Potential SQL Query",
    "sources": [
      {
        "kind": "SqlQuery"
      }
    ]
  },
  {
    "pattern": "AI[0-9A-Z]{16}",
    "description": "Suspected Google API Key",
    "sources": [
      {
        "kind": "GoogleAPIKey"
      }
    ]
  }
]

Example code:

void testRegexSource() {
  String prefix = "SELECT * FROM USERS WHERE id = ";
  String aci = getAttackerControlledInput();
  String query = prefix + aci; // Sink
}

void testRegexSourceGoogleApiKey() {
  String secret = "AIABCD1234EFGH5678";
  sink(secret);
}

Reviewer note

In order to test these changes, I added new classes and methods to source/tests/integration/end-to-end/library_classes, which causes a significant amount of unrelated changes to other tests' expected_xxx.json. I tried to constrain those changes to the commit Update library_classes.

Copy link
Contributor

@yuhshin-oss yuhshin-oss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks pretty good to me :D Just small comments/nits

void join_with(const LiteralModel& other);

/**
* Indicats whether model contains any taints.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spelling "Indicates"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐱‍👓

Comment on lines 73 to 75
static constexpr const char* PATTERN = "pattern";

static constexpr const char* SOURCES = "sources";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we generally don't define json keys as constants in our code base (reasoning being that it adds another level of indirection for something that is quite localized). We just use the string literal at the call-site directly

LiteralModel LiteralModel::from_json(
const Json::Value& value,
Context& context) {
JsonValidation::validate_object(value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think check_unexpected_members already calls validate_object

source/LiteralModel.cpp Show resolved Hide resolved
source/LiteralModel.cpp Show resolved Hide resolved
Comment on lines 153 to 154
for (const auto& literal_model : literal_models_) {
if (literal_model.second.matches(literal)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: think we usually do:

for (const auto& [pattern, model] : literal_models_) {
  if (pattern.matches(literal)) {
    // ...
  }
}

source/Registry.h Show resolved Hide resolved
// -------------------------------------------------------
// Issues
// -------------------------------------------------------
public static void testRegexSource() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we usually have the word "issue" somewhere in the method name if we're expecting an issue there... so maybe testRegexSourceIssue() or something? (and testRegexSourceNoMatchNoIssue() when there are no issues expected)

@pkesseli pkesseli force-pushed the pkesseli/literal-sources branch 2 times, most recently from aed2108 to 5019e0a Compare August 9, 2023 21:17
@pkesseli
Copy link
Contributor Author

pkesseli commented Aug 9, 2023

I also realised since I updated library_classes I also affected source/tests/integration/end-to-end/code/check_cast_types, which doesn't get run by default in tests.yml. I updated those expected JSON files in my latest push as well now.

I manually compiled check_cast_types/KotlinCheckCast.kt and added it to the JAR - not sure whether that's intended way of running this test, but it produced the same JSON output files again except for my new models in library_classes. 🙂

@facebook-github-bot
Copy link
Contributor

@yuhshin-oss has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -5,6 +5,8 @@
* LICENSE file in the root directory of this source tree.
*/

#include <regex>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this before, but let's use <re2/re2.h> instead

@@ -5,6 +5,8 @@
* LICENSE file in the root directory of this source tree.
*/

#include <regex>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this before, but let's use <re2/re2.h> instead

Comment on lines 1212 to 1213
if (model.empty())
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: style:

if (model.empty()) {
  return false;
}


#pragma once

#include <regex>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, use <re2/re2.h> instead.

Taint sources_;
};

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

} // namespace marianatrench

Copy link
Contributor

@yuhshin-oss yuhshin-oss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, couple of clangtidy/format issues showing up internally that's too tedious to type all of them here. If you can address the RE2 one, we should be able to auto-format after importing

@pkesseli
Copy link
Contributor Author

Ugh, couple of clangtidy/format issues showing up internally that's too tedious to type all of them here. If you can address the RE2 one, we should be able to auto-format after importing

Thanks, all addressed. I wasn't sure how to run the linter myself, otherwise I would have checked for that!

@facebook-github-bot
Copy link
Contributor

@yuhshin-oss has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@arthaud arthaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for coming up late on this, I saw a few things after @yuhshin-oss that I think are worth improving. Please bear with me.

environment->write(
aliasing.result_memory_locations(),
model.sources(),
UpdateKind::Weak);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use a strong update here. This could matter if this is in a loop.

for (const auto& source : sources) {
mt_assert(source.is_leaf());
check_taint_config_consistency(pattern, source);
all_sources.join_with(Taint{std::move(source)});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are using std::move on a const ref, that does nothing. might as well remove that std::move.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I changed this to a const ref when I migrated to re2 and I missed the move.

return all_sources;
}

LiteralModel::LiteralModel() : regex_(pattern_) { }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we create a LiteralModel with an empty string and a pattern for the empty string?
I would prefer if this was more clear in the internal state, for instance we could use a std::optional<re2::RE2>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now used an optional, and we do not match anything if default.

LiteralModel::LiteralModel() : regex_(pattern_) { }

LiteralModel::LiteralModel(
std::string&& pattern,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually take by copy in those cases.
See https://www.youtube.com/watch?v=PNRju6_yn3o

Also, as a rule of thumb, let's be consistent with the rest of the codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced by a std::string_view.

Copy link
Contributor

@arthaud arthaud Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well technically std::string_view is worse in the case where we move a std::string, it will make an extra copy.
but performance is not a problem here unless proven otherwise, so that's fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I am seeing an extra copy in that case? std::string_view doesn't have a constructor for std::string &&, so won't it just do the same as usual and just copy the pointer and length? And the constructor of RE2 will always copy exactly once, irrespective of whether we use const std::string & or const StringPiece &/std::string_view, since StringPiece has a string_view constructor.

I thus defaulted to std::string_view since it doesn't require the caller to use std::move. Having said that, we invoke this constructor exactly once with an rvalue anyway, so I just changed it to an std::string to keep things simple and consistent. 🙂

Comment on lines 68 to 74
if (this != &other) {
sources_.join_with(other.sources_);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the patterns are different I think it would make sense to set it to null/empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now checking if they have the same pattern, resetting regex_ otherwise.

Json::Value to_json(Context& context) const;

private:
std::string pattern_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to store the pattern since RE2 already does it. We could get it with regex_.pattern().

Copy link
Contributor Author

@pkesseli pkesseli Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this was still an artifact from using std::regex before I migrated to RE2.

@@ -176,13 +200,25 @@ void Registry::join_with(const FieldModel& field_model) {
}
}

void Registry::join_with(const LiteralModel& literal_model) {
auto iterator = literal_models_.find(literal_model.pattern());
if (iterator != end(literal_models_)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: we usually use literal_models_.end(). Let's be consistent with the rest of the codebase.

Comment on lines 349 to +363
for (std::size_t i = batch_size * batch; i < batch_size * (batch + 1) &&
i < models.size() + field_models.size();
i < models.size() + field_models.size() + literal_models.size();
i++) {
if (i < models.size()) {
writer->write(models[i].to_json(context_), &batch_stream);
} else {
} else if (i < models.size() + field_models.size()) {
writer->write(
field_models[i - models.size()].to_json(context_),
&batch_stream);
} else {
writer->write(
literal_models[i - models.size() - field_models.size()].to_json(context_),
&batch_stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-existing but we should probably rewrite this with a proper iterator. These index computations are error prone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what refactoring you have in mind here. Presently these models are stored in three different vectors and accessed based on the current batch number. Is there an iterator-based API in sparta that I should use?

Copy link
Contributor

@arthaud arthaud Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, I'm not expecting you to implement this here. This could be done in another PR, or we could do this internally.

Anyway, I changed my mind here. Yes we could probably make this work with iterators but the fact that iterators hide the indexing will make this code unreadable.

What might make it easier to read is abstracting away "models + field_models + literal_models". We could have:

class ModelVector {
  public:
   std::size_t size() const { return _models.size() + _field_models.size() + _literal_models.size(); }

  // TODO: find a better name..
  Json::Value json_for_index(std::size_t i) {
    if (i < _models.size()) {
      return _models[i].to_json();
    }
    i -= _models.size();
    if (i < _field_models.size()) {
      return _field_models[i].to_json();
    }
    i -= _field_models.size();
    if (i < _literal_models.size()) {
      return _literal_models[i].to_json();
    }
    mt_unreachable("invalid index");
  }

  private:
    std::vector<Model> _models;
    std::vector<FieldModel> _field_models;
    std::vector<LiteralModel> _literal_models;
}

@@ -63,6 +65,7 @@ class Registry final {
/* These are thread-safe. */
Model get(const Method* method) const;
FieldModel get(const Field* field) const;
LiteralModel get(std::string_view literal) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change the name of this method?
When I read get I usually expect a O(1) but here we loop over all literal models.
Also, we might want another API latter that is an actual get for a given pattern.
I would call it match_literal or something.
Also nitpick: the comment above says "These are thread-safe" but your implementation of get is not thread-safe (iterating on a ConcurrentMap is not thread safe). I think it's fine though as long as there are no mutations in parallel.

Copy link
Contributor Author

@pkesseli pkesseli Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically not my comment, but definitely surprised to learn that concurrent traversal in redex' ConcurrentMap is not safe. Not what I'm used to from concurrent_unordered_map. 🙂

Since I renamed the method, I now moved it away from those two get mehods and gave it a comment of its own.

Comment on lines 10 to 15
public class StringBuilder {

public StringBuilder append(String str) {
return this;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could all changes to library_classes be in a different PR? This causes a lot of changes in integration tests that are not related to the PR, making it hard to review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem, here is the extracted PR:
#140

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, could you rebase this on top of master now that #140 was merged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@pkesseli pkesseli requested a review from arthaud August 21, 2023 15:26
facebook-github-bot pushed a commit that referenced this pull request Aug 21, 2023
Summary:
Add library_classes to later test additional literal sources.

Extracted from:
#139

Pull Request resolved: #140

Reviewed By: anwesht

Differential Revision: D48519654

Pulled By: arthaud

fbshipit-source-id: 6f1b4795ae932f257009978bebd8f8bb35b59441
Match static regex for new source in `analyze_const_string`.
}

LiteralModel::LiteralModel(const LiteralModel& other)
: regex_(other.regex_ ? std::optional(other.regex_->pattern()) : std::nullopt), sources_(other.sources_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is because re2::RE2 is not copyable? That's annoying.

Copy link
Contributor Author

@pkesseli pkesseli Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct:

  // Not copyable.
  // RE2 objects are expensive. You should probably use std::shared_ptr<RE2>
  // instead. If you really must copy, RE2(first.pattern(), first.options())
  // effectively does so: it produces a second object that mimics the first.
  RE2(const RE2&) = delete;

One alternative would indeed be to use a shared_ptr instead of an optional, RE2 objects are safe for concurrent use after all. Happy to apply that change if preferred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine as is :)

Comment on lines 55 to 57
std::string LiteralModel::pattern() const {
return regex_ ? regex_->pattern() : "(no pattern)";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we return a std::optional instead? Not a fan of implicitly returning "no pattern".

}

void LiteralModel::join_with(const LiteralModel& other) {
if (this != &other) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: We generally prefer the early-return pattern, instead of nesting if blocks.

Suggested change
if (this != &other) {
if (this == &other) {
return;
}

Json::Value LiteralModel::to_json(const ExportOriginsMode export_origins_mode) const {
auto value = Json::Value(Json::objectValue);

value["pattern"] = pattern();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check beforehand if the pattern is empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omitted now if the pattern is empty.


Json::Value LiteralModel::to_json(Context& context) const {
auto value = to_json(context.options->export_origins_mode());
value["position"] = context.positions->unknown()->to_json();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine, let's keep it for consistency.

@@ -176,13 +200,25 @@ void Registry::join_with(const FieldModel& field_model) {
}
}

void Registry::join_with(const LiteralModel& literal_model) {
auto iterator = literal_models_.find(literal_model.pattern());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe assert that the pattern is non-empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now using mt_assert:

const std::optional<std::string> pattern{literal_model.pattern()};
mt_assert(pattern);

Pascal Kesseli added 6 commits August 23, 2023 12:32
Allow configuring regex sources using JSON config.
Add Google API key test for literal model.
Do not log tainting a register if model is empty.
Allow unused `description` property in literal model.
Add literal models doc and examples to `models.md`.
Expose models, field models and literal models in options and shim.
@facebook-github-bot
Copy link
Contributor

@arthaud has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@arthaud merged this pull request in c266c18.

@pkesseli pkesseli deleted the pkesseli/literal-sources branch October 4, 2023 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants