Skip to content
This repository has been archived by the owner on Jul 26, 2024. It is now read-only.

feat: add rule parser for advertiser hosts #304

Merged
merged 13 commits into from
Oct 28, 2021
Merged

feat: add rule parser for advertiser hosts #304

merged 13 commits into from
Oct 28, 2021

Conversation

jrconlin
Copy link
Member

@jrconlin jrconlin commented Oct 15, 2021

This will allow for simple pattern matches against advertiser_hosts (See
#303). This does not use regular expressions, but rather a very
primative string match function that matches the final portion of the
host and (optionally) the leading portion of the path.
e.g.

    "advertiser_hosts": ["acme.biz/ca", "acme.biz/uk", "acme.tv"]

Closes #303

@jrconlin jrconlin requested a review from a team October 15, 2021 15:21
@mythmon
Copy link
Contributor

mythmon commented Oct 15, 2021

I think using regex here is the wrong tool. I've heard a lot of stories of bugs because URLs were parsed too permissively. The biggest issue is the special behavior of . in regex, which can cause bugs. I think that making it so these settings are sometimes regex and sometimes fixed strings complicates things. I see you've made some efforts to fix this, but they seem incomplete compared to a proper regex escaping scheme.

An exploitable version of it would be something like a rule r#foo.example.com. An attacker could register foo-example.com, and that would still pass the filter.

I think something like globs would be more fitting, or to define that any subdomain of a matched domain is valid. I would also like to see incoming URLs parsed as URLs instead of treated as unstructured strings, but that might be out of scope for this change.

If we do keep regex, I think it would be beneficial to consider using RegexSet from the same crate, which can more efficiently match against many regexes at once.

@jrconlin
Copy link
Member Author

Yeah, not really going to complain. Regex is really powerful, a resource hog, and made out of pure explodium. I think we're probably NOT going to land this PR. I created it mostly as a draft to see what all would be involved in actually creating it, and realized I had accidentally built a patch.

There's a working document that I'm using with our customer to create a simpler string based rule set that should solve their requirements without invoking the great satan that regex can be.

This will allow for simple pattern matches against `advertiser_hosts` (See
 #303). This does not use regular expressions, but rather a very
primative string match function that matches the final portion of the
host and (optionally) the leading portion of the path.
e.g.
```json
    "advertiser_hosts": ["acme.biz/ca", "acme.biz/uk", "acme.tv"]
```

Closes #303
Based on conversations and requirement clarifications
@jrconlin jrconlin marked this pull request as ready for review October 20, 2021 22:13
@jrconlin jrconlin changed the title feat: add regex rule parser for advertiser hosts feat: add rule parser for advertiser hosts Oct 20, 2021
src/adm/filter.rs Outdated Show resolved Hide resolved
src/adm/filter.rs Outdated Show resolved Hide resolved
filter.clone()
};
// make sure to do the same s/./*/ to the filter, because you never know...
let filter_parts = filter.splitn(2, '/').collect::<Vec<&str>>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add a guard here to check the vector length? Otherwise, filter_parts[1] might be problematic on malformed filters.

Also, I don't quite follow the special handling on ".". Can you show me some examples?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w/ the guaranteed trailing '/' above splitn would always give us 2 elements. Maybe worth a comment though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're forcing the string to add a / at the end, so there should always be at least 2 parts, even if one of them is an empty string.

I have a test case that illustrates the reason I convert . to * in the path, but basically it's to avoid situations like

https://evil.com/good.com/etc from passing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. But this doesn't fully convince me though. As we hold control over the filters, we will make sure of their validity. So Contile will drop those bad hosts even if they manage to sneak into the partner's API.

If we still want to have this extra check on filters, I'd recommend doing this dot-escaping only once perhaps during the startup. Doing this over and over again for each request seems wasteful and unnecessary to me.

Does that make sense?

src/adm/filter.rs Outdated Show resolved Hide resolved
src/error.rs Outdated
@@ -198,6 +198,12 @@ impl From<HandlerError> for HttpResponse {
}
}

impl From<regex::Error> for HandlerError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks leftover from the last version, might as well kill it

src/adm/filter.rs Outdated Show resolved Hide resolved
@pjenvey
Copy link
Member

pjenvey commented Oct 26, 2021

Almost forgot a couple more things we're missing per the new spec:

  1. It would be an error to provide a filter that contains a bare host and one that contains a path (e.g. example.com example.com/ca)
  2. Contile would also enforce `https` as the only valid scheme.

I think we need #1 to go along w/ this PR. #2 should be a few lines of easy code

src/adm/filter.rs Outdated Show resolved Hide resolved
ncloudioj
ncloudioj previously approved these changes Oct 27, 2021
Copy link
Collaborator

@ncloudioj ncloudioj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+ with one more comment.

Thanks!

Co-authored-by: Nan Jiang <njiang028@gmail.com>
@ncloudioj ncloudioj self-requested a review October 28, 2021 18:10
Copy link
Collaborator

@ncloudioj ncloudioj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@jrconlin jrconlin merged commit 0a28722 into main Oct 28, 2021
@jrconlin jrconlin deleted the 303/prefix branch October 28, 2021 20:47
@hackebrot
Copy link
Member

Hi @jrconlin! 👋🏻 Would you say this new feature is sufficiently covered by unit tests? Do you think we can add at least one integration test that uses settings which rely on this feature?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add subdomain allowlist to Contile
5 participants