-
Notifications
You must be signed in to change notification settings - Fork 2
feat: add rule parser for advertiser hosts #304
Conversation
I think using regex here is the wrong tool. I've heard a lot of stories of bugs because URLs were parsed too permissively. The biggest issue is the special behavior of An exploitable version of it would be something like a rule I think something like globs would be more fitting, or to define that any subdomain of a matched domain is valid. I would also like to see incoming URLs parsed as URLs instead of treated as unstructured strings, but that might be out of scope for this change. If we do keep regex, I think it would be beneficial to consider using |
Yeah, not really going to complain. Regex is really powerful, a resource hog, and made out of pure explodium. I think we're probably NOT going to land this PR. I created it mostly as a draft to see what all would be involved in actually creating it, and realized I had accidentally built a patch. There's a working document that I'm using with our customer to create a simpler string based rule set that should solve their requirements without invoking the great satan that regex can be. |
This will allow for simple pattern matches against `advertiser_hosts` (See #303). This does not use regular expressions, but rather a very primative string match function that matches the final portion of the host and (optionally) the leading portion of the path. e.g. ```json "advertiser_hosts": ["acme.biz/ca", "acme.biz/uk", "acme.tv"] ``` Closes #303
Based on conversations and requirement clarifications
src/adm/filter.rs
Outdated
filter.clone() | ||
}; | ||
// make sure to do the same s/./*/ to the filter, because you never know... | ||
let filter_parts = filter.splitn(2, '/').collect::<Vec<&str>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add a guard here to check the vector length? Otherwise, filter_parts[1]
might be problematic on malformed filters.
Also, I don't quite follow the special handling on ".". Can you show me some examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
w/ the guaranteed trailing '/'
above splitn would always give us 2 elements. Maybe worth a comment though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're forcing the string to add a /
at the end, so there should always be at least 2 parts, even if one of them is an empty string.
I have a test case that illustrates the reason I convert .
to *
in the path, but basically it's to avoid situations like
https://evil.com/good.com/etc
from passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. But this doesn't fully convince me though. As we hold control over the filters, we will make sure of their validity. So Contile will drop those bad hosts even if they manage to sneak into the partner's API.
If we still want to have this extra check on filters, I'd recommend doing this dot-escaping only once perhaps during the startup. Doing this over and over again for each request seems wasteful and unnecessary to me.
Does that make sense?
src/error.rs
Outdated
@@ -198,6 +198,12 @@ impl From<HandlerError> for HttpResponse { | |||
} | |||
} | |||
|
|||
impl From<regex::Error> for HandlerError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks leftover from the last version, might as well kill it
Almost forgot a couple more things we're missing per the new spec:
I think we need #1 to go along w/ this PR. #2 should be a few lines of easy code |
e.g allow `example.com/ca` & `example.co.uk`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r+ with one more comment.
Thanks!
Co-authored-by: Nan Jiang <njiang028@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
Hi @jrconlin! 👋🏻 Would you say this new feature is sufficiently covered by unit tests? Do you think we can add at least one integration test that uses settings which rely on this feature? |
This will allow for simple pattern matches against
advertiser_hosts
(See#303). This does not use regular expressions, but rather a very
primative string match function that matches the final portion of the
host and (optionally) the leading portion of the path.
e.g.
Closes #303