-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix url_extract_* functions for malformed URL's #7668
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgpai Thank you for the fix. Would you update documentation to clarify semantics of these functions?
"([^?#]*)" // authority and path | ||
"(?:\\?([^#]*))?" // ?query | ||
"(?:#(.*))?"); // #fragment | ||
"^(([^:\\/?#]+):)?" // scheme: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These regexes are not very readable. Maybe add comments to explain each of them?
@@ -84,6 +85,35 @@ FOLLY_ALWAYS_INLINE void urlEscape(TOutString& output, const TInString& input) { | |||
output.resize(outIndex); | |||
} | |||
|
|||
template <typename TInString> | |||
FOLLY_ALWAYS_INLINE bool urlUnescapeCheck(const TInString& input) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you add a comment to explain what checks are performed here?
naming: start with a verb (checkUrlUnescape) or isValidUrl
@@ -84,6 +85,35 @@ FOLLY_ALWAYS_INLINE void urlEscape(TOutString& output, const TInString& input) { | |||
output.resize(outIndex); | |||
} | |||
|
|||
template <typename TInString> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this template parameter needed? Is not not always a StringView?
@@ -84,6 +85,35 @@ FOLLY_ALWAYS_INLINE void urlEscape(TOutString& output, const TInString& input) { | |||
output.resize(outIndex); | |||
} | |||
|
|||
template <typename TInString> | |||
FOLLY_ALWAYS_INLINE bool urlUnescapeCheck(const TInString& input) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove FOLLY_ALWAYS_INLINE. This function is too long.
@@ -143,11 +173,16 @@ struct UrlExtractProtocolFunction { | |||
FOLLY_ALWAYS_INLINE bool call( | |||
out_type<Varchar>& result, | |||
const arg_type<Varchar>& url) { | |||
auto validUrl = urlUnescapeCheck(url); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: validUrl variable is not needed
if(!urlUnescapeCheck(url))
@@ -201,7 +246,8 @@ struct UrlExtractHostFunction { | |||
if (matchAuthorityAndPath( | |||
match, authAndPathMatch, authorityMatch, hasAuthority) && | |||
hasAuthority) { | |||
result.setNoCopy(submatch(authorityMatch, 3)); | |||
auto host = submatch(authorityMatch, 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: host variable is not needed
} | ||
|
||
StringView escapedPath; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
escapedPath variable is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kgpai Looks good, but let's update documentation in https://facebookincubator.github.io/velox/functions/presto/url.html to clarify what constitutes a valid URL and what happens if URL is not valid. Let's include some examples to helps readers understand.
@@ -29,11 +30,33 @@ FOLLY_ALWAYS_INLINE StringView submatch(const boost::cmatch& match, int idx) { | |||
|
|||
template <typename TInString> | |||
bool parse(const TInString& rawUrl, boost::cmatch& match) { | |||
/// This regex is taken from RFC - 3986. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice comment. Thanks.
@@ -84,6 +107,37 @@ FOLLY_ALWAYS_INLINE void urlEscape(TOutString& output, const TInString& input) { | |||
output.resize(outIndex); | |||
} | |||
|
|||
/// Performs basic initial validation of the URI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 'basic initial' -> pick one, either basic or initial, no need to have both
@@ -84,6 +107,37 @@ FOLLY_ALWAYS_INLINE void urlEscape(TOutString& output, const TInString& input) { | |||
output.resize(outIndex); | |||
} | |||
|
|||
/// Performs basic initial validation of the URI. | |||
/// Checks if the URI contains ascii whitespaces or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this check is complete? Are all ASCII characters other than whitespace allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes ascii characters other than whitespaces are allowed.
@kgpai Would you update PR description with some examples of before and after to help readers understand the impact of this change? |
218ae6c
to
3c40c63
Compare
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for extending the docs. Some comments.
velox/docs/functions/presto/url.rst
Outdated
|
||
Invalid URI's | ||
------------- | ||
Well formed URI's should not contain ascii whitespace and percent-encoded URI's should be followed by two hexadecimal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add empty line.
ascii -> ASCII
Consider splitting this sentence into two for readability.
Well formed URI's should not contain ascii whitespace. Percent-encoded URI's should be followed by two hexadecimal ...
Is "Percent-encoded URI's" a well-known term? Consider adding a link to a definition in Wikipedia or some other good source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. "Percent-encoded" is how its referred to in the URI RFC, I will reference that.
.. code-block:: | ||
|
||
# URI with whitespace | ||
SELECT url_extract_path('foo '); -- NULL (1 row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you want to show with this example? Are you trying to say that all URL functions return NULL when input URL is invalid? If so, please, state this directly.
3c40c63
to
716455c
Compare
716455c
to
53a64fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Fixes #7038
Malformed URL's which contain invalid escape sequences (%xx) used to throw in Velox, but not in Presto.
Also, for absolute URI's, url_extract_path used to return NULL when it should return the path, e.g url_extract_path('foo') should return 'foo'. Fix this by making the scheme/authority/path regex to be compliant with the URI RFC (https://www.rfc-editor.org/rfc/rfc3986#appendix-A).
Some examples of the new changes: