Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix url_extract_* functions for malformed URL's #7668

Closed
wants to merge 5 commits into from
Closed

Conversation

kgpai
Copy link
Contributor

@kgpai kgpai commented Nov 21, 2023

Fixes #7038

Malformed URL's which contain invalid escape sequences (%xx) used to throw in Velox, but not in Presto.

Also, for absolute URI's, url_extract_path used to return NULL when it should return the path, e.g url_extract_path('foo') should return 'foo'. Fix this by making the scheme/authority/path regex to be compliant with the URI RFC (https://www.rfc-editor.org/rfc/rfc3986#appendix-A).

Some examples of the new changes:

> SELECT url_extract_path('https://www.ucu.edu.uy/agenda/evento/%%UCUrlCompartir%%'); 

Before: throws exception.
After: returns NULL.

> SELECT url_extract_path('foo');

Before: returns NULL.
After: returns 'foo'.

Copy link

netlify bot commented Nov 21, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 53a64fc
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6564e17131e47f0008727d79

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 21, 2023
@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgpai Thank you for the fix. Would you update documentation to clarify semantics of these functions?

"([^?#]*)" // authority and path
"(?:\\?([^#]*))?" // ?query
"(?:#(.*))?"); // #fragment
"^(([^:\\/?#]+):)?" // scheme:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These regexes are not very readable. Maybe add comments to explain each of them?

@@ -84,6 +85,35 @@ FOLLY_ALWAYS_INLINE void urlEscape(TOutString& output, const TInString& input) {
output.resize(outIndex);
}

template <typename TInString>
FOLLY_ALWAYS_INLINE bool urlUnescapeCheck(const TInString& input) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add a comment to explain what checks are performed here?

naming: start with a verb (checkUrlUnescape) or isValidUrl

@@ -84,6 +85,35 @@ FOLLY_ALWAYS_INLINE void urlEscape(TOutString& output, const TInString& input) {
output.resize(outIndex);
}

template <typename TInString>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this template parameter needed? Is not not always a StringView?

@@ -84,6 +85,35 @@ FOLLY_ALWAYS_INLINE void urlEscape(TOutString& output, const TInString& input) {
output.resize(outIndex);
}

template <typename TInString>
FOLLY_ALWAYS_INLINE bool urlUnescapeCheck(const TInString& input) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove FOLLY_ALWAYS_INLINE. This function is too long.

@@ -143,11 +173,16 @@ struct UrlExtractProtocolFunction {
FOLLY_ALWAYS_INLINE bool call(
out_type<Varchar>& result,
const arg_type<Varchar>& url) {
auto validUrl = urlUnescapeCheck(url);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: validUrl variable is not needed

if(!urlUnescapeCheck(url))

@@ -201,7 +246,8 @@ struct UrlExtractHostFunction {
if (matchAuthorityAndPath(
match, authAndPathMatch, authorityMatch, hasAuthority) &&
hasAuthority) {
result.setNoCopy(submatch(authorityMatch, 3));
auto host = submatch(authorityMatch, 3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: host variable is not needed

}

StringView escapedPath;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

escapedPath variable is not needed

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgpai Looks good, but let's update documentation in https://facebookincubator.github.io/velox/functions/presto/url.html to clarify what constitutes a valid URL and what happens if URL is not valid. Let's include some examples to helps readers understand.

@@ -29,11 +30,33 @@ FOLLY_ALWAYS_INLINE StringView submatch(const boost::cmatch& match, int idx) {

template <typename TInString>
bool parse(const TInString& rawUrl, boost::cmatch& match) {
/// This regex is taken from RFC - 3986.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice comment. Thanks.

@@ -84,6 +107,37 @@ FOLLY_ALWAYS_INLINE void urlEscape(TOutString& output, const TInString& input) {
output.resize(outIndex);
}

/// Performs basic initial validation of the URI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 'basic initial' -> pick one, either basic or initial, no need to have both

@@ -84,6 +107,37 @@ FOLLY_ALWAYS_INLINE void urlEscape(TOutString& output, const TInString& input) {
output.resize(outIndex);
}

/// Performs basic initial validation of the URI.
/// Checks if the URI contains ascii whitespaces or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this check is complete? Are all ASCII characters other than whitespace allowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ascii characters other than whitespaces are allowed.

@mbasmanova
Copy link
Contributor

@kgpai Would you update PR description with some examples of before and after to help readers understand the impact of this change?

@kgpai kgpai force-pushed the fix_uri_malformed branch 2 times, most recently from 218ae6c to 3c40c63 Compare November 22, 2023 18:23
@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for extending the docs. Some comments.


Invalid URI's
-------------
Well formed URI's should not contain ascii whitespace and percent-encoded URI's should be followed by two hexadecimal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add empty line.

ascii -> ASCII

Consider splitting this sentence into two for readability.

Well formed URI's should not contain ascii whitespace. Percent-encoded URI's should be followed by two hexadecimal ...

Is "Percent-encoded URI's" a well-known term? Consider adding a link to a definition in Wikipedia or some other good source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. "Percent-encoded" is how its referred to in the URI RFC, I will reference that.

.. code-block::

# URI with whitespace
SELECT url_extract_path('foo '); -- NULL (1 row)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you want to show with this example? Are you trying to say that all URL functions return NULL when input URL is invalid? If so, please, state this directly.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kgpai merged this pull request in 70320dd.

Copy link

Conbench analyzed the 1 benchmark run on commit 70320ddb.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

url_extract_path should return NULL if URI is malformed
3 participants