New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Fix url_extract_* functions for malformed URL's #7668

Closed

kgpai wants to merge 5 commits into main from fix_uri_malformed

Contributor

kgpai commented Nov 21, 2023 •

edited

Loading

Fixes #7038

Malformed URL's which contain invalid escape sequences (%xx) used to throw in Velox, but not in Presto.

Also, for absolute URI's, url_extract_path used to return NULL when it should return the path, e.g url_extract_path('foo') should return 'foo'. Fix this by making the scheme/authority/path regex to be compliant with the URI RFC (https://www.rfc-editor.org/rfc/rfc3986#appendix-A).

Some examples of the new changes:

> SELECT url_extract_path('https://www.ucu.edu.uy/agenda/evento/%%UCUrlCompartir%%'); 

Before: throws exception.
After: returns NULL.

> SELECT url_extract_path('foo');

Before: returns NULL.
After: returns 'foo'.

kgpai added 4 commits

November 7, 2023 07:02


          Save work.

330dd3c


          Merge branch 'main' of github.com:facebookincubator/velox into fix_ur…

f489e36

…i_malformed


          Save work.

736a1e3


          Fix url_extract_path bugs.

72e2867

netlify bot commented Nov 21, 2023 •

edited

Loading

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`53a64fc`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/6564e17131e47f0008727d79

facebook-github-bot added the CLA Signed label

kgpai requested review from mbasmanova and isadikov

November 21, 2023 02:15

Contributor

facebook-github-bot commented Nov 21, 2023

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mbasmanova reviewed

View reviewed changes

Contributor

mbasmanova left a comment

@kgpai Thank you for the fix. Would you update documentation to clarify semantics of these functions?

velox/functions/prestosql/URLFunctions.h

-                    "([^?#]*)" // authority and path
-                    "(?:\\?([^#]*))?" // ?query
-                    "(?:#(.*))?"); // #fragment
+                    "^(([^:\\/?#]+):)?" // scheme:

Contributor

mbasmanova Nov 21, 2023

These regexes are not very readable. Maybe add comments to explain each of them?

velox/functions/prestosql/URLFunctions.h Outdated

                 output.resize(outIndex);
               }
+              template <typename TInString>
+              FOLLY_ALWAYS_INLINE bool urlUnescapeCheck(const TInString& input) {

Contributor

mbasmanova Nov 21, 2023

Would you add a comment to explain what checks are performed here?

naming: start with a verb (checkUrlUnescape) or isValidUrl

velox/functions/prestosql/URLFunctions.h Outdated

                 output.resize(outIndex);
               }
+              template <typename TInString>

Contributor

mbasmanova Nov 21, 2023

Is this template parameter needed? Is not not always a StringView?

velox/functions/prestosql/URLFunctions.h Outdated

                 output.resize(outIndex);
               }
+              template <typename TInString>
+              FOLLY_ALWAYS_INLINE bool urlUnescapeCheck(const TInString& input) {

Contributor

mbasmanova Nov 21, 2023

Remove FOLLY_ALWAYS_INLINE. This function is too long.

velox/functions/prestosql/URLFunctions.h Outdated

@@ @@ -143,11 +173,16 @@ struct UrlExtractProtocolFunction { @@
                 FOLLY_ALWAYS_INLINE bool call(
                     out_type<Varchar>& result,
                     const arg_type<Varchar>& url) {
+                  auto validUrl = urlUnescapeCheck(url);

Contributor

mbasmanova Nov 21, 2023

nit: validUrl variable is not needed

if(!urlUnescapeCheck(url))

velox/functions/prestosql/URLFunctions.h Outdated

@@ @@ -201,7 +246,8 @@ struct UrlExtractHostFunction { @@
                   if (matchAuthorityAndPath(
                           match, authAndPathMatch, authorityMatch, hasAuthority) &&
                       hasAuthority) {
-                    result.setNoCopy(submatch(authorityMatch, 3));
+                    auto host = submatch(authorityMatch, 3);

Contributor

mbasmanova Nov 21, 2023

nit: host variable is not needed

velox/functions/prestosql/URLFunctions.h Outdated

		}

		StringView escapedPath;

Contributor

mbasmanova Nov 21, 2023

escapedPath variable is not needed

mbasmanova reviewed

View reviewed changes

Contributor

mbasmanova left a comment

@kgpai Looks good, but let's update documentation in https://facebookincubator.github.io/velox/functions/presto/url.html to clarify what constitutes a valid URL and what happens if URL is not valid. Let's include some examples to helps readers understand.

velox/functions/prestosql/URLFunctions.h

               template <typename TInString>
               bool parse(const TInString& rawUrl, boost::cmatch& match) {
+                /// This regex is taken from RFC - 3986.

Contributor

mbasmanova Nov 22, 2023

Nice comment. Thanks.

velox/functions/prestosql/URLFunctions.h Outdated

                 output.resize(outIndex);
               }
+              /// Performs basic initial validation of the URI.

Contributor

mbasmanova Nov 22, 2023

nit: 'basic initial' -> pick one, either basic or initial, no need to have both

velox/functions/prestosql/URLFunctions.h

                 output.resize(outIndex);
               }
+              /// Performs basic initial validation of the URI.
+              /// Checks if the URI contains ascii whitespaces or

Contributor

mbasmanova Nov 22, 2023

I wonder if this check is complete? Are all ASCII characters other than whitespace allowed?

Contributor Author

kgpai Nov 22, 2023

Yes ascii characters other than whitespaces are allowed.

Contributor

mbasmanova commented Nov 22, 2023

@kgpai Would you update PR description with some examples of before and after to help readers understand the impact of this change?

kgpai force-pushed the fix_uri_malformed branch 2 times, most recently from 218ae6c to 3c40c63 Compare

November 22, 2023 18:23

Contributor

facebook-github-bot commented Nov 22, 2023

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mbasmanova reviewed

View reviewed changes

Contributor

mbasmanova left a comment

Thank you for extending the docs. Some comments.

velox/docs/functions/presto/url.rst Outdated

+              Invalid URI's
+              -------------
+              Well formed URI's should not contain ascii whitespace and percent-encoded URI's should be followed by two hexadecimal

Contributor

mbasmanova Nov 27, 2023

Add empty line.

ascii -> ASCII

Consider splitting this sentence into two for readability.

Well formed URI's should not contain ascii whitespace. Percent-encoded URI's should be followed by two hexadecimal ...

Is "Percent-encoded URI's" a well-known term? Consider adding a link to a definition in Wikipedia or some other good source.

Contributor Author

kgpai Nov 27, 2023

Will do. "Percent-encoded" is how its referred to in the URI RFC, I will reference that.

velox/docs/functions/presto/url.rst

+              .. code-block::
+                  # URI with whitespace
+                  SELECT url_extract_path('foo '); -- NULL (1 row)

Contributor

mbasmanova Nov 27, 2023

What do you want to show with this example? Are you trying to say that all URL functions return NULL when input URL is invalid? If so, please, state this directly.

kgpai force-pushed the fix_uri_malformed branch from 3c40c63 to 716455c Compare

November 27, 2023 18:07


          Fix url_extract_path bugs.

53a64fc

kgpai force-pushed the fix_uri_malformed branch from 716455c to 53a64fc Compare

November 27, 2023 18:35

mbasmanova approved these changes

View reviewed changes

Contributor

mbasmanova left a comment

Thanks.

Contributor

facebook-github-bot commented Nov 27, 2023

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot closed this in

70320dd

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Nov 27, 2023

@kgpai merged this pull request in 70320dd.

conbench-facebook bot commented Nov 27, 2023

Conbench analyzed the 1 benchmark run on commit 70320ddb.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

kgpai mentioned this pull request

Enhance Presto url_extract_* functions to use RE2 instead of boost::regexp #7885

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed Merged