Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix url_extract_* functions for malformed URL's #7668

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 37 additions & 3 deletions velox/docs/functions/presto/url.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
URL Functions
=============

Extraction Functions
--------------------
Introduction
------------

The URL extraction functions extract components from HTTP URLs (or any valid URIs conforming to `RFC 2396 <https://tools.ietf.org/html/rfc2396.html>`_). The following syntax is supported:
The URL extraction functions extract components from HTTP URLs (or any valid URIs conforming to `RFC 3986 <https://tools.ietf.org/html/rfc3986.html>`_). The following syntax is supported:

.. code-block:: bash

Expand All @@ -14,6 +14,40 @@ The URL extraction functions extract components from HTTP URLs (or any valid URI

The extracted components do not contain URI syntax separators such as ``:`` , ``?`` and ``#``.

Consider for example the below URI:

.. code-block::

http://www.ics.uci.edu/pub/ietf/uri/?k1=v1#Related

scheme = http
authority = www.ics.uci.edu
path = /pub/ietf/uri/
query = k1=v1
fragment = Related


Invalid URI's
-------------

Well formed URI's should not contain ascii whitespace. `Percent-encoded URI's <https://www.rfc-editor.org/rfc/rfc3986#section-2.1>`_ should be followed by two hexadecimal
digits after the percent character "%". All the url extract functions will return null when passed an invalid uri.

.. code-block::

# Examples of url functions with Invalid URI's.

# Invalid URI due to whitespace
SELECT url_extract_path('foo '); -- NULL (1 row)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you want to show with this example? Are you trying to say that all URL functions return NULL when input URL is invalid? If so, please, state this directly.

SELECT url_extract_host('http://www.foo.com '); -- NULL (1 row)

# Invalid URI due to improper escaping of '%'
SELECT url_extract_path('https://www.ucu.edu.uy/agenda/evento/%%UCUrlCompartir%%'); -- NULL (1 row)
SELECT url_extract_host('https://www.ucu.edu.uy/agenda/evento/%%UCUrlCompartir%%'); -- NULL (1 row)

Extraction Functions
--------------------

.. function:: url_extract_fragment(url) -> varchar

Returns the fragment identifier from ``url``.
Expand Down
2 changes: 1 addition & 1 deletion velox/functions/prestosql/URLFunctions.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ bool matchAuthorityAndPath(
boost::cmatch& authorityMatch,
bool& hasAuthority) {
static const boost::regex kAuthorityAndPathRegex("//([^/]*)(/.*)?");
auto authorityAndPath = submatch(urlMatch, 2);
auto authorityAndPath = submatch(urlMatch, 3);
if (!boost::regex_match(
authorityAndPath.begin(),
authorityAndPath.end(),
Expand Down
117 changes: 94 additions & 23 deletions velox/functions/prestosql/URLFunctions.h
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
#include <boost/regex.hpp>
#include <cctype>
#include "velox/functions/Macros.h"
#include "velox/functions/lib/string/StringImpl.h"

namespace facebook::velox::functions {

Expand All @@ -29,11 +30,33 @@ FOLLY_ALWAYS_INLINE StringView submatch(const boost::cmatch& match, int idx) {

template <typename TInString>
bool parse(const TInString& rawUrl, boost::cmatch& match) {
/// This regex is taken from RFC - 3986.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice comment. Thanks.

/// See: https://www.rfc-editor.org/rfc/rfc3986#appendix-B
/// The basic groups are:
/// scheme = $2
/// authority = $4
/// path = $5
/// query = $7
/// fragment = $9
/// For example a URI like below :
/// http://www.ics.uci.edu/pub/ietf/uri/#Related
///
/// results in the following subexpression matches:
///
/// $1 = http:
/// $2 = http
/// $3 = //www.ics.uci.edu
/// $4 = www.ics.uci.edu
/// $5 = /pub/ietf/uri/
/// $6 = <undefined>
/// $7 = <undefined>
/// $8 = #Related
/// $9 = Related
static const boost::regex kUriRegex(
"([a-zA-Z][a-zA-Z0-9+.-]*):" // scheme:
"([^?#]*)" // authority and path
"(?:\\?([^#]*))?" // ?query
"(?:#(.*))?"); // #fragment
"^(([^:\\/?#]+):)?" // scheme:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These regexes are not very readable. Maybe add comments to explain each of them?

"(\\/\\/([^\\/?#]*))?([^?#]*)" // authority and path
"(\\?([^#]*))?" // ?query
"(#(.*))?"); // #fragment

return boost::regex_match(
rawUrl.data(), rawUrl.data() + rawUrl.size(), match, kUriRegex);
Expand Down Expand Up @@ -84,6 +107,37 @@ FOLLY_ALWAYS_INLINE void urlEscape(TOutString& output, const TInString& input) {
output.resize(outIndex);
}

/// Performs initial validation of the URI.
/// Checks if the URI contains ascii whitespaces or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this check is complete? Are all ASCII characters other than whitespace allowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ascii characters other than whitespaces are allowed.

/// unescaped '%' chars.
bool isValidURI(const StringView& input) {
const char* p = input.data();
const char* end = p + input.size();
char buf[3];
buf[2] = '\0';
char* endptr;
for (; p < end; ++p) {
if (stringImpl::isAsciiWhiteSpace(*p)) {
return false;
}

if (*p == '%') {
if (p + 2 < end) {
buf[0] = p[1];
buf[1] = p[2];
strtol(buf, &endptr, 16);
p += 2;
if (endptr != buf + 2) {
return false;
}
} else {
return false;
}
}
}
return true;
}

template <typename TOutString, typename TInString>
FOLLY_ALWAYS_INLINE void urlUnescape(
TOutString& output,
Expand Down Expand Up @@ -143,11 +197,15 @@ struct UrlExtractProtocolFunction {
FOLLY_ALWAYS_INLINE bool call(
out_type<Varchar>& result,
const arg_type<Varchar>& url) {
if (!isValidURI(url)) {
result.setEmpty();
return false;
}
boost::cmatch match;
if (!parse(url, match)) {
result.setEmpty();
} else {
result.setNoCopy(submatch(match, 1));
result.setNoCopy(submatch(match, 2));
}
return true;
}
Expand All @@ -166,11 +224,15 @@ struct UrlExtractFragmentFunction {
FOLLY_ALWAYS_INLINE bool call(
out_type<Varchar>& result,
const arg_type<Varchar>& url) {
if (!isValidURI(url)) {
result.setEmpty();
return false;
}
boost::cmatch match;
if (!parse(url, match)) {
result.setEmpty();
} else {
result.setNoCopy(submatch(match, 4));
result.setNoCopy(submatch(match, 9));
}
return true;
}
Expand All @@ -189,6 +251,10 @@ struct UrlExtractHostFunction {
FOLLY_ALWAYS_INLINE bool call(
out_type<Varchar>& result,
const arg_type<Varchar>& url) {
if (!isValidURI(url)) {
result.setEmpty();
return false;
}
boost::cmatch match;
if (!parse(url, match)) {
result.setEmpty();
Expand All @@ -214,6 +280,9 @@ struct UrlExtractPortFunction {
VELOX_DEFINE_FUNCTION_TYPES(T);

FOLLY_ALWAYS_INLINE bool call(int64_t& result, const arg_type<Varchar>& url) {
if (!isValidURI(url)) {
return false;
}
boost::cmatch match;
if (!parse(url, match)) {
return false;
Expand Down Expand Up @@ -244,29 +313,23 @@ struct UrlExtractPathFunction {

// Input is always ASCII, but result may or may not be ASCII.

FOLLY_ALWAYS_INLINE void call(
FOLLY_ALWAYS_INLINE bool call(
out_type<Varchar>& result,
const arg_type<Varchar>& url) {
if (!isValidURI(url)) {
result.setEmpty();
return false;
}

boost::cmatch match;
if (!parse(url, match)) {
result.setEmpty();
return;
return false;
}

boost::cmatch authAndPathMatch;
boost::cmatch authorityMatch;
bool hasAuthority;
urlUnescape(result, submatch(match, 5));

if (matchAuthorityAndPath(
match, authAndPathMatch, authorityMatch, hasAuthority)) {
StringView escapedPath;
if (hasAuthority) {
escapedPath = submatch(authAndPathMatch, 2);
} else {
escapedPath = submatch(match, 2);
}
urlUnescape(result, escapedPath);
}
return true;
}
};

Expand All @@ -283,13 +346,17 @@ struct UrlExtractQueryFunction {
FOLLY_ALWAYS_INLINE bool call(
out_type<Varchar>& result,
const arg_type<Varchar>& url) {
if (!isValidURI(url)) {
result.setEmpty();
return false;
}
boost::cmatch match;
if (!parse(url, match)) {
result.setEmpty();
return true;
}

auto query = submatch(match, 3);
auto query = submatch(match, 7);
result.setNoCopy(query);
return true;
}
Expand All @@ -309,13 +376,17 @@ struct UrlExtractParameterFunction {
out_type<Varchar>& result,
const arg_type<Varchar>& url,
const arg_type<Varchar>& param) {
if (!isValidURI(url)) {
result.setEmpty();
return false;
}
boost::cmatch match;
if (!parse(url, match)) {
result.setEmpty();
return false;
}

auto query = submatch(match, 3);
auto query = submatch(match, 7);
if (!query.empty()) {
// Parse query string.
static const boost::regex kQueryParamRegex(
Expand Down
40 changes: 32 additions & 8 deletions velox/functions/prestosql/tests/URLFunctionsTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,11 @@ class URLFunctionsTest : public functions::test::FunctionBaseTest {
protected:
void validate(
const std::string& url,
const std::string& expectedProtocol,
const std::string& expectedHost,
const std::string& expectedPath,
const std::string& expectedFragment,
const std::string& expectedQuery,
const std::optional<std::string>& expectedProtocol,
const std::optional<std::string>& expectedHost,
const std::optional<std::string>& expectedPath,
const std::optional<std::string>& expectedFragment,
const std::optional<std::string>& expectedQuery,
const std::optional<int32_t> expectedPort) {
const auto extractFn = [&](const std::string& fn,
const std::optional<std::string>& a) {
Expand All @@ -40,7 +40,7 @@ class URLFunctionsTest : public functions::test::FunctionBaseTest {
};

EXPECT_EQ(extractFn("protocol", url), expectedProtocol);
EXPECT_EQ(extractFn("host", url).value(), expectedHost);
EXPECT_EQ(extractFn("host", url), expectedHost);
EXPECT_EQ(extractFn("path", url), expectedPath);
EXPECT_EQ(extractFn("fragment", url), expectedFragment);
EXPECT_EQ(extractFn("query", url), expectedQuery);
Expand Down Expand Up @@ -97,18 +97,42 @@ TEST_F(URLFunctionsTest, validateURL) {
"",
"",
std::nullopt);
validate("foo", "", "", "", "", "", std::nullopt);
validate(
"https://www.ucu.edu.uy/agenda/evento/%%UCUrlCompartir%%",
std::nullopt,
std::nullopt,
std::nullopt,
std::nullopt,
std::nullopt,
std::nullopt);
validate("foo", "", "", "foo", "", "", std::nullopt);
validate(
"foo ",
std::nullopt,
std::nullopt,
std::nullopt,
std::nullopt,
std::nullopt,
std::nullopt);
}

TEST_F(URLFunctionsTest, extractPath) {
const auto extractPath = [&](const std::optional<std::string>& url) {
return evaluateOnce<std::string>("url_extract_path(c0)", url);
};

ASSERT_EQ(
EXPECT_EQ(
"/media/set/Books and Magazines.php",
extractPath(
"https://www.cnn.com/media/set/Books%20and%20Magazines.php?foo=bar"));

EXPECT_EQ(
"java-net@java.sun.com", extractPath("mailto:java-net@java.sun.com"));
EXPECT_EQ(
std::nullopt,
extractPath("https://www.ucu.edu.uy/agenda/evento/%%UCUrlCompartir%%"));
EXPECT_EQ("foo", extractPath("foo"));
EXPECT_EQ(std::nullopt, extractPath("BAD URL!"));
}

TEST_F(URLFunctionsTest, extractParameter) {
Expand Down