Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAST(varchar as <integral type>) doesn't trim leading and trailing whitespaces, but fails #21315

Closed
gggrace14 opened this issue Nov 6, 2023 · 11 comments · Fixed by #22284
Closed
Labels

Comments

@gggrace14
Copy link
Contributor

gggrace14 commented Nov 6, 2023

Current behavior of casting varchar to integral types is to throw exception if the varchar is valid but contains leading or trailing whitespaces. For example,

SELECT CAST('100 ' AS BIGINT)

throws the following exception

com.facebook.presto.spi.PrestoException: Cannot cast '100 ' to BIGINT
	at com.facebook.presto.type.VarcharOperators.castToBigint(VarcharOperators.java:185)

This is different from the behavior described in ANSI SQL

... The declared type of the result of the <cast specification> is TD... let SD be the declared type of the <value expression>... let SV be its value...
If SD is character string, then SV is replaced by SV with any leading or trailing <space>s removed.
Case:
i) If SV does not comprise a <signed numeric literal>, then an exception condition is raised.
ii) Otherwise, let LT be that <signed numeric literal>. The <cast specification> is equivalent to CAST ( LT AS TD )

Accordingly, TRY(CAST(varchar AS ) and TRY_CAST(varchar AS ) return NULL rather than the actual valid integral number.

Should Presto change the above CAST behavior to accept and trim leading and trailing whitespaces and be compliant with ANSI SQL?

The exception is thrown because the implementation of casting varchar to integral types directly delegates to Java, ex.

@LiteralParameters("x")
@ScalarOperator(CAST)
@SqlType(StandardTypes.BIGINT)
public static long castToBigint(@SqlType("varchar(x)") Slice slice)
{
try {
return Long.parseLong(slice.toStringUtf8());
}
catch (Exception e) {
throw new PrestoException(INVALID_CAST_ARGUMENT, format("Cannot cast '%s' to BIGINT", slice.toStringUtf8()));
}
}

And Java does not accept whitespaces or any character other than decimal digits and sign character.
https://docs.oracle.com/javase/8/docs/api/java/lang/Long.html#parseLong-java.lang.String-

Your Environment

  • Presto version used:

Expected Behavior

Casting of varchar trims leading and trailing whitespaces.

Current Behavior

Casting of varchar to integral types throws exception if the varchar contains leading or trailing whitespaces

Possible Solution

Casting of varchar trims leading and trailing whitespaces before delegating to Java Long.parseLong().

Steps to Reproduce

Screenshots (if appropriate)

Context

Throwing exception makes Presto have a different semantics of above CAST from ANSI SQL and other mainstream engines or libraries that compliant with ANSI SQL, ex., MySQL, folly, etc.

@gggrace14 gggrace14 added the bug label Nov 6, 2023
@gggrace14
Copy link
Contributor Author

gggrace14 commented Nov 6, 2023

cc @mbasmanova @kaikalur

@mbasmanova
Copy link
Contributor

@mbasmanova
Copy link
Contributor

CC: @amitkdutta

@mbasmanova
Copy link
Contributor

Velox currently allows leading and trailing whitespace in cast(varchar as ). We are going to keep this behavior. It would be nice to change Presto to match as well.

@mbasmanova mbasmanova changed the title Casting varchar to integral types does not trim leading and trailing whitespaces but throws exception CAST(varchar as <integral type>) doesn't trim leading and trailing whitespaces, but fails Nov 6, 2023
@agrawaldevesh
Copy link
Contributor

Interesting! I am just curious what do other systems like MySQL, duckdb etc do here ? This should be easily fixable in presto Java without breaking any client assumptions.

@spershin
Copy link
Contributor

spershin commented Nov 6, 2023

I believe this case should be easily fixable by making Presto's behavior compiling with ANSI SQL.

I doubt any client/user would expect and depend on a throw for varchar -> number conversion because of some spaces (although this is not impossible, plenty of weird cases out there).

ANSI SQL says it very clear, thanks @gggrace14 for digging this out!

DubkDB should ignore whitespaces too - Velox uses it in the unit tests.

@tdcmeehan
Copy link
Contributor

tdcmeehan commented Nov 6, 2023

Related to #21001 (another blank padding bug)

This works fine in Postgres. My guess is this is because in Postgres, it's not a conversion from varchar to bigint, but char(3) to bigint. The extra space is treated as blank padding, Postgres is consistent in ignoring the blank padding for comparisons, and also apparently when converting between types.

Even this query succeeds, and I suspect it's because there's an implicit coercion from varchar to char for the sake of the cast: SELECT CAST (CAST('100 ' AS VARCHAR(4)) AS INTEGER);

In Presto, it's actually a conversion from varchar to bigint. This causes an issue here, but also in the issue above. For varchar, I don't think blank padding semantics apply.

@mbasmanova
Copy link
Contributor

FYI, there will be result differences between Presto and Velox for try_cast, e.g. NULL in Presto vs. 12 in Velox:

presto:di> select try_cast(' 12' as integer);
 _col0
-------
 NULL
(1 row)

@gggrace14
Copy link
Contributor Author

FYI, there will be result differences between Presto and Velox for try_cast, e.g. NULL in Presto vs. 12 in Velox:

presto:di> select try_cast(' 12' as integer);
 _col0
-------
 NULL
(1 row)

Yes, this is true. I should've also included in the description.

So a common case we can predict is the result will have more records if the user has filter like try_cast() IS NOT NULL.

If we want to proceed with the alignment, we should make it into a major Presto release.

@spershin
Copy link
Contributor

Note, that Presto's behavior is also inconsistent.
It ignores whitespaces for floating point numbers conversion:

SELECT CAST('  6.0 ' AS REAL)
 _col0
-------
   6.0

But throws an error for integers.
Only because Java itself has this behavior, rather than the conscious choice of a query engine.
Presto Java uses Long.parseLong, which behaves like this:

Parses the string argument as a signed decimal long. The characters in the string must all be decimal digits, except that the first character may be an ASCII minus sign '-' (\u002D') to indicate a negative value or an ASCII plus sign '+' ('\u002B') to indicate a positive value. The resulting long value is returned, exactly as if the argument and the radix 10 were given as arguments to the parseLong(java.lang.String, int) method.
Note that neither the character L ('\u004C') nor l ('\u006C') is permitted to appear at the end of the string as a type indicator, as would be permitted in Java programming language source code.

@spershin
Copy link
Contributor

Submitting #22284 to change the behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants