string to integer conversion with newline #697

SamChill · 2012-04-10T15:06:27Z

In Julia passing a string that contains an integer and ends in a newline causes int() to error(). In other languages if the last character is a newline int() doesn't complain. Example:

julia> int("1\n")
'\n' is not a valid digit (in "1\n")
 in parse_int at string.jl:886
 in int at string.jl:916

I think that "1\n" should be valid and "1\n2" should be invalid. If there is a newline and it isn't the last character in the string, then the operation is ill defined. This is how Python's int works by the way.

The thinking for this behavior comes from the following use case:

n = int(readline(stream))

It is pretty common, at least for me, to write parsers for simple file formats where there is a single number per line. Currently you would need to call chomp() before int().

The text was updated successfully, but these errors were encountered:

StefanKarpinski · 2012-04-10T18:17:11Z

This is probably a good thing from a pragmatic perspective. What about other kinds of trailing whitespace? Should that be an error? Then there's also various combinations of \n and \r...

SamChill · 2012-04-10T18:31:24Z

It appears that float and bool don't care about white space:

julia> float(" 2 \n ")
2.0
julia> bool(" 1 \r\n ")
true

In fact they might be much too carefree:

julia> float("10 \r 2 \n foobar")
10.0

I'm not sure what the desired behavior is, but I'm pretty sure it isn't that!

Regardless of the desired behavior there should be some consistent behavior for converting strings to a given type. I'd vote for stripping all whitespace as whitespace has no meaning for most type conversion and is consistent with some other languages.

StefanKarpinski · 2012-04-10T18:32:59Z

Completely agreed. Int is way too strict (written by use in Julia code) and float is way too liberal (I think we're just using a standard C parsing function there).

carlobaldassi · 2012-04-13T02:09:33Z

As you can see from the automatic message above, I fixed the behaviour for the int case (the easy one). I did not push it in julia master because I wanted to make sure that it makes sense to substitute error("...") with throw(ArgumentError("..."), which is what float(::String) does.
Also, note that I've been trying to reproduce the original error messages via strcat, but this exposed a weird behaviour: if I load the file from a running Julia session, everything works fine. If, however, I try to recompile sys.ji, my memory usage starts to grow steeply and eats up easily several gigabytes of RAM in a few seconds, leaving no other option than killing the compilation. This forced me to use less informative error messages.

About the floating point case, I see 2 ways of achieving the desired behaviour: reimplementing everything in Julia, or just do a preliminary sanity check and delegating the real work to strtod in case of success. The second strategy has the obvious advantage of being much easier (in particular, it deals with over/underflows), and in fact I already did some experiments with it. The cheaper way to do that would be using a regex, like this:

const _jl_float_regex = r"^\s*([+-]?(\d+\.?\d*|\.\d+)(e[+-]?\d+)?|[Nn][Aa][Nn]|[Ii][Nn][Ff])\s*$"
_jl_float_isvalid_regex(s::String) = matches(_jl_float_regex, s)

Now, apart from the fact that regexes cannot be defined within string.jl, which forces to postpone this definition (which is ugly), the main problem which emerges here is that this preparsing step alone is 4 times slower than the call to strtod, meaning that this strategy is going to reduce performance by 5 times. This seems too big a performance loss to be really acceptable.
On the other hand, I have also implemented some ad hoc preparser here (beware, it's ugly), which just goes through the string one character at a time and keeps track of what's allowed and what's not, and while it is faster than the regex it's still twice as slow as strtod, and it's basically doing nothing. Given these premises, the approach of rewriting the whole parser in Julia seems even less appealing.
Of course, you may see some obvious ways to speed up the code I wrote; at the moment, I certainly don't.
So to summarize, sanity in floating point parsing at the moment is going to cost a ~3-fold performance loss with that code.

Two more things I found out while experimenting:

strtod does not accept denormal numbers as inputs. This introduces a discrepancy with the Julia parser.
apparently, bool(::String) is not just very liberal: it alway returns true

JeffBezanson · 2012-04-13T02:53:13Z

strtod already does error checking, we just need to use its output better. For denormals it sets errno to ERANGE, so we could really just ignore errno. Unfortunately it doesn't distinguish denormals from completely out-of-range numbers like 1e-400 and just gives 0 in that case.

We can use the end pointer strtod returns to see if there's junk after the converted number.

We have bool(Any)=true, which should probably just be deleted.

carlobaldassi · 2012-04-13T03:54:43Z

I implemented the first part here and it seems to work fine. I'm assuming that _jl_strtod will only get well-behaved strings and using strlen.

About denormals: from what you write it seems we could check if errono == ERANGE and the output is non-zero and small. In the manpage I don't find any mention of denormals though, just ±HUGE_VAL being returned in case of overflow and 0 in case of underflow. Needs to be investigated.

JeffBezanson · 2012-04-13T04:06:43Z

Here's my implementation:

DLLEXPORT int jl_strtod(char *str, double *out)
{
    char *p;
    errno = 0;
    *out = strtod(str, &p);
    if (p == str ||
        (errno==ERANGE && (*out == 0 || *out == HUGE_VAL || *out == -HUGE_VAL)))
        return 1;
    while (*p != '\0') {
        if (!isspace(*p))
            return 1;
        p++;
    }
    return 0;
}

carlobaldassi · 2012-04-13T04:08:34Z

Yes, testing for non-zero/non-inf results gets us denormals as well.

carlobaldassi · 2012-04-13T04:37:53Z

Yes, that's better, strlen was just wasting time.

So, if it's ok to throw exceptions in parse_int I can push the other fix and we could close this issue.

JeffBezanson · 2012-04-13T06:25:14Z

I tried replacing some of the error with throw(ArgumentError(strcat(...))) and it seems to work for me. I didn't try your exact patch though.

JeffBezanson · 2012-04-13T06:34:01Z

Ok your version with strcat also works for me...I wonder what happened.

carlobaldassi · 2012-04-13T10:20:20Z

Now the version with the modified parse_int works for me too. But if I try to do the same with the exception message produced by float(::String):

if !float64_isvalid(s, tmp)
    throw(ArgumentError(strcat("float64(String): invalid number format (in ",show_to_string(s),")")))
end

then there goes my RAM again. Can you reproduce this?

JeffBezanson · 2012-04-13T19:03:29Z

OK, that one gives me a stack overflow. Not 100% sure why, but it has something to do with the circular dependencies involved in string interpolation. Man I hate string interpolation; its only purpose is to serve the (apparently) deep human need for horrible broken syntax.

ghost assigned StefanKarpinski Apr 10, 2012

carlobaldassi added a commit to carlobaldassi/julia that referenced this issue Apr 13, 2012

More liberal int(String) (addresses JuliaLang#697)

8d02fee

carlobaldassi added a commit to carlobaldassi/julia that referenced this issue Apr 13, 2012

More liberal int(String) (addresses JuliaLang#697)

6457c26

JeffBezanson closed this as completed in 9e7fbf5 Apr 14, 2012

tkelman mentioned this issue Nov 30, 2015

OpenBLAS 4 times slower than MKL on DDOT() OpenMathLib/OpenBLAS#530

Closed

martin-frbg mentioned this issue Nov 28, 2015

Calling zpotrs_ introduces nans in unrelated results of cpotrf_ with Nehalem kernels OpenMathLib/OpenBLAS#695

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

string to integer conversion with newline #697

string to integer conversion with newline #697

SamChill commented Apr 10, 2012

StefanKarpinski commented Apr 10, 2012

SamChill commented Apr 10, 2012

StefanKarpinski commented Apr 10, 2012

carlobaldassi commented Apr 13, 2012

JeffBezanson commented Apr 13, 2012

carlobaldassi commented Apr 13, 2012

JeffBezanson commented Apr 13, 2012

carlobaldassi commented Apr 13, 2012

carlobaldassi commented Apr 13, 2012

JeffBezanson commented Apr 13, 2012

JeffBezanson commented Apr 13, 2012

carlobaldassi commented Apr 13, 2012

JeffBezanson commented Apr 13, 2012

string to integer conversion with newline #697

string to integer conversion with newline #697

Comments

SamChill commented Apr 10, 2012

StefanKarpinski commented Apr 10, 2012

SamChill commented Apr 10, 2012

StefanKarpinski commented Apr 10, 2012

carlobaldassi commented Apr 13, 2012

JeffBezanson commented Apr 13, 2012

carlobaldassi commented Apr 13, 2012

JeffBezanson commented Apr 13, 2012

carlobaldassi commented Apr 13, 2012

carlobaldassi commented Apr 13, 2012

JeffBezanson commented Apr 13, 2012

JeffBezanson commented Apr 13, 2012

carlobaldassi commented Apr 13, 2012

JeffBezanson commented Apr 13, 2012