Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string to integer conversion with newline #697

Closed
SamChill opened this issue Apr 10, 2012 · 13 comments
Closed

string to integer conversion with newline #697

SamChill opened this issue Apr 10, 2012 · 13 comments
Assignees
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request

Comments

@SamChill
Copy link
Contributor

In Julia passing a string that contains an integer and ends in a newline causes int() to error(). In other languages if the last character is a newline int() doesn't complain. Example:

julia> int("1\n")
'\n' is not a valid digit (in "1\n")
 in parse_int at string.jl:886
 in int at string.jl:916

I think that "1\n" should be valid and "1\n2" should be invalid. If there is a newline and it isn't the last character in the string, then the operation is ill defined. This is how Python's int works by the way.

The thinking for this behavior comes from the following use case:

n = int(readline(stream))

It is pretty common, at least for me, to write parsers for simple file formats where there is a single number per line. Currently you would need to call chomp() before int().

@ghost ghost assigned StefanKarpinski Apr 10, 2012
@StefanKarpinski
Copy link
Member

This is probably a good thing from a pragmatic perspective. What about other kinds of trailing whitespace? Should that be an error? Then there's also various combinations of \n and \r...

@SamChill
Copy link
Contributor Author

It appears that float and bool don't care about white space:

julia> float(" 2 \n ")
2.0
julia> bool(" 1 \r\n ")
true

In fact they might be much too carefree:

julia> float("10 \r 2 \n foobar")
10.0

I'm not sure what the desired behavior is, but I'm pretty sure it isn't that!

Regardless of the desired behavior there should be some consistent behavior for converting strings to a given type. I'd vote for stripping all whitespace as whitespace has no meaning for most type conversion and is consistent with some other languages.

@StefanKarpinski
Copy link
Member

Completely agreed. Int is way too strict (written by use in Julia code) and float is way too liberal (I think we're just using a standard C parsing function there).

carlobaldassi added a commit to carlobaldassi/julia that referenced this issue Apr 13, 2012
@carlobaldassi
Copy link
Member

As you can see from the automatic message above, I fixed the behaviour for the int case (the easy one). I did not push it in julia master because I wanted to make sure that it makes sense to substitute error("...") with throw(ArgumentError("..."), which is what float(::String) does.
Also, note that I've been trying to reproduce the original error messages via strcat, but this exposed a weird behaviour: if I load the file from a running Julia session, everything works fine. If, however, I try to recompile sys.ji, my memory usage starts to grow steeply and eats up easily several gigabytes of RAM in a few seconds, leaving no other option than killing the compilation. This forced me to use less informative error messages.

About the floating point case, I see 2 ways of achieving the desired behaviour: reimplementing everything in Julia, or just do a preliminary sanity check and delegating the real work to strtod in case of success. The second strategy has the obvious advantage of being much easier (in particular, it deals with over/underflows), and in fact I already did some experiments with it. The cheaper way to do that would be using a regex, like this:

const _jl_float_regex = r"^\s*([+-]?(\d+\.?\d*|\.\d+)(e[+-]?\d+)?|[Nn][Aa][Nn]|[Ii][Nn][Ff])\s*$"
_jl_float_isvalid_regex(s::String) = matches(_jl_float_regex, s)

Now, apart from the fact that regexes cannot be defined within string.jl, which forces to postpone this definition (which is ugly), the main problem which emerges here is that this preparsing step alone is 4 times slower than the call to strtod, meaning that this strategy is going to reduce performance by 5 times. This seems too big a performance loss to be really acceptable.
On the other hand, I have also implemented some ad hoc preparser here (beware, it's ugly), which just goes through the string one character at a time and keeps track of what's allowed and what's not, and while it is faster than the regex it's still twice as slow as strtod, and it's basically doing nothing. Given these premises, the approach of rewriting the whole parser in Julia seems even less appealing.
Of course, you may see some obvious ways to speed up the code I wrote; at the moment, I certainly don't.
So to summarize, sanity in floating point parsing at the moment is going to cost a ~3-fold performance loss with that code.

Two more things I found out while experimenting:

  • strtod does not accept denormal numbers as inputs. This introduces a discrepancy with the Julia parser.
  • apparently, bool(::String) is not just very liberal: it alway returns true

@JeffBezanson
Copy link
Member

strtod already does error checking, we just need to use its output better. For denormals it sets errno to ERANGE, so we could really just ignore errno. Unfortunately it doesn't distinguish denormals from completely out-of-range numbers like 1e-400 and just gives 0 in that case.

We can use the end pointer strtod returns to see if there's junk after the converted number.

We have bool(Any)=true, which should probably just be deleted.

carlobaldassi added a commit to carlobaldassi/julia that referenced this issue Apr 13, 2012
@carlobaldassi
Copy link
Member

I implemented the first part here and it seems to work fine. I'm assuming that _jl_strtod will only get well-behaved strings and using strlen.

About denormals: from what you write it seems we could check if errono == ERANGE and the output is non-zero and small. In the manpage I don't find any mention of denormals though, just ±HUGE_VAL being returned in case of overflow and 0 in case of underflow. Needs to be investigated.

@JeffBezanson
Copy link
Member

Here's my implementation:

DLLEXPORT int jl_strtod(char *str, double *out)
{
    char *p;
    errno = 0;
    *out = strtod(str, &p);
    if (p == str ||
        (errno==ERANGE && (*out == 0 || *out == HUGE_VAL || *out == -HUGE_VAL)))
        return 1;
    while (*p != '\0') {
        if (!isspace(*p))
            return 1;
        p++;
    }
    return 0;
}

@carlobaldassi
Copy link
Member

Yes, testing for non-zero/non-inf results gets us denormals as well.

@carlobaldassi
Copy link
Member

Yes, that's better, strlen was just wasting time.

So, if it's ok to throw exceptions in parse_int I can push the other fix and we could close this issue.

@JeffBezanson
Copy link
Member

I tried replacing some of the error with throw(ArgumentError(strcat(...))) and it seems to work for me. I didn't try your exact patch though.

@JeffBezanson
Copy link
Member

Ok your version with strcat also works for me...I wonder what happened.

@carlobaldassi
Copy link
Member

Now the version with the modified parse_int works for me too. But if I try to do the same with the exception message produced by float(::String):

if !float64_isvalid(s, tmp)
    throw(ArgumentError(strcat("float64(String): invalid number format (in ",show_to_string(s),")")))
end

then there goes my RAM again. Can you reproduce this?

@JeffBezanson
Copy link
Member

OK, that one gives me a stack overflow. Not 100% sure why, but it has something to do with the circular dependencies involved in string interpolation. Man I hate string interpolation; its only purpose is to serve the (apparently) deep human need for horrible broken syntax.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request
Projects
None yet
Development

No branches or pull requests

4 participants