Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make last(empty) empty #1869

Closed
wants to merge 5 commits into from
Closed

Make last(empty) empty #1869

wants to merge 5 commits into from

Conversation

muhmuhten
Copy link
Contributor

This is a kind of unsatisfying patchset, in which I set out to make last/1 faster, then discovered that it was buggy all along and fixing it made it even slower than it was in the first place.

I noticed reduce's update semantic s(it throws away all but the last value from the update), and noticed it basically made for a faster last/1. Passed tests. Discovered last(empty) == null. Realized that it was always like that. I don't know what the right thing to do even is and am leaving this here to foist this headache on someone else.

last/1 and first/1 are inconsistent, in that first(empty) is empty but currently last(empty) == null. The former is fairly intentional and gives rise to a useful pattern used by the current definition of isempty. The latter is an accident resulting from reduce semantics.

Unfortunately, it is surprisingly hard to implement a consistent last! The attempt given here unnecessarily wraps every single output of the generator in an array, then throws it out unless it's the last one. I think it's possible to bytecode without doing that nonsense, but would abusing variable guts and I don't want to write that and nobody else will understand it.

In any case, the existing documented semantics of nth is inconsistent with both and is also total nonsense, but I just complained about that elsewhere.

Probably the fundamental issue is the reduce shouldn't throw away all except the last update. On the other hand, its current behaviour is consistent with foreach, where it kind of does make sense, in that all the updates get extracted on but only the last one becomes the next state. Maybe that's nonsense too.

Also fixed a bug (mea culpa) and an unnecessary call in limit/2 (mea culpa).

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 84.867% when pulling 964caad on muhmuhten:misc_fixes into 3ea0199 on stedolan:master.

@nicowilliams
Copy link
Contributor

nicowilliams commented Mar 22, 2019

Here's a reduce-based last/1 with correct semantics:

$ jq -n 'def last(e): reduce e as $e ({empty:true,last:null}; (.empty = false) | (.last=$e))|if .empty then empty else .last end; last(empty)
'
$ jq -n 'def last(e): reduce e as $e ({empty:true,last:null}; (.empty = false) | (.last=$e))|if .empty then empty else .last end; last(1,2)'
2
$ 

EDIT: Ah, similar to yours. Yours uses an array instead of an object, which is faster. There have been times when I've wondered about making a special jv encoding for arrays of one element which do not need a jvp_array to be allocated. I'm not sure the additional branches would hurt, but certainly fewer allocations would help.

@nicowilliams
Copy link
Contributor

Probably the fundamental issue is the reduce shouldn't throw away all except the last update. ...

Actually, that behavior is critical. It's very important that during the execution of the reduction update body, jq not hold on to any additional references to the reduction state, otherwise the memory consumption of things like assignments blow up.

@nicowilliams
Copy link
Contributor

I agree it should be possible to bytecode a last/1 that does not need to wrap outputs in arrays. I'm also not sure it's worth it.

@nicowilliams
Copy link
Contributor

nicowilliams commented Mar 22, 2019

Another one like this I noticed today is repeat(empty), which loops forever. That is very unsatisfying. I think repeat(empty) should be the same as empty, and I intend to make it so as it turns out to be very useful in dealing with co-routines.

@nicowilliams
Copy link
Contributor

nicowilliams commented Mar 22, 2019

Here's an efficient (I think) last/1:

def last(e):
    [0] as $path
  | reduce e as $e ([]; setpath($path;$e))
  | if length==0 then empty else .[0] end;

This is efficient because, though it allocates an array at run-time, it only allocates one array, and it mutates it over and over.

@muhmuhten
Copy link
Contributor Author

I agree it should be possible to bytecode a last/1 that does not need to wrap outputs in arrays. I'm also not sure it's worth it.

Probably any case where last/1 performance matter is also one that would be better served by just not computing out values just to throw them away. Unlike first, it doesn't even actually avoid the extra work...

Here's an efficient (I think) last/1:

A few variants I tried:

% time ./jq -cn 'def last(g): [0] as $p | reduce g as $_ ([]; setpath($p; $_)) | .[]; last(empty), last(range(2e6))'
1999999
./jq -cn   1.47s user 0.00s system 99% cpu 1.473 total
% time ./jq -cn 'def last(g): reduce g as $_ ([]; setpath([0]; $_)) | .[]; last(empty), last(range(2e6))'
1999999
./jq -cn   1.40s user 0.00s system 99% cpu 1.400 total
% time ./jq -cn 'def last(g): reduce g as $_ ([]; [$_]) | .[]; last(empty), last(range(2e6))'
1999999
./jq -cn   1.12s user 0.00s system 99% cpu 1.123 total
% time ./jq -cn 'def last(g): reduce . as $_ (.; g|[.]) | .[]?; last(empty), last(range(2e6))'
1999999
./jq -cn   0.89s user 0.00s system 99% cpu 0.890 total

Sometimes performance can be p unintuitive...

@muhmuhten
Copy link
Contributor Author

(I think what we're seeing is, in order: keeping [0] in a variable isn't actually an improvement over a constant array; array collection is still turns out to be quicker than three function calls; and skipping elements in the VM is quicker than doing the reduction rigmarole.)

There have been times when I've wondered about making a special jv encoding for arrays of one element which do not need a jvp_array to be allocated. I'm not sure the additional branches would hurt, but certainly fewer allocations would help.

If that does in fact help, it seems like it'd be helpful to foreach ((g|[.]), null) &c. code as well.

@fadado
Copy link

fadado commented Mar 22, 2019

Another one like this I noticed today is repeat(empty), which loops forever. That is very unsatisfying. I think repeat(empty) should be the same as empty, and I intend to make it so as it turns out to be very useful in dealing with co-routines.

def repeat(g): #:: a|(a->*b) => *b
    def r: g , r;
    select(isempty(g)|not) | r
;

@muhmuhten
Copy link
Contributor Author

Evaluates g too many times

@nicowilliams
Copy link
Contributor

Yeah, evaluating g only once is essential, otherwise repeat (or whatever function) can't be used with side-effect-having expressions. The I/O and co-routine stuff necessarily adds side-effects...

@nicowilliams
Copy link
Contributor

And input, inputs, debug, and stderr all already have side effects.

@nicowilliams
Copy link
Contributor

Here's an almost correct repeat/1:

  def repeat(exp):
    label $out |
    def _repeat:
      (exp // break $out), _repeat;
    _repeat;

I say "almost" because the // operator will execute the RHS when the LHS produces null.

@nicowilliams
Copy link
Contributor

nicowilliams commented Mar 22, 2019

@muhmuhten looking at the disassembly helps a bit:

$ ./jq --debug-dump-disasm -cn 'def last(g): [0] as $p | reduce g as $_ ([]; setpath($p; $_)) | .[]; last(empty), last(range(2e3))'
0000 TOP
0001 FORK 0011
0003 CALL_JQ last:3 @lambda:4
0009 JUMP 0017
0011 CALL_JQ last:3 @lambda:5
0017 RET
...
last:3:
  [params: g]
  0000 DUP
  0001 PUSHK_UNDER [0]
  0003 POP
  0004 STOREV $p:0
  0007 DUP
  0008 LOADK []
  0010 STOREV $reduce:1
  0013 FORK 0043
  0015 DUPN
  0016 CALL_JQ g:0
  0020 STOREV $_:2
  0023 LOADVN $reduce:1
  0026 SUBEXP_BEGIN
  0027 LOADV $_:2
  0030 SUBEXP_END
  0031 SUBEXP_BEGIN
  0032 LOADV $p:0
  0035 SUBEXP_END
  0036 CALL_BUILTIN setpath
  0039 STOREV $reduce:1
  0042 BACKTRACK
  0043 LOADVN $reduce:1
  0046 EACH
  0047 RET
...
$ 

vs

$ ./jq --debug-dump-disasm -cn 'def last(g): reduce . as $_ (.; g|[.]) | .[]?; last(empty), last(range(2e3))'
...
last:3:
  [params: g]
  0000 DUP
  0001 STOREV $reduce:0
  0004 FORK 0036
  0006 DUPN
  0007 STOREV $_:1
  0010 LOADVN $reduce:0
  0013 CALL_JQ g:0
  0017 DUP
  0018 LOADK []
  0020 STOREV $collect:2
  0023 FORK 0029
  0025 APPEND $collect:2
  0028 BACKTRACK
  0029 LOADVN $collect:2
  0032 STOREV $reduce:0
  0035 BACKTRACK
  0036 LOADVN $reduce:0
  0039 EACH_OPT
  0040 RET
...

Stripping out most of the instruction addresses and diff'ing:

 last:3:
   [params: g]
        DUP
-       PUSHK_UNDER [0]
-       POP
-       STOREV $p:0
-       DUP
-       LOADK []
-       STOREV $reduce:1
-       FORK 0043
+       STOREV $reduce:0
+       FORK 0036
        DUPN
+       STOREV $_:1
+       LOADVN $reduce:0
        CALL_JQ g:0
-       STOREV $_:2
-       LOADVN $reduce:1
-       SUBEXP_BEGIN
-       LOADV $_:2
-       SUBEXP_END
-       SUBEXP_BEGIN
-       LOADV $p:0
-       SUBEXP_END
-       CALL_BUILTIN setpath
-       STOREV $reduce:1
+       DUP
+       LOADK []
+       STOREV $collect:2
+       FORK 0029
+       APPEND $collect:2
+       BACKTRACK
+  0029 LOADVN $collect:2
+       STOREV $reduce:0
        BACKTRACK
-  0043 LOADVN $reduce:1
-       EACH
+  0036 LOADVN $reduce:0
+       EACH_OPT
        RET

@pkoppstein
Copy link
Contributor

pkoppstein commented Mar 22, 2019

@nicowilliams wrote:

Here's an almost correct repeat/1

For the record, the relevant tweak would be:

def repeat(exp):
    label $out
    | def r: ([exp] | if length == 0 then break $out else .[0] end), r;
    r;

Personally, I'm more concerned about repeat(input) than repeat(empty). One might even say that repeatedly asking for nothing should indeed take an indefinite amount of time :-)

@nicowilliams
Copy link
Contributor

@pkoppstein repeat/1 currently produces all the outputs of its argument expression...

@pkoppstein
Copy link
Contributor

pkoppstein commented Mar 22, 2019

@nicowilliams wrote:

repeat/1 currently produces all the outputs of its argument expression...

Consider:

echo $'1\n2\n3\n' | jq -n 'repeat(input)'
1
2
3
jq: error (at <stdin>:4): break

Notice that an error was raised. Similarly, repeat(input?) does not terminate. Yes, I know, given the current def, that is all as expected, and I'm not asking for any change. As I indicated, to me, the fact that repeat(empty) does not terminate is a feature, or if you like, the acceptable consequence of the simplicity of its current def.

@nicowilliams
Copy link
Contributor

@pkoppstein

$ jq -cn '[limit(10; repeat(1,2,3))]'
[1,2,3,1,2,3,1,2,3,1]
$ 

@nicowilliams
Copy link
Contributor

Taking a page from @muhmuhten's last/1:

$ ./jq -cn '
  def repeat(e):
    label $out |
    def _r:
      foreach . as $junk ([]; (e|[.]) // break $out; .[0]), _r;
    _r;
  [limit(10; repeat(empty))]'
[]
$ ./jq -cn '
  def repeat(e):
    label $out |
    def _r:
      foreach . as $junk ([]; (e|[.]) // break $out; .[0]), _r;
    _r;
  [limit(10; repeat(1,2,3))]'
[1,2,3,1,2,3,1,2,3,1]
$ ./jq -cn '
  def repeat(e):
    label $out |
    def _r:
      foreach . as $junk ([]; (e|[.]) // break $out; .[0]), _r;
    _r;
  [limit(10; repeat(1,null,3))]'
[1,null,3,1,null,3,1,null,3,1]
$ 

@muhmuhten
Copy link
Contributor Author

muhmuhten commented Mar 22, 2019 via email

@nicowilliams
Copy link
Contributor

Currently input outputs an error on EOF. I think instead it should empty on EOF, and repeat/1 should stop on empty. The only expressions that can empty and when re-evaluated not empty are I/O-like functions, and those should come with utilities for checking EOF, doing clearerr(3), and I think it's fair for repeat/1 to stop on empty because you probably can't read again unless you take some action, and looping forever is clearly wrong.

@nicowilliams
Copy link
Contributor

A slightly different way to think repeat/1:

  • repeat(pure_expression_that_is_not_empty) -> makes sense, like range(1e12), maybe this is what you want, an endless stream of something

  • repeat(IO_writing_expression) -> as long as the sink will take it, keep it up -- and if the sink goes away, well, we can raise an error and get out

  • repeat(empty) -> not useful at all, but what about:

  • repeat(some_pure_expression_that_is_same_as_empty) -> certainly not useful to loop tightly doing nothing forever

  • repeat(IO_reading_expression) -> read until EOF, then... then certainly not loop tightly doing nothing forever

jq -n 'repeat(input) | ...' -> when I hit ^D, I expect to get my shell prompt back.

Because to date we haven't had any impure expressions that end up being empty, and because repeat(empty) is decidedly non-useful, I think we are free to make this change repeat(empty) to backtrack.

BTW, we still have to do something about C-coded jq functions returning jv_invalid() to trigger backtracking. Right now my co-routine and I/O stuff depends on that behavior, and I think we should change input to do the same -- raising an error ("break") works, but it feels very unidiomatic. I don't think returning jv_invalid() to backtrack is a a problem, but I forgot why @muhmuhten thought it was, and I'm open to it being a problem -- if it is a problem we can just add a jv_backtrack() that returns a flavor of invalid jv that causes the VM to backtrack, and a jv_is_backtrack() to go with it.

@muhmuhten
Copy link
Contributor Author

The issue is with repeat(IO expression which can possibly be empty without being finished).

% seq 20 | jq -cn 'def repeat(exp): def _r: exp, _r; _r; [limit(10; repeat(input|select(. % 10 != 5)))]'
[1,2,3,4,6,7,8,9,10,11]
% seq 20 | jq -cn 'def repeat(e): label $out | def _r: foreach . as $junk ([]; (e|[.]) // break $out; .[0]), _r; _r; [limit(10; repeat(input|select(. % 10 != 5)))]'
[1,2,3,4]

I don't think jv_invalid() being a backtrack was a problem, just that null|error should probably be some kind of error rather than a backtrack.

Possibly it would be better to have this repeat as a builtin, since it has useful properties and the implementation is nonobvious, but those properties need to be documented since they are a bit surprising.

Also, it the init expression of foreach needs to be . so that e gets the right input:

% jq -cn 'def repeat(exp): def _r: exp, _r; _r; [3|limit(10; repeat(.))]'
[3,3,3,3,3,3,3,3,3,3]
% jq -cn 'def repeat(e): label $out | def _r: foreach . as $junk ([]; (e|[.]) // break $out; .[0]), _r; _r; [3|limit(10; repeat(.))]'
[[],[],[],[],[],[],[],[],[],[]]

@nicowilliams
Copy link
Contributor

I agree, but why not write that sort of expression like so: repeat(input_expression)|select(...) instead of having the select() inside the repeat expression?

Ah, but the point is this is a breaking change then. OK, fair enough :(

@nicowilliams
Copy link
Contributor

Another thing like this is that the // operator unfortunately alternates two conditions: the emptiness/non-emptiness of the LHS, and its truthiness:

$ jq './/"yo"'
true
true
false
"yo"
null
"yo"
1
1
"a"
"a"
$ 

But sometimes you want the RHS only if the LHS was empty. isempty/1 doesn't help if you want walt all the outputs of the LHS.

The way to jq-code that is remarkably like your last/1, @muhmuhten:

$ jq -n </dev/null 'foreach . as $i (null; [input?]; .) | if length==0 then "yo" else .[0] end'
"yo"
$ jq -n 'foreach . as $i (null; [input?]; .) | if length==0 then "yo" else .[0] end'
false
false
$ jq -n 'foreach . as $i (null; [input?]; .) | if length==0 then "yo" else .[0] end'
null
null
$ 

As a function:

$ jq  -n '
  def definedor(stream; alt):
      foreach . as $i (null; [stream]; .)
    | if length==0 then alt else .[0] end;
  definedor(input?; "yo")'
...

("definedor" is the internal name for the // operator in jq.)

The bytecode for this is nowhere near as tight as for // in spite of not caring what the values are that are produced by the LHS, so we might need a proper new operator. Right now the only ideas I've got as to what that should look like is ///, but that's not very satisfying.

@muhmuhten
Copy link
Contributor Author

It seems like "alternative for empty", "alternative for null", and "alternative for falsey things" should be distinct. It's just the first one that's particularly awkward/nonintuitive to implement in jq.

I think a bit more work is needed to get definedor right:

% jq -n '0|range(.) // -1'
-1
% jq -n 'def definedor(stream; alt): foreach . as $i (null; [stream]; .) | if length==0 then alt else .[0] end; 0|definedor(range(.); -1)'
jq: error (at <unknown>): Range bounds must be numeric
% time jq -n 'first(range(1e20)//-1)|empty'
jq -n 'first(range(1e20)//-1)'  0.01s user 0.01s system 92% cpu 0.014 total
% jq -n 'def definedor(stream; alt): foreach . as $i (null; [stream]; .) | if length==0 then alt else .[0] end; first(definedor(range(1e20); -1))|empty'
# this shouldn't succeed

@muhmuhten
Copy link
Contributor Author

Actually, wouldn't def definedor(s; a): (s|[.]) // (a|[.]) | .[]; do the job? Needless box-unboxing aside.

@nicowilliams
Copy link
Contributor

Boxing happens anyways in my version.

@nicowilliams
Copy link
Contributor

Oops, my definedor boxes the whole stream, not each of its outputs. Ay.

@nicowilliams
Copy link
Contributor

Using // for a this is better than using foreach because a) it's more idiomatic, b) it avoids the SUBEXP_BEGIN/SUBEXP_END brackets around conditionals, but in the end it's only three instructions fewer. Thanks for showing me that variant.

@itchyny itchyny added the bug label Jun 3, 2023
wader added a commit to wader/jqjq that referenced this pull request Sep 17, 2024
last(f) output null for empty (is a bug jqlang/jq#1869)
nth($n; f) error on negative $n and pick correct value for short f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants