Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream to Seq #131

Merged
merged 2 commits into from
Feb 18, 2022
Merged

Conversation

gasche
Copy link
Contributor

@gasche gasche commented Feb 9, 2022

Before:

┌─────────────────────────────────────┬──────────┬────────────┬────────────┬──────────┬────────────┐
│ Name                                │ Time/Run │    mWd/Run │   mjWd/Run │ Prom/Run │ Percentage │
├─────────────────────────────────────┼──────────┼────────────┼────────────┼──────────┼────────────┤
│ JSON stream roundtrip               │ 998.06us │ 70_051.93w │ 10_985.58w │   13.98w │    100.00% │
└─────────────────────────────────────┴──────────┴────────────┴────────────┴──────────┴────────────┘

After:

┌────────────────────┬──────────┬────────────┬────────────┬──────────┬────────────┐
│ Name               │ Time/Run │    mWd/Run │   mjWd/Run │ Prom/Run │ Percentage │
├────────────────────┼──────────┼────────────┼────────────┼──────────┼────────────┤
│ JSON seq roundtrip │ 907.16us │ 60_038.93w │ 10_980.64w │    8.75w │    100.00% │
└────────────────────┴──────────┴────────────┴────────────┴──────────┴────────────┘

@gasche
Copy link
Contributor Author

gasche commented Feb 9, 2022

See #129.

@Leonidas-from-XIV
Copy link
Member

That's great.

The only problem I see is with stdcompat. After seing how it is built and maintained (autotools, no dune, which creates issues for us using it in the duniverse and is also kinda a large amount of code to depend on) I would much prefer a different solution. Given we infamously use CPPO we could have our own Seq module for older versions of OCaml or something.

@gasche
Copy link
Contributor Author

gasche commented Feb 9, 2022

Actually I think I could use c-cube/seq instead, I can update the PR, but only later in the day (or feel free to do it if you want).

@Leonidas-from-XIV
Copy link
Member

That's a good point, @c-cube's seq is exactly the kind of library I hoped for.

CHANGES.md Outdated
@@ -25,6 +25,8 @@

- The function `to_file` now adds a newline at the end of the generated file. An
optional argument allows to return to the original behaviour (#124, @panglesd)
- The stream_from_* and stream_to_* functions now use a Seq.t instead of a Stream.t,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The stream_from_* and stream_to_* functions now use a Seq.t instead of a Stream.t,
- The `stream_from_*` and `stream_to_*` functions now use a Seq.t instead of a Stream.t,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, also quoted the other identifiers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, reopening this since the force-push undid my change.

@c-cube
Copy link
Member

c-cube commented Feb 9, 2022

Something I'm wondering about is whether it's worth it trying to make the seq persistent. It's not too hard to do, for a small cost compared to json decoding, and it can be convenient if one uses the Seq as a functional value. In pseudo code:

let rec stream_to_seq ic : t Seq.t =
  let r = ref None in
  fun () ->
    match !r with
    | Some node -> node
    | None ->
      let node =
        try
          let json = decode_json ic in
          Seq.Cons (j, stream_to_seq ic)
        with End_of_file -> Seq.Nil
      in
      r := Some node;
      node

Copy link
Member

@Leonidas-from-XIV Leonidas-from-XIV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things I have encountered while going through the code.

bench/bench.ml Show resolved Hide resolved
lib/read.mll Outdated Show resolved Hide resolved
@gasche gasche force-pushed the Stream-to-Seq branch 2 times, most recently from 5d7091c to fa8e5f2 Compare February 9, 2022 15:34
@gasche
Copy link
Contributor Author

gasche commented Feb 9, 2022

@c-cube I tried to implement your suggestion, it's a bit more work because we need to handle exceptions as well. I got the following version which is 2x slower and allocates 3x more.

  type 'a cache =
    | Empty
    | Full of 'a
    | Fail of exn

  let seq_from_lexbuf v ?(fin = fun () -> ()) lexbuf =
    let stream = Some true in
    let rec f lexbuf =
      let cache = ref Empty in
      fun () ->
        match !cache with
        | Full node -> node
        | Fail e -> raise e
        | Empty ->
          let node =
            try Seq.Cons (from_lexbuf v ?stream lexbuf, f lexbuf)
            with
            | End_of_input ->
              fin ();
              Seq.Nil
            | e ->
              let exn =
                try ignore (fin ()); e
                with fin_e -> Finally (e, fin_e)
              in
              cache := Fail exn;
              raise exn
          in
          cache := Full node;
          node
    in
    f lexbuf

I think it's simpler to stay with the non-persistent version, and let users convert it explicitly to a persistent version if they need to. (I expect that they rarely do.)

Note: (even before @c-cube chimed in,) I was wondering whether one of containers' generator/iterator types (those with type unit -> 'a option, basically) could be an even better fit than Seq.t, because (they're close to how the code is written and) they are more obviously non-persistent. But I think that reusing a type from the standard library has value.

@c-cube
Copy link
Member

c-cube commented Feb 9, 2022

@gasche I'm surprised by the overhead, but so be it :-). I think it's indeed valuable to use Seq as it's the most common type now. Adding a gen function on the side is doable but probably won't benefit as many people.

@gasche
Copy link
Contributor Author

gasche commented Feb 9, 2022

(Note: I'm not working from a flambda switch.)

@c-cube
Copy link
Member

c-cube commented Feb 9, 2022

(Note: I'm not working from a flambda switch.)

Most people aren't, so I think it's reasonable 😁

@tcoopman
Copy link

tcoopman commented Feb 9, 2022

By coincidence I stumbled on this PR, I was looking into what streaming functions yojson had to offer and because this PR was just opened, I wanted to try it.

Now I have an issue with Yojson.Basic.seq_to_file "path/to/file" seq. It runs fine, creates the file, but the file stays empty?

If I run the exact same code but with Yojson.Basic.seq_to_string it outputs the correct string

Any idea if I'm doing something wrong, or if something else is going on?
I'll see if I can find a minimal reproduction

@tcoopman
Copy link

tcoopman commented Feb 9, 2022

Here is a minimal (I'm using base) that fails:

let test () =
  let seq = Sequence.of_list [`String "a"; `String "b"; `String "c"] |> Sequence.to_seq in
  Yojson.Basic.seq_to_string seq |> Stdio.print_endline;
  Yojson.Basic.seq_to_file "test.json" seq

This prints "a""b""c" on terminal, but "test.json" is empty.
I'm on linux with ocaml 4.13.1

@gasche
Copy link
Contributor Author

gasche commented Feb 9, 2022

This does look like a bug indeed, that (according to my test) was already present in the Stream functions this PR is replacing. (I guess seq_to_file isn't used much!)

lib/write.ml Outdated
let ob =
match buf with
None -> Buffer.create len
| Some ob -> ob
in
stream_to_buffer ?std ob st
seq_to_buffer ?std ob st
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is missing a line to actually write the buffer Buffer.output_buffer oc ob

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I just found the same fix. I think it should be fixed separately from this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is okay to add this to this PR, but it is up to you.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fix should be the suggestion made by @gasche in the comments, otherwise the seq has no real value as there is no memory improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to send a separate PR with the fix, and probably the streaming improvement as well, but probably not before tomorrow. I think the current PR should stay focused on just replacing Streaming with Seq, preserving the behavior exactly.

There are other things that look fishy in the buffer-handling logic of write.mll (I suspect that it could sometimes include the existing data of the buffer in the output, which is not intended), so it would be nice to review that as well, maybe I'll do it.

@tcoopman
Copy link

tcoopman commented Feb 9, 2022

@gasche just found it - see my review - it's missing a line to write the buffer. At least that is how the regular to_channel works

@tcoopman
Copy link

tcoopman commented Feb 9, 2022

I've noticed something else that's strange (so after I fixed writing the buffer as well). I'm not sure if it's an issue with this PR or something I'm not understanding correctly.

When I'm doing seq_to_file my memory usage of the program grows to 14GB (I'm generating a file of over 5GB.
But when I just use Seq.iter (fun _ -> ()) seq the memory usage never goes over 40MB. The execution time of the program is similar (1m9s for writing vs 52s without writing) so I'm pretty sure I'm doing the same work

Isn't the goal of the seq_to_file that you don't have to keep the whole file in memory?

(if you prefer this question somewhere else or as bug reports, let me know)

@gasche
Copy link
Contributor Author

gasche commented Feb 9, 2022

Could you try the following implementation?

let seq_to_channel ?buf ?(len=2096) ?std oc st =
  let ob =
    match buf with
        None -> Buffer.create len
      | Some ob -> ob
  in
  Seq.iter (fun v ->
    to_buffer ?std ob v;
    Buffer.output_buffer oc ob;
    Buffer.clear ob;
  ) st

@tcoopman
Copy link

tcoopman commented Feb 9, 2022

Awesome, that fixes it! ❤️

@gasche
Copy link
Contributor Author

gasche commented Feb 10, 2022

The CI failed on <4.07 versions, I think it's because I forgot to add the seq dependency to dune in addition to yojson.opam, I just pushed an updated version.

@gasche gasche mentioned this pull request Feb 10, 2022
@gasche
Copy link
Contributor Author

gasche commented Feb 10, 2022

See #133 (fixing the bugs reported by @tcoopman) and #132 (improving the usage of the ?buf parameter in other write functions). #133 conflicts with the current PR, I'll let maintainers decide which one they want to merge (if any) and rebase the other accordingly.

@Leonidas-from-XIV
Copy link
Member

I would suggest going with #133 first. I took a look at it and it is ready to be merged in my opinion (despite the failure on 4.14 in CI, but this is gonna be handled in the PR) then rebase this PR on it and then tackle #132 after this is merged. Thanks for your effort @gasche and @tcoopman, much appreciated! There's apparently big parts of Yojson that are under-utilized/under-tested so it is great that these things are getting reported and fixed.

@Leonidas-from-XIV Leonidas-from-XIV added this to the 2.0.0 milestone Feb 15, 2022
@Leonidas-from-XIV
Copy link
Member

@gasche #133 is merged, can you rebase please? That would fix our CI too since OCaml-CI added 4.14 already and we're failing on it.

@gasche
Copy link
Contributor Author

gasche commented Feb 16, 2022

Done!

@Leonidas-from-XIV
Copy link
Member

Looks like List.to_seq was only added in 4.07. I've opened an issue on seq to see what @c-cube thinks about adding it, but in the meantime we could just implement our own version.

@gasche
Copy link
Contributor Author

gasche commented Feb 17, 2022

I changed the code to avoid List.seq and rebased.

@Leonidas-from-XIV
Copy link
Member

Seq.cons was added in 4.11, but given it is just fun x xs -> Seq.Cons (x, xs) it is an easy fix to get it to work throughout all OCaml versions.

@gasche
Copy link
Contributor Author

gasche commented Feb 17, 2022

Damn! Sorry for the noise, I did another attempt with just Seq.Cons and Seq.Nil.

Copy link
Member

@Leonidas-from-XIV Leonidas-from-XIV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I think we got all concerns addressed. Thanks for your work @gasche!

bench/bench.ml Show resolved Hide resolved
@Leonidas-from-XIV Leonidas-from-XIV merged commit 2f8e7a6 into ocaml-community:master Feb 18, 2022
panglesd added a commit to panglesd/opam-repository that referenced this pull request Jun 2, 2022
CHANGES:

### Removed

- Removed dependency on easy-format and removed `pretty_format` from
  `Yojson`, `Yojson.Basic`, `Yojson.Safe` and `Yojson.Raw`. (@c-cube, ocaml-community/yojson#90)
- Removed dependency on `biniou`, simplifying the chain of dependencies. This
  changes some APIs:
  * `Bi_outbuf.t` in signatures is replaced with `Buffer.t`
  * `to_outbuf` becomes `to_buffer` and `stream_to_outbuf` becomes
    `stream_to_buffer`
  (@Leonidas-from-XIV, ocaml-community/yojson#74, and @gasche, ocaml-community/yojson#132)
- Removed `yojson-biniou` library
- Removed deprecated `json` type aliasing type `t` which has been available
  since 1.6.0 (@Leonidas-from-XIV, ocaml-community/yojson#100).
- Removed `json_max` type (@Leonidas-from-XIV, ocaml-community/yojson#103)
- Removed constraint that the "root" value being rendered (via either
  `pretty_print` or `to_string`) must be an object or array. (@cemerick, ocaml-community/yojson#121)
- Removed `validate_json` as it only made sense if the type was called `json`.
  (@Leonidas-from-XIV, ocaml-community/yojson#137)

### Add

- Add an opam package `yojson-bench` to deal with benchmarks dependency
  (@tmcgilchrist, ocaml-community/yojson#117)
- Add a benchmark to judge the respective performance of providing a buffer vs
  letting Yojson create an internal (ocaml-community/yojson#134, @Leonidas-from-XIV)
- Add an optional `suf` keyword argument was added to functions that write
  serialized JSON, thus allowing NDJSON output. Most functions default to not
  adding any suffix except for `to_file` (ocaml-community/yojson#124, @panglesd) and functions
  writing sequences of values where the default is `\n` (ocaml-community/yojson#135,
  @Leonidas-from-XIV)

### Change

- The `stream_from_*` and `stream_to_*` functions now use a `Seq.t` instead of a
  `Stream.t`, and they are renamed into `seq_from_*` and `seq_to_*` (@gasche, ocaml-community/yojson#131).

### Fix

- Avoid copying unnecessarily large amounts of strings when parsing (ocaml-community/yojson#85, ocaml-community/yojson#108,
  @Leonidas-from-XIV)
- Fix `stream_to_file` (ocaml-community/yojson#133, @tcoopman and @gasche)
panglesd added a commit to panglesd/opam-repository that referenced this pull request Jun 3, 2022
CHANGES:

### Removed

- Removed dependency on easy-format and removed `pretty_format` from
  `Yojson`, `Yojson.Basic`, `Yojson.Safe` and `Yojson.Raw`. (@c-cube, ocaml-community/yojson#90)
- Removed dependency on `biniou`, simplifying the chain of dependencies. This
  changes some APIs:
  * `Bi_outbuf.t` in signatures is replaced with `Buffer.t`
  * `to_outbuf` becomes `to_buffer` and `stream_to_outbuf` becomes
    `stream_to_buffer`
  (@Leonidas-from-XIV, ocaml-community/yojson#74, and @gasche, ocaml-community/yojson#132)
- Removed `yojson-biniou` library
- Removed deprecated `json` type aliasing type `t` which has been available
  since 1.6.0 (@Leonidas-from-XIV, ocaml-community/yojson#100).
- Removed `json_max` type (@Leonidas-from-XIV, ocaml-community/yojson#103)
- Removed constraint that the "root" value being rendered (via either
  `pretty_print` or `to_string`) must be an object or array. (@cemerick, ocaml-community/yojson#121)
- Removed `validate_json` as it only made sense if the type was called `json`.
  (@Leonidas-from-XIV, ocaml-community/yojson#137)

### Add

- Add an opam package `yojson-bench` to deal with benchmarks dependency
  (@tmcgilchrist, ocaml-community/yojson#117)
- Add a benchmark to judge the respective performance of providing a buffer vs
  letting Yojson create an internal (ocaml-community/yojson#134, @Leonidas-from-XIV)
- Add an optional `suf` keyword argument was added to functions that write
  serialized JSON, thus allowing NDJSON output. Most functions default to not
  adding any suffix except for `to_file` (ocaml-community/yojson#124, @panglesd) and functions
  writing sequences of values where the default is `\n` (ocaml-community/yojson#135,
  @Leonidas-from-XIV)

### Change

- The `stream_from_*` and `stream_to_*` functions now use a `Seq.t` instead of a
  `Stream.t`, and they are renamed into `seq_from_*` and `seq_to_*` (@gasche, ocaml-community/yojson#131).

### Fix

- Avoid copying unnecessarily large amounts of strings when parsing (ocaml-community/yojson#85, ocaml-community/yojson#108,
  @Leonidas-from-XIV)
- Fix `stream_to_file` (ocaml-community/yojson#133, @tcoopman and @gasche)
Leonidas-from-XIV pushed a commit to panglesd/opam-repository that referenced this pull request Jun 9, 2022
CHANGES:

### Removed

- Removed dependency on easy-format and removed `pretty_format` from
  `Yojson`, `Yojson.Basic`, `Yojson.Safe` and `Yojson.Raw`. (@c-cube, ocaml-community/yojson#90)
- Removed dependency on `biniou`, simplifying the chain of dependencies. This
  changes some APIs:
  * `Bi_outbuf.t` in signatures is replaced with `Buffer.t`
  * `to_outbuf` becomes `to_buffer` and `stream_to_outbuf` becomes
    `stream_to_buffer`
  (@Leonidas-from-XIV, ocaml-community/yojson#74, and @gasche, ocaml-community/yojson#132)
- Removed `yojson-biniou` library
- Removed deprecated `json` type aliasing type `t` which has been available
  since 1.6.0 (@Leonidas-from-XIV, ocaml-community/yojson#100).
- Removed `json_max` type (@Leonidas-from-XIV, ocaml-community/yojson#103)
- Removed constraint that the "root" value being rendered (via either
  `pretty_print` or `to_string`) must be an object or array. (@cemerick, ocaml-community/yojson#121)
- Removed `validate_json` as it only made sense if the type was called `json`.
  (@Leonidas-from-XIV, ocaml-community/yojson#137)

### Add

- Add an opam package `yojson-bench` to deal with benchmarks dependency
  (@tmcgilchrist, ocaml-community/yojson#117)
- Add a benchmark to judge the respective performance of providing a buffer vs
  letting Yojson create an internal (ocaml-community/yojson#134, @Leonidas-from-XIV)
- Add an optional `suf` keyword argument was added to functions that write
  serialized JSON, thus allowing NDJSON output. Most functions default to not
  adding any suffix except for `to_file` (ocaml-community/yojson#124, @panglesd) and functions
  writing sequences of values where the default is `\n` (ocaml-community/yojson#135,
  @Leonidas-from-XIV)

### Change

- The `stream_from_*` and `stream_to_*` functions now use a `Seq.t` instead of a
  `Stream.t`, and they are renamed into `seq_from_*` and `seq_to_*` (@gasche, ocaml-community/yojson#131).

### Fix

- Avoid copying unnecessarily large amounts of strings when parsing (ocaml-community/yojson#85, ocaml-community/yojson#108,
  @Leonidas-from-XIV)
- Fix `stream_to_file` (ocaml-community/yojson#133, @tcoopman and @gasche)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants