-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove hard dependency on Biniou #74
Remove hard dependency on Biniou #74
Conversation
Again, one of the macOS builds failed for unrelated reasons, the rest is green. |
Some people may rely on the extra performance here [benchmarks needed], especially when used from atdgen (which is the reason why yojson was created in the first place). For context, the original application for which I developed biniou, yojson, and atdgen originally was a 24-hour hadoop pipeline running on 80 machines around 2011. The inputs and outputs of the pipeline were using json, while the intermediate steps were using biniou instead of json because it was significantly faster to process (4x if I remember correctly, don't quote me on this) and was otherwise equally extensible. There may be scenarios where json is imposed but the cost of parsing or printing json dominates. I don't know. |
I'm curious, what operations of |
@c-cube see the |
@mjambon that makes sense, thank you. It'd be interesting to see benchmarks, I think |
I'll look into providing a microbenchmark to check the difference in performance. Except for |
I think performances are indeed a concern for a lot of If switching to Benchmarks would be welcome, even more if they can be integrated to the repo so that we can easily evaluate the performances when changes are made. There's an existing |
I'll create another PR for it, since that'd have to be merged before this one to measure before/after. |
To get started here's the PR for a revived benchmark: #76 I should probably add a case where it writes JSON with a lot of integers to maximize the impact that this change might have on the performance. |
I have run the revised benchmark (where I added a test that writes a lot of integers to test how big), this is the current master branch, 32d17e9:
and here are the results from this branch, 6420a5e:
Both are running on an 2015 Macbook Pro (OCaml 4.07.1, without flambda) right after each other (admittedly, not the greatest machine to run benchmarks on, I invite you to run the benchmarks yourself). I can also try to rename this branch to Yojson2 to build it in the same binary if there is an interest in that. But from these numbers it looks like using |
I'm not super familiar with reading core_bench's output. Are the columns after |
I'm not super familiar with that output either, but yes, according to their Getting Started document it is about words, major words and promoted words. In that criterion it looks like this branch has a few words less, one major word more and promotes more into major heap. But I don't really know how to judge this. |
I'm not sure how to weigh it either, but the speed increase + smaller dependency footprint does seem like a nice benefit. |
Same as you I'm not sure exactly how to interpret all of the benchmark output but mostly it seems to have no significant impact and if anything it's faster than it use to so as far as I'm concerned it's 👍 I haven't reviewed it all yet but that should be fine. As you mentioned it, this will break the API so that requires a major version bump, especially since What do you think? |
I think yes. I was thinking about working on I don't see much need to do minor changes or patches in |
Wrt |
Sorry for the delay. I agree I'll review and (hopefully) merge today! |
let ob = | ||
match buf with | ||
None -> Bi_outbuf.create_channel_writer ?len oc | ||
None -> Buffer.create len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks to_channel
and to_file
since we just write to a Buffer
and nothing else ever happens.
We need to fix this obviously but my guess is that depending on how we do it this might have an impact on performances. Just calling Buffer.output_buffer
at the end will probably perform pretty bad comapred to the old implementation.
The fact that this can pass the CI is alarming, tomorrow I will:
- Add basic unit tests to make sure the entry points of the API work as expected
- Add IO versions to the benchmarks
let ob = | ||
match buf with | ||
None -> Bi_outbuf.create_output_writer ?len out | ||
None -> Buffer.create len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark as above!
I rebased on top of master which now has a bit more tests to showcase the bug it introduces! |
What is the decision on this PR? Seems quite some time passed. |
There isn't really a decision that needs to be made at the moment. The PR currently breaks existing functionality with no replacement (as the test introduced by @NathanReb shows), so it cannot be merged at the moment until me or someone else implements it in a way where you can actually output to a file. The |
So, my hope is that I'll get to work on this during this "between the years" period and can get it merged soon and then with #88 can remove also the transitive dependency on easy-format. As such I've implemented the missing functionality in the simplest possible way, as you can see in the latest commit. Thanks for the unit tests @NathanReb, they are passing now. I think I'll put adding more extensive tests for this on the TODO list. I would've expected you to be right, with the This branch (rebased on current
Current
I am a bit surprised by these results, and I suspect I might be missing something, so I would appreciate if people could take a look and run the benchmarks themselves. Edit: Okay, what I was missing is that the benchmarks don't stress IO and that would of course make them faster. I'll add this. |
Okay, after adding the relevant benchmarks:
This branch (plus the cherry-picked benchmark from
So indeed writing to a channel got slower, but less than I would've expected honestly. |
@NathanReb, @Leonidas-from-XIV What is the current status? Are there any outstanding questions or work that I can help with? I would really love to see these changes merged and have version 2.0.0 published though opam with reduced dependencies. Looking at the benchmarks, it seems that the changes results in reduced GC pressure. The timings differences are also very small (ranging from 87% to 113%), and can always be micro-optimized later. |
I'm looking into this right now! @Leonidas-from-XIV could you rebase this on top of master please? I think the benchmarks you added have now been merged! |
Thanks! I've just rebased it and since I did the same change as @bschommer in #89 I merged his change first so the PR got slightly smaller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I have a couple minor comments but otherwise it's good to go!
The main loss here is that the buffer will grow until the final writing to channel/file/output. That means when writing a large JSON value using those function we'll alocate an accordingly large buffer when used to be emptied every now and then, is that correct?
If that becomes an issue I guess we could improve those functions so that we do that ourselves internally without leaking a custom buffer type in the API. For the to_string
and to_buffer
variants it's not an issue so we're good on that front.
Ouh! Could you also please add one or more changelog entries about the API changes we're introducing here? |
Yep, done all that! For some reason notifications are kinda flaky for me, so I didn't see your review before. Needs some investigating. |
The previous behaviour was a change from the old serialization behaviour, notably it omitted the 'u' from the escape sequence. That was incorrect.
The signature changes anyway so might as well change the name too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, looks good to me!
Thanks for working on this and for your patience!
CHANGES: ### Removed - Removed dependency on easy-format and removed `pretty_format` from `Yojson`, `Yojson.Basic`, `Yojson.Safe` and `Yojson.Raw`. (@c-cube, ocaml-community/yojson#90) - Removed dependency on `biniou`, simplifying the chain of dependencies. This changes some APIs: * `Bi_outbuf.t` in signatures is replaced with `Buffer.t` * `to_outbuf` becomes `to_buffer` and `stream_to_outbuf` becomes `stream_to_buffer` (@Leonidas-from-XIV, ocaml-community/yojson#74, and @gasche, ocaml-community/yojson#132) - Removed `yojson-biniou` library - Removed deprecated `json` type aliasing type `t` which has been available since 1.6.0 (@Leonidas-from-XIV, ocaml-community/yojson#100). - Removed `json_max` type (@Leonidas-from-XIV, ocaml-community/yojson#103) - Removed constraint that the "root" value being rendered (via either `pretty_print` or `to_string`) must be an object or array. (@cemerick, ocaml-community/yojson#121) - Removed `validate_json` as it only made sense if the type was called `json`. (@Leonidas-from-XIV, ocaml-community/yojson#137) ### Add - Add an opam package `yojson-bench` to deal with benchmarks dependency (@tmcgilchrist, ocaml-community/yojson#117) - Add a benchmark to judge the respective performance of providing a buffer vs letting Yojson create an internal (ocaml-community/yojson#134, @Leonidas-from-XIV) - Add an optional `suf` keyword argument was added to functions that write serialized JSON, thus allowing NDJSON output. Most functions default to not adding any suffix except for `to_file` (ocaml-community/yojson#124, @panglesd) and functions writing sequences of values where the default is `\n` (ocaml-community/yojson#135, @Leonidas-from-XIV) ### Change - The `stream_from_*` and `stream_to_*` functions now use a `Seq.t` instead of a `Stream.t`, and they are renamed into `seq_from_*` and `seq_to_*` (@gasche, ocaml-community/yojson#131). ### Fix - Avoid copying unnecessarily large amounts of strings when parsing (ocaml-community/yojson#85, ocaml-community/yojson#108, @Leonidas-from-XIV) - Fix `stream_to_file` (ocaml-community/yojson#133, @tcoopman and @gasche)
CHANGES: ### Removed - Removed dependency on easy-format and removed `pretty_format` from `Yojson`, `Yojson.Basic`, `Yojson.Safe` and `Yojson.Raw`. (@c-cube, ocaml-community/yojson#90) - Removed dependency on `biniou`, simplifying the chain of dependencies. This changes some APIs: * `Bi_outbuf.t` in signatures is replaced with `Buffer.t` * `to_outbuf` becomes `to_buffer` and `stream_to_outbuf` becomes `stream_to_buffer` (@Leonidas-from-XIV, ocaml-community/yojson#74, and @gasche, ocaml-community/yojson#132) - Removed `yojson-biniou` library - Removed deprecated `json` type aliasing type `t` which has been available since 1.6.0 (@Leonidas-from-XIV, ocaml-community/yojson#100). - Removed `json_max` type (@Leonidas-from-XIV, ocaml-community/yojson#103) - Removed constraint that the "root" value being rendered (via either `pretty_print` or `to_string`) must be an object or array. (@cemerick, ocaml-community/yojson#121) - Removed `validate_json` as it only made sense if the type was called `json`. (@Leonidas-from-XIV, ocaml-community/yojson#137) ### Add - Add an opam package `yojson-bench` to deal with benchmarks dependency (@tmcgilchrist, ocaml-community/yojson#117) - Add a benchmark to judge the respective performance of providing a buffer vs letting Yojson create an internal (ocaml-community/yojson#134, @Leonidas-from-XIV) - Add an optional `suf` keyword argument was added to functions that write serialized JSON, thus allowing NDJSON output. Most functions default to not adding any suffix except for `to_file` (ocaml-community/yojson#124, @panglesd) and functions writing sequences of values where the default is `\n` (ocaml-community/yojson#135, @Leonidas-from-XIV) ### Change - The `stream_from_*` and `stream_to_*` functions now use a `Seq.t` instead of a `Stream.t`, and they are renamed into `seq_from_*` and `seq_to_*` (@gasche, ocaml-community/yojson#131). ### Fix - Avoid copying unnecessarily large amounts of strings when parsing (ocaml-community/yojson#85, ocaml-community/yojson#108, @Leonidas-from-XIV) - Fix `stream_to_file` (ocaml-community/yojson#133, @tcoopman and @gasche)
CHANGES: ### Removed - Removed dependency on easy-format and removed `pretty_format` from `Yojson`, `Yojson.Basic`, `Yojson.Safe` and `Yojson.Raw`. (@c-cube, ocaml-community/yojson#90) - Removed dependency on `biniou`, simplifying the chain of dependencies. This changes some APIs: * `Bi_outbuf.t` in signatures is replaced with `Buffer.t` * `to_outbuf` becomes `to_buffer` and `stream_to_outbuf` becomes `stream_to_buffer` (@Leonidas-from-XIV, ocaml-community/yojson#74, and @gasche, ocaml-community/yojson#132) - Removed `yojson-biniou` library - Removed deprecated `json` type aliasing type `t` which has been available since 1.6.0 (@Leonidas-from-XIV, ocaml-community/yojson#100). - Removed `json_max` type (@Leonidas-from-XIV, ocaml-community/yojson#103) - Removed constraint that the "root" value being rendered (via either `pretty_print` or `to_string`) must be an object or array. (@cemerick, ocaml-community/yojson#121) - Removed `validate_json` as it only made sense if the type was called `json`. (@Leonidas-from-XIV, ocaml-community/yojson#137) ### Add - Add an opam package `yojson-bench` to deal with benchmarks dependency (@tmcgilchrist, ocaml-community/yojson#117) - Add a benchmark to judge the respective performance of providing a buffer vs letting Yojson create an internal (ocaml-community/yojson#134, @Leonidas-from-XIV) - Add an optional `suf` keyword argument was added to functions that write serialized JSON, thus allowing NDJSON output. Most functions default to not adding any suffix except for `to_file` (ocaml-community/yojson#124, @panglesd) and functions writing sequences of values where the default is `\n` (ocaml-community/yojson#135, @Leonidas-from-XIV) ### Change - The `stream_from_*` and `stream_to_*` functions now use a `Seq.t` instead of a `Stream.t`, and they are renamed into `seq_from_*` and `seq_to_*` (@gasche, ocaml-community/yojson#131). ### Fix - Avoid copying unnecessarily large amounts of strings when parsing (ocaml-community/yojson#85, ocaml-community/yojson#108, @Leonidas-from-XIV) - Fix `stream_to_file` (ocaml-community/yojson#133, @tcoopman and @gasche)
This should make Kappa compatible with the latest version of yojson. Since ocaml-community/yojson#74, yojson does not rely on biniou buffers. The main drawback seems to be that `JsonUtil.write_to_channel` now writes everything to the memory (in a `Buffer.t`) before dumping the buffer to the channel.
This should make Kappa compatible with the latest version of yojson. Since ocaml-community/yojson#74, yojson does not rely on biniou buffers. The main drawback seems to be that `JsonUtil.write_to_channel` now writes everything to the memory (in a `Buffer.t`) before dumping the buffer to the channel.
This should make Kappa compatible with the latest version of yojson. Since ocaml-community/yojson#74, yojson does not rely on biniou buffers. The main drawback seems to be that `JsonUtil.write_to_channel` now writes everything to the memory (in a `Buffer.t`) before dumping the buffer to the channel.
This should make Kappa compatible with the latest version of yojson. Since ocaml-community/yojson#74, yojson does not rely on biniou buffers. The main drawback seems to be that `JsonUtil.write_to_channel` now writes everything to the memory (in a `Buffer.t`) before dumping the buffer to the channel.
Looking at the dependency tree of Yojson, which is one of the more popular packages in the OCaml ecosystem I encountered Biniou, a serialization library. Biniou is not a particularly popular in OCaml, given it has (at time of writing) 11612 downloads per week in OPAM, which might seem a lot but my hypothesis is that this is mainly due to the dependency in yojson (1.5.0 accounts for 7469 downloads, 1.4.1 for 5120).
So therefore I was curious how tightly interconnected Yojson and Biniou are and it turns out… not much. I split the code that depends on Biniou into a new OPAM/findlib package, to preserve compatibility with code that requires it. The remaining code was using
Bi_outbuf
which can be replaced nearly 1:1 withBuffer
. This changes the signature of some functions, but the advantage is that it replaces the somewhatBi_outbuf.t
type withBuffer.t
which, due to the fact it is in the compiler standard library is somewhat more common in the OCaml community.I haven't run many tests except for very basic "does it even still work" but to my understanding
Bi_outbuf
andBuffer
should have similar asymptotic behaviour since they are designed to fit a similar purpose.Edit: It was noted on IRC that it is not entirely clear, but the changes in
write.mli
andread.mli
change the public API, thus requiring something bumping the major number. Judging by the fact that these changes only affect programs that use Biniou AND yojson and there are hardly any on OPAM, I believe this is a reasonable trade-off.