Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add read-values and write-values #53

Closed
wants to merge 3 commits into from
Closed

Conversation

bsless
Copy link
Contributor

@bsless bsless commented May 21, 2021

read-values dispatches to the ReadValues protocol. It returns an
iterator via an ObjectReader derived from the supplied mapper.
The returned iterator is reified in a manner similar to Eduction to
support reduction and sequence construction over it.

write-values relies on two protocols - WriteValues for the output
destination, similarly to WriteValue, and WriteAll for the type being
written, which can be an array or an Iterable.
It writes an array or iterable to destination via a SequenceWriter.
Importantly, write-values distables automatic flushing on serialization
to get good performance.

read-values dispatches to the ReadValues protocol. It returns an
iterator via an ObjectReader derived from the supplied mapper.
The returned iterator is reified in a manner similar to Eduction to
support reduction and sequence construction over it.

write-values relies on two protocols - WriteValues for the output
destination, similarly to WriteValue, and WriteAll for the type being
written, which can be an array or an Iterable.
It writes an array or iterable to destination via a SequenceWriter.
Importantly, write-values distables automatic flushing on serialization
to get good performance.
@bsless
Copy link
Contributor Author

bsless commented May 21, 2021

This adds support for reading and writing large sequences without materializing them in memory.
Few choices I made I'm not sure about:

  • using .writeValuesAsArray and not writeValues, but I wanted ensure equivalency between input and output sources
  • how I implemented the iterator wrap

@bsless
Copy link
Contributor Author

bsless commented May 23, 2021

@ikitommi do you know why the work flow failed when setting up the environment? The error is from tar, of all things.
Did I do something wrong or does it need to be rerun?

[^Iterator iterator]
(when iterator
(reify
Iterable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm unsure of is implementing Iterable here. Is it a good idea to return an object which is both an Iterable and Iterator?

@Deraen
Copy link
Member

Deraen commented Jun 4, 2021

Is the read-values for just reading e.g. top level Array elements?

I recently implemented following example, where I used readTree to get one property from top level object, and then create lazy seq from items in that property. Not sure if this uses streaming, but I'm quite sure this prevented storing the whole array in a vector (https://github.com/metosin/jsonista/blob/master/src/java/jsonista/jackson/PersistentVectorDeserializer.java):

(defn read-geojson-features
  "Try reading geojson file without loading full features array to memory"
  [^Reader f]
  (let [^JsonNode tree (.readTree json/default-object-mapper f)
        ^JsonNode node (.get tree "features")]
    (->> (map (fn [node]
                (.treeToValue json/default-object-mapper node ^Class Object))
              node))))

Is it possible to use streaming in such cases? If not, maybe preventing creating vectors for big Arrays is an separate issue.

@bsless
Copy link
Contributor Author

bsless commented Jun 5, 2021

@Deraen, thanks for giving it a look
read-values is just for reading to level Array elements, as it's the context in which I considered streaming. I don't know if maps are "streamable" in that sense. Looks like JsonNode does expose a method for an Iterator of Map.Entry, so yes.
The main difference in your implementation is that .readTree seems to be eager, just keeps an internal node representation instead of mapping into an external object. In that sense it is lazy. If you return a reducible/iterator instead of map over the node it will be even more lazy.
The use case I was trying to solve is one where you already know you're going to be reading a large array or dealing with some stream of data.
We can divide the possible solutions into three degrees of laziness:

  1. zero laziness: This is what current Jsonista supports
  2. partial laziness: This is your solution. It is slightly more general in that it allows querying the entire json structure and mapping over map entry pairs. Its downside is that it reads the entire data in to memory and creates the JsonNode tree. It may not be desirable in some cases, where the tree can be very large. It still does not create the intermediary Clojure objects. Writing a EQL parser which compiles to it could be interesting.
  3. full top-level laziness: My implementation, assumes the top level node is an array and exposes an iterator/reducible over it. its components are fully serialized.

I think these solutions are fundamentally different. I don't think lazy streaming could be generalized beyond 3, but partial laziness like your solution is an avenue to explore. These are separate issues, use cases and requirements, in my estimation.

@Deraen
Copy link
Member

Deraen commented Jun 5, 2021

Yeah, that makes sense.

I'll try to look a bit more into case 2, if there is still something that will be shared with this case. Before introducing new API here, I want to understand if we could cover both cases with similar functions.

Maybe I'll need to read JsonNode impl, or profile memory use with readTree.

@Deraen
Copy link
Member

Deraen commented Jun 5, 2021

It is possible to also use stream reading to read values from an array inside an object: https://github.com/metosin/jsonista/compare/stream-testing
https://cassiomolin.com/2019/08/19/combining-jackson-streaming-api-with-objectmapper-for-parsing-json/

One just needs to navigate the parser to the array start token first.

I guess lazy-seq is doing some caching so the example is not optimal, but didn't quickly find better way to call .readValueAs until the END_ARRAY token is found.

I don't think we need to provide functions to move the parser, but maybe something to make easier to efficiently read array values once the parser is in correct position?

@Deraen
Copy link
Member

Deraen commented Oct 1, 2021

Wrap-values is currently private, and that would be useful if a user wants to call e.g. readValuesAs themselves. Is that fn needed because the Iterators from Jackson don't implement Iterable themselves?

What's the difference with wrap-values and clojure.core/iterator-seq? Chunking? Though an Iterable is turned to seq with the same method.

@bsless
Copy link
Contributor Author

bsless commented Oct 1, 2021

@Deraen not exposing a seq api over read-values was intentional. It returns something very similar to an Education. A user can always transform it to a lazy-seq and get everything associated with it but the other way around? not so much. Lazy seqs just create data buffers in memory. I want to be able to stream data from input to output directly. Imagine reading a byte stream with read-values and writing it out with write-values. No intermediary allocations or buffering, directly bytes to bytes (or stream to stream). This implementation returns an iterable, you can wrap in with an Education which is also an iterable, then write it out with write-values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants