Provide an API for performant bulk or sequential read/writes #135

fzhinkin · 2023-06-12T11:36:05Z

As stated in #132, kotlinx-io don't include Okio APIs aimed for performant sequential and bulk ops (like, UnsafeCursor).

Such API is required to implement previously removed functionality (like select) efficiently. It's also required for integration with platform-specific APIs (like hash functions implementation using Java's MessageDigest API).

This issue claims the intent to provide such functionality, but the actual API will be discussed later.

The text was updated successfully, but these errors were encountered:

whyoleg · 2023-06-14T14:33:14Z

Just wanted to mention, that from my perspective this should be implemented taking into account future feature from #131: different memory types (in addition to byte arrays) Buffers are built upon (like NIO ByteBuffers on JVM);
Use-case: ktorio/ktor-io#7
Ideally it should be possible to use JDK MessageDigest/Cipher API for Buffer with multiple underlying segments of different types like ByteBuffer, ByteArray, MemorySegment and even Netty ByteBuf (which allows to get ByteBuffer from it).

fzhinkin · 2023-06-14T15:26:41Z

An implementation will definitely take different memory types into account, otherwise, it'll be hard to make performant reads/writes.

rnett · 2024-04-20T01:48:38Z

I am trying to implement a custom HashingSink (using Guava's hash functions, but it doesn't really matter) and found that implementing a custom sink is rather difficult because it's not easy to read (groups of) bytes from a Buffer yourself, especially without "consuming" them (so I can forward to the "real" sink). This makes it very hard to implement custom sinks or sources. It seems like this issue would be the feature responsible for solving this, is that correct? If not I'll create a new one.

fzhinkin · 2024-04-22T07:55:51Z

@rnett, that's correct. Once solved, this issue should allow more easily implementing things like HashingSink. I'm currently finalizing the design and will post it soon.

fzhinkin · 2024-05-22T14:56:36Z

Here's an update with the details on an API we will implement.

Intro

In general, there are two groups of cases when existing public API works poorly both performance- and convenience-wise:

integration with other frameworks/libraries/platform APIs;
custom extensions.

The first group usually requires taking as much data as possible and passing it down to other API, or, reserving some
empty space in advance and then supply it to the underlying API to fill it up.
For example, POSIX declares vectorized-io functions, readv and writev, allowing to read/write multiple chunks of
data using a single system call (if you're familiar with Java NIO API, Gathering- and ScatteringChannel provide the
same functionality).
If one wants to integrate kotlinx-io with such an API, there is no efficient way to do that as data stored within
a buffer could only be accessed by copying.

The second group is about various extensions that need to access some variadic amount of bytes in an efficient manner.
For example, to read LEB128-encoded value, one has to read data byte by byte until either the value is fully decoded
or an error condition is detected. The only way to implement such an extension right now is by calling
Source::readByte which, when called sequentially multiple times, has a significant overhead for such a scenario
compared to direct access to the buffer's underlying bytes.
Yet another example from this group is computing a checksum/digest over the buffer's content without consuming it:
one must somehow iterate over the buffer's content and pass it to some API, like Java's MessageDigest.
There's no way it could be implemented efficiently with the existing API.

Proposed API

One of the goals we're trying to achieve with an unsafe/bulk API is to provide a way to write efficient code, meaning, among anything else, that the amount of data copying should be minimal. That requires segment data to be directly accessible.
However, we also want to have an abstraction layer above existing byte-array backed segments so we can change the underlying container from byte array to Java ByteBuffer/native off-heap buffer/Wasm linear memory/etc in the future without breaking client's code.
There are not so many ways to guarantee zero-copy on access and, at the same time, fully abstract the underlying type away. Instead, we decided to split the API into three logical tiers, where only one of them provides direct access
to segment's internal container, and it's zero-copy on a best-effort basis only.
With that weak guarantee, an API returning/consuming byte arrays continues to work even after segments change
its underlying container, just with reduced performance. And, if the underlying container type changes
on one of the platforms, corresponding zero-copy accessors will be provided there.

The proposed API consists of the following three tiers:

a bulk API for reading/writing data from/to buffers
an API to iterate over buffer's segments
an API to read segments data / append data into a buffer with a byte granularity

These tiers are intertwined, and sometimes boundaries are fuzzy. I'm mentioning them mostly to structure the document and make it more approachable.

1. A Bulk API

The API aimed for integration with other APIs and consists of the following functions:

On JVM, the following extensions will be provided:

There will be no writeBulk at the moment, but it might be added later.

Additionally, there's a SegmentReadContext.withData function allowing to access segment data.
This function makes sense in the context of buffer iteration (the next section), but I'm mentioning it here as it provides bulk access.

Note that for Native, extensions similar to JVMs could also be provided.

2. Buffer iteration API

To iterate over buffer segments, we're going to make Segment and some of its properties public.
Yet another change aimed to simplify iteration and allow it to be done without any allocations is that segments will no longer be organized into a cicrualar queue, as it was previously. Instead, there will be separate head/tail references inside the buffer, segments continue referencing their predecessors and successors, but now the head's previous segment will be null, as well as the tail's next segment.

Will all that being said, one can start iteration from either buffer's head or a segment covering a specific offset:

UnsafeBufferOperations.iterate

These functions supply a BufferIterationContext that could be used to get a next segment.
If there is no segment next, the next call returns null.

BufferIterationContext extends SegmentReadContext,
allowing to both get whole segment's data using withData and reading bytes individually using getUnchecked.

3. Byte-granular read/write API

Given a SegmentReadContext (that could be obtained by calling readFromHead or iterate),
one can access individual segment bytes using getUnchecked.

Given a SegmentWriteContext (could be obtained by calling writeToTail),
one can write individual bytes into uncommitted part of a tail segments using setUnchecked.

Total number of readable bytes always lies in range 0..Segment.size.

A total number of writable bytes is always within 0..Segment.remainingCapacity.

Alternatives

Several alternatives were considered, namely:

Using Okio's `UnsafeCursor`

UnsafeCursor is a really nice API, but it has to expose a reference to the underlying data container so that
does not solve one of the major challenges we have with this API.

Providing some `Iterator` over segments

Iterators work well on JVM, thanks to the advanced JIT compilers, but on ART, it results in excessive
memory allocations, which significantly slow down existing code.

Exposing the underlying container as `Any`

... so that users could check if the container has an appropriate type, cast it to that type and then simply use it.

That approach fails short when it comes to testing: there should always be a code path that somehow
handles a segment when its data type is not what was expected (for instance, it's not a ByteArray).
Until there are polymorphic segments (and we're not implementing them yet), these paths are effectively untestable.

The plan

Assuming there will be no concerns with the proposed API, internal API changes will be rolled out first and
then the public API will follow.

lppedd · 2024-05-22T15:17:16Z

If kotlinx-benchmarks is in a good state, it would also be nice to add benchmarks for JS and Native, to see how they compare over time.

Previously, Buffer's segments were organized into a circular list. That allowed storing only a single reference to buffer's head, and also facilitated insertion/removal of new list nodes. The downside of a circular list is that one has to always compare a current node with a head when iterating over segments. That complicates the implementation of a public API for segments iterations. See #135 (comment) for details on segment iteration API.

The API aimed to facilitate integration with other frameworks and libraries. Implemented API was initially described in the "Bulk API" subsection of #135 (comment)

Previously, Buffer's segments were organized into a circular list. That allowed storing only a single reference to buffer's head, and also facilitated insertion/removal of new list nodes. The downside of a circular list is that one has to always compare a current node with a head when iterating over segments. That complicates the implementation of a public API for segments iterations. See #135 (comment) for details on segment iteration API.

The API aimed to facilitate integration with other frameworks and libraries. Implemented API was initially described in the "Bulk API" subsection of #135 (comment)

Previously, Buffer's segments were organized into a circular list. That allowed storing only a single reference to the buffer's head and also facilitated the insertion/removal of new list nodes. The downside of a circular list is that one has to always compare a current node with a head when iterating over segments. That complicates the implementation of a public API for segment iterations. See #135 (comment) for details on segment iteration API.

The API aimed to facilitate integration with other frameworks and libraries. Implemented API was initially described in the "Bulk API" subsection of #135 (comment)

* Introduce unsafe API for bulk read/write ops. The API aimed to facilitate integration with other frameworks and libraries. Implemented API was initially described in the "Bulk API" subsection of #135 (comment)

…cess its data (#336) This is a second portion of unsafe API described in #135. This particular part aimed to facilitate the implementation of various Buffer extensions. The implemented API was initially described in "2. Buffer iteration API" and "3. Byte-granular read/write API" subsections of #135 (comment) This PR concludes the changes required to implement the API from #135, but our journey won't end here! The next patch in the queue is rewriting all existing extensions using the new API and then encapsulating all segments' internals.

fzhinkin self-assigned this Jun 12, 2023

fzhinkin mentioned this issue Jun 22, 2023

A trimmed-down API #136

Merged

3 tasks

whyoleg mentioned this issue Jun 27, 2023

Support JS IR target #151

Merged

whyoleg mentioned this issue Jul 10, 2023

Missing API: zero-copy wrapping of ByteArray with Source. #166

Closed

fzhinkin mentioned this issue Sep 14, 2023

Add Buffer.copyTo(ByteArray) extension #191

Open

ptitjes mentioned this issue Sep 23, 2023

Support read-only views with random-access capability #225

Open

fzhinkin added the enhancement label Nov 17, 2023

fzhinkin mentioned this issue Nov 24, 2023

Support ByteBuffer as a backing storage on JVM #239

Open

15 tasks

fzhinkin added unsafe API epic labels Mar 19, 2024

fzhinkin added this to the kotlinx-io stabilization milestone May 6, 2024

whyoleg mentioned this issue May 27, 2024

KTOR-6030 Migrate Ktor to kotlinx-io ktorio/ktor#4032

Merged

fzhinkin mentioned this issue May 28, 2024

[PR 1/5] Organize Buffer's segments as a regular list #332

Merged

fzhinkin added a commit that referenced this issue May 29, 2024

Introduce unsafe API for bulk read/write ops.

71f37d0

The API aimed to facilitate integration with other frameworks and libraries. Implemented API was initially described in the "Bulk API" subsection of #135 (comment)

This was referenced May 30, 2024

[PR 2/5] Introduce unsafe API for bulk read/write ops #334

Merged

[PR 3/5] Implement Unsafe API to iterate over segments and access its data #336

Merged

fzhinkin added a commit that referenced this issue Jun 6, 2024

Introduce unsafe API for bulk read/write ops.

dc5ad30

The API aimed to facilitate integration with other frameworks and libraries. Implemented API was initially described in the "Bulk API" subsection of #135 (comment)

fzhinkin added a commit that referenced this issue Jun 6, 2024

Introduce unsafe API for bulk read/write ops.

8b34f7c

The API aimed to facilitate integration with other frameworks and libraries. Implemented API was initially described in the "Bulk API" subsection of #135 (comment)

fzhinkin mentioned this issue Jun 7, 2024

[PR 4/5] Rewrite existing extensions using unsafe API #337

Merged

fzhinkin added a commit that referenced this issue Jun 21, 2024

Introduce unsafe API for bulk read/write ops.

e996977

The API aimed to facilitate integration with other frameworks and libraries. Implemented API was initially described in the "Bulk API" subsection of #135 (comment)

fzhinkin closed this as completed Jul 12, 2024

fzhinkin mentioned this issue Oct 30, 2024

Unsafe API: make segment size and remaining capacity public #408

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide an API for performant bulk or sequential read/writes #135

Provide an API for performant bulk or sequential read/writes #135

fzhinkin commented Jun 12, 2023

whyoleg commented Jun 14, 2023

fzhinkin commented Jun 14, 2023

rnett commented Apr 20, 2024

fzhinkin commented Apr 22, 2024

fzhinkin commented May 22, 2024 •

edited

Loading

lppedd commented May 22, 2024

Provide an API for performant bulk or sequential read/writes #135

Provide an API for performant bulk or sequential read/writes #135

Comments

fzhinkin commented Jun 12, 2023

whyoleg commented Jun 14, 2023

fzhinkin commented Jun 14, 2023

rnett commented Apr 20, 2024

fzhinkin commented Apr 22, 2024

fzhinkin commented May 22, 2024 • edited Loading

Intro

Proposed API

1. A Bulk API

2. Buffer iteration API

3. Byte-granular read/write API

Alternatives

Using Okio's UnsafeCursor

Providing some Iterator over segments

Exposing the underlying container as Any

The plan

lppedd commented May 22, 2024

fzhinkin commented May 22, 2024 •

edited

Loading

Using Okio's `UnsafeCursor`

Providing some `Iterator` over segments

Exposing the underlying container as `Any`