Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADR-007: Universal Share Prefix #660

Merged
merged 24 commits into from
Oct 6, 2022
Merged
Changes from 11 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
ad050e2
Draft universal share encoding ADR
rootulp Aug 24, 2022
fb338a1
address @evan-forbes feedback
rootulp Aug 25, 2022
bde5a90
Prefer reserved / unreserved namespace terminology
rootulp Aug 25, 2022
5b42edf
Merge branch 'main' into rp/adr-universal-share-encoding
rootulp Aug 26, 2022
3ddf2dd
Open question: continuation share indicator
rootulp Aug 26, 2022
2a9b792
Update docs/architecture/ADR-002-universal-share-encoding.md
rootulp Aug 29, 2022
40d195b
clarify: share -> message
rootulp Aug 29, 2022
ae2f06b
Merge branch 'main' into rp/adr-universal-share-encoding
rootulp Aug 30, 2022
c9a2fd5
docs: use compact vs sparse
rootulp Aug 31, 2022
b4b44d0
fix typo, message length -> data length
rootulp Aug 31, 2022
82191cf
Add example
rootulp Sep 2, 2022
af0d256
Update docs/architecture/ADR-002-universal-share-encoding.md
rootulp Sep 7, 2022
8cc6476
prefer namespace_id over nid
rootulp Sep 16, 2022
a0c7359
update impl details
rootulp Sep 16, 2022
2eb9e35
clarify existing schema
rootulp Sep 16, 2022
41fbf05
Clarify proposal
rootulp Sep 16, 2022
b6872a8
Merge branch 'main' into rp/adr-universal-share-encoding
rootulp Sep 26, 2022
21e7fe1
chore: rename encoding to prefix, fix adr number
rootulp Sep 26, 2022
2f26c38
rename ADR number, introduce share sequence, add new validity rules
rootulp Oct 3, 2022
07c0a9f
Update docs/architecture/adr-007-universal-share-prefix.md
rootulp Oct 6, 2022
c76c0fc
Update docs/architecture/adr-007-universal-share-prefix.md
rootulp Oct 6, 2022
17a0d05
Update docs/architecture/adr-007-universal-share-prefix.md
rootulp Oct 6, 2022
e680e8d
Update docs/architecture/adr-007-universal-share-prefix.md
rootulp Oct 6, 2022
8aa8a5c
docs: Current Sparse Share Schema
rootulp Oct 6, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 141 additions & 0 deletions docs/architecture/ADR-002-universal-share-encoding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# ADR 009: Universal Share Encoding
rootulp marked this conversation as resolved.
Show resolved Hide resolved
<!-- disable markdownlint MD010 because Go code snippet uses tabs -->
<!-- markdownlint-disable MD010 -->

## Terminology

- **nid** (8 bytes): namespace id
- **reserved** (1 byte): is the location of the first transaction, ISR, or evidence in this share if there is one and `0` if there isn't one
- **message length** (varint 1 to 10 bytes): is the length of the entire message in bytes
rootulp marked this conversation as resolved.
Show resolved Hide resolved
- **compact share**: a type of share that can accomodate multiple units. Currently, compact shares are used for transactions, ISRs, and evidence to efficiently pack this information into as few shares as possible.
- **sparse share**: a type of share that can accomodate zero or one unit. Currently, sparse shares are used for messages.

## Context

The current compact share format is:<br>`nid (8 bytes) | reserved (1 byte) | data`
rootulp marked this conversation as resolved.
Show resolved Hide resolved

The current sparse share format is:

- First share of message:<br>`nid (8 bytes) | message length (varint 1 to 10 bytes) | data`
- Contiguous share in message:<br>`nid (8 bytes) | data`

The current share format poses multiple challenges:

1. Clients must have two share parsing implementations (one for compact shares and one for sparse shares).
1. It is difficult to make changes to the share format in a backwards compatible way because clients can't determine which version of the share format an individual share conforms to.
1. It is not possible for a client that samples a random share to determine if the share is the first share of a message or a contiguous share in the message.

## Proposal

Introduce a universal share encoding that applies to both compact and sparse shares:

- First share of namespace for compact shares or message for sprase shares:<br>`nid (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | data`
- Contiguous share in namespace for compact shares or message for sparse shares:<br>`nid (8 bytes) | info (1 byte) | data`

Shares in the reserved namespace have the added constraint: the first byte of `data` is a reserved byte so the format is:<br>`nid (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | reserved (1 byte) | data`
rootulp marked this conversation as resolved.
Show resolved Hide resolved

Where **info** (1 byte) is a byte with the following structure:

- the first 7 bits are reserved for the version information in big endian form (initially, this will be `0000000` for version 0);
- the last bit is a **message start indicator**, that is `1` if the share is at the start of a message or `0` if it is a contiguous share within a message.
rootulp marked this conversation as resolved.
Show resolved Hide resolved

Rationale:

1. The first 9 bytes of a share are formatted in a consistent way regardless of the type of share (compact or sparse). Clients can therefore parse shares into data via one mechanism rather than two.
1. The message start indicator allows clients to parse a whole message in the middle of a namespace, without needing to read the whole namespace.
1. The version bits allow us to upgrade the share format in the future, if we need to do so in a way that different share formats can be mixed within a block.

## Example

| share number | 10 | 11 | 12 | 13 |
| ----------------------- | -------------------------------- | -------------------------------- | -------------------------------- | -------------------------------- |
| namespace | `[]byte{1, 1, 1, 1, 1, 1, 1, 1}` | `[]byte{1, 1, 1, 1, 1, 1, 1, 1}` | `[]byte{1, 1, 1, 1, 1, 1, 1, 1}` | `[]byte{2, 2, 2, 2, 2, 2, 2, 2}` |
| version | `0000000` | `0000000` | `0000000` | `0000000` |
| message start indicator | `1` | `1` | `0` | `1` |
| data | foo | bar | bar (continued) | buzz |

Without the universal share format: if a client is provided share 11, they have no way of knowing that a message length delimiter is encoded in this share. In order to parse the bar message, they must request and download all shares in this namespace (shares 10 and 12) and parse them in-order to determine the length of the bar message.

With the universal share format: if a client is provided share 11, they know from the prefix that share 11 is the start of a message and can therefore parse the message length delimiter in share 11. With the parsed message length, the client knows that the bar message will complete after reading N bytes (where N includes shares 11 and 12) and can therefore avoid requesting and downloading share 10.

## Questions

1. Does the info byte introduce any new attack vectors?
1. Should one bit in the info byte be used to signify that a continuation share is expected after this share?
- This **continuation share indicator** is inspired by [protocol buffer varints](https://developers.google.com/protocol-buffers/docs/encoding#varints) and [UTF-8](https://en.wikipedia.org/wiki/UTF-8).
- The **continuation share indicator** is distinct from the **message start indicator**. Consider a message with 3 contiguous shares:

| share number | 1 | 2 | 3 |
| ---------------------------- | --- | --- | ------------------------------------------------------------------------ |
| message start indicator | `1` | `0` | `0` |
| continuation share indicator | `1` | `1` | `0` <- client stops requesting contiguous shares when they encounter `0` |

- This would enable clients to begin parsing a message by sampling a share in the middle of a message and proceed to parsing contiguous shares until the end without ever encountering the first share of the message which contains the message length. However, this use case seems contrived because a subset of the message shares may not be meaningful to the client. This depends on how roll-ups encode the data in a `PayForData` transaction.
- Without the continuation share indicator, the client would have to request the first share of the message to parse the message length. If they don't request the first share, they can request contiguous shares until they reach the first share after their message ends to learn that they completed requesting the previous message.

1. What happens if a block producer publishes a message with a version that isn't in the list of supported versions?
- This can be considered invalid via a `ProcessProposal` validity check. Validators already compute the shares in `ProcessProposal` [here](https://github.com/rootulp/celestia-app/blob/ad050e28678119adae02536db3ef5ce083ea1436/app/process_proposal.go#L104-L110) so we can add a check to verify that every share has a known valid version.
1. What happens if a block producer publishes a message where the message start indicator isn't set correctly?
- Add a check similar to the one above.

## Alternative Approaches

We briefly considered adding the info byte to only sparse shares, see <https://github.com/celestiaorg/celestia-app/pull/651>. This approach was a miscommunication for an earlier proposal and was deprecated in favor of this ADR.

## Decision

// TODO

## Implementation Details

A share version is not set by a user who submits a `PayForData`. Instead, it is set by the block producer when data is split into shares.

### Constants

1. Define a new constant for `InfoReservedBytes = 1`.
1. Update [`MsgShareSize`](https://github.com/celestiaorg/celestia-core/blob/v0.34.x-celestia/pkg/consts/consts.go#L26) to account for one less byte available
1. Update [`TxShareSize`](https://github.com/celestiaorg/celestia-core/blob/v0.34.x-celestia/pkg/consts/consts.go#L24) to account for one less byte available

**NOTE**: Currently constants are defined in celestia-core's [consts.go](https://github.com/celestiaorg/celestia-core/blob/master/pkg/consts/consts.go) but some will be moved to celestia-app's [appconsts.go](https://github.com/celestiaorg/celestia-app/tree/evan/non-interactive-defaults-feature/pkg/appconsts). See [celestia-core#841](https://github.com/celestiaorg/celestia-core/issues/841).

### Types

1. Introduce a new type `InfoReservedByte` to encapsulate the logic around getting the `Version()` or `IsMessageStart()` from a share.

### Logic

#### celestia-core

1. Account for the new `InfoReservedByte` in `./types/share_splitting.go` and `./types/share_merging.go`.
- **NOTE**: These files are subject to be deleted soon. See <https://github.com/celestiaorg/celestia-core/issues/842>

#### celestia-app

1. Account for the new `InfoReservedByte` in all share splitting and merging code.

## Status

Proposed

## Consequences

### Positive

This proposal resolves challenges posed above.

### Negative

This proposal reduces the number of bytes a share can use for data by one byte.

### Neutral

If 127 versions is larger than required, the share format can be updated (in a subsequent version) to reserve fewer bits for the version in order to use some bits for other purposes.

If 127 versions is smaller than required, the share format can be updated (in a subsequent version) to occupy multiple bytes for the version. For example if the 7 bits are `1111111` then read an additional byte.

## References

- <https://github.com/celestiaorg/celestia-core/issues/839>
- <https://github.com/celestiaorg/celestia-core/issues/759>
- <https://github.com/celestiaorg/celestia-core/issues/757>
- <https://github.com/celestiaorg/celestia-app/issues/659>