From ad050e28678119adae02536db3ef5ce083ea1436 Mon Sep 17 00:00:00 2001 From: Rootul Patel Date: Wed, 24 Aug 2022 14:21:00 -0400 Subject: [PATCH 01/21] Draft universal share encoding ADR --- .../ADR-002-universal-share-encoding.md | 152 ++++++++++++++++++ 1 file changed, 152 insertions(+) create mode 100644 docs/architecture/ADR-002-universal-share-encoding.md diff --git a/docs/architecture/ADR-002-universal-share-encoding.md b/docs/architecture/ADR-002-universal-share-encoding.md new file mode 100644 index 0000000000..78baeb6ce3 --- /dev/null +++ b/docs/architecture/ADR-002-universal-share-encoding.md @@ -0,0 +1,152 @@ +# ADR 009: Universal Share Encoding + + + +## Changelog + +- 2022/9/22: inital draft of InfoReservedByte +- 2022/9/24: update draft to Universal Share Encoding + +## Context + +The current contiguous (transaction, ISRs, evidence) share format is: + +- First share of namespace: `nid (8 bytes) | reserved byte | share data` +- Contiguous share in namespace: `nid (8 bytes) | share data` + +The current non-contigous (message) share format is: + +- First share of message: `nid (8 bytes) | message length (varint) | share data` +- Contiguous share in message: `nid (8 bytes) | share data` + +The current share format poses multiple challenges: + +1. Clients must have two share parsing implementations (one for contiguous shares and one for non-contiguous shares). +1. It is difficult to make changes to the share format in a backwards compatible way because clients can't determine which version of the share format an individual share conforms to. +1. It is not possible for a client that samples a random share to determine if the share is the start of a namespace (for reserved namespaces) / message (for non-reserved namespaces) or a contiguous share for a multi-share namespace / message. + +## Proposal + +Introduce a universal share encoding that applies to both contiguous and non-contiguous share formats: + +- First share of namespace (for reserved namespaces) or message (for non-reserved namespaces): `nid (8 bytes) | info (1 byte)| message length (varint) | data` +- Contiguous shares in namespace / message: `nid (8 bytes) | info (1 byte)| data` + +The contiguous share format has the added constraint: + +- First share of namespace: the first byte of `data` is a reserved byte so the format is: `nid (8 bytes) | info (1 byte) | message length (varint) | reserved (1 byte) | data` +- Contiguous shares in namespace: no additional constraint + +Where info byte is a byte with the following structure: + +- the first 7 bits are reserved for the version information in big endian form (initially, this will just be 0000000 until further notice); +- the last bit is a *message start indicator*, that is 1 if the share is at the start of a namespace (for reserved namespaces) / message (for non-reserved namespaces). + +Rationale: + +1. The first 9 bytes of a share are formatted in a consistent way regardless of the type of share (contiguous or non-contiguous). Clients can therefore parse shares into data via one mechanism rather than two. +1. The message start indicator allows clients to parse a whole message in the middle of a namespace, without needing to read the whole namespace. +1. The version bits allow us to upgrade the share format in the future, if we need to do so in such a way that different share formats can be mixed within a block. + +## Questions + +1. Does the info byte introduce any new attack vectors? +1. What happens if a block producer publishes a message with a version that isn't in the list of supported versions (initially only `0000000`)? + +## Alternative Approaches + +// TODO + +## Decision + +// TODO + +## Implementation Details + +### Protobuf + +1. (Potentially) add `Version` to [`MsgPayForData`](https://github.com/celestiaorg/celestia-app/blob/main/proto/payment/tx.proto#L44) + +**NOTE**: Protobuf does not support the byte type (see [Scalar Value Types](https://developers.google.com/protocol-buffers/docs/proto3#scalar)) so a `uint32` will be used for `Version`. Since `Version` is constrained to 2^7 bits (0 to 127 inclusive), a `Version` outside the supported range (i.e. 128) will seriealize / deserialize correctly but be considered invalid by the application. Adding this field increases the size of the message by one byte + protobuf overhead. + +### Constants + +1. Define a new constant for `InfoReservedBytes = 1`. +1. Update [`MsgShareSize`](https://github.com/celestiaorg/celestia-core/blob/v0.34.x-celestia/pkg/consts/consts.go#L26) to account for one less byte available +1. Update [`TxShareSize`](https://github.com/celestiaorg/celestia-core/blob/v0.34.x-celestia/pkg/consts/consts.go#L24) to account for one less byte available + +**NOTE**: Currently constants are defined in celestia-core's [consts.go](https://github.com/celestiaorg/celestia-core/blob/master/pkg/consts/consts.go) but some will be moved to celestia-app's [appconsts.go](https://github.com/celestiaorg/celestia-app/tree/evan/non-interactive-defaults-feature/pkg/appconsts). See [celestia-core#841](https://github.com/celestiaorg/celestia-core/issues/841). + +### Types + +1. Introduce a new type `InfoReservedByte` to encapsulate the logic around getting the `Version()` or `IsMessageStart()` from a share. + +```golang +// InfoReservedByte is a byte with the following structure: the first 7 bits are +// reserved for version information in big endian form (initially `0000000`). +// The last bit is a "message start indicator", that is `1` if the share is at +// the start of a message and `0` otherwise. +type InfoReservedByte byte + +func NewInfoReservedByte(version uint8, isMessageStart bool) (InfoReservedByte, error) { + if version > 127 { + return 0, fmt.Errorf("version %d must be less than or equal to 127", version) + } + + prefix := version << 1 + if isMessageStart { + return InfoReservedByte(prefix + 1), nil + } + return InfoReservedByte(prefix), nil +} + +// Version returns the version encoded in this InfoReservedByte. +// Version is expected to be between 0 and 127 (inclusive). +func (i InfoReservedByte) Version() uint8 { + version := uint8(i) >> 1 + return version +} + +// IsMessageStart returns whether this share is the start of a message. +func (i InfoReservedByte) IsMessageStart() bool { + return uint(i)%2 == 1 +} +``` + +### Logic + +#### celestia-core + +1. Account for the new `InfoReservedByte` in `./types/share_splitting.go` and `./types/share_merging.go`. + - **NOTE**: These files are subject to be deleted soon. See + +#### celestia-app + +1. Account for the new `InfoReservedByte` in all share splitting and merging code. There is an in-progress refactor of the relevant files. See + +## Status + +Proposed + +## Consequences + +### Positive + +This proposal resolves challenges posed above. + +### Negative + +This proposal reduces the number of bytes a message share can use for data by one byte. + +### Neutral + +If 127 versions is larger than required, the share format spec can be updated (in a subsequent version) to reserve fewer bits for the version in order to use some bits for other purposes. + +If 127 versions is smaller than required, the share format spec can be updated (in a subsequent version) to occupy multiple bytes for the version. For example if the 7 bits are `1111111` then read an additional byte. + +## References + +- +- +- +- From fb338a15aa26e502f10a5cbfd9167d8a5bbf8d65 Mon Sep 17 00:00:00 2001 From: Rootul Patel Date: Thu, 25 Aug 2022 16:30:52 -0400 Subject: [PATCH 02/21] address @evan-forbes feedback --- .../ADR-002-universal-share-encoding.md | 33 ++++++++----------- 1 file changed, 14 insertions(+), 19 deletions(-) diff --git a/docs/architecture/ADR-002-universal-share-encoding.md b/docs/architecture/ADR-002-universal-share-encoding.md index 78baeb6ce3..3b81402943 100644 --- a/docs/architecture/ADR-002-universal-share-encoding.md +++ b/docs/architecture/ADR-002-universal-share-encoding.md @@ -9,33 +9,31 @@ ## Context -The current contiguous (transaction, ISRs, evidence) share format is: +**nid**: namespace id +**reserved**: is the location of the first transaction, ISR, or evidence in this share if there is one and `0` if there isn't one +**message length**: is the length of the entire message in bytes -- First share of namespace: `nid (8 bytes) | reserved byte | share data` -- Contiguous share in namespace: `nid (8 bytes) | share data` +The current contiguous (transaction, ISRs, evidence) share format is:
`nid (8 bytes) | reserved (1 byte) | share data` The current non-contigous (message) share format is: -- First share of message: `nid (8 bytes) | message length (varint) | share data` -- Contiguous share in message: `nid (8 bytes) | share data` +- First share of message:
`nid (8 bytes) | message length (varint) | share data` +- Contiguous share in message:
`nid (8 bytes) | share data` The current share format poses multiple challenges: 1. Clients must have two share parsing implementations (one for contiguous shares and one for non-contiguous shares). 1. It is difficult to make changes to the share format in a backwards compatible way because clients can't determine which version of the share format an individual share conforms to. -1. It is not possible for a client that samples a random share to determine if the share is the start of a namespace (for reserved namespaces) / message (for non-reserved namespaces) or a contiguous share for a multi-share namespace / message. +1. It is not possible for a client that samples a random share to determine if the share is the start of a namespace (for reserved namespaces) / message (for non-reserved namespaces) or a contiguous share. ## Proposal Introduce a universal share encoding that applies to both contiguous and non-contiguous share formats: -- First share of namespace (for reserved namespaces) or message (for non-reserved namespaces): `nid (8 bytes) | info (1 byte)| message length (varint) | data` -- Contiguous shares in namespace / message: `nid (8 bytes) | info (1 byte)| data` +- First share of namespace (for reserved namespaces) or message (for non-reserved namespaces):
`nid (8 bytes) | info (1 byte) | message length (varint) | data` +- Contiguous shares in namespace / message:
`nid (8 bytes) | info (1 byte) | data` -The contiguous share format has the added constraint: - -- First share of namespace: the first byte of `data` is a reserved byte so the format is: `nid (8 bytes) | info (1 byte) | message length (varint) | reserved (1 byte) | data` -- Contiguous shares in namespace: no additional constraint +The contiguous share format has the added constraint: the first byte of `data` is a reserved byte so the format is:
`nid (8 bytes) | info (1 byte) | message length (varint) | reserved (1 byte) | data` Where info byte is a byte with the following structure: @@ -46,16 +44,17 @@ Rationale: 1. The first 9 bytes of a share are formatted in a consistent way regardless of the type of share (contiguous or non-contiguous). Clients can therefore parse shares into data via one mechanism rather than two. 1. The message start indicator allows clients to parse a whole message in the middle of a namespace, without needing to read the whole namespace. -1. The version bits allow us to upgrade the share format in the future, if we need to do so in such a way that different share formats can be mixed within a block. +1. The version bits allow us to upgrade the share format in the future, if we need to do so in a way that different share formats can be mixed within a block. ## Questions 1. Does the info byte introduce any new attack vectors? 1. What happens if a block producer publishes a message with a version that isn't in the list of supported versions (initially only `0000000`)? + 1. It seems like this could be a `ProcessProposal` validity check. Validators already compute the shares in `ProcessProposal` [here](https://github.com/rootulp/celestia-app/blob/ad050e28678119adae02536db3ef5ce083ea1436/app/process_proposal.go#L104-L110) so we can add a check to verify that every share has a valid version. ## Alternative Approaches -// TODO +We briefly considered adding the info byte to only non-contiguous (message) shares, see . This approach was a miscommunication / earlier proposal and was deprecated in favor of this ADR. ## Decision @@ -63,11 +62,7 @@ Rationale: ## Implementation Details -### Protobuf - -1. (Potentially) add `Version` to [`MsgPayForData`](https://github.com/celestiaorg/celestia-app/blob/main/proto/payment/tx.proto#L44) - -**NOTE**: Protobuf does not support the byte type (see [Scalar Value Types](https://developers.google.com/protocol-buffers/docs/proto3#scalar)) so a `uint32` will be used for `Version`. Since `Version` is constrained to 2^7 bits (0 to 127 inclusive), a `Version` outside the supported range (i.e. 128) will seriealize / deserialize correctly but be considered invalid by the application. Adding this field increases the size of the message by one byte + protobuf overhead. +A share version is not set by a user who submits a `PayForData`. Instead, it is set by the block producer when data is split into shares. ### Constants From bde5a900db9f6fbc72a66daff398116eee790b36 Mon Sep 17 00:00:00 2001 From: Rootul Patel Date: Thu, 25 Aug 2022 16:44:52 -0400 Subject: [PATCH 03/21] Prefer reserved / unreserved namespace terminology --- .../ADR-002-universal-share-encoding.md | 30 +++++++++---------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/docs/architecture/ADR-002-universal-share-encoding.md b/docs/architecture/ADR-002-universal-share-encoding.md index 3b81402943..e38194ea71 100644 --- a/docs/architecture/ADR-002-universal-share-encoding.md +++ b/docs/architecture/ADR-002-universal-share-encoding.md @@ -13,36 +13,36 @@ **reserved**: is the location of the first transaction, ISR, or evidence in this share if there is one and `0` if there isn't one **message length**: is the length of the entire message in bytes -The current contiguous (transaction, ISRs, evidence) share format is:
`nid (8 bytes) | reserved (1 byte) | share data` +The current reserved namespace (transaction, ISRs, evidence) share format is:
`nid (8 bytes) | reserved (1 byte) | share data` -The current non-contigous (message) share format is: +The current unreserved namespace (message) share format is: - First share of message:
`nid (8 bytes) | message length (varint) | share data` - Contiguous share in message:
`nid (8 bytes) | share data` The current share format poses multiple challenges: -1. Clients must have two share parsing implementations (one for contiguous shares and one for non-contiguous shares). +1. Clients must have two share parsing implementations (one for reserved namespace shares and one for unreserved namespace shares). 1. It is difficult to make changes to the share format in a backwards compatible way because clients can't determine which version of the share format an individual share conforms to. -1. It is not possible for a client that samples a random share to determine if the share is the start of a namespace (for reserved namespaces) / message (for non-reserved namespaces) or a contiguous share. +1. It is not possible for a client that samples a random share to determine if the share is the first share of that namespace or a contiguous share. ## Proposal -Introduce a universal share encoding that applies to both contiguous and non-contiguous share formats: +Introduce a universal share encoding that applies to both reserved and unreserved share formats: -- First share of namespace (for reserved namespaces) or message (for non-reserved namespaces):
`nid (8 bytes) | info (1 byte) | message length (varint) | data` -- Contiguous shares in namespace / message:
`nid (8 bytes) | info (1 byte) | data` +- First share of namespace (for reserved namespaces) or message (for unreserved namespaces):
`nid (8 bytes) | info (1 byte) | message length (varint) | data` +- Contiguous shares in namespace:
`nid (8 bytes) | info (1 byte) | data` -The contiguous share format has the added constraint: the first byte of `data` is a reserved byte so the format is:
`nid (8 bytes) | info (1 byte) | message length (varint) | reserved (1 byte) | data` +The reserved share format has the added constraint: the first byte of `data` is a reserved byte so the format is:
`nid (8 bytes) | info (1 byte) | message length (varint) | reserved (1 byte) | data` Where info byte is a byte with the following structure: -- the first 7 bits are reserved for the version information in big endian form (initially, this will just be 0000000 until further notice); -- the last bit is a *message start indicator*, that is 1 if the share is at the start of a namespace (for reserved namespaces) / message (for non-reserved namespaces). +- the first 7 bits are reserved for the version information in big endian form (initially, this will just be `0000000` until further notice); +- the last bit is a *message start indicator*, that is `1` if the share is at the start of a namespace or `0` if it is a contiguous share within a namespace. Rationale: -1. The first 9 bytes of a share are formatted in a consistent way regardless of the type of share (contiguous or non-contiguous). Clients can therefore parse shares into data via one mechanism rather than two. +1. The first 9 bytes of a share are formatted in a consistent way regardless of the type of share (reserved or unreserved namespace). Clients can therefore parse shares into data via one mechanism rather than two. 1. The message start indicator allows clients to parse a whole message in the middle of a namespace, without needing to read the whole namespace. 1. The version bits allow us to upgrade the share format in the future, if we need to do so in a way that different share formats can be mixed within a block. @@ -54,7 +54,7 @@ Rationale: ## Alternative Approaches -We briefly considered adding the info byte to only non-contiguous (message) shares, see . This approach was a miscommunication / earlier proposal and was deprecated in favor of this ADR. +We briefly considered adding the info byte to only unreserved namespace shares, see . This approach was a miscommunication or earlier proposal and was deprecated in favor of this ADR. ## Decision @@ -131,13 +131,13 @@ This proposal resolves challenges posed above. ### Negative -This proposal reduces the number of bytes a message share can use for data by one byte. +This proposal reduces the number of bytes a share can use for data by one byte. ### Neutral -If 127 versions is larger than required, the share format spec can be updated (in a subsequent version) to reserve fewer bits for the version in order to use some bits for other purposes. +If 127 versions is larger than required, the share format can be updated (in a subsequent version) to reserve fewer bits for the version in order to use some bits for other purposes. -If 127 versions is smaller than required, the share format spec can be updated (in a subsequent version) to occupy multiple bytes for the version. For example if the 7 bits are `1111111` then read an additional byte. +If 127 versions is smaller than required, the share format can be updated (in a subsequent version) to occupy multiple bytes for the version. For example if the 7 bits are `1111111` then read an additional byte. ## References From 3ddf2ddcc7c7061b84b2295e4a2e07f55472597c Mon Sep 17 00:00:00 2001 From: Rootul Patel Date: Fri, 26 Aug 2022 18:02:07 -0400 Subject: [PATCH 04/21] Open question: continuation share indicator --- .../ADR-002-universal-share-encoding.md | 36 +++++++++++++------ 1 file changed, 25 insertions(+), 11 deletions(-) diff --git a/docs/architecture/ADR-002-universal-share-encoding.md b/docs/architecture/ADR-002-universal-share-encoding.md index e38194ea71..2fcf182275 100644 --- a/docs/architecture/ADR-002-universal-share-encoding.md +++ b/docs/architecture/ADR-002-universal-share-encoding.md @@ -9,15 +9,15 @@ ## Context -**nid**: namespace id -**reserved**: is the location of the first transaction, ISR, or evidence in this share if there is one and `0` if there isn't one -**message length**: is the length of the entire message in bytes +- **nid** (8 bytes): namespace id +- **reserved** (1 byte): is the location of the first transaction, ISR, or evidence in this share if there is one and `0` if there isn't one +- **message length** (varint 1 to 10 bytes): is the length of the entire message in bytes The current reserved namespace (transaction, ISRs, evidence) share format is:
`nid (8 bytes) | reserved (1 byte) | share data` The current unreserved namespace (message) share format is: -- First share of message:
`nid (8 bytes) | message length (varint) | share data` +- First share of message:
`nid (8 bytes) | message length (varint 1 to 10 bytes) | share data` - Contiguous share in message:
`nid (8 bytes) | share data` The current share format poses multiple challenges: @@ -30,27 +30,41 @@ The current share format poses multiple challenges: Introduce a universal share encoding that applies to both reserved and unreserved share formats: -- First share of namespace (for reserved namespaces) or message (for unreserved namespaces):
`nid (8 bytes) | info (1 byte) | message length (varint) | data` +- First share of namespace (for reserved namespaces) or message (for unreserved namespaces):
`nid (8 bytes) | info (1 byte) | message length (varint 1 to 10 bytes) | data` - Contiguous shares in namespace:
`nid (8 bytes) | info (1 byte) | data` -The reserved share format has the added constraint: the first byte of `data` is a reserved byte so the format is:
`nid (8 bytes) | info (1 byte) | message length (varint) | reserved (1 byte) | data` +The reserved share format has the added constraint: the first byte of `data` is a reserved byte so the format is:
`nid (8 bytes) | info (1 byte) | message length (varint 1 to 10 bytes) | reserved (1 byte) | data` -Where info byte is a byte with the following structure: +Where **info** (1 byte) is a byte with the following structure: -- the first 7 bits are reserved for the version information in big endian form (initially, this will just be `0000000` until further notice); +- the first 7 bits are reserved for the version information in big endian form (initially, this will be `0000000` for version 0); - the last bit is a *message start indicator*, that is `1` if the share is at the start of a namespace or `0` if it is a contiguous share within a namespace. Rationale: 1. The first 9 bytes of a share are formatted in a consistent way regardless of the type of share (reserved or unreserved namespace). Clients can therefore parse shares into data via one mechanism rather than two. -1. The message start indicator allows clients to parse a whole message in the middle of a namespace, without needing to read the whole namespace. +1. The message start indicator allows clients to parse a whole share in the middle of a namespace, without needing to read the whole namespace. 1. The version bits allow us to upgrade the share format in the future, if we need to do so in a way that different share formats can be mixed within a block. ## Questions 1. Does the info byte introduce any new attack vectors? -1. What happens if a block producer publishes a message with a version that isn't in the list of supported versions (initially only `0000000`)? - 1. It seems like this could be a `ProcessProposal` validity check. Validators already compute the shares in `ProcessProposal` [here](https://github.com/rootulp/celestia-app/blob/ad050e28678119adae02536db3ef5ce083ea1436/app/process_proposal.go#L104-L110) so we can add a check to verify that every share has a valid version. +1. Should one bit in the info byte be used to signify that a continuation share is expected after this share? + - This **continuation share indicator** is inspired by [protocol buffer varints](https://developers.google.com/protocol-buffers/docs/encoding#varints) and [UTF-8](https://en.wikipedia.org/wiki/UTF-8). + - The **continuation share indicator** is distinct from the **message start indicator**. Consider a message with 3 contiguous shares: + + indicator | share 1 | share 2 | share 3 + --- | --- | --- | --- + message start | `1` | `0` | `0` + continuation share | `1` | `1` | `0` <- client stops requesting contiguous shares when they encounter `0` + + - This would enable clients to begin parsing a message by sampling a share in the middle of a namespace and proceed to parsing contiguous shares until the end without ever encountering the first share of the message which contains the message length. However, this use case seems contrived because a subset of the message shares may not be meaningful to the client. + - Without the continuation share indicator, the client would have to request the first share of the next message to learn that they had completed requesting the previous message. + +1. What happens if a block producer publishes a message with a version that isn't in the list of supported versions? + - It seems like this could be a `ProcessProposal` validity check. Validators already compute the shares in `ProcessProposal` [here](https://github.com/rootulp/celestia-app/blob/ad050e28678119adae02536db3ef5ce083ea1436/app/process_proposal.go#L104-L110) so we can add a check to verify that every share has a valid version. +1. What happens if a block producer publishes a message where the message start indicator isn't set correctly? + - Add a check similar to the one above. ## Alternative Approaches From 2a9b792413b790f453821f6046d5e16e883523fc Mon Sep 17 00:00:00 2001 From: Rootul Patel Date: Mon, 29 Aug 2022 11:45:11 -0400 Subject: [PATCH 05/21] Update docs/architecture/ADR-002-universal-share-encoding.md --- docs/architecture/ADR-002-universal-share-encoding.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/architecture/ADR-002-universal-share-encoding.md b/docs/architecture/ADR-002-universal-share-encoding.md index 2fcf182275..ed752af448 100644 --- a/docs/architecture/ADR-002-universal-share-encoding.md +++ b/docs/architecture/ADR-002-universal-share-encoding.md @@ -31,7 +31,7 @@ The current share format poses multiple challenges: Introduce a universal share encoding that applies to both reserved and unreserved share formats: - First share of namespace (for reserved namespaces) or message (for unreserved namespaces):
`nid (8 bytes) | info (1 byte) | message length (varint 1 to 10 bytes) | data` -- Contiguous shares in namespace:
`nid (8 bytes) | info (1 byte) | data` +- Contiguous shares in namespace (for reserved namespaces) or message (for unreserved namespaces):
`nid (8 bytes) | info (1 byte) | data` The reserved share format has the added constraint: the first byte of `data` is a reserved byte so the format is:
`nid (8 bytes) | info (1 byte) | message length (varint 1 to 10 bytes) | reserved (1 byte) | data` From 40d195b9ba89280bab61b119f9863b6b074d6d34 Mon Sep 17 00:00:00 2001 From: Rootul Patel Date: Mon, 29 Aug 2022 11:46:49 -0400 Subject: [PATCH 06/21] clarify: share -> message --- docs/architecture/ADR-002-universal-share-encoding.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/architecture/ADR-002-universal-share-encoding.md b/docs/architecture/ADR-002-universal-share-encoding.md index ed752af448..0a38e98317 100644 --- a/docs/architecture/ADR-002-universal-share-encoding.md +++ b/docs/architecture/ADR-002-universal-share-encoding.md @@ -43,7 +43,7 @@ Where **info** (1 byte) is a byte with the following structure: Rationale: 1. The first 9 bytes of a share are formatted in a consistent way regardless of the type of share (reserved or unreserved namespace). Clients can therefore parse shares into data via one mechanism rather than two. -1. The message start indicator allows clients to parse a whole share in the middle of a namespace, without needing to read the whole namespace. +1. The message start indicator allows clients to parse a whole message in the middle of a namespace, without needing to read the whole namespace. 1. The version bits allow us to upgrade the share format in the future, if we need to do so in a way that different share formats can be mixed within a block. ## Questions From c9a2fd52a04df1302a236c7efb613e2d9a6d28fc Mon Sep 17 00:00:00 2001 From: Rootul Patel Date: Wed, 31 Aug 2022 14:04:31 -0400 Subject: [PATCH 07/21] docs: use compact vs sparse --- .../ADR-002-universal-share-encoding.md | 73 ++++++------------- 1 file changed, 23 insertions(+), 50 deletions(-) diff --git a/docs/architecture/ADR-002-universal-share-encoding.md b/docs/architecture/ADR-002-universal-share-encoding.md index 0a38e98317..d51a355439 100644 --- a/docs/architecture/ADR-002-universal-share-encoding.md +++ b/docs/architecture/ADR-002-universal-share-encoding.md @@ -4,45 +4,50 @@ ## Changelog -- 2022/9/22: inital draft of InfoReservedByte -- 2022/9/24: update draft to Universal Share Encoding +- 2022/8/22: inital draft of InfoReservedByte +- 2022/8/24: update draft to Universal Share Encoding +- 2022/8/31: switch from "reserved vs unreserved" to "compact vs sparse" when describing share format -## Context +## Terminology - **nid** (8 bytes): namespace id - **reserved** (1 byte): is the location of the first transaction, ISR, or evidence in this share if there is one and `0` if there isn't one - **message length** (varint 1 to 10 bytes): is the length of the entire message in bytes +- **compact share**: a type of share that can accomodate multiple units. Currently, compact shares are used for transactions, ISRs, and evidence to efficiently pack this information into as few shares as possible. +- **sparse share**: a type of share that can accomodate zero or one unit. Currently, sparse shares are used for messages. + +## Context -The current reserved namespace (transaction, ISRs, evidence) share format is:
`nid (8 bytes) | reserved (1 byte) | share data` +The current compact share format is:
`nid (8 bytes) | reserved (1 byte) | share data` -The current unreserved namespace (message) share format is: +The current spare share format is: - First share of message:
`nid (8 bytes) | message length (varint 1 to 10 bytes) | share data` - Contiguous share in message:
`nid (8 bytes) | share data` The current share format poses multiple challenges: -1. Clients must have two share parsing implementations (one for reserved namespace shares and one for unreserved namespace shares). +1. Clients must have two share parsing implementations (one for compact shares and one for spares shares). 1. It is difficult to make changes to the share format in a backwards compatible way because clients can't determine which version of the share format an individual share conforms to. -1. It is not possible for a client that samples a random share to determine if the share is the first share of that namespace or a contiguous share. +1. It is not possible for a client that samples a random share to determine if the share is the first share of that namespace or a contiguous share in the message. ## Proposal -Introduce a universal share encoding that applies to both reserved and unreserved share formats: +Introduce a universal share encoding that applies to both compact and sparse shares: -- First share of namespace (for reserved namespaces) or message (for unreserved namespaces):
`nid (8 bytes) | info (1 byte) | message length (varint 1 to 10 bytes) | data` -- Contiguous shares in namespace (for reserved namespaces) or message (for unreserved namespaces):
`nid (8 bytes) | info (1 byte) | data` +- First share of namespace for compact shares or message for sprase shares:
`nid (8 bytes) | info (1 byte) | message length (varint 1 to 10 bytes) | data` +- Contiguous share in namespace for compact shares or message for sparse shares:
`nid (8 bytes) | info (1 byte) | data` -The reserved share format has the added constraint: the first byte of `data` is a reserved byte so the format is:
`nid (8 bytes) | info (1 byte) | message length (varint 1 to 10 bytes) | reserved (1 byte) | data` +Shares in the reserved namespace have the added constraint: the first byte of `data` is a reserved byte so the format is:
`nid (8 bytes) | info (1 byte) | message length (varint 1 to 10 bytes) | reserved (1 byte) | data` Where **info** (1 byte) is a byte with the following structure: - the first 7 bits are reserved for the version information in big endian form (initially, this will be `0000000` for version 0); -- the last bit is a *message start indicator*, that is `1` if the share is at the start of a namespace or `0` if it is a contiguous share within a namespace. +- the last bit is a **message start indicator**, that is `1` if the share is at the start of a message or `0` if it is a contiguous share within a message. Rationale: -1. The first 9 bytes of a share are formatted in a consistent way regardless of the type of share (reserved or unreserved namespace). Clients can therefore parse shares into data via one mechanism rather than two. +1. The first 9 bytes of a share are formatted in a consistent way regardless of the type of share (compact or sparse). Clients can therefore parse shares into data via one mechanism rather than two. 1. The message start indicator allows clients to parse a whole message in the middle of a namespace, without needing to read the whole namespace. 1. The version bits allow us to upgrade the share format in the future, if we need to do so in a way that different share formats can be mixed within a block. @@ -58,17 +63,17 @@ Rationale: message start | `1` | `0` | `0` continuation share | `1` | `1` | `0` <- client stops requesting contiguous shares when they encounter `0` - - This would enable clients to begin parsing a message by sampling a share in the middle of a namespace and proceed to parsing contiguous shares until the end without ever encountering the first share of the message which contains the message length. However, this use case seems contrived because a subset of the message shares may not be meaningful to the client. - - Without the continuation share indicator, the client would have to request the first share of the next message to learn that they had completed requesting the previous message. + - This would enable clients to begin parsing a message by sampling a share in the middle of a message and proceed to parsing contiguous shares until the end without ever encountering the first share of the message which contains the message length. However, this use case seems contrived because a subset of the message shares may not be meaningful to the client. + - Without the continuation share indicator, the client would have to request the first share of the message to parse the message length. If they don't request the first share, they can request contiguous shares until they reach the first share after their message ends to learn that they completed requesting the previous message. 1. What happens if a block producer publishes a message with a version that isn't in the list of supported versions? - - It seems like this could be a `ProcessProposal` validity check. Validators already compute the shares in `ProcessProposal` [here](https://github.com/rootulp/celestia-app/blob/ad050e28678119adae02536db3ef5ce083ea1436/app/process_proposal.go#L104-L110) so we can add a check to verify that every share has a valid version. + - This can be considered invalid via a `ProcessProposal` validity check. Validators already compute the shares in `ProcessProposal` [here](https://github.com/rootulp/celestia-app/blob/ad050e28678119adae02536db3ef5ce083ea1436/app/process_proposal.go#L104-L110) so we can add a check to verify that every share has a known valid version. 1. What happens if a block producer publishes a message where the message start indicator isn't set correctly? - Add a check similar to the one above. ## Alternative Approaches -We briefly considered adding the info byte to only unreserved namespace shares, see . This approach was a miscommunication or earlier proposal and was deprecated in favor of this ADR. +We briefly considered adding the info byte to only sparse shares, see . This approach was a miscommunication for an earlier proposal and was deprecated in favor of this ADR. ## Decision @@ -90,38 +95,6 @@ A share version is not set by a user who submits a `PayForData`. Instead, it is 1. Introduce a new type `InfoReservedByte` to encapsulate the logic around getting the `Version()` or `IsMessageStart()` from a share. -```golang -// InfoReservedByte is a byte with the following structure: the first 7 bits are -// reserved for version information in big endian form (initially `0000000`). -// The last bit is a "message start indicator", that is `1` if the share is at -// the start of a message and `0` otherwise. -type InfoReservedByte byte - -func NewInfoReservedByte(version uint8, isMessageStart bool) (InfoReservedByte, error) { - if version > 127 { - return 0, fmt.Errorf("version %d must be less than or equal to 127", version) - } - - prefix := version << 1 - if isMessageStart { - return InfoReservedByte(prefix + 1), nil - } - return InfoReservedByte(prefix), nil -} - -// Version returns the version encoded in this InfoReservedByte. -// Version is expected to be between 0 and 127 (inclusive). -func (i InfoReservedByte) Version() uint8 { - version := uint8(i) >> 1 - return version -} - -// IsMessageStart returns whether this share is the start of a message. -func (i InfoReservedByte) IsMessageStart() bool { - return uint(i)%2 == 1 -} -``` - ### Logic #### celestia-core @@ -131,7 +104,7 @@ func (i InfoReservedByte) IsMessageStart() bool { #### celestia-app -1. Account for the new `InfoReservedByte` in all share splitting and merging code. There is an in-progress refactor of the relevant files. See +1. Account for the new `InfoReservedByte` in all share splitting and merging code. ## Status From b4b44d0bcf09b697e125dfdd602f8a6603b47c76 Mon Sep 17 00:00:00 2001 From: Rootul Patel Date: Wed, 31 Aug 2022 16:02:10 -0400 Subject: [PATCH 08/21] fix typo, message length -> data length --- .../ADR-002-universal-share-encoding.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/architecture/ADR-002-universal-share-encoding.md b/docs/architecture/ADR-002-universal-share-encoding.md index d51a355439..e9216d947b 100644 --- a/docs/architecture/ADR-002-universal-share-encoding.md +++ b/docs/architecture/ADR-002-universal-share-encoding.md @@ -18,16 +18,16 @@ ## Context -The current compact share format is:
`nid (8 bytes) | reserved (1 byte) | share data` +The current compact share format is:
`nid (8 bytes) | reserved (1 byte) | data` -The current spare share format is: +The current sparse share format is: -- First share of message:
`nid (8 bytes) | message length (varint 1 to 10 bytes) | share data` -- Contiguous share in message:
`nid (8 bytes) | share data` +- First share of message:
`nid (8 bytes) | message length (varint 1 to 10 bytes) | data` +- Contiguous share in message:
`nid (8 bytes) | data` The current share format poses multiple challenges: -1. Clients must have two share parsing implementations (one for compact shares and one for spares shares). +1. Clients must have two share parsing implementations (one for compact shares and one for sparse shares). 1. It is difficult to make changes to the share format in a backwards compatible way because clients can't determine which version of the share format an individual share conforms to. 1. It is not possible for a client that samples a random share to determine if the share is the first share of that namespace or a contiguous share in the message. @@ -35,10 +35,10 @@ The current share format poses multiple challenges: Introduce a universal share encoding that applies to both compact and sparse shares: -- First share of namespace for compact shares or message for sprase shares:
`nid (8 bytes) | info (1 byte) | message length (varint 1 to 10 bytes) | data` +- First share of namespace for compact shares or message for sprase shares:
`nid (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | data` - Contiguous share in namespace for compact shares or message for sparse shares:
`nid (8 bytes) | info (1 byte) | data` -Shares in the reserved namespace have the added constraint: the first byte of `data` is a reserved byte so the format is:
`nid (8 bytes) | info (1 byte) | message length (varint 1 to 10 bytes) | reserved (1 byte) | data` +Shares in the reserved namespace have the added constraint: the first byte of `data` is a reserved byte so the format is:
`nid (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | reserved (1 byte) | data` Where **info** (1 byte) is a byte with the following structure: From 82191cfa0e5fe5c02e49180aaea2f5726a3955bf Mon Sep 17 00:00:00 2001 From: Rootul Patel Date: Fri, 2 Sep 2022 14:01:30 -0400 Subject: [PATCH 09/21] Add example --- .../ADR-002-universal-share-encoding.md | 31 ++++++++++++------- 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/docs/architecture/ADR-002-universal-share-encoding.md b/docs/architecture/ADR-002-universal-share-encoding.md index e9216d947b..c4aa5fc0bd 100644 --- a/docs/architecture/ADR-002-universal-share-encoding.md +++ b/docs/architecture/ADR-002-universal-share-encoding.md @@ -2,12 +2,6 @@ -## Changelog - -- 2022/8/22: inital draft of InfoReservedByte -- 2022/8/24: update draft to Universal Share Encoding -- 2022/8/31: switch from "reserved vs unreserved" to "compact vs sparse" when describing share format - ## Terminology - **nid** (8 bytes): namespace id @@ -29,7 +23,7 @@ The current share format poses multiple challenges: 1. Clients must have two share parsing implementations (one for compact shares and one for sparse shares). 1. It is difficult to make changes to the share format in a backwards compatible way because clients can't determine which version of the share format an individual share conforms to. -1. It is not possible for a client that samples a random share to determine if the share is the first share of that namespace or a contiguous share in the message. +1. It is not possible for a client that samples a random share to determine if the share is the first share of a message or a contiguous share in the message. ## Proposal @@ -51,6 +45,19 @@ Rationale: 1. The message start indicator allows clients to parse a whole message in the middle of a namespace, without needing to read the whole namespace. 1. The version bits allow us to upgrade the share format in the future, if we need to do so in a way that different share formats can be mixed within a block. +## Example + +| share number | 10 | 11 | 12 | 13 | +| ----------------------- | -------------------------------- | -------------------------------- | -------------------------------- | -------------------------------- | +| namespace | `[]byte{1, 1, 1, 1, 1, 1, 1, 1}` | `[]byte{1, 1, 1, 1, 1, 1, 1, 1}` | `[]byte{1, 1, 1, 1, 1, 1, 1, 1}` | `[]byte{2, 2, 2, 2, 2, 2, 2, 2}` | +| version | `0000000` | `0000000` | `0000000` | `0000000` | +| message start indicator | `1` | `1` | `0` | `1` | +| data | foo | bar | bar (continued) | buzz | + +Without the universal share format: if a client is provided share 11, they have no way of knowing that a message length delimiter is encoded in this share. In order to parse the bar message, they must request and download all shares in this namespace (shares 10 and 12) and parse them in-order to determine the length of the bar message. + +With the universal share format: if a client is provided share 11, they know from the prefix that share 11 is the start of a message and can therefore parse the message length delimiter in share 11. With the parsed message length, the client knows that the bar message will complete after reading N bytes (where N includes shares 11 and 12) and can therefore avoid requesting and downloading share 10. + ## Questions 1. Does the info byte introduce any new attack vectors? @@ -58,12 +65,12 @@ Rationale: - This **continuation share indicator** is inspired by [protocol buffer varints](https://developers.google.com/protocol-buffers/docs/encoding#varints) and [UTF-8](https://en.wikipedia.org/wiki/UTF-8). - The **continuation share indicator** is distinct from the **message start indicator**. Consider a message with 3 contiguous shares: - indicator | share 1 | share 2 | share 3 - --- | --- | --- | --- - message start | `1` | `0` | `0` - continuation share | `1` | `1` | `0` <- client stops requesting contiguous shares when they encounter `0` + | share number | 1 | 2 | 3 | + | ---------------------------- | --- | --- | ------------------------------------------------------------------------ | + | message start indicator | `1` | `0` | `0` | + | continuation share indicator | `1` | `1` | `0` <- client stops requesting contiguous shares when they encounter `0` | - - This would enable clients to begin parsing a message by sampling a share in the middle of a message and proceed to parsing contiguous shares until the end without ever encountering the first share of the message which contains the message length. However, this use case seems contrived because a subset of the message shares may not be meaningful to the client. + - This would enable clients to begin parsing a message by sampling a share in the middle of a message and proceed to parsing contiguous shares until the end without ever encountering the first share of the message which contains the message length. However, this use case seems contrived because a subset of the message shares may not be meaningful to the client. This depends on how roll-ups encode the data in a `PayForData` transaction. - Without the continuation share indicator, the client would have to request the first share of the message to parse the message length. If they don't request the first share, they can request contiguous shares until they reach the first share after their message ends to learn that they completed requesting the previous message. 1. What happens if a block producer publishes a message with a version that isn't in the list of supported versions? From af0d256c3a29e89eecf7a5f8376118cef347f8cc Mon Sep 17 00:00:00 2001 From: Rootul Patel Date: Wed, 7 Sep 2022 14:40:53 -0400 Subject: [PATCH 10/21] Update docs/architecture/ADR-002-universal-share-encoding.md --- docs/architecture/ADR-002-universal-share-encoding.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/architecture/ADR-002-universal-share-encoding.md b/docs/architecture/ADR-002-universal-share-encoding.md index c4aa5fc0bd..1291cf5a12 100644 --- a/docs/architecture/ADR-002-universal-share-encoding.md +++ b/docs/architecture/ADR-002-universal-share-encoding.md @@ -12,7 +12,10 @@ ## Context -The current compact share format is:
`nid (8 bytes) | reserved (1 byte) | data` +The current compact share format is: + +- First share of reserved namespace:
`nid (8 bytes) | reserved (1 byte) | data length (varint 1 to 10 bytes) | data` +- Contiguous share in reserved namespace:
`nid (8 bytes) | reserved (1 byte) | data` The current sparse share format is: From 8cc6476ff8defc66a35e6a36d89f66e77028213d Mon Sep 17 00:00:00 2001 From: Rootul Patel Date: Fri, 16 Sep 2022 14:44:05 -0400 Subject: [PATCH 11/21] prefer namespace_id over nid --- .../ADR-002-universal-share-encoding.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/docs/architecture/ADR-002-universal-share-encoding.md b/docs/architecture/ADR-002-universal-share-encoding.md index 1291cf5a12..aba611c9ee 100644 --- a/docs/architecture/ADR-002-universal-share-encoding.md +++ b/docs/architecture/ADR-002-universal-share-encoding.md @@ -4,7 +4,6 @@ ## Terminology -- **nid** (8 bytes): namespace id - **reserved** (1 byte): is the location of the first transaction, ISR, or evidence in this share if there is one and `0` if there isn't one - **message length** (varint 1 to 10 bytes): is the length of the entire message in bytes - **compact share**: a type of share that can accomodate multiple units. Currently, compact shares are used for transactions, ISRs, and evidence to efficiently pack this information into as few shares as possible. @@ -14,13 +13,13 @@ The current compact share format is: -- First share of reserved namespace:
`nid (8 bytes) | reserved (1 byte) | data length (varint 1 to 10 bytes) | data` -- Contiguous share in reserved namespace:
`nid (8 bytes) | reserved (1 byte) | data` +- First share of reserved namespace:
`namespace_id (8 bytes) | reserved (1 byte) | data length (varint 1 to 10 bytes) | data` +- Contiguous share in reserved namespace:
`namespace_id (8 bytes) | reserved (1 byte) | data` The current sparse share format is: -- First share of message:
`nid (8 bytes) | message length (varint 1 to 10 bytes) | data` -- Contiguous share in message:
`nid (8 bytes) | data` +- First share of message:
`namespace_id (8 bytes) | message length (varint 1 to 10 bytes) | data` +- Contiguous share in message:
`namespace_id (8 bytes) | data` The current share format poses multiple challenges: @@ -32,10 +31,10 @@ The current share format poses multiple challenges: Introduce a universal share encoding that applies to both compact and sparse shares: -- First share of namespace for compact shares or message for sprase shares:
`nid (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | data` -- Contiguous share in namespace for compact shares or message for sparse shares:
`nid (8 bytes) | info (1 byte) | data` +- First share of namespace for compact shares or message for sprase shares:
`namespace_id (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | data` +- Contiguous share in namespace for compact shares or message for sparse shares:
`namespace_id (8 bytes) | info (1 byte) | data` -Shares in the reserved namespace have the added constraint: the first byte of `data` is a reserved byte so the format is:
`nid (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | reserved (1 byte) | data` +Shares in the reserved namespace have the added constraint: the first byte of `data` is a reserved byte so the format is:
`namespace_id (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | reserved (1 byte) | data` Where **info** (1 byte) is a byte with the following structure: From a0c73590e7b312007d3bbf8f5fb21ab40a5e9747 Mon Sep 17 00:00:00 2001 From: Rootul Patel Date: Fri, 16 Sep 2022 14:49:32 -0400 Subject: [PATCH 12/21] update impl details --- .../ADR-002-universal-share-encoding.md | 16 ++++------------ 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/docs/architecture/ADR-002-universal-share-encoding.md b/docs/architecture/ADR-002-universal-share-encoding.md index aba611c9ee..f4995ee0a2 100644 --- a/docs/architecture/ADR-002-universal-share-encoding.md +++ b/docs/architecture/ADR-002-universal-share-encoding.md @@ -4,7 +4,7 @@ ## Terminology -- **reserved** (1 byte): is the location of the first transaction, ISR, or evidence in this share if there is one and `0` if there isn't one +- **reserved** (1 byte): is the location of the first transaction, ISR, or evidence in the share if there is one and `0` if there isn't one - **message length** (varint 1 to 10 bytes): is the length of the entire message in bytes - **compact share**: a type of share that can accomodate multiple units. Currently, compact shares are used for transactions, ISRs, and evidence to efficiently pack this information into as few shares as possible. - **sparse share**: a type of share that can accomodate zero or one unit. Currently, sparse shares are used for messages. @@ -95,10 +95,8 @@ A share version is not set by a user who submits a `PayForData`. Instead, it is ### Constants 1. Define a new constant for `InfoReservedBytes = 1`. -1. Update [`MsgShareSize`](https://github.com/celestiaorg/celestia-core/blob/v0.34.x-celestia/pkg/consts/consts.go#L26) to account for one less byte available -1. Update [`TxShareSize`](https://github.com/celestiaorg/celestia-core/blob/v0.34.x-celestia/pkg/consts/consts.go#L24) to account for one less byte available - -**NOTE**: Currently constants are defined in celestia-core's [consts.go](https://github.com/celestiaorg/celestia-core/blob/master/pkg/consts/consts.go) but some will be moved to celestia-app's [appconsts.go](https://github.com/celestiaorg/celestia-app/tree/evan/non-interactive-defaults-feature/pkg/appconsts). See [celestia-core#841](https://github.com/celestiaorg/celestia-core/issues/841). +1. Update [`CompactShareContentSize`](https://github.com/celestiaorg/celestia-app/blob/566b3d41d2bf097ac49f1a925cb56a3abeabadc8/pkg/appconsts/appconsts.go#L29) to account for one less byte available +1. Update [`SparseShareContentSize`](https://github.com/celestiaorg/celestia-app/blob/566b3d41d2bf097ac49f1a925cb56a3abeabadc8/pkg/appconsts/appconsts.go#L32) to account for one less byte available ### Types @@ -106,14 +104,8 @@ A share version is not set by a user who submits a `PayForData`. Instead, it is ### Logic -#### celestia-core - -1. Account for the new `InfoReservedByte` in `./types/share_splitting.go` and `./types/share_merging.go`. - - **NOTE**: These files are subject to be deleted soon. See - -#### celestia-app - 1. Account for the new `InfoReservedByte` in all share splitting and merging code. +1. Introduce a new `ParseShares` API that can accept any type of share (compact or sparse). ## Status From 2eb9e35a382181b3dd2ffb941ee30c5324247f3a Mon Sep 17 00:00:00 2001 From: Rootul Patel Date: Fri, 16 Sep 2022 15:28:55 -0400 Subject: [PATCH 13/21] clarify existing schema --- .../ADR-002-universal-share-encoding.md | 22 ++++++++++++------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/docs/architecture/ADR-002-universal-share-encoding.md b/docs/architecture/ADR-002-universal-share-encoding.md index f4995ee0a2..686318c542 100644 --- a/docs/architecture/ADR-002-universal-share-encoding.md +++ b/docs/architecture/ADR-002-universal-share-encoding.md @@ -4,23 +4,29 @@ ## Terminology -- **reserved** (1 byte): is the location of the first transaction, ISR, or evidence in the share if there is one and `0` if there isn't one -- **message length** (varint 1 to 10 bytes): is the length of the entire message in bytes - **compact share**: a type of share that can accomodate multiple units. Currently, compact shares are used for transactions, ISRs, and evidence to efficiently pack this information into as few shares as possible. - **sparse share**: a type of share that can accomodate zero or one unit. Currently, sparse shares are used for messages. ## Context -The current compact share format is: +### Compact Share Schema -- First share of reserved namespace:
`namespace_id (8 bytes) | reserved (1 byte) | data length (varint 1 to 10 bytes) | data` -- Contiguous share in reserved namespace:
`namespace_id (8 bytes) | reserved (1 byte) | data` +`namespace_id (8 bytes) | reserved (1 byte) | data` -The current sparse share format is: +Where: + +- `reserved (1 byte)`: is the location of the first transaction, ISR, or evidence in the share if there is one and `0` if there isn't one. +- `data`: contains the raw bytes where each unit is prefixed with a varint 1 to 10 bytes that indicates how long the unit is in bytes. + +### Sparse Share Schema - First share of message:
`namespace_id (8 bytes) | message length (varint 1 to 10 bytes) | data` - Contiguous share in message:
`namespace_id (8 bytes) | data` +Where: + +- `message length** (varint 1 to 10 bytes)`: is the length of the entire message in bytes + The current share format poses multiple challenges: 1. Clients must have two share parsing implementations (one for compact shares and one for sparse shares). @@ -34,9 +40,9 @@ Introduce a universal share encoding that applies to both compact and sparse sha - First share of namespace for compact shares or message for sprase shares:
`namespace_id (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | data` - Contiguous share in namespace for compact shares or message for sparse shares:
`namespace_id (8 bytes) | info (1 byte) | data` -Shares in the reserved namespace have the added constraint: the first byte of `data` is a reserved byte so the format is:
`namespace_id (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | reserved (1 byte) | data` +Compact shares have the added constraint: the first byte of `data` is a reserved byte so the format is:
`namespace_id (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | reserved (1 byte) | data` -Where **info** (1 byte) is a byte with the following structure: +Where `info (1 byte)` is a byte with the following structure: - the first 7 bits are reserved for the version information in big endian form (initially, this will be `0000000` for version 0); - the last bit is a **message start indicator**, that is `1` if the share is at the start of a message or `0` if it is a contiguous share within a message. From 41fbf05f3dddc6d1813c7f5afec6ffdb1679b881 Mon Sep 17 00:00:00 2001 From: Rootul Patel Date: Fri, 16 Sep 2022 16:39:56 -0400 Subject: [PATCH 14/21] Clarify proposal --- docs/architecture/ADR-002-universal-share-encoding.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/architecture/ADR-002-universal-share-encoding.md b/docs/architecture/ADR-002-universal-share-encoding.md index 686318c542..d3002da23f 100644 --- a/docs/architecture/ADR-002-universal-share-encoding.md +++ b/docs/architecture/ADR-002-universal-share-encoding.md @@ -37,10 +37,12 @@ The current share format poses multiple challenges: Introduce a universal share encoding that applies to both compact and sparse shares: -- First share of namespace for compact shares or message for sprase shares:
`namespace_id (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | data` -- Contiguous share in namespace for compact shares or message for sparse shares:
`namespace_id (8 bytes) | info (1 byte) | data` +- First share of message:
`namespace_id (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | data` +- Contiguous share of message:
`namespace_id (8 bytes) | info (1 byte) | data` -Compact shares have the added constraint: the first byte of `data` is a reserved byte so the format is:
`namespace_id (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | reserved (1 byte) | data` +Note: conceptually we think of all the data in a reserved namespace as a single message + +Compact shares have the added constraint: the first byte of `data` is a reserved byte so the format is:
`namespace_id (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | reserved (1 byte) | data` and every unit in the compact share `data` is prefixed with a `unit length (varint 1 to 10 bytes)`. Where `info (1 byte)` is a byte with the following structure: From 21e7fe1e14609330707452b41f84289e00350a50 Mon Sep 17 00:00:00 2001 From: Rootul Patel Date: Mon, 26 Sep 2022 09:54:29 -0400 Subject: [PATCH 15/21] chore: rename encoding to prefix, fix adr number --- ...al-share-encoding.md => adr-006-universal-share-prefix.md} | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) rename docs/architecture/{ADR-002-universal-share-encoding.md => adr-006-universal-share-prefix.md} (98%) diff --git a/docs/architecture/ADR-002-universal-share-encoding.md b/docs/architecture/adr-006-universal-share-prefix.md similarity index 98% rename from docs/architecture/ADR-002-universal-share-encoding.md rename to docs/architecture/adr-006-universal-share-prefix.md index d3002da23f..8ed32aa73e 100644 --- a/docs/architecture/ADR-002-universal-share-encoding.md +++ b/docs/architecture/adr-006-universal-share-prefix.md @@ -1,6 +1,4 @@ -# ADR 009: Universal Share Encoding - - +# ADR 006: Universal Share Prefix ## Terminology From 2f26c385831048808d99f4626aa3569df18c187c Mon Sep 17 00:00:00 2001 From: Rootul Patel Date: Mon, 3 Oct 2022 16:47:48 -0400 Subject: [PATCH 16/21] rename ADR number, introduce share sequence, add new validity rules --- ...x.md => adr-007-universal-share-prefix.md} | 67 ++++++++++--------- 1 file changed, 37 insertions(+), 30 deletions(-) rename docs/architecture/{adr-006-universal-share-prefix.md => adr-007-universal-share-prefix.md} (60%) diff --git a/docs/architecture/adr-006-universal-share-prefix.md b/docs/architecture/adr-007-universal-share-prefix.md similarity index 60% rename from docs/architecture/adr-006-universal-share-prefix.md rename to docs/architecture/adr-007-universal-share-prefix.md index 8ed32aa73e..5818e9ef7a 100644 --- a/docs/architecture/adr-006-universal-share-prefix.md +++ b/docs/architecture/adr-007-universal-share-prefix.md @@ -1,9 +1,10 @@ -# ADR 006: Universal Share Prefix +# ADR 007: Universal Share Prefix ## Terminology - **compact share**: a type of share that can accomodate multiple units. Currently, compact shares are used for transactions, ISRs, and evidence to efficiently pack this information into as few shares as possible. - **sparse share**: a type of share that can accomodate zero or one unit. Currently, sparse shares are used for messages. +- **share sequence**: an ordered list of shares ## Context @@ -23,7 +24,7 @@ Where: Where: -- `message length** (varint 1 to 10 bytes)`: is the length of the entire message in bytes +- `message length (varint 1 to 10 bytes)`: is the length of the entire message in bytes The current share format poses multiple challenges: @@ -35,55 +36,55 @@ The current share format poses multiple challenges: Introduce a universal share encoding that applies to both compact and sparse shares: -- First share of message:
`namespace_id (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | data` -- Contiguous share of message:
`namespace_id (8 bytes) | info (1 byte) | data` +- First share of sequence:
`namespace_id (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | data` +- Contiguous share of sequence:
`namespace_id (8 bytes) | info (1 byte) | data` -Note: conceptually we think of all the data in a reserved namespace as a single message - -Compact shares have the added constraint: the first byte of `data` is a reserved byte so the format is:
`namespace_id (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | reserved (1 byte) | data` and every unit in the compact share `data` is prefixed with a `unit length (varint 1 to 10 bytes)`. +Compact shares have the added constraint: the first byte of `data` in each share is a reserved byte so the format is:
`namespace_id (8 bytes) | info (1 byte) | data length (varint 1 to 10 bytes) | reserved (1 byte) | data` and every unit in the compact share `data` is prefixed with a `unit length (varint 1 to 10 bytes)`. Where `info (1 byte)` is a byte with the following structure: - the first 7 bits are reserved for the version information in big endian form (initially, this will be `0000000` for version 0); -- the last bit is a **message start indicator**, that is `1` if the share is at the start of a message or `0` if it is a contiguous share within a message. +- the last bit is a **sequence start indicator**, that is `1` if the share is at the start of a sequence or `0` if it is a continuation share. + +Note: all compact shares in a reserved namespace are grouped into one sequence. Rationale: 1. The first 9 bytes of a share are formatted in a consistent way regardless of the type of share (compact or sparse). Clients can therefore parse shares into data via one mechanism rather than two. -1. The message start indicator allows clients to parse a whole message in the middle of a namespace, without needing to read the whole namespace. +1. The sequence start indicator allows clients to parse a whole message in the middle of a namespace, without needing to read the whole namespace. 1. The version bits allow us to upgrade the share format in the future, if we need to do so in a way that different share formats can be mixed within a block. ## Example -| share number | 10 | 11 | 12 | 13 | -| ----------------------- | -------------------------------- | -------------------------------- | -------------------------------- | -------------------------------- | -| namespace | `[]byte{1, 1, 1, 1, 1, 1, 1, 1}` | `[]byte{1, 1, 1, 1, 1, 1, 1, 1}` | `[]byte{1, 1, 1, 1, 1, 1, 1, 1}` | `[]byte{2, 2, 2, 2, 2, 2, 2, 2}` | -| version | `0000000` | `0000000` | `0000000` | `0000000` | -| message start indicator | `1` | `1` | `0` | `1` | -| data | foo | bar | bar (continued) | buzz | +| share number | 10 | 11 | 12 | 13 | +| ------------------------ | -------------------------------- | -------------------------------- | -------------------------------- | -------------------------------- | +| namespace | `[]byte{1, 1, 1, 1, 1, 1, 1, 1}` | `[]byte{1, 1, 1, 1, 1, 1, 1, 1}` | `[]byte{1, 1, 1, 1, 1, 1, 1, 1}` | `[]byte{2, 2, 2, 2, 2, 2, 2, 2}` | +| version | `0000000` | `0000000` | `0000000` | `0000000` | +| sequence start indicator | `1` | `1` | `0` | `1` | +| data | foo | bar | bar (continued) | buzz | -Without the universal share format: if a client is provided share 11, they have no way of knowing that a message length delimiter is encoded in this share. In order to parse the bar message, they must request and download all shares in this namespace (shares 10 and 12) and parse them in-order to determine the length of the bar message. +Without the universal share prefix: if a client is provided share 11, they have no way of knowing that a message length delimiter is encoded in this share. In order to parse the bar message, they must request and download all shares in this namespace (shares 10 and 12) and parse them in-order to determine the length of the bar message. -With the universal share format: if a client is provided share 11, they know from the prefix that share 11 is the start of a message and can therefore parse the message length delimiter in share 11. With the parsed message length, the client knows that the bar message will complete after reading N bytes (where N includes shares 11 and 12) and can therefore avoid requesting and downloading share 10. +With the universal share prefix: if a client is provided share 11, they know from the prefix that share 11 is the start of a sequence and can therefore parse the data length delimiter in share 11. With the parsed data length, the client knows that the bar message will complete after reading N bytes (where N includes shares 11 and 12) and can therefore avoid requesting and downloading share 10. ## Questions 1. Does the info byte introduce any new attack vectors? 1. Should one bit in the info byte be used to signify that a continuation share is expected after this share? - This **continuation share indicator** is inspired by [protocol buffer varints](https://developers.google.com/protocol-buffers/docs/encoding#varints) and [UTF-8](https://en.wikipedia.org/wiki/UTF-8). - - The **continuation share indicator** is distinct from the **message start indicator**. Consider a message with 3 contiguous shares: + - The **continuation share indicator** is distinct from the **sequence start indicator**. Consider a message with 3 contiguous shares: | share number | 1 | 2 | 3 | | ---------------------------- | --- | --- | ------------------------------------------------------------------------ | - | message start indicator | `1` | `0` | `0` | + | sequence start indicator | `1` | `0` | `0` | | continuation share indicator | `1` | `1` | `0` <- client stops requesting contiguous shares when they encounter `0` | - - This would enable clients to begin parsing a message by sampling a share in the middle of a message and proceed to parsing contiguous shares until the end without ever encountering the first share of the message which contains the message length. However, this use case seems contrived because a subset of the message shares may not be meaningful to the client. This depends on how roll-ups encode the data in a `PayForData` transaction. - - Without the continuation share indicator, the client would have to request the first share of the message to parse the message length. If they don't request the first share, they can request contiguous shares until they reach the first share after their message ends to learn that they completed requesting the previous message. + - This would enable clients to begin parsing a message by sampling a share in the middle of a message and proceed to parsing contiguous shares until the end without ever encountering the first share of the message which contains the data length. However, this use case seems contrived because a subset of the message shares may not be meaningful to the client. This depends on how roll-ups encode the data in a `PayForData` transaction. + - Without the continuation share indicator, the client would have to request the first share of the message to parse the data length. If they don't request the first share, they can request contiguous shares until they reach the first share after their message ends to learn that they completed requesting the previous message. 1. What happens if a block producer publishes a message with a version that isn't in the list of supported versions? - This can be considered invalid via a `ProcessProposal` validity check. Validators already compute the shares in `ProcessProposal` [here](https://github.com/rootulp/celestia-app/blob/ad050e28678119adae02536db3ef5ce083ea1436/app/process_proposal.go#L104-L110) so we can add a check to verify that every share has a known valid version. -1. What happens if a block producer publishes a message where the message start indicator isn't set correctly? +1. What happens if a block producer publishes a message where the sequence start indicator isn't set correctly? - Add a check similar to the one above. ## Alternative Approaches @@ -92,30 +93,36 @@ We briefly considered adding the info byte to only sparse shares, see Date: Thu, 6 Oct 2022 10:54:00 -0400 Subject: [PATCH 17/21] Update docs/architecture/adr-007-universal-share-prefix.md Co-authored-by: Ismail Khoffi --- docs/architecture/adr-007-universal-share-prefix.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/architecture/adr-007-universal-share-prefix.md b/docs/architecture/adr-007-universal-share-prefix.md index 5818e9ef7a..d5ee1b3360 100644 --- a/docs/architecture/adr-007-universal-share-prefix.md +++ b/docs/architecture/adr-007-universal-share-prefix.md @@ -3,7 +3,7 @@ ## Terminology - **compact share**: a type of share that can accomodate multiple units. Currently, compact shares are used for transactions, ISRs, and evidence to efficiently pack this information into as few shares as possible. -- **sparse share**: a type of share that can accomodate zero or one unit. Currently, sparse shares are used for messages. +- **sparse share**: a type of share that can accommodate zero or one unit. Currently, sparse shares are used for messages. - **share sequence**: an ordered list of shares ## Context From c76c0fc0b5fcb05001f8f4d998369a7fa8d88922 Mon Sep 17 00:00:00 2001 From: Rootul P Date: Thu, 6 Oct 2022 10:54:43 -0400 Subject: [PATCH 18/21] Update docs/architecture/adr-007-universal-share-prefix.md Co-authored-by: Ismail Khoffi --- docs/architecture/adr-007-universal-share-prefix.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/architecture/adr-007-universal-share-prefix.md b/docs/architecture/adr-007-universal-share-prefix.md index d5ee1b3360..d5c29a34a6 100644 --- a/docs/architecture/adr-007-universal-share-prefix.md +++ b/docs/architecture/adr-007-universal-share-prefix.md @@ -2,7 +2,7 @@ ## Terminology -- **compact share**: a type of share that can accomodate multiple units. Currently, compact shares are used for transactions, ISRs, and evidence to efficiently pack this information into as few shares as possible. +- **compact share**: a type of share that can accommodate multiple units. Currently, compact shares are used for transactions, and evidence to efficiently pack this information into as few shares as possible. - **sparse share**: a type of share that can accommodate zero or one unit. Currently, sparse shares are used for messages. - **share sequence**: an ordered list of shares From 17a0d05270997778bec9db0ce25c3ae77ff7d490 Mon Sep 17 00:00:00 2001 From: Rootul P Date: Thu, 6 Oct 2022 10:54:51 -0400 Subject: [PATCH 19/21] Update docs/architecture/adr-007-universal-share-prefix.md Co-authored-by: Ismail Khoffi --- docs/architecture/adr-007-universal-share-prefix.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/architecture/adr-007-universal-share-prefix.md b/docs/architecture/adr-007-universal-share-prefix.md index d5c29a34a6..19681be1fa 100644 --- a/docs/architecture/adr-007-universal-share-prefix.md +++ b/docs/architecture/adr-007-universal-share-prefix.md @@ -14,7 +14,7 @@ Where: -- `reserved (1 byte)`: is the location of the first transaction, ISR, or evidence in the share if there is one and `0` if there isn't one. +- `reserved (1 byte)`: is the location of the first transaction or evidence in the share if there is one and `0` if there isn't one. - `data`: contains the raw bytes where each unit is prefixed with a varint 1 to 10 bytes that indicates how long the unit is in bytes. ### Sparse Share Schema From e680e8db1ea0ba8c9c5c7e088a8e6b50a7b55abc Mon Sep 17 00:00:00 2001 From: Rootul P Date: Thu, 6 Oct 2022 10:54:59 -0400 Subject: [PATCH 20/21] Update docs/architecture/adr-007-universal-share-prefix.md Co-authored-by: Ismail Khoffi --- docs/architecture/adr-007-universal-share-prefix.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/architecture/adr-007-universal-share-prefix.md b/docs/architecture/adr-007-universal-share-prefix.md index 19681be1fa..b103308f69 100644 --- a/docs/architecture/adr-007-universal-share-prefix.md +++ b/docs/architecture/adr-007-universal-share-prefix.md @@ -8,7 +8,7 @@ ## Context -### Compact Share Schema +### Current Compact Share Schema `namespace_id (8 bytes) | reserved (1 byte) | data` From 8aa8a5cd02d57881574bfbd3f5806207604d61b0 Mon Sep 17 00:00:00 2001 From: Rootul Patel Date: Thu, 6 Oct 2022 10:55:56 -0400 Subject: [PATCH 21/21] docs: Current Sparse Share Schema --- docs/architecture/adr-007-universal-share-prefix.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/architecture/adr-007-universal-share-prefix.md b/docs/architecture/adr-007-universal-share-prefix.md index b103308f69..c31b6f2750 100644 --- a/docs/architecture/adr-007-universal-share-prefix.md +++ b/docs/architecture/adr-007-universal-share-prefix.md @@ -17,7 +17,7 @@ Where: - `reserved (1 byte)`: is the location of the first transaction or evidence in the share if there is one and `0` if there isn't one. - `data`: contains the raw bytes where each unit is prefixed with a varint 1 to 10 bytes that indicates how long the unit is in bytes. -### Sparse Share Schema +### Current Sparse Share Schema - First share of message:
`namespace_id (8 bytes) | message length (varint 1 to 10 bytes) | data` - Contiguous share in message:
`namespace_id (8 bytes) | data`