Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaner bitfields #1019

Closed
wants to merge 7 commits into from
Closed

Cleaner bitfields #1019

wants to merge 7 commits into from

Conversation

JustinDrake
Copy link
Contributor

@JustinDrake JustinDrake commented May 1, 2019

This draft PR tries to fix a longstanding wart (see #371, #667) regarding bitfields. We define bitfields as lists or vectors of bits (as opposed to bytes and uint64 respectively). Benefits:

  • Remove ugly-as-hell helper functions verify_bitfield and get_bitfield_bit (save 28 lines)
  • No more bit manipulation with justification_bitfield—more readable
  • Merkleisation-friendly justification_bitfield (i.e. justification_bitfield does not change unless a new epoch is justified)
  • Increased number of bits in justification_bitfield in case dApps and light clients find it useful (set to EPOCHS_PER_HISTORICAL_VECTOR == 8192)
  • Semantically natural typing (bytes and uint64 were hacks)
  • Use bit type for custody_bit

This requires two minor SSZ changes:

  1. Add bit as an alias to bool (cosmetic)
  2. Pack lists and vectors of bits similar to packing of uintN (substantive)

This draft PR try to fix a longstanding wart (see #196, #371, #667) regarding bitfields. We define bitfields as lists or vectors of `bool`s, as opposed to `bytes`. Benefits:

* Remove ugly-as-hell helper function `verify_bitfield`
* No more bit manipulation with `justification_bitfield`
* Merkleisation-friendly `justification_bitfield` (i.e. `justification_bitfield` does not change unless a new epoch is justified).
* Easy parametrisation of the number of bits in `justification_bitfield` (can be more or fewer than 64—more may be useful for dApps and light clients)
* Semantically cleaner typing

This requires a minor SSZ change (Merkleisation not affected) where lists and vectors of `bool`s should be packed, similar to packing of `uintN`. We could alias `bool` to `uint1` so that the packing logic only needs to be defined once for `uintN`.
@dankrad
Copy link
Contributor

dankrad commented May 1, 2019

I'm in favour of this because it adds clarity (bit manipulations should ideally account for as little as possible of the spec, as they are very hard to read).

@dankrad
Copy link
Contributor

dankrad commented May 1, 2019

Rename bool to bit (alternatively, add bit as an alias to bool.

I still prefer the "bool" name as it is a functional description (and "bit" an implementation detail) but on the other hand, "bitfield" feels more natural than "boolfield".

@JustinDrake
Copy link
Contributor Author

Screenshot 2019-05-01 at 16 57 39

@vbuterin
Copy link
Contributor

vbuterin commented May 2, 2019

Wouldn't this also require a special case in SSZ? It makes the phase 0 spec cleaner but the SSZ serialization more complicated by about the same amount...

Though you could argue that adding the bigint type to SSZ has a similar effect, and I've advocated for that :)

@JustinDrake
Copy link
Contributor Author

JustinDrake commented May 3, 2019

Wouldn't this also require a special case in SSZ?

Yes, the substantive change involves packing lists and vectors of bits similarly to packing lists and vectors of uintN (yay for consistency!). My best guess is that it is a 2-liner in SSZ, versus a 32-liner in the state transition function. I'll wait until after the SSZ rewrite is done so we can make a more informed decision on the SSZ complexity increase (cc @protolambda).

@dankrad
Copy link
Contributor

dankrad commented May 3, 2019

My best guess is that it is a 2-liner in SSZ, versus a 32-liner in the state transition function. I'll wait until after the SSZ rewrite is done so we can make a more informed decision on the SSZ complexity increase (cc @protolambda).

I think I would argue that even if it is not a complexity reduction, we should just do it because SSZ is the right place to deal with data types, and this is clearly a question of data types. Not having to worry about bitfields at higher levels of the spec makes it more readable; basically it should be possible to read the Eth2.0 spec without reading SSZ and just assuming that SSZ will do the right thing when it comes to bools/bits.

@vbuterin
Copy link
Contributor

vbuterin commented May 4, 2019

Yes, the substantive change involves packing lists and vectors of bits similarly to packing lists and vectors of uintN (yay for consistency!) My best guess is that it is a 2-liner in SSZ, versus a 32-liner in the state transition function.

Probably a bit more than 2 lines. Here's my attempt:

def serialize_bits(bits):
    o = bytearray((len(bits) + 7) // 8)
    for i, bit in enumerate(bits):
        o[i // 8] |= bit << (i % 8)
    return bytes(o)

Note however that this is insufficient because unlike all other SSZ data structures, where we can deduce the number of items from the length, here we cannot do so, because an N byte list could correspond to a [8N .... 8N+7] size bitfield. So we'd need a custom length prefix (we don't need that today because we're feeding in the length from the committee size, so in some sense we have a "static" data type that's not really "static"). So perhaps we'd have to do something like return bytes([len(bits) % 8]) + o

My instinct is still that bitfields in SSZ are on net a good thing, but there definitely are costs to doing it.

basically it should be possible to read the Eth2.0 spec without reading SSZ and just assuming that SSZ will do the right thing when it comes to bools/bits.

Fully agree with this!

@vbuterin
Copy link
Contributor

vbuterin commented May 4, 2019

Ooh, I have another idea for how to serialize variable-sized bitfields: use the generalized index scheme from https://github.com/ethereum/eth2.0-specs/blob/dev/specs/light_client/merkle_proofs.md#generalized-merkle-tree-index. As I describe here, there is a natural correspondence between generalized indices and bitfields, in that a generalized index is a bitfield, with 0s representing when to go left and 1s representing when to go right, but with an extra 1 bit to signify where the bitfield starts from. For example, 9 is the generalized index for "the right child of the left child of the left child of the root", 9 in binary is 1001, where the first 1 is the start marker, and then the 001 is the path from root to leaf. One could take this exact scheme and use it to re-interpret any bigint as a bitfield for any other purpose, and then we'd just need SSZ to support bigints.

@JustinDrake
Copy link
Contributor Author

Probably a bit more than 2 lines. Here's my attempt:

Serialisation is not part of consensus :) Serialisation-specific implementation details (such as a custom length prefix—which is a good point) do not contribute towards the "strict" line count. This PR is a natural continuation to #924.

In the SSZ rewrite in progress (where the short Merkleisation-specific code is to be incorporated in 0_beacon_chain.md) the main thing to do is generalise the pack function from bytes to bits. I wouldn't be surprised if it ends up being 1-liner (or even a 0-liner) 😂Proof is in the pudding though.

I have another idea for how to serialize variable-sized bitfields: use the generalized index scheme

Interesting!

and then we'd just need SSZ to support bigints

Can we simply alias bigint and bitfield to list[bit] and call it a day?

@vbuterin
Copy link
Contributor

vbuterin commented May 4, 2019

Can we simply alias bigint and bitfield to list[bit] and call it a day?

We could, though I would argue that bigint is the more natural "base-ish" type. The fact is that the way this serialization would work can't avoid being fundamentally different to how every other List[x] works, and we don't want to confuse people by making non-similar things look similar. That said, naming is secondary, if we're ok with (i) the encoding algorithm for bitfields defined by def bitfield_to_int(bits): return (1 << len(bits)) + sum([(1 << i) * bit for i, bit in enumerate(bits)]) or its inverse def int_to_bitfield(n): return [(n >> i) % 2 for i in range(log2(n))] and (ii) adding an SSZ type that's identical to bytes except it enforces that leading zero bytes are illegal, then I'd be ok describing it either way, as they're equivalent under the hood.

the main thing to do is generalise the pack function from bytes to bits

The big problem here is that in pretty much every computer language, and the underlying architectures, things can't be N bits long for N % 8 != 0. The very output of serialize, and the expected input to hash, are both bytes.

Serialisation is not part of consensus :)

Everything I said above applies to tree-hashing too :)

@JustinDrake
Copy link
Contributor Author

def bitfield_to_int(bits): return (1 << len(bits)) + sum([(1 << i) * bit for i, bit in enumerate(bits)])
def int_to_bitfield(n): return [(n >> i) % 2 for i in range(log2(n))]

This is neat :) The main question is whether we want 0-based or 1-based indexing of generalised indices (#1008). An argument in favour of 0-based indexing is that bigints have support for 0, so with 0-based indexing the concepts of "bigint" and "generalised index" become equivalent.

adding an SSZ type that's identical to bytes except it enforces that leading zero bytes are illegal

It's still not clear to me why adding an SSZ type is necessary given we can use existing types.

the expected input to hash is bytes

Right, the current chunkify function should be generalised to right-pad the last chunk with zero bits (as opposed to zero bytes). Maybe replace

def chunkify(byte_string):
    byte_string += b'\x00' * (-len(byte_string) % 32)
    return [byte_string[i:i + 32] for i in range(0, len(byte_string), 32)]

with something like

def chunkify(bits):
    bits += [0b0] * (-len(bits) % 256)
    byte_string = bytes([sum([byte[b] << b for b in range(0, 8)]) for byte in zip(*(iter(bits),) * 8)])
    return [byte_string[i:i + 32] for i in range(0, len(byte_string), 32)]

@vbuterin
Copy link
Contributor

vbuterin commented May 5, 2019

I'm really inclined to say that making bits rather than bytes be the base unit of SSZ just for this single use case is overkill.

It's still not clear to me why adding an SSZ type is necessary given we can use existing types.

To preserve the norm that there's only one valid way to serialize something. If we have an injective map from bitfields to bigints, then we want there to be only one valid way to serialize a bigint, which means that deserializing eg. \x05\x00 should return an error (it should be \x05).

An argument in favour of 0-based indexing is that bigints have support for 0, so with 0-based indexing the concepts of "bigint" and "generalised index" become equivalent.

That is interesting. So if we offset everything by 1 then we get a fully bijective map. Though that's only a really strong desideratum if the goal is to make List[bit] be the base type, and I still feel like that would confuse things, and we should stick to SSZ objects all being byte-based, in which case bigint would be the base type, and we'd only need an injective map (ie. each bitfield has a unique corresponding bigint, not necessarily vice versa). In that case, 1-based is better because the bit representation of the bigint would just be the bitfield with an extra 1 on the end, as opposed to do an awkward subtract-and-carry operation on a bitfield to deserialize it.

@djrtwo
Copy link
Contributor

djrtwo commented May 5, 2019

I'm really inclined to say that making bits rather than bytes be the base unit of SSZ just for this single use case is overkill.

Late to the party but very much agree with this.

@djrtwo
Copy link
Contributor

djrtwo commented May 27, 2019

Are we closing this in favor of the bigint PR?

@JustinDrake
Copy link
Contributor Author

Are we closing this in favor of the bigint PR?

I guess we want to merge the two. The bigint PR current only touches simple-serialize.md. This PR cleans up 0_beacon-chain.md and can be adapted to however bigints get implemented.

@JustinDrake JustinDrake added the milestone:June 30 freeze 🥶 Phase 0 spec freeze for long-lived cross-client testnet label Jun 15, 2019
@djrtwo
Copy link
Contributor

djrtwo commented Jun 25, 2019

@dankrad Can we close this in favor of your coming PR?

@JustinDrake
Copy link
Contributor Author

Closing in favour of #1224

@djrtwo djrtwo deleted the JustinDrake-patch-18 branch August 14, 2019 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
milestone:June 30 freeze 🥶 Phase 0 spec freeze for long-lived cross-client testnet scope:SSZ Simple Serialize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants