Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Little-endian vs big-endian (take 2) #1046

Closed
JustinDrake opened this issue May 5, 2019 · 11 comments
Closed

Little-endian vs big-endian (take 2) #1046

JustinDrake opened this issue May 5, 2019 · 11 comments
Labels

Comments

@JustinDrake
Copy link
Contributor

In the last endianness debate we decided to favour little-endianness. Since then there have been various developments (see points 1-4 in favour of big-endianness) that warrant revisiting that decision.

Pros of little-endian

  1. Consistent with WASM
  2. Friendly to commodity hardware
  3. Consistent with parity-codec

Pros of big-endian

  1. Friendly to mathematical purity and spec readability/reasonability (we get nice things such as the ith bit corresponding to 2**i)
  2. Friendly to generalised indices and bigints (see here)
  3. Consistent with SHA256 (both external and internal serialisation), and potentially friendlier to MPCs and SNARKs/STARKs that involve hashed data.
  4. Consistent with BLS12-381 (Zcash's big-endian implementation is likely to get standardised)
  5. Consistent with big number implementations
  6. Consistent with RLP

All things considered, big-endianness may now to be favoured option. Brief counter-arguments to the little-endianness arguments:

  1. "Consistent with WASM" is the big one that made the scales favour of little-endianness last time. Would be good to get thoughts from WASM people. cc @wanderer, @axic, @jakelang, @lrettig, @cdetrio
  2. As I understand "Friendly to commodity hardware" is a non-issue for dual-endian hardware, and a minor performance penalty for little-endian hardware.
  3. As I understand "Consistent with parity-codec" is an annoyance for the Parity team and can be worked around relatively easily.
@djrtwo
Copy link
Contributor

djrtwo commented May 5, 2019

big-endian

Consistent with SHA256 (both external and internal serialisation)

This is crypto black-box that shouldn't affect us.

potentially friendlier to MPCs and SNARKs/STARKs that involve hashed data

I understand this is your hunch, but is there any actual evidence of this? source or something?

Consistent with BLS12-381

crypto black-box. Not sure it's relevant

Consistent with big number implementations

In general? Across languages? What big int implementations are you talking about?

Consistent with RLP

Consistency with VM serialization (single-endianness from top to bottom) trumps backwards compatibility imo

@JustinDrake
Copy link
Contributor Author

JustinDrake commented May 5, 2019

This is crypto black-box that shouldn't affect us.

crypto black-box. Not sure it's relevant

Serialisation of crypto black boxes may matter when those black boxes are implemented as MPC/SNARK/STARK circuits. One generally wants to reduce the depth and complexity of these circuits so avoiding unnecessary "flip" circuitry may help. Imagine an MPC circuit that does bigint arithmetic (naturally done using big-endian) and then does some hashing with the same bigint interpreted as data and SSZ little-endianness forces a bunch of unnecessary flipping. Imperceptible performance degradations in silicon circuits can turn out to be hugely significant in MPC or SNARK circuits.

this is your hunch, but is there any actual evidence of this? source or something?

Yes, just a hunch. The thought is literally from a few hours ago—arguably FUD at this point 😂 To be confirmed/denied by an MPC expert. (I will be meeting Muthu from Ligero at MIT mid-May and will discuss with him.)

In general? Across languages? What big int implementations are you talking about?

My source is @mkalinin (see here). He writes: "all big number implementations in Java that I've seen uses big-endian to encode/decode numbers to/from byte arrays. So does Milagro, even in C implementation". I'm not qualified to comment beyond that.

Consistency with VM serialization trumps backwards compatibility imo

Agreed. The good news that we may not have to enshrine VM serialisation inside consensus.

@mratsim
Copy link
Contributor

mratsim commented May 5, 2019

Contrary to other debates like cough signed vs unsigned, I don't have any preference of big vs little endian but it is not true that big number implementations are big-endian:

The serialization of their internal representation is big-endian however (but even if it was little-endian you would need a specific serialization pass for most bigint/crypto libraries)

@lightclient
Copy link
Member

From a performance point of view, most architectures support some instruction which efficiently reverses the byte order of a register value (see BSWAP on x86 or REV on ARM). This is specifically to support fast decoding of network packets which are received in big-endian, but could be applicable to passing and retrieving data from our WASM VM.

@arnetheduck
Copy link
Contributor

dual-endian hardware

very rare, in particular, x86 and wasm are not on the list - x86 has instructions to help with bigendian (movbe, bswap), but they all work by converting back and forth, instead of operating on the data "natively" (there's no addbe).

wasm

wasm in particular does not have a native byte-swapping instruction, meaning that whenever you have to "read" big-endian data and convert to an integer, you have to do so byte-by-byte which takes (a lot of) space - here's a trivial example of implementing a simple add function operating on data serialized as BE and LE: https://gcc.godbolt.org/z/Dlcd46

compare how, if the data arrives in bigendian format, we have to use a lot of (contract) space to convert to little-endian and back, to perform an the operation.

Consistent with big number implementations

not sure what this means - big number implementations will typically use native endian internally for operations on the limbs (machine-word-size chunks) and compose these operations however is convenient - endianess doesn't come into play. for interoperability, big-endian is sometimes/often used during serialization, but that also depends. Here's GMP, popular bigint library supporting both endians in serialization: https://gmplib.org/manual/Integer-Import-and-Export.html - it supports both, and "native" for maximum efficiency when on a single platform. Most importantly, the serialization is completely arbitrary / disconnected from what goes on inside the library - it's up to the protocol to choose a serialization and be consistent.

if/when we implement a bigint library for wasm, it will first convert everything to little-endian words (like the add example above), perform the operation and convert back to store it. less conversion = smaller code size.

@lightclient
Copy link
Member

compare how, if the data arrives in bigendian format, we have to use a lot of (contract) space to convert to little-endian and back, to perform an the operation.

Would this conversion not happen on the native machine before handing off the data to the WASM runtime via the linear memory buffer?

@djrtwo djrtwo closed this as completed May 5, 2019
@djrtwo djrtwo reopened this May 5, 2019
@djrtwo
Copy link
Contributor

djrtwo commented May 5, 2019

whoops, accidentally closed. Reopened!

@arnetheduck
Copy link
Contributor

Would this conversion not happen on the native machine before handing off the data to the WASM runtime via the linear memory buffer?

I guess that depends on where the data comes from and who does the interpreting of that data - a conversion de facto has to happen somewhere unless there's mechanical sympathy between the parts that make up the whole, something to consider when choosing the parts. I'd think we want as much as possible to be possible at the execution layer, where "possible" includes the notion of efficient - for example so we don't have to add to many black boxes and precompiles.

@lightclient
Copy link
Member

lightclient commented May 6, 2019

I guess that depends on where the data comes from and who does the interpreting of that data

Since the data will flow into and out of the the WASM runtime via the EEI, I figured a reverse operation could occur at that boundary in the native client.

@mkalinin
Copy link
Contributor

mkalinin commented May 8, 2019

not sure what this means - big number implementations will typically use native endian internally for operations on the limbs (machine-word-size chunks) and compose these operations however is convenient - endianess doesn't come into play

Speaking of big numbers endianness may be used to specify an order of limbs in its representation. That what I meant in the post about Java big number impl; it uses an order with most significant byte (limb = byte) first ~ big-endian with no other options.

Why big numbers may matter? Cause, starting from UIntN, N > 64 type any language will have to use big number impl to carry values of these type as they are bigger than a word that processor's register can handle. But it's hardly the case that such types will come into play.

For UInt64 types little-endian is obviously a win for languages that able to work with memory pointers (C, C++, Go, Rust, etc.).

With regard to crypto black boxes. The best way for SSZ would be to serialise them as pure byte string which represents serialisation format of particular crypto black box, just in the same way we do it for BLS points.

I guess that depends on where the data comes from and who does the interpreting of that data

The only way for WASM to benefit from keeping SSZ little-endian that I can see is avoiding flipping in SSZ implementation written in WASM.

big-endian pros:

  • conformance with some big number implementations (maybe most significant part of them), a case that has low probability due to our needs

little-endian pros:

  • often win for a number of languages
  • win for WASM implementation of SSZ

Comparing both pros makes me think of keeping little-endian for SSZ as a rational decision.

P.S. A discussion on endianness in different systems that I've found an interesting one https://news.ycombinator.com/item?id=9451284

@vbuterin
Copy link
Contributor

vbuterin commented May 11, 2019

A couple more pros for big endian:

  • Big-endianness feels more natural to humans that are accustomed with decimal place value
  • The standard way to serialize big endian into hex doesn't have the awkward mixed endian effect that little endian does (eg. the number 2 * 2**12 + 3 * 2**8 + 4 * 2**4 + 5 in big endian is 0x2345 but in little endian is 0x4523 which is just so hard to read)
  • With big endians, we have the invariant (a > b) <-> (int_to_bytesn(a) > int_to_bytesn(b))

Since the data will flow into and out of the the WASM runtime via the EEI, I figured a reverse operation could occur at that boundary in the native client.

It's worth noting that most of the serialization is going to happen on layers higher than the base protocol; eg. in the phase 2 proposal the "transaction" that gets passed into the top-level execution is just a bytearray. So if WASM is little endian and SSZ is big endian there is going to be a lot of byte flipping code at many levels of the stack.

And this is honestly a big pro for little endian especially given our goals with abstraction. So it does seem like WASM is forcing our hand....

Agree that BLS-12-381 and SHA256 are black boxes.

@djrtwo
Copy link
Contributor

djrtwo commented Jul 30, 2019

Closed as phase 0 spec is frozen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants