-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CIP-0041? | UPLC Serialization Optimizations #314
Conversation
|
||
This proposal suggest ways to reduce serialized scripts size. | ||
|
||
the changes where designed to keep the bit oriented style of flat minimizing the number of required bits for values where possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale section is a bit thin, though there's a bit of rationale sprinkled in various places of the specification. It'd be nice to have a summary of what motivates each design choices. This can most likely be done by reworking a bit how the specification is written (that is, write the specification without justifications, and move the justifications to the rationale section).
For example, most of the "data serialization" section is about explaining what is currently wrong with the Data
CBOR serialization. Instead, leave the specification focused to what the proposal is about, and move the rationale to this section.
It'd be nice to consider some benchmarks? Taking some reference UPLC and serializing them side-by-side with the two methods to see how much gain one can expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @KtorZ; will refactor as suggested ( and correct the various typos 😄 )
|
||
## Abstract | ||
|
||
this document describes the parts of the current serialization algorithm that can be improved and provides the specification and documentation needed in order to implement an optimized version of this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want to clean up the formatting, e.g. use capital case at the beginning of sentences.
case 6: return "000001"; | ||
case 7: return "0000001"; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What language is this? Why not describe this in Haskell?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typescript just because I feel the description is more intuitive using it
it translates to haskell as
pad :: Int -> String
pad 0 = "00000001"
pad 1 = "1"
pad 2 = "01"
pad 3 = "001"
pad 4 = "0001"
pad 5 = "00001"
pad 6 = "000001"
pad 7 = "0000001"
pad _ = undefined
--- | ||
CIP: "???" | ||
Title: optimized UPLC serialization | ||
Authors: Michele Nuzzi <michele.nuzzi.2014@gamil.com> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Authors: Michele Nuzzi <michele.nuzzi.2014@gamil.com> | |
Authors: Michele Nuzzi <michele.nuzzi.2014@gmail.com> |
|
||
the proposed changes to the algorithm will cause the same UPLC Abstract Syntaxt Tree to serialize in a different way based on the algorithm used; | ||
|
||
In order to allow the deserializaton process to handle the old serialization algorithm the version of the program sholud be chcked first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to allow the deserializaton process to handle the old serialization algorithm the version of the program sholud be chcked first. | |
In order to allow the deserializaton process to handle the old serialization algorithm the version of the program should be checked first. |
cc @michaelpj |
This could probably have started life as a @kwxm can you take a look? |
|
Apart from anything else, this could do with an impact assessment that assess how much of an improvement this is: actual numbers are very relevant. |
``` | ||
the case in which the ```missingBits``` is 0 implies that the current serialized program is already byte-alligned | ||
|
||
since this padding carries no usefull informations, the current ```pad( 0 )``` adds a useless byte each time a padding is needed and the number of used bits is a multiple of 8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the reason for this is so that the decoder can easily discard padding at the end without having to check whether it's at a byte boundary, and this makes it faster. The proposed change would on average save one bit per bytestring, so I'm not sure if it's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the proposed changes are accepted the only place where padding will be used is at the end of the script; since bytestring will no longer require to be byte alligned
the check for pad( 0 )
then becomes just lengthInBits `mod` 8 == 0
|
||
tags from ```integer``` to ```bool``` and the ```data``` one are directly followed by the respective value encoding; | ||
|
||
tags ```list``` and ```pair``` are the only tags that do require some other tag in order to be a defined type; since the twos always require some other type in order to be valid, the type application is implcit and it should be removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a bit annoying: it's like that because it reflects how the internal representations of types work. The extra generality might permit us to do some more complex things in future, but it's not clear if we'll ever actually need that. Simplifying this might be a good thing to do, since it complicates the encoding/decoding process as well as taking up extra room.
|
||
### ```data``` serialization | ||
|
||
All the effort of minimizing the size of on-chain scripts by prefering ```flat``` over ```CBOR``` serialization are ignored when it comes to ```data``` serialization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be good to have a more efficient encoding of Data
rather than just wrapping the CBOR representation. However, I think the reason for this is that Data
is used elsewhere on the chain (including things that are passed to validator scripts as parameters), and CBOR is the preferred format there. It might be difficult to switch because of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conversion from CBOR to a specific data encoding can be done by the node prior to being passed as argument;
CBOR encoding is definitely too expansive to be included in a script
@michaelpj I will work on a CIP implementation these days to have a comparison in terms of size |
this implies ```(~1) + #chunks + 1``` meaningless bytes are added per ```ByteString``` | ||
|
||
in the descripton above | ||
- step ```1``` allows for an easy serialization and deserialization but doesn't carries any meaningful informations; given the importance of ByteStrings it should be removed at the cost of an added shift while serializing/deserializing the value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that might slow things down significantly, since I think we'd need to shift every byte in a bytestring if it didn't start at a byte boundary (or have I got that wrong?). It's important that on-chain deserialisation be as fast as possible, so the extra expense might be problematic. I'd like to see some benchmark results for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be possible to implement deserialization without shifting bytes for efficiency
I'll leave here the plu-ts implementation of the readNBits
function on which the deserialization process is based
I believe a very similar result can be achieved in Haskell using the Data.Binary.Get
monad
2) as many bytes as specified in the Unsigned Integer at step 1 | ||
``` | ||
|
||
using the new algorithm the ```ByteString``` serialization space complexity goes form ```O(n)``` to ```O(log n)``` where ```n``` is the number of bytes in the ```ByteString``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. I need to think more carefully about this. It's a while since I've thought about bytestring encodings and I'll need to remind myself of the details of how we currently do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ideas here aren't nonsensical, but I'd like to see some benchmark results to see how they affect (a) the space taken up by serialised scripts, and (b) deserialisation times. I think that the benefits would have to be quite big in order to accept this.
I'll add that we're aware that there's room for improving things, but the reason that things are the way that they currently are is because we're using a pre-existing library and that's the way that it does things. We're definitely not averse to changing things, but then we'd need to maintain our own version of the flat
library,
Apologies for a slightly hurried review: this had slipped off the bottom of my list. I'll think a bit more about this, in particular the bytestring encoding.
Wondering if this is still an issue and needs to remain open given the advancements of Plutus v2/v3 and Plu.TS, Aiken, et al in the ensuing years since this CIP was opened. @michele-nuzzi what do you think? |
Theoretically it could still be something worth exploring Even with Aiken and plu-ts reducing size as much as possible Devs still tend to max out script size in an effort to reduce recursion and handle many inputs However I or my team are not able to focus on this in the short term If someone else wants to take over the CIP I'm ok with that |
thanks for update @michele-nuzzi ... @Ryun1 @Crypto2099 I'll remove this from the |
Improvements to the cost of deserialization are an extremely relevant topic now with the introduction of the cost per reference script byte fee which has heavily reduced the revenue of most major dApp protocols on Cardano. It would be good if either this CIP could be revisited (and reworked to optimize for deserialization speed instead of serialization size), or a more general CPS could be made about improving the serialization format with less emphasis on script size and more emphasis on deserialization speed (developers would be happy to pay a much much higher one-time min ada fee for deploying large reference scripts if it meant a lower feePerRefScriptByte because the latter is a reoccurring fee and eats into protocol revenue) |
@colll78 if you want to take it over I'm ok with it |
Isn't the serialization already quite efficient? Merkelization could be useful though. I made a CIP before even. |
For cross-reference: |
The merklization CIP is not relevant to this discussion. The goal of that CIP is to reduce the size of scripts. In this case, we don't actually care about the size of scripts, instead what we are concerned about is the preparation time of scripts (fee associated with deserialization etc). The issue is the cost of deserialization (introduced with the So either we make deserialization more efficient (and thus reduce the cost of The reason deserialization is costly is because the serialization format was optimized for script size because at the time we did not have reference scripts and thus script size was much more important because the scripts needed to be redundantly included in every single tx. Now that we have reference scripts, and the ability to create scripts of nearly unbounded size ex 200kb+ with merklized-scripts and the withdraw zero design pattern, script size is now significantly less valuable than script efficiency (ex-units), and the current meta is to sacrifice script size for efficiency (inline everything, lookup tables everywhere, unrolled recursion, etc), and so the serialization format should prioritize deserialization speed over compression. Developers will happily pay huge one-time fees to deploy reference scripts (even thousands of Ada) if it results in lower costs in the long term (reduce deserialization cost) since the long-term cost savings eventually outweigh the one-time-fee of storing the reference script. |
The point is you wouldn't need to deserialize the whole thing with merkelization. |
Anyway intuitively and naively I feel like you can probably make the situation 10-100x better without changing the serialization format by just improving the deserialization code. JSON isn't particularly optimized either but the parsers for JSON are incredibly fast. Really 95% of the performance problems can probably be fixed by just improving the implementation rather than doing architectural changes. |
this CIP proposes some changes to the UPLC AST serialization in order to reduce the size of the serialized output