CIP-0041? | UPLC Serialization Optimizations #314

HarmonicPool · 2022-08-12T16:39:16Z

this CIP proposes some changes to the UPLC AST serialization in order to reduce the size of the serialized output

KtorZ · 2022-08-13T08:08:01Z

CIP-optimized-uplc-serialization/README.md

+
+This proposal suggest ways to reduce serialized scripts size.
+
+the changes where designed to keep the bit oriented style of flat minimizing the number of required bits for values where possible.


The rationale section is a bit thin, though there's a bit of rationale sprinkled in various places of the specification. It'd be nice to have a summary of what motivates each design choices. This can most likely be done by reworking a bit how the specification is written (that is, write the specification without justifications, and move the justifications to the rationale section).

For example, most of the "data serialization" section is about explaining what is currently wrong with the Data CBOR serialization. Instead, leave the specification focused to what the proposal is about, and move the rationale to this section.

It'd be nice to consider some benchmarks? Taking some reference UPLC and serializing them side-by-side with the two methods to see how much gain one can expect.

Thank you @KtorZ; will refactor as suggested ( and correct the various typos 😄 )

L-as · 2022-08-16T00:34:03Z

CIP-optimized-uplc-serialization/README.md

+
+## Abstract
+
+this document describes the parts of the current serialization algorithm that can be improved and provides the specification and documentation needed in order to implement an optimized version of this one.


You probably want to clean up the formatting, e.g. use capital case at the beginning of sentences.

L-as · 2022-08-16T00:34:53Z

CIP-optimized-uplc-serialization/README.md

+        case 6: return "000001";
+        case 7: return "0000001";
+    }
+}


What language is this? Why not describe this in Haskell?

typescript just because I feel the description is more intuitive using it

it translates to haskell as

pad :: Int -> String pad 0 = "00000001" pad 1 = "1" pad 2 = "01" pad 3 = "001" pad 4 = "0001" pad 5 = "00001" pad 6 = "000001" pad 7 = "0000001" pad _ = undefined

KtorZ · 2022-09-06T08:07:41Z

CIP-optimized-uplc-serialization/README.md

+---
+CIP: "???"
+Title: optimized UPLC serialization  
+Authors: Michele Nuzzi <michele.nuzzi.2014@gamil.com> 


Suggested change

Authors: Michele Nuzzi <michele.nuzzi.2014@gamil.com>

Authors: Michele Nuzzi <michele.nuzzi.2014@gmail.com>

KtorZ · 2022-09-06T08:09:01Z

CIP-optimized-uplc-serialization/README.md

+
+the proposed changes to the algorithm will cause the same UPLC Abstract Syntaxt Tree to serialize in a different way based on the algorithm used;
+
+In order to allow the deserializaton process to handle the old serialization algorithm the version of the program sholud be chcked first.


Suggested change

In order to allow the deserializaton process to handle the old serialization algorithm the version of the program sholud be chcked first.

In order to allow the deserializaton process to handle the old serialization algorithm the version of the program should be checked first.

KtorZ · 2022-09-06T08:09:14Z

cc @michaelpj

michaelpj · 2022-09-06T10:43:33Z

This could probably have started life as a plutus issue, but this is also fine!

@kwxm can you take a look?

KtorZ · 2022-10-25T07:20:34Z

Note

We'll be reviewing this proposal briefly in today's editor meeting but since this change concerns Plutus internals, the ultimate decision and approval is up to the current Plutus core team in accordance with CIP-0035.

cc @michaelpj @kwxm

michaelpj · 2022-10-25T10:41:04Z

Apart from anything else, this could do with an impact assessment that assess how much of an improvement this is: actual numbers are very relevant.

kwxm · 2022-10-25T10:49:03Z

CIP-optimized-uplc-serialization/README.md

+```
+the case in which the ```missingBits``` is 0 implies that the current serialized program is already byte-alligned
+
+since this padding carries no usefull informations, the current ```pad( 0 )``` adds a useless byte each time a padding is needed and the number of used bits is a multiple of 8.


I think that the reason for this is so that the decoder can easily discard padding at the end without having to check whether it's at a byte boundary, and this makes it faster. The proposed change would on average save one bit per bytestring, so I'm not sure if it's worth it.

if the proposed changes are accepted the only place where padding will be used is at the end of the script; since bytestring will no longer require to be byte alligned

the check for pad( 0 ) then becomes just lengthInBits `mod` 8 == 0

kwxm · 2022-10-25T10:54:43Z

CIP-optimized-uplc-serialization/README.md

+
+tags from ```integer``` to ```bool``` and the ```data``` one are directly followed by the respective value encoding;
+
+tags ```list``` and ```pair``` are the only tags that do require some other tag in order to be a defined type; since the twos always require some other type in order to be valid, the type application is implcit and it should be removed.


Yes, that's a bit annoying: it's like that because it reflects how the internal representations of types work. The extra generality might permit us to do some more complex things in future, but it's not clear if we'll ever actually need that. Simplifying this might be a good thing to do, since it complicates the encoding/decoding process as well as taking up extra room.

kwxm · 2022-10-25T10:58:12Z

CIP-optimized-uplc-serialization/README.md

+
+### ```data``` serialization
+
+All the effort of minimizing the size of on-chain scripts by prefering ```flat``` over ```CBOR``` serialization are ignored when it comes to ```data``` serialization.


Yes, it would be good to have a more efficient encoding of Data rather than just wrapping the CBOR representation. However, I think the reason for this is that Data is used elsewhere on the chain (including things that are passed to validator scripts as parameters), and CBOR is the preferred format there. It might be difficult to switch because of this.

conversion from CBOR to a specific data encoding can be done by the node prior to being passed as argument;

CBOR encoding is definitely too expansive to be included in a script

HarmonicPool · 2022-10-25T11:01:50Z

Apart from anything else, this could do with an impact assessment that assess how much of an improvement this is: actual numbers are very relevant.

@michaelpj I will work on a CIP implementation these days to have a comparison in terms of size

kwxm · 2022-10-25T11:03:46Z

CIP-optimized-uplc-serialization/README.md

+this implies ```(~1) + #chunks + 1``` meaningless bytes are added per ```ByteString```
+
+in the descripton above
+- step ```1``` allows for an easy serialization and deserialization but doesn't carries any meaningful informations; given the importance of ByteStrings it should be removed at the cost of an added shift while serializing/deserializing the value


I think that might slow things down significantly, since I think we'd need to shift every byte in a bytestring if it didn't start at a byte boundary (or have I got that wrong?). It's important that on-chain deserialisation be as fast as possible, so the extra expense might be problematic. I'd like to see some benchmark results for this.

It might be possible to implement deserialization without shifting bytes for efficiency

I'll leave here the plu-ts implementation of the readNBits function on which the deserialization process is based

I believe a very similar result can be achieved in Haskell using the Data.Binary.Get monad

kwxm · 2022-10-25T11:05:04Z

CIP-optimized-uplc-serialization/README.md

+2) as many bytes as specified in the Unsigned Integer at step 1
+```
+
+using the new algorithm the ```ByteString``` serialization space complexity goes form ```O(n)``` to ```O(log n)``` where ```n``` is the number of bytes in the ```ByteString``` 


Hmmm. I need to think more carefully about this. It's a while since I've thought about bytestring encodings and I'll need to remind myself of the details of how we currently do it.

kwxm

The ideas here aren't nonsensical, but I'd like to see some benchmark results to see how they affect (a) the space taken up by serialised scripts, and (b) deserialisation times. I think that the benefits would have to be quite big in order to accept this.

I'll add that we're aware that there's room for improving things, but the reason that things are the way that they currently are is because we're using a pre-existing library and that's the way that it does things. We're definitely not averse to changing things, but then we'd need to maintain our own version of the flat library,

Apologies for a slightly hurried review: this had slipped off the bottom of my list. I'll think a bit more about this, in particular the bytestring encoding.

Crypto2099 · 2024-07-14T14:34:31Z

Wondering if this is still an issue and needs to remain open given the advancements of Plutus v2/v3 and Plu.TS, Aiken, et al in the ensuing years since this CIP was opened. @michele-nuzzi what do you think?

michele-nuzzi · 2024-07-14T14:41:52Z

Theoretically it could still be something worth exploring

Even with Aiken and plu-ts reducing size as much as possible Devs still tend to max out script size in an effort to reduce recursion and handle many inputs

However I or my team are not able to focus on this in the short term

If someone else wants to take over the CIP I'm ok with that

rphair · 2024-07-23T22:11:25Z

thanks for update @michele-nuzzi ... @Ryun1 @Crypto2099 I'll remove this from the Waiting for Authors table in the README in the upcoming biweekly update.

colll78 · 2024-11-24T22:51:31Z

Improvements to the cost of deserialization are an extremely relevant topic now with the introduction of the cost per reference script byte fee which has heavily reduced the revenue of most major dApp protocols on Cardano. It would be good if either this CIP could be revisited (and reworked to optimize for deserialization speed instead of serialization size), or a more general CPS could be made about improving the serialization format with less emphasis on script size and more emphasis on deserialization speed (developers would be happy to pay a much much higher one-time min ada fee for deploying large reference scripts if it meant a lower feePerRefScriptByte because the latter is a reoccurring fee and eats into protocol revenue)

michele-nuzzi · 2024-11-24T22:53:27Z

@colll78 if you want to take it over I'm ok with it

L-as · 2024-11-25T07:20:24Z

Isn't the serialization already quite efficient? Merkelization could be useful though. I made a CIP before even.

rphair · 2024-11-25T11:15:14Z

Merkelization could be useful though. I made a CIP before even.

For cross-reference:

CIP-0046? | Merkelised Plutus Scripts #385

colll78 · 2024-11-25T18:49:59Z

Isn't the serialization already quite efficient? Merkelization could be useful though. I made a CIP before even.

The merklization CIP is not relevant to this discussion. The goal of that CIP is to reduce the size of scripts. In this case, we don't actually care about the size of scripts, instead what we are concerned about is the preparation time of scripts (fee associated with deserialization etc).

The issue is the cost of deserialization (introduced with the feePerReferenceScriptByte param) has severely reduced dApp revenue because it disproportionally impacts honest actors in order to prevent cheap spam by dishonest actors. For certain scripts (mostly deliberately malicious scripts) 99% of the time is consumed by preparation/deserialization, but for any non-malicious script (except for some extreme edge-cases) preparation/deserialization should only account for a very small portion of the overall execution time. The current fee is set the way it is to prevent cheap spam from malicious actors who deliberately craft scripts such that 99% of the execution time is deserialization, but a by-product of this is that honest actors (who submit scripts where deserialization is only a small minority of execution time) are punished unfairly because the fees they must pay are disproportionate to the amount of work done (total execution time of the plutus scripts including preparation).

So either we make deserialization more efficient (and thus reduce the cost of feePerReferenceScriptByte) or we get rid of feePerReferenceScriptByte and instead make the cost of preparation be a function of preparationTime/executionTime then it will be much more fair to non-malicious actors (because currently non-malicious users get charged twice, once for the size and a second time for the extra ex-unit cost of the additional instructions which result in the increased size; where-as a malicious user only gets charged once for the increased size since it is dead code and does not result in increased ex-units). Ideally we do both, increase the efficiency of deserialization (and thus reduce preparation time) and modify the costing model to get rid of the separate feePerReferenceScriptByte and instead just accurately cost total script execution time (including preparation/deserialization) and charge accordingly.

The reason deserialization is costly is because the serialization format was optimized for script size because at the time we did not have reference scripts and thus script size was much more important because the scripts needed to be redundantly included in every single tx. Now that we have reference scripts, and the ability to create scripts of nearly unbounded size ex 200kb+ with merklized-scripts and the withdraw zero design pattern, script size is now significantly less valuable than script efficiency (ex-units), and the current meta is to sacrifice script size for efficiency (inline everything, lookup tables everywhere, unrolled recursion, etc), and so the serialization format should prioritize deserialization speed over compression. Developers will happily pay huge one-time fees to deploy reference scripts (even thousands of Ada) if it results in lower costs in the long term (reduce deserialization cost) since the long-term cost savings eventually outweigh the one-time-fee of storing the reference script.

L-as · 2024-11-26T04:49:04Z

The point is you wouldn't need to deserialize the whole thing with merkelization.

L-as · 2024-11-26T04:55:17Z

Anyway intuitively and naively I feel like you can probably make the situation 10-100x better without changing the serialization format by just improving the deserialization code. JSON isn't particularly optimized either but the parsers for JSON are incredibly fast.

Really 95% of the performance problems can probably be fixed by just improving the implementation rather than doing architectural changes.

added uplc serializatonù

b5a1054

KtorZ reviewed Aug 13, 2022

View reviewed changes

L-as reviewed Aug 16, 2022

View reviewed changes

KtorZ changed the title ~~CIP-optimized-uplc-serialization~~ CIP-0041? | UPLC Serialization Optimizations Aug 17, 2022

michele-nuzzi added 2 commits August 24, 2022 11:30

refactoring of specifications

1c87951

rationale

56560d0

KtorZ reviewed Sep 6, 2022

View reviewed changes

KtorZ added Candidate CIP labels Sep 6, 2022

kwxm reviewed Oct 25, 2022

View reviewed changes

KtorZ added the State: Waiting for Author Proposal showing lack of documented progress by authors. label Nov 30, 2022

KtorZ added Category: Plutus Proposals belonging to the 'Plutus' category. and removed Candidate CIP labels Mar 18, 2023

bezirg mentioned this pull request May 17, 2024

Compact / shrink down the flat size of uplc IntersectMBO/plutus#6051

Open

michele-nuzzi closed this by deleting the head repository Jul 23, 2024

rphair mentioned this pull request Jul 27, 2024

Update top-level README: post meeting 93 #865

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CIP-0041? | UPLC Serialization Optimizations #314

CIP-0041? | UPLC Serialization Optimizations #314

HarmonicPool commented Aug 12, 2022

KtorZ Aug 13, 2022

HarmonicPool Aug 14, 2022

L-as Aug 16, 2022

L-as Aug 16, 2022

HarmonicPool Aug 23, 2022

KtorZ Sep 6, 2022

KtorZ Sep 6, 2022

KtorZ commented Sep 6, 2022

michaelpj commented Sep 6, 2022

KtorZ commented Oct 25, 2022

michaelpj commented Oct 25, 2022

kwxm Oct 25, 2022

HarmonicPool Oct 25, 2022

kwxm Oct 25, 2022

kwxm Oct 25, 2022

HarmonicPool Oct 25, 2022 •

edited

Loading

HarmonicPool commented Oct 25, 2022 •

edited

Loading

kwxm Oct 25, 2022 •

edited

Loading

HarmonicPool Oct 25, 2022

kwxm Oct 25, 2022

kwxm left a comment

Crypto2099 commented Jul 14, 2024

michele-nuzzi commented Jul 14, 2024

rphair commented Jul 23, 2024 •

edited

Loading

colll78 commented Nov 24, 2024

michele-nuzzi commented Nov 24, 2024

L-as commented Nov 25, 2024

rphair commented Nov 25, 2024 •

edited

Loading

colll78 commented Nov 25, 2024 •

edited

Loading

L-as commented Nov 26, 2024

L-as commented Nov 26, 2024


		This proposal suggest ways to reduce serialized scripts size.

		the changes where designed to keep the bit oriented style of flat minimizing the number of required bits for values where possible.


		## Abstract

		this document describes the parts of the current serialization algorithm that can be improved and provides the specification and documentation needed in order to implement an optimized version of this one.

	Authors: Michele Nuzzi <michele.nuzzi.2014@gamil.com>
	Authors: Michele Nuzzi <michele.nuzzi.2014@gmail.com>


		the proposed changes to the algorithm will cause the same UPLC Abstract Syntaxt Tree to serialize in a different way based on the algorithm used;

		In order to allow the deserializaton process to handle the old serialization algorithm the version of the program sholud be chcked first.


		tags from ```integer``` to ```bool``` and the ```data``` one are directly followed by the respective value encoding;

		tags ```list``` and ```pair``` are the only tags that do require some other tag in order to be a defined type; since the twos always require some other type in order to be valid, the type application is implcit and it should be removed.


		### ```data``` serialization

		All the effort of minimizing the size of on-chain scripts by prefering ```flat``` over ```CBOR``` serialization are ignored when it comes to ```data``` serialization.

CIP-0041? | UPLC Serialization Optimizations #314

CIP-0041? | UPLC Serialization Optimizations #314

Conversation

HarmonicPool commented Aug 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KtorZ commented Sep 6, 2022

michaelpj commented Sep 6, 2022

KtorZ commented Oct 25, 2022

michaelpj commented Oct 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HarmonicPool Oct 25, 2022 • edited Loading

Choose a reason for hiding this comment

HarmonicPool commented Oct 25, 2022 • edited Loading

kwxm Oct 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwxm left a comment

Choose a reason for hiding this comment

Crypto2099 commented Jul 14, 2024

michele-nuzzi commented Jul 14, 2024

rphair commented Jul 23, 2024 • edited Loading

colll78 commented Nov 24, 2024

michele-nuzzi commented Nov 24, 2024

L-as commented Nov 25, 2024

rphair commented Nov 25, 2024 • edited Loading

colll78 commented Nov 25, 2024 • edited Loading

L-as commented Nov 26, 2024

L-as commented Nov 26, 2024

HarmonicPool Oct 25, 2022 •

edited

Loading

HarmonicPool commented Oct 25, 2022 •

edited

Loading

kwxm Oct 25, 2022 •

edited

Loading

rphair commented Jul 23, 2024 •

edited

Loading

rphair commented Nov 25, 2024 •

edited

Loading

colll78 commented Nov 25, 2024 •

edited

Loading