-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flexible serialization with a mild focus on JSON #247
Comments
Over the past months, we at Status, have been developing a set of general purpose serialization APIs that offer some significant advantages over this design. The most significant design principle is that a serialization framework should not produce dynamically allocated data structures, such as type
Person = object
name: string
age: int
const
jsonString = """{"name": "Sean Connery", age: 90}"""
let sean = Json.decode(jsonString, Person) To implement this efficiently, the type-specific code that the compiler generates automatically is not concerned with converting Please note that this model doesn't preclude the support for let sean = Json.decode(jsonString, JsonNode)
echo sean["name"].kind # prints JString Also, the first argument here ( Once you learn the Json API, it's quite familiar to switch to another format: let protobufBytes = Protobuf.encode(sean)
let decoded = Protobuf.decode(protobufBytes, Person) In our real-world uses cases for the library, the extensibility API has faced significantly more difficult challenges than the ones that triggered the current discussion in nim-lang/Nim#15133. nim-serialization is able to define custom serialization for any type, even when the type is coming from a third party package you cannot modify, or even if it's an external C/C++ type that Nim cannot analyze. You can even define custom serialization that's context-specific (some types may be encoded in different ways in JSON for example in different interchange standards). |
@zah I considered this too but the RFC deals with how further additions to json and jsonutils should be handled. Otherwise we might as well use Status's library/design. And maybe we should. :-) |
I have recently implemented a thrift library in Nim and one aspect of it that was crucial was the concept of a "Transport". JSON isn't a transport protocol, so it doesn't matter as much for it, but for protocols like thrift/protobufs I think you need something more flexible than this proposal. At the end of the day though, I don't think you should worry about that. Let the community come up with a good serialization library. We just need the basics in the stdlib. From that perspective this seems fine to me. |
I'm not sure what you meant by basics. What we need in the standard library for protocols (considering serialization/deserialization as a "codec" protocol) is concepts/interfaces/API that the implementations must adhere to. This way "protocol" users can switch to new backends depending on backend implementation: performance, thread safety, pure Nim, suitability for embedded/real-time/... Unfortunately, this often requires at least one implementation to explore the field and make mistakes. In the future I would like more of those mistakes be corrected or at least that we have a process to correct them and leave more of the implementations to the community or interested parties. Examples:
|
I believe deserialization should be into target object type not into the intermediate proc fromJson[T](x: JsonNode, t: typedesc[T]): T
proc toJson[T](a: T): JsonNode |
These can be achieved without implementing a serialization library/framework in stdlib. During development of nim-serialization and friends e.g. nim-json-serialization, nim-protobuf-serialization, etc., we have collected several issues that should be addressed by Nim compiler. problems such as #176 and #44 are example that hindered us to produce better serialization lib. another problem type
Something = ref object # bad for serialization lib because of anonymous object insertion
...
SomeSeq = seq object # we never write seq like this, why write `ref object`?
...
SomeRoot = ref object of RootRef
...
WeirdObject = object of SomeRoot # valid but hey, it's a ref or not? nightmare for serialization lib author.
... The above example combined with case object and custom pragma makes compile time code in serialization lib/framework cannot reach it's maximum potential. I believe it does not stop there. Better we solve those already known issues and then improve existing serialization lib/framework. Writing another serialization lib in stdlib will only again hitting the same wall. |
@jangko That's all true but contributors keep submitting patches and features to the existing JSON stuff, at least now there is an RFC describing where the journey should go. |
This is more of a "what interface does this data type implement" scenario, right? For example, if I have a "IntRange" type that is represented as a single integer,
I believe it would be better if the pragmas had some sort of grouping, indicating that they are used for JSON serialization. How would one handle the case where a concrete reference type isn't known at compile-time (for example, types that use inheritance and methods)? Regarding performance, while I do think performance should be important, a standard library should also be flexible, and usable for newcomers and rapid prototyping. |
With regard to the RFC, I think it's a bad idea to serialize enums as strings. Enum strings can change, and for reasons unrelated to serialization, but what's unlikely to change is the ordinal value of the enum. It seems senseless to make an author reconsider the length of a symbol name, or its supplied enum string, simply because of a fear that it may affect serialization performance should anyone ever wish to serialize the type. |
Nobody knows if the enum name changes more frequently than its ordinal and JSON is a format readable by human beings, magic numbers are less readable than names. |
I have written two very fast serializers/deserializers for json (https://github.com/treeform/jsony) and custom binary (https://github.com/treeform/flatty). I think they are the fastest libs of their kind (please send me timings if they are not). Jsony has ability to add user defined serializing and deserializing hooks. I don't think this is required. It's easy to import tables or other modules to add support for types in your serializer. I don't think these concept would have helped me. Nim already has everything you need to implement simple and very fast serializers/deserializers. |
It's nice that we have many direct to /from JSON serializations libraries, BUT they are ALL flawed.
Its good to have a fast library, but if its not correct, well ... |
AFAIK we have several serializers that support case objects... Which don't work, in your experience? It's a boring problem if it doesn't work for deserialization, but not one we can't fix. Where doesn't it work? |
@disruptek Documented at https://github.com/planetis-m/eminim#limitations You can see the example code to understand why. Afaik other libs might be more flexible but not without hacks. |
Well, jason serializes correctly. If I write a deserializer, it will also be correct. But you already have a deserializer -- why don't you make yours work correctly? |
@planetis-m I agree with @disruptek. I just implemented case objects and my discriminator field does not need to be first. Yes its complex but I would not call it "hacks." Its just code. https://github.com/treeform/jsony#full-support-for-case-variant-objects My point still sands that we don't need extra Iterable, Pairable, Optional concepts as what's in Nim now works well. They would not help with complex to handle case variant objects. Fixing issues with case variant objects so that they are easier to use would be great though. |
It seems to me you haven't properly tested your stuff, try the stdlib tests, and if you don't yet understand the second problem, read https://github.com/planetis-m/eminim/blob/master/examples/ex_matrix.nim#L43 . It's easy to be dismissive but my points still stands. What I'm saying is, we still need std/json and this proposal implemented, if we care about correctness, don't want to depend in a single developer's idea of what a serialization library should do. Also false and misleading claims won't help the situation. |
Let me recap my opinion. Direct JSON serializers don't offer a complete replacement to std/json. The JSON spec doesn't make guarantees on the order of the objects fields. The aforementioned libraries all have limitation due to that, and do NOT cover the all the use-cases of std/json. In particular object variants and when creating custom (de)serializers. A workaround could be to use JSON arrays of different types. It isn't sufficient, if you don't control the writer (that sends the JSON data). Another is to use some complex lookahead and backtrack combination. This makes the performance unpredictable. If the objective is to improve |
Well, it's easy to wave hands but without specific criticisms, there's not much to talk about. I don't think the stdlib is really a good source of tests, especially since the code is so awful; what are we really testing and what makes us think the tests are any better than the lousy implementation? If you can share what in particular doesn't work for you, perhaps we can improve the code. Otherwise... I just added another variant object test to jason which demonstrates out-of-order fields and whatnot and I just can't seem to provoke misbehavior. 😕 |
What are you talking about, you said for yourself you don't provide a deserializer, the trouble comes when deserializing. I am referring to this specific test https://github.com/nim-lang/Nim/blob/devel/tests/stdlib/tjson_unmarshall.nim |
My mistake, I misunderstood this series of sentences:
|
It's ok @disruptek, I'm happy to give you a better example, so that everyone understands better. Our JSON data: {"m": 2, "data": [1, 2], "n": 1} In std/json you can write: proc initFromJson(dst: var Matrix; jsonNode: JsonNode) =
dst.m = jsonNode["m"].getInt
dst.n = jsonNode["n"].getInt
assert dst.m != 0 and dst.n == 0
let size = dst.m * dst.n
dst.data = cast[ptr UncheckedArray[float]](alloc0(size * sizeof(float)))
for (i, x) in enumerate(jsonNode["data"].items):
assert i < size
dst.data[i] = x.getFloat With direct serializers you have many options:
|
That's not the same as the example I posted. I see you used a sequence to store the data. I mentioned temporary seqs because they have an add method. It avoids the real issue. Try again with You can still prove that your library is correct by posting the correct example this time, but as we talked about it in the IRC |
You know very well I checked, must have forgotten. But engineering is not about hurting people's feelings, I can't care less that your auditor's feelings are hypothetically hurt. It's about upholding ethics as to what you claim your library does, and what it doesn't. And also writing correct code that works every time. |
To quote you previously:
I showed that we can handle arbitrary order, now you're just moving goal post by adding ptr UncheckedArray. However if you can handle sequence fields in arbitrary order, you can handle ptr UncheckedArray easily. Also this part is not relevant within the Nim ecosystem.
We are free to impose stricter requirements for our own serialization and deserialization in our own ecosystem. Json also makes no guarantees of deserializing an int64 properly, how we serialize and deserialize this is up to us who write the spec. To avoid confusion, just like Python called their own serialization pickle, we can call ours nickle.
No, I asked here https://discord.com/channels/371759389889003530/706542664643772436/797834372786946079 if you found a bug with misordered field and you replied that you were sure that our auditors could find one. Following that you never showed me that bug and I just proved that our code handles arbitrary order fine.
Engineering is about building things, and to build complex things you need a variety of skills, you cannot build alone, an engineer should be able to work with people from various trades, understand their concerns, skills and expectations or building will get stuck for reason of design, logistic, hiring, economics, marketing, politics, sociology or technique. Disruptek, treeform and my colleagues at Status are spending valuable personal or company time and sharing it with the Nim community. Furthermore those libraries are not low quality ones like you suggest when you say
First, everyone is using the MIT license so I don't get that liability snide remark, second all those libraries comes with tests and continuous integration and in the case of Status, auditing, 24/7 fuzzing and being one of the 4 targets for exploitation on a $1.950 billions network. This is not a left-pad situation or reinventing the wheel just because, everyone tries hard to guarantee a smooth experience for users and it is not OK to dismiss their hard work like that especially because you are an influential member of the community and in that remark case, newcomers in the #nim-gamedev channel will defer to your experience. This is the second-time you are condescending to fellow Nim developers because the libraries they develop doesn't fit your use-case or your preferred API. I will call out this attitude every time it's needed. Regarding std/json, its speed is a showstopper, it allocates a There is a parallel in the standard library, the implementations are fine for fast prototyping but when you need to build a lasting product the design decisions for fast prototyping do not hold. Key components like I don't think we should even try to address serialization or streams before we have VTable or traits which would give us dynamic dispatch without forcing us to use garbage-collected types. |
That's great, congratulations. You implemented the temporary seq solution. I never moved goal post, the example I posted, never used a sequence. My criticism is about hijacking a Nim-lang/stdlib proposal with a solution, I know is unfit to be used for a general purpose JSON library. I made sure that the decision makers know every aspect, including it's limitation, before making the wrong decision. Not out of altruism but since I use the stdlib I want it to be useful for everybody. You continue to pretend there is nothing wrong with it, in spite I gave you as much details as I can.
Except "my" preferred API is what everyone knows and uses. |
Time passed, we have yet to see a proposal of how to fix the performance issue ranging between 10% to 20% of not using dynamic dispatch in Streams. One of the "thick skins" got himself banned, I'm "vile" for caring about correctness, while there were several articles floating online arguing about the pitfalls of parsing JSON. IMO if you choose the way of personal attacks, without much technical knowledge on the underlying issue, and "I make billions so my use-cases are more important than yours", is a great way to demotivate new contributors. |
Well, time passed and we now have the new styled concepts. Time to look at the issue again?
Sorry, I don't understand what this means? Is our way of doing Streams slow? |
@Araq There is this benchmark: https://gist.github.com/treeform/bb7378f3904df29bf834d52236025181 but it wasn't reported. |
Why not just make the growth rate configurable? I'd be in favor of making the faster growth rate the default. |
Yea that makes more sense, just an optional flag of |
Probably caused by our new creative "of" operator implementation... We need a better implementation. |
Had to investigate further and seems it's just staying in |
As I said, proc isObj(obj: PNimTypeV2, subclass: cstring): bool {.compilerRtl, inl.} =
proc strstr(s, sub: cstring): cstring {.header: "<string.h>", importc.}
result = strstr(obj.name, subclass) != nil
is the problem. :-) |
Or perhaps just an integer? Is it even possible to express the growth rate as an integer? |
I did implement it as an integer but yea forgot |
Flexible Serialization
Abstract
We want to have a flexible, convenient, sound, efficient way of serializing arbitrary Nim data types to and from an external format (JSON for example). Nim's collections should be serializable without explicit import code, in other words tables.nim should not import json.nim. JSON is one format out of many and a basic hash table implementation should not know about these formats.
References: nim-lang/Nim#15133
Description
We distinguish between 3 different classes of data types, the serialization of every class is handled differently.
int
,string
,bool
.seq
,Option
,Table
orCritBitTree
.Person
,Customer
orHouse
.Note: I'll use JSON throughout in the examples, but most concepts apply equally well to other formats such as Google's protobuffers.
Atoms
Every atom
A
supports two operations:Note that
Time
andDateTime
should not be supported out of the box as JSON does not have this type, aDateTime
in JSON is a string. We'll see how to handleDateTime
later.Collections
There are 3 different kinds of collections:
Iterable
concept (seq, queue etc).Pairable
concept (any kind ofkey->value
mapping).Optional
concept.The concept declarations are roughly:
Because JSON has a special value
null
I decided to map anOption[T]
into its own concept rather than treating it just as more limitedIterable
type.Given these concepts, we can define
fromJson
andtoJson
operations easily:Custom objects
Example:
For an
object
type the serializer supports custom pragma annotations. These are:hidden
: The field is hidden and not subject to serialization.default: value
: If the field is missing, a default value will beused instead.
rename: "external name"
: The field will be serialized as "external name".customFrom: <fromProc>, customTo: <toProc>
: For this field custom serializationprocs should be used instead. The procs are attached to the field and not to the
field's type. This is most useful for
DateTime
so that two fields can have thesame Nim type and yet are converted differently to JSON.
The operations that do the object declaration inspection are:
Ref objects
While you can write custom
toJson
andfromJson
procs for yourref object
types, it's currently unclear how to solve this problem in a generic fashion. Refs can be used to construct arbitrary cyclic graphs and JSON does not support cycles.Enums
Enum values should be mapped to JSON strings for better readability. The fact this is less efficient than integer values should not matter too much, JSON simply is not an efficient data format anyhow.
Backward incompatibility
I don't see incompatibilities, but the code inside jsonutils.nim needs to be changed.
The text was updated successfully, but these errors were encountered: