Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible serialization with a mild focus on JSON #247

Open
Araq opened this issue Aug 26, 2020 · 38 comments
Open

Flexible serialization with a mild focus on JSON #247

Araq opened this issue Aug 26, 2020 · 38 comments

Comments

@Araq
Copy link
Member

Araq commented Aug 26, 2020

Flexible Serialization

Abstract

We want to have a flexible, convenient, sound, efficient way of serializing arbitrary Nim data types to and from an external format (JSON for example). Nim's collections should be serializable without explicit import code, in other words tables.nim should not import json.nim. JSON is one format out of many and a basic hash table implementation should not know about these formats.

References: nim-lang/Nim#15133

Description

We distinguish between 3 different classes of data types, the serialization of every class is handled differently.

  1. Atoms like int, string, bool.
  2. Collections like seq, Option, Table or CritBitTree.
  3. Custom objects like Person, Customer or House.

Note: I'll use JSON throughout in the examples, but most concepts apply equally well to other formats such as Google's protobuffers.

Atoms

Every atom A supports two operations:

proc fromJson(a: var A, b: JsonNode)
proc toJson(a: A): JsonNode

Note that Time and DateTime should not be supported out of the box as JSON does not have this type, a DateTime in JSON is a string. We'll see how to handle DateTime later.

Collections

There are 3 different kinds of collections:

  • A collection that fulfills the Iterable concept (seq, queue etc).
  • A collection that fulfills the Pairable concept (any kind of key->value mapping).
  • A collection that fulfills the Optional concept.

The concept declarations are roughly:

type
  Iterable[T] = concept
    iterator items(c: self): T
    proc add(c: var self; value: T)

  Pairable[K, V] = concept
    iterator pairs(c: self): (K, V)
    proc `[]=`(c: var self; k: K; v: V)

  Optional[T] = concept
    proc isSome(c: self): bool
    proc get(c: self): T

    proc some(val: T): self
    proc none[T]: self

Because JSON has a special value null I decided to map an Option[T] into its own concept rather than treating it just as more limited Iterable type.

Given these concepts, we can define fromJson and toJson operations easily:

proc toJson(c: Iterable[T]): JsonNode =
  result = newJArray()
  for x in items(c):
    result.add toJson(x)

proc toJson[K, V](c: Pairable[K, V]): JsonNode =
  static:
    assert K is string
  result = newJObject()
  for key, val in pairs(c):
    result[toJson(key)] = toJson(val)

proc toJson[T](c: Optional[T]): JsonNode =
  if c.isSome:
    result = toJson(c.get)
  else:
    result = newJNull()

proc fromJson[T](dest: var Iterable[T]; j: JsonNode) =
  for a in items(j):
    var tmp: T
    fromJson(tmp, a)
    dest.add tmp

proc fromJson[K, V](dest: var Pairable[K, V]; j: JsonNode) =
  for key, val in pairs(j):
    var tmpKey: K
    fromJson(tmpKey, key)
    var tmpVal: V
    fromJson(tmpVal, val)
    dest[tmpKey] = tmpVal

proc fromJson[T](dest: var Optional[T]; j: JsonNode) =
  if j.kind != JNull:
    var tmp: T
    fromJson(tmp, j)
    dest = some(tmp)
  else:
    dest = none[T]()

Custom objects

Example:

proc fromJsonToBirth(dest: var DateTime; j: JsonNode) = discard "..."
proc birthToJson(d: DateTime): JsonNode = discard "..."

type
  Person = object
    s {.hidden.}: Socket # do not serialize
    age {.default: 20, rename: "Age".}: int # if not provided, assume 20 for the age
    birth {.customFrom: fromJsonToBirth, customTo: birthToJson.}: DateTime

For an object type the serializer supports custom pragma annotations. These are:

  • hidden: The field is hidden and not subject to serialization.
  • default: value: If the field is missing, a default value will be
    used instead.
  • rename: "external name": The field will be serialized as "external name".
  • customFrom: <fromProc>, customTo: <toProc>: For this field custom serialization
    procs should be used instead. The procs are attached to the field and not to the
    field's type. This is most useful for DateTime so that two fields can have the
    same Nim type and yet are converted differently to JSON.

The operations that do the object declaration inspection are:

proc fromJson[T: object](x: var T, j: JsonNode) =
  # Ignoring pragma annotations the basic algorithm is:
  assert j.kind == JObject
  for name, value in fieldPairs(x):
    fromJson(value, j[name])

proc toJson[T: object](x: T): JsonNode =
  # Ignoring pragma annotations the basic algorithm is:
  result = newJObject()
  for name, value in fieldPairs(x):
    result[toJson(name)] = toJson(value)

Ref objects

While you can write custom toJson and fromJson procs for your ref object types, it's currently unclear how to solve this problem in a generic fashion. Refs can be used to construct arbitrary cyclic graphs and JSON does not support cycles.

Enums

Enum values should be mapped to JSON strings for better readability. The fact this is less efficient than integer values should not matter too much, JSON simply is not an efficient data format anyhow.

Backward incompatibility

I don't see incompatibilities, but the code inside jsonutils.nim needs to be changed.

@zah
Copy link
Member

zah commented Aug 26, 2020

Over the past months, we at Status, have been developing a set of general purpose serialization APIs that offer some significant advantages over this design.

The most significant design principle is that a serialization framework should not produce dynamically allocated data structures, such as JsonNode. It should be able to go straight from text to strongly typed Nim objects. Here is how the API looks like:

type
  Person = object
    name: string
    age: int

const
  jsonString = """{"name": "Sean Connery", age: 90}"""

let sean = Json.decode(jsonString, Person)

To implement this efficiently, the type-specific code that the compiler generates automatically is not concerned with converting JsonNodes to fields, but it rather directly sits on top of the Json lexer and the underlying input text stream. Many data types can be deserialized without any memory allocations.

Please note that this model doesn't preclude the support for JsonNode. JsonNode is merely one of the many supported types:

let sean = Json.decode(jsonString, JsonNode)
echo sean["name"].kind # prints JString

Also, the first argument here (Json) is just one of the many formats supported by nim-serialization (others include ProtoBuf, TOML, SSZ, Logfmt, etc). What the API allows you is to treat the format as a parameter in your code, which makes it possible to develop libraries such as Confutils that can work with any configuration file format (e.g. YAML, TOML, JSON, etc) or Chronicles that can produce structured logs in Json, Logfmt, Xml or any other custom text format.

Once you learn the Json API, it's quite familiar to switch to another format:

let protobufBytes = Protobuf.encode(sean)
let decoded = Protobuf.decode(protobufBytes, Person)

In our real-world uses cases for the library, the extensibility API has faced significantly more difficult challenges than the ones that triggered the current discussion in nim-lang/Nim#15133. nim-serialization is able to define custom serialization for any type, even when the type is coming from a third party package you cannot modify, or even if it's an external C/C++ type that Nim cannot analyze. You can even define custom serialization that's context-specific (some types may be encoded in different ways in JSON for example in different interchange standards).

@Araq
Copy link
Member Author

Araq commented Aug 26, 2020

@zah I considered this too but the RFC deals with how further additions to json and jsonutils should be handled. Otherwise we might as well use Status's library/design. And maybe we should. :-)

@dom96
Copy link
Contributor

dom96 commented Aug 26, 2020

I have recently implemented a thrift library in Nim and one aspect of it that was crucial was the concept of a "Transport". JSON isn't a transport protocol, so it doesn't matter as much for it, but for protocols like thrift/protobufs I think you need something more flexible than this proposal.

At the end of the day though, I don't think you should worry about that. Let the community come up with a good serialization library. We just need the basics in the stdlib. From that perspective this seems fine to me.

@mratsim
Copy link
Collaborator

mratsim commented Aug 26, 2020

We just need the basics in the stdlib. From that perspective this seems fine to me.

I'm not sure what you meant by basics.

What we need in the standard library for protocols (considering serialization/deserialization as a "codec" protocol) is concepts/interfaces/API that the implementations must adhere to.

This way "protocol" users can switch to new backends depending on backend implementation: performance, thread safety, pure Nim, suitability for embedded/real-time/...

Unfortunately, this often requires at least one implementation to explore the field and make mistakes.

In the future I would like more of those mistakes be corrected or at least that we have a process to correct them and leave more of the implementations to the community or interested parties.

Examples:

  • Marshal/serialization/deserialization
  • Crypto API (initContext, hash, sign, verify)
  • Streams
  • Async
  • Threads/threadpools/executors
  • Coroutines
  • Unittests
  • Logging
  • BigInt
  • Database

@cooldome
Copy link
Member

cooldome commented Aug 28, 2020

I believe deserialization should be into target object type not into the intermediate JsonNode:
For example:

proc fromJson[T](x: JsonNode, t: typedesc[T]): T
proc toJson[T](a: T): JsonNode

@jangko
Copy link

jangko commented Aug 28, 2020

We want to have a flexible, convenient, sound, efficient way of serializing arbitrary Nim data types to and from an external format (JSON for example). Nim's collections should be serializable without explicit import code, in other words tables.nim should not import json.nim. JSON is one format out of many and a basic hash table implementation should not know about these formats.

These can be achieved without implementing a serialization library/framework in stdlib.
What we need is better support from Nim compiler. Nim compiler should provide solid infrastructure to facilitate writing a better serialization lib/framework.

During development of nim-serialization and friends e.g. nim-json-serialization, nim-protobuf-serialization, etc., we have collected several issues that should be addressed by Nim compiler.

problems such as #176 and #44 are example that hindered us to produce better serialization lib.

another problem ref object shortcut have bad internal design:

type
   Something = ref object  # bad for serialization lib because of anonymous object insertion
      ...

   SomeSeq = seq object  # we never write seq like this, why write `ref object`?
      ...

   SomeRoot = ref object of RootRef
      ...

   WeirdObject = object of SomeRoot  # valid but hey, it's a ref or not? nightmare for serialization lib author.
      ...

The above example combined with case object and custom pragma makes compile time code in serialization lib/framework cannot reach it's maximum potential.

I believe it does not stop there. Better we solve those already known issues and then improve existing serialization lib/framework.

Writing another serialization lib in stdlib will only again hitting the same wall.

@Araq
Copy link
Member Author

Araq commented Aug 28, 2020

@jangko That's all true but contributors keep submitting patches and features to the existing JSON stuff, at least now there is an RFC describing where the journey should go.

@Varriount
Copy link

We distinguish between 3 different classes of data types, the serialization of every class is handled differently.

This is more of a "what interface does this data type implement" scenario, right? For example, if I have a "IntRange" type that is represented as a single integer, type IntRange = int (or maybe distinct int), is it possible to serialize/deserialize it into an "array" of integers (using JSON terminology)?

For an object type the serializer supports custom pragma annotations.

I believe it would be better if the pragmas had some sort of grouping, indicating that they are used for JSON serialization.
For example, Go's struct tags use the format json:"<option>,<option...>". This grants increased flexibility in the case that a type is serialized to multiple for formats (for instance, JSON and SQL).


How would one handle the case where a concrete reference type isn't known at compile-time (for example, types that use inheritance and methods)?

Regarding performance, while I do think performance should be important, a standard library should also be flexible, and usable for newcomers and rapid prototyping.

@disruptek
Copy link

With regard to the RFC, I think it's a bad idea to serialize enums as strings. Enum strings can change, and for reasons unrelated to serialization, but what's unlikely to change is the ordinal value of the enum. It seems senseless to make an author reconsider the length of a symbol name, or its supplied enum string, simply because of a fear that it may affect serialization performance should anyone ever wish to serialize the type.

@Araq
Copy link
Member Author

Araq commented Sep 22, 2020

Nobody knows if the enum name changes more frequently than its ordinal and JSON is a format readable by human beings, magic numbers are less readable than names.

@treeform
Copy link

treeform commented Jan 7, 2021

I have written two very fast serializers/deserializers for json (https://github.com/treeform/jsony) and custom binary (https://github.com/treeform/flatty). I think they are the fastest libs of their kind (please send me timings if they are not). Jsony has ability to add user defined serializing and deserializing hooks.

I don't think this is required. It's easy to import tables or other modules to add support for types in your serializer. I don't think these concept would have helped me. Nim already has everything you need to implement simple and very fast serializers/deserializers.

@planetis-m
Copy link

planetis-m commented Jan 10, 2021

It's nice that we have many direct to /from JSON serializations libraries, BUT they are ALL flawed.

  1. Limited support for case objects. It's a known issue, so I wont expand on this, but unless there is a language change, we cant do much (without hacks).
  2. Order of object fields matters (even for most non case-objects). This is overlooked but JSON makes no guarantees on field order. Direct deserializers work in the order the JSON data are read, unlike the example code posted in this rfc. Thus it may break in unpredictable ways.

Its good to have a fast library, but if its not correct, well ...

@disruptek
Copy link

AFAIK we have several serializers that support case objects... Which don't work, in your experience?

It's a boring problem if it doesn't work for deserialization, but not one we can't fix. Where doesn't it work?

@planetis-m
Copy link

@disruptek Documented at https://github.com/planetis-m/eminim#limitations You can see the example code to understand why. Afaik other libs might be more flexible but not without hacks.

@disruptek
Copy link

Well, jason serializes correctly. If I write a deserializer, it will also be correct. But you already have a deserializer -- why don't you make yours work correctly?

@treeform
Copy link

@planetis-m I agree with @disruptek. I just implemented case objects and my discriminator field does not need to be first. Yes its complex but I would not call it "hacks." Its just code. https://github.com/treeform/jsony#full-support-for-case-variant-objects

My point still sands that we don't need extra Iterable, Pairable, Optional concepts as what's in Nim now works well. They would not help with complex to handle case variant objects. Fixing issues with case variant objects so that they are easier to use would be great though.

@planetis-m
Copy link

planetis-m commented Jan 10, 2021

It seems to me you haven't properly tested your stuff, try the stdlib tests, and if you don't yet understand the second problem, read https://github.com/planetis-m/eminim/blob/master/examples/ex_matrix.nim#L43 . It's easy to be dismissive but my points still stands.

What I'm saying is, we still need std/json and this proposal implemented, if we care about correctness, don't want to depend in a single developer's idea of what a serialization library should do. Also false and misleading claims won't help the situation.

@planetis-m
Copy link

planetis-m commented Jan 10, 2021

Let me recap my opinion. Direct JSON serializers don't offer a complete replacement to std/json. The JSON spec doesn't make guarantees on the order of the objects fields. The aforementioned libraries all have limitation due to that, and do NOT cover the all the use-cases of std/json. In particular object variants and when creating custom (de)serializers. A workaround could be to use JSON arrays of different types. It isn't sufficient, if you don't control the writer (that sends the JSON data). Another is to use some complex lookahead and backtrack combination. This makes the performance unpredictable.

If the objective is to improve std/json, I propose for sam to be used instead. Otherwise I support this rfc, and the improvements it brings.

@disruptek
Copy link

disruptek commented Jan 10, 2021

It seems to me you haven't properly tested your stuff, try the stdlib tests, and if you don't yet understand the second problem, read https://github.com/planetis-m/eminim/blob/master/examples/ex_matrix.nim#L43 . It's easy to be dismissive but my points still stands.

Well, it's easy to wave hands but without specific criticisms, there's not much to talk about.

I don't think the stdlib is really a good source of tests, especially since the code is so awful; what are we really testing and what makes us think the tests are any better than the lousy implementation? If you can share what in particular doesn't work for you, perhaps we can improve the code. Otherwise...

I just added another variant object test to jason which demonstrates out-of-order fields and whatnot and I just can't seem to provoke misbehavior. 😕

@planetis-m
Copy link

planetis-m commented Jan 10, 2021

What are you talking about, you said for yourself you don't provide a deserializer, the trouble comes when deserializing. I am referring to this specific test https://github.com/nim-lang/Nim/blob/devel/tests/stdlib/tjson_unmarshall.nim

@disruptek
Copy link

My mistake, I misunderstood this series of sentences:

Direct JSON serializers don't offer a complete replacement to std/json. The JSON spec doesn't make guarantees on the order of the objects fields. The aforementioned libraries all have limitation due to that, and do NOT cover the all the use-cases of std/json.

@planetis-m
Copy link

planetis-m commented Jan 11, 2021

It's ok @disruptek, I'm happy to give you a better example, so that everyone understands better.

Our JSON data:

{"m": 2, "data": [1, 2], "n": 1}

In std/json you can write:

proc initFromJson(dst: var Matrix; jsonNode: JsonNode) =
  dst.m = jsonNode["m"].getInt
  dst.n = jsonNode["n"].getInt
  assert dst.m != 0 and dst.n == 0
  let size = dst.m * dst.n
  dst.data = cast[ptr UncheckedArray[float]](alloc0(size * sizeof(float)))
  for (i, x) in enumerate(jsonNode["data"].items):
    assert i < size
    dst.data[i] = x.getFloat

With direct serializers you have many options:

  1. Default, ignore incorrect order, crash:

    Same as the link posted eariler: ex_matrix.nim

  2. Require strict order, ignore JSON spec.

  3. Use a temporary seq, doubles the storage space.

  4. Find the "m", "n" keys then backtrack to "data".

@mratsim
Copy link
Collaborator

mratsim commented Jan 11, 2021

As mentioned multiple times on IRC, nim-json-serialization properly deals with random order in Json
image

It has been heavily tested and audited. Assuming that our code is faulty and passive-aggressively suggesting that our auditors didn't do their job without even checking is not OK.

@planetis-m
Copy link

planetis-m commented Jan 11, 2021

That's not the same as the example I posted.

I see you used a sequence to store the data. I mentioned temporary seqs because they have an add method. It avoids the real issue. Try again with data: ptr UncheckedArray[float]. Either way my problem is not about making this specific example work, it's the limitations of this library and that it can't replace std/json, as I mentioned countless times.

You can still prove that your library is correct by posting the correct example this time, but as we talked about it in the IRC nim-json-serialization deals with random order, well in random order, not "properly"./

@planetis-m
Copy link

planetis-m commented Jan 11, 2021

Assuming that our code is faulty and passive-aggressively suggesting that our auditors didn't do their job without even checking is not OK.

You know very well I checked, must have forgotten. But engineering is not about hurting people's feelings, I can't care less that your auditor's feelings are hypothetically hurt. It's about upholding ethics as to what you claim your library does, and what it doesn't. And also writing correct code that works every time.

@mratsim
Copy link
Collaborator

mratsim commented Jan 12, 2021

That's not the same as the example I posted.

I see you used a sequence to store the data. I mentioned temporary seqs because they have an add method. It avoids the real issue. Try again with data: ptr UncheckedArray[float]. Either way my problem is not about making this specific example work, it's the limitations of this library and that it can't replace std/json, as I mentioned countless times.

You can still prove that your library is correct by posting the correct example this time, but as we talked about it in the IRC nim-json-serialization deals with random order, well in random order, not "properly"./

To quote you previously:

The JSON spec doesn't make guarantees on the order of the objects fields. The aforementioned libraries all have limitation due to that, and do NOT cover the all the use-cases of std/json.

I showed that we can handle arbitrary order, now you're just moving goal post by adding ptr UncheckedArray. However if you can handle sequence fields in arbitrary order, you can handle ptr UncheckedArray easily.
image

Also this part is not relevant within the Nim ecosystem.

Order of object fields matters (even for most non case-objects). This is overlooked but JSON makes no guarantees on field order.

We are free to impose stricter requirements for our own serialization and deserialization in our own ecosystem. Json also makes no guarantees of deserializing an int64 properly, how we serialize and deserialize this is up to us who write the spec. To avoid confusion, just like Python called their own serialization pickle, we can call ours nickle.

Assuming that our code is faulty and passive-aggressively suggesting that our auditors didn't do their job without even checking is not OK.

You know very well I checked, must have forgotten. But engineering is not about hurting people's feelings, I can't care less that your auditor's feelings are hypothetically hurt. It's about upholding ethics as to what you claim your library does, and what it doesn't. And also writing correct code that works every time.

No, I asked here https://discord.com/channels/371759389889003530/706542664643772436/797834372786946079 if you found a bug with misordered field and you replied that you were sure that our auditors could find one. Following that you never showed me that bug and I just proved that our code handles arbitrary order fine.

But engineering is not about hurting people's feelings, I can't care less that your auditor's feelings are hypothetically hurt. It's about upholding ethics as to what you claim your library does, and what it doesn't. And also writing correct code that works every time.

Engineering is about building things, and to build complex things you need a variety of skills, you cannot build alone, an engineer should be able to work with people from various trades, understand their concerns, skills and expectations or building will get stuck for reason of design, logistic, hiring, economics, marketing, politics, sociology or technique.
Understanding others start by being respectful. Then instead of disparaging others libraries and talking them down maybe you could share how people can improve and prop them up.

Disruptek, treeform and my colleagues at Status are spending valuable personal or company time and sharing it with the Nim community. Furthermore those libraries are not low quality ones like you suggest when you say

i hope people don't do anything important with these libraries that involves money or worse
at least my license covers me from liability

First, everyone is using the MIT license so I don't get that liability snide remark, second all those libraries comes with tests and continuous integration and in the case of Status, auditing, 24/7 fuzzing and being one of the 4 targets for exploitation on a $1.950 billions network. This is not a left-pad situation or reinventing the wheel just because, everyone tries hard to guarantee a smooth experience for users and it is not OK to dismiss their hard work like that especially because you are an influential member of the community and in that remark case, newcomers in the #nim-gamedev channel will defer to your experience.

This is the second-time you are condescending to fellow Nim developers because the libraries they develop doesn't fit your use-case or your preferred API. I will call out this attitude every time it's needed.
Don't be wrong, I appreciate your contributions and I've said it, for example for your ECS post https://github.com/planetis-m/goodluck/blob/main/docs/intro.rst, but it is way more important for me that developers who don't have a skin as tough as disruptek, treeform or my colleagues don't shy away from contributing to the Nim ecosystem because of misplaced or inaccurate criticism.

Regarding std/json, its speed is a showstopper, it allocates a Table per node which is incredibly inefficient. And before you lash out at people for whom performance is important (whether treeform who parses TB of data or me that deals with GB of data for data science), you need to accept that there is no one size fits all and that extensibility, ergonomics, robustness, security, error model but also performance matter it all depends on the problem at hand.

There is a parallel in the standard library, the implementations are fine for fast prototyping but when you need to build a lasting product the design decisions for fast prototyping do not hold.
Web-based businesses are embracing API-first designs, this allows both the producer and consumer (frontend <-> backend) to evolve in a decoupled manner or multiply the backend and endpoints.

Key components like streams and serialization are used in a variety of contexts, but design decisions in the implementation can make them a showstopper in some. Case in point streams uses a ref object. This is an immediate showstopper in embedded or kernel driver, and a tough requirement in cryptographic protocol which need to avoid having sensitive data in memory.

I don't think we should even try to address serialization or streams before we have VTable or traits which would give us dynamic dispatch without forcing us to use garbage-collected types.

@planetis-m
Copy link

planetis-m commented Jan 12, 2021

That's great, congratulations. You implemented the temporary seq solution. I never moved goal post, the example I posted, never used a sequence.

My criticism is about hijacking a Nim-lang/stdlib proposal with a solution, I know is unfit to be used for a general purpose JSON library.

I made sure that the decision makers know every aspect, including it's limitation, before making the wrong decision. Not out of altruism but since I use the stdlib I want it to be useful for everybody. You continue to pretend there is nothing wrong with it, in spite I gave you as much details as I can.

This is the second-time you are condescending to fellow Nim developers because the libraries they develop doesn't fit your use-case or your preferred API. I will call out this attitude every time it's needed.

Except "my" preferred API is what everyone knows and uses.

@planetis-m
Copy link

Time passed, we have yet to see a proposal of how to fix the performance issue ranging between 10% to 20% of not using dynamic dispatch in Streams. One of the "thick skins" got himself banned, I'm "vile" for caring about correctness, while there were several articles floating online arguing about the pitfalls of parsing JSON. IMO if you choose the way of personal attacks, without much technical knowledge on the underlying issue, and "I make billions so my use-cases are more important than yours", is a great way to demotivate new contributors.

@Araq
Copy link
Member Author

Araq commented May 6, 2021

Well, time passed and we now have the new styled concepts. Time to look at the issue again?

Time passed, we have yet to see a proposal of how to fix the performance issue ranging between 10% to 20% of not using dynamic dispatch in Streams.

Sorry, I don't understand what this means? Is our way of doing Streams slow?

@planetis-m
Copy link

@Araq There is this benchmark: https://gist.github.com/treeform/bb7378f3904df29bf834d52236025181 but it wasn't reported.

@beef331
Copy link

beef331 commented May 9, 2021

Sorry, I don't understand what this means? Is our way of doing Streams slow?

Doing some brief testing, it's mostly just due to the growth of the StringStream.data, for some reason in the streams module it's vastly slower, changing it to grow by a factor of 2 instead of the size of the data made it faster than binny. Oddly discovered that using arc slows down streams tremendously.

# Compiled with -d:lto -d:release
name ............................... min time      avg time    std dv   runs
streams ............................ 0.976 ms      1.053 ms    ±0.092   x100
fast streams ....................... 0.542 ms      0.588 ms    ±0.053   x100
binny .............................. 0.633 ms      0.650 ms    ±0.029   x100

The default string stream impl spent a lot of time in setLengthStr
image

with the changed growth rate and starting the data as a string with 1024 capacity, it was mostly in memCopy
image

So seems we could have a fastStringStream which uses more memory due to the different growth rate, but is like twice the efficiency.

@Varriount
Copy link

So seems we could have a fastStringStream which uses more memory due to the different growth rate, but is like twice the efficiency.

Why not just make the growth rate configurable? I'd be in favor of making the faster growth rate the default.

@beef331
Copy link

beef331 commented May 9, 2021

Yea that makes more sense, just an optional flag of sizeOptimized which is defaulted on(for legacy support), using the default streams logic when enabled, fast logic when disabled.

@Araq
Copy link
Member Author

Araq commented May 9, 2021

Oddly discovered that using arc slows down streams tremendously.

Probably caused by our new creative "of" operator implementation... We need a better implementation.

@beef331
Copy link

beef331 commented May 9, 2021

Had to investigate further and seems it's just staying in __strstr_sse2_unaligned for a while with --newRuntime. Atleast that's what callgrind is reporting.

@Araq
Copy link
Member Author

Araq commented May 9, 2021

As I said,

proc isObj(obj: PNimTypeV2, subclass: cstring): bool {.compilerRtl, inl.} =
  proc strstr(s, sub: cstring): cstring {.header: "<string.h>", importc.}

  result = strstr(obj.name, subclass) != nil

is the problem. :-)

@Varriount
Copy link

Yea that makes more sense, just an optional flag of sizeOptimized which is defaulted on(for legacy support), using the default streams logic when enabled, fast logic when disabled.

Or perhaps just an integer? Is it even possible to express the growth rate as an integer?

@beef331
Copy link

beef331 commented May 9, 2021

I did implement it as an integer but yea forgot 1 exists, integers work fine as long as you're good with growth rates of whole numbers, floats are fine if you're ok with a bit mor math and ensuring you get the size you need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests