Consider combining "foo" "bar" into single *StringNode. #1479

enebo · 2023-09-13T20:42:58Z

The build logic for this is pretty ugly:

"a" "b"
"a #{oh}" "b"
"a" "b #{oh}"
"a #{oh}" "b #{oh}"

Legacy parsers do this combination during parse. The only reason I can think of not to add this enhancement is syntax tooling probably needs to know this. With that said this:

uses extra post-processing (wastes time)
bigger (wastes space)

I keep thinking we need to have two profiles (semantics, syntax) during parse which does the thing each type of tool wants.

kddnewton · 2023-09-13T21:33:03Z

Right now, if you have "foo" "bar" then each unescaped field on the two strings is 1 byte for the offset and 1 byte for the length. But if we concat them together, then we're forced to allocate our own strings since we can't reference the source anymore. So now we'll have 1 byte for the length and then 6 bytes for the content. It will very quickly end up being more memory unless it's extremely small strings.

Of course that doesn't address the issue of post-processing. But I'm guessing it wouldn't be worth increasing the size of the serialized buffer?

enebo · 2023-09-13T21:43:53Z

@kddnewton are you saying the long term goal will be to keep source in order to build source? I believe this will make us have to abandon the idea of lazy method evaluation if so (e.g. we will be keeping serialized + source in memory).

If it ends up in a constant pool then it will then become less memory. The interpolated case will be less memory (alhough admittedly much less common) either way.

This could be a case of something serialization perhaps want but the main parser doesn't? Has compile.c tackled these mixed interpolated cases yet? I am curious if they prefer this as-is.

enebo · 2023-09-13T21:50:23Z

@kddnewton I may be speaking on this wrong. I see that loadString will populate from the source when making the tree so we can do lazy method building but it means that we need to fully generate its AST. I am ok with this and you can ignore my previous comment past it being aspirational.

By aspirational I mean the dream would be to get to be able to save offsets to a method and not generate a tree for that section. In that case we would need strings to be in a pool and not source.

enebo · 2023-09-13T21:57:43Z

@kddnewton ok another idea. StringConcat is always normal Strings and when mixed with an InterpolatedString it is combined into the InterpolatedString during parse? Then it would make the complexity much less. It should reduce interpolated cases size (and just be somethign we already can parse).

kddnewton · 2023-09-14T00:13:22Z

I think there are a couple of things we can take away from this that can move us toward the aspirational goal:

We should eagerly load the constant pool. In Ruby and C, we do this when deserializing by looping through each constant and getting the value out. In Java, it lazily loads the constant based on when one of the nodes needs it. Right now, since we're deserializing the entire tree, the lazy loading doesn't actually help us because every constant is going to be referenced. However, it stops us from moving toward lazy methods because you need to keep the source around to get the constant pool loaded.
We should move strings into a separate pool that is eagerly loaded when deserializing. I don't think this would be particularly difficult to do - and we could do it without impacting the way we parse by only building the string pool at the time of serialization. I think there's not really too much benefit in sharing in the same way we do with constants since strings are so much less likely to be shared.

As for combining them all, the parser gem does that and I think it would be possible. What they do is have a parent dstr (our InterpolatedStringNode) that combines all of them together. For example:

"a #{b} c" "d" "e #{f} g"

would get parsed as:

(dstr
  (dstr
    (str "a ")
    (begin
      (send nil :b))
    (str " c"))
  (str "d")
  (dstr
    (str "e ")
    (begin
      (send nil :f))
    (str " g")))

This works because we still get to keep all of the opening/closing/location information for the child strings, and it removes the need to have a string concat node at all so at least it can be consistently compiled. What do you think of that approach?

eregon · 2023-09-14T10:24:30Z

Re lazy parsing my intention is to not deserialize at all (i.e. not read the serialized bytes of) lazy methods (until that method is executed).
The code in Loader doesn't really handle that yet, so still some things to figure out.

That can mean either keeping the source around, or rereading it from disk, or indeed populating the constant pool eagerly, which is probably the best/simplest.
Regarding non-constant-pool strings, right we need to care about the strings referring back to the source, and serializing them like the constant pool but without interning sounds like a good way to do it.

An alternative would be when serializing to copy & embed any string/constant pool entry in the serialized output and therefore not need the source at all when deserializing.
That would mean larger serialized size but no need to read the source at all which could be a nice speedup in the case the serialized form is saved on disk and newer/equal the source file.
I think the serialized size wouldn't be worse than before + eager loading constant pool and strings, it would be about the same, but gaining to not have to read the source.
That might actually be fairly easy, and would be less temporal memory than serialized + source at the same time in memory (but that's not a big deal).

What do you think of this alternative? I think it might be the best and also be fairly straightforward to implement.

eregon · 2023-09-14T10:30:20Z

This works because we still get to keep all of the opening/closing/location information for the child strings, and it removes the need to have a string concat node at all so at least it can be consistently compiled. What do you think of that approach?

So that means replacing StringConcatNode by InterpolatedStringNode, right?
That sounds good.
It would be convenient to have a flag on that node if all children are static, i.e. no interpolation, maybe we can reuse the YP_NODE_FLAG_STATIC_LITERAL flag for that from #1451 ?

enebo · 2023-09-14T14:02:34Z

Re lazy parsing my intention is to not deserialize at all (i.e. not read the serialized bytes of) lazy methods (until that method is executed).

This is my aspirational goal as well. It may be that we eagerly put strings into the node when serializing in the first place (and still need to retain the constant pool).

This works because we still get to keep all of the opening/closing/location information for the child strings, and it removes the need to have a string concat node at all so at least it can be consistently compiled. What do you think of that approach?

I also believe you mean as @eregon just said that means using InterpolatedStringNode (DStr is this with legacy parsers). I think that is a reasonable solution.

kddnewton added the enhancement New feature or request label Sep 14, 2023

This was referenced Sep 16, 2023

[rust] fix accessing ConstantId slices #1449

Merged

Copy & embed any string/constant pool entry in the serialized output #1537

Open

kddnewton mentioned this issue Nov 10, 2023

bare string concatenation uses lots o' stack #1795

Closed

kddnewton added this to the Ruby 3.3.0 milestone Nov 10, 2023

kddnewton mentioned this issue Nov 10, 2023

Remove string concat in favor of a flat list #1799

Merged

kddnewton closed this as completed in #1799 Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider combining "foo" "bar" into single *StringNode. #1479

Consider combining "foo" "bar" into single *StringNode. #1479

enebo commented Sep 13, 2023

kddnewton commented Sep 13, 2023

enebo commented Sep 13, 2023

enebo commented Sep 13, 2023

enebo commented Sep 13, 2023 •

edited

Loading

kddnewton commented Sep 14, 2023

eregon commented Sep 14, 2023 •

edited

Loading

eregon commented Sep 14, 2023

enebo commented Sep 14, 2023

Consider combining "foo" "bar" into single *StringNode. #1479

Consider combining "foo" "bar" into single *StringNode. #1479

Comments

enebo commented Sep 13, 2023

kddnewton commented Sep 13, 2023

enebo commented Sep 13, 2023

enebo commented Sep 13, 2023

enebo commented Sep 13, 2023 • edited Loading

kddnewton commented Sep 14, 2023

eregon commented Sep 14, 2023 • edited Loading

eregon commented Sep 14, 2023

enebo commented Sep 14, 2023

enebo commented Sep 13, 2023 •

edited

Loading

eregon commented Sep 14, 2023 •

edited

Loading