Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider combining "foo" "bar" into single *StringNode. #1479

Closed
enebo opened this issue Sep 13, 2023 · 8 comments · Fixed by #1799
Closed

Consider combining "foo" "bar" into single *StringNode. #1479

enebo opened this issue Sep 13, 2023 · 8 comments · Fixed by #1799
Labels
enhancement New feature or request
Milestone

Comments

@enebo
Copy link
Collaborator

enebo commented Sep 13, 2023

The build logic for this is pretty ugly:

"a" "b"
"a #{oh}" "b"
"a" "b #{oh}"
"a #{oh}" "b #{oh}"

Legacy parsers do this combination during parse. The only reason I can think of not to add this enhancement is syntax tooling probably needs to know this. With that said this:

  1. uses extra post-processing (wastes time)
  2. bigger (wastes space)

I keep thinking we need to have two profiles (semantics, syntax) during parse which does the thing each type of tool wants.

@kddnewton
Copy link
Collaborator

Right now, if you have "foo" "bar" then each unescaped field on the two strings is 1 byte for the offset and 1 byte for the length. But if we concat them together, then we're forced to allocate our own strings since we can't reference the source anymore. So now we'll have 1 byte for the length and then 6 bytes for the content. It will very quickly end up being more memory unless it's extremely small strings.

Of course that doesn't address the issue of post-processing. But I'm guessing it wouldn't be worth increasing the size of the serialized buffer?

@enebo
Copy link
Collaborator Author

enebo commented Sep 13, 2023

@kddnewton are you saying the long term goal will be to keep source in order to build source? I believe this will make us have to abandon the idea of lazy method evaluation if so (e.g. we will be keeping serialized + source in memory).

If it ends up in a constant pool then it will then become less memory. The interpolated case will be less memory (alhough admittedly much less common) either way.

This could be a case of something serialization perhaps want but the main parser doesn't? Has compile.c tackled these mixed interpolated cases yet? I am curious if they prefer this as-is.

@enebo
Copy link
Collaborator Author

enebo commented Sep 13, 2023

@kddnewton I may be speaking on this wrong. I see that loadString will populate from the source when making the tree so we can do lazy method building but it means that we need to fully generate its AST. I am ok with this and you can ignore my previous comment past it being aspirational.

By aspirational I mean the dream would be to get to be able to save offsets to a method and not generate a tree for that section. In that case we would need strings to be in a pool and not source.

@enebo
Copy link
Collaborator Author

enebo commented Sep 13, 2023

@kddnewton ok another idea. StringConcat is always normal Strings and when mixed with an InterpolatedString it is combined into the InterpolatedString during parse? Then it would make the complexity much less. It should reduce interpolated cases size (and just be somethign we already can parse).

@kddnewton
Copy link
Collaborator

I think there are a couple of things we can take away from this that can move us toward the aspirational goal:

  1. We should eagerly load the constant pool. In Ruby and C, we do this when deserializing by looping through each constant and getting the value out. In Java, it lazily loads the constant based on when one of the nodes needs it. Right now, since we're deserializing the entire tree, the lazy loading doesn't actually help us because every constant is going to be referenced. However, it stops us from moving toward lazy methods because you need to keep the source around to get the constant pool loaded.
  2. We should move strings into a separate pool that is eagerly loaded when deserializing. I don't think this would be particularly difficult to do - and we could do it without impacting the way we parse by only building the string pool at the time of serialization. I think there's not really too much benefit in sharing in the same way we do with constants since strings are so much less likely to be shared.

As for combining them all, the parser gem does that and I think it would be possible. What they do is have a parent dstr (our InterpolatedStringNode) that combines all of them together. For example:

"a #{b} c" "d" "e #{f} g"

would get parsed as:

(dstr
  (dstr
    (str "a ")
    (begin
      (send nil :b))
    (str " c"))
  (str "d")
  (dstr
    (str "e ")
    (begin
      (send nil :f))
    (str " g")))

This works because we still get to keep all of the opening/closing/location information for the child strings, and it removes the need to have a string concat node at all so at least it can be consistently compiled. What do you think of that approach?

@eregon
Copy link
Member

eregon commented Sep 14, 2023

Re lazy parsing my intention is to not deserialize at all (i.e. not read the serialized bytes of) lazy methods (until that method is executed).
The code in Loader doesn't really handle that yet, so still some things to figure out.

That can mean either keeping the source around, or rereading it from disk, or indeed populating the constant pool eagerly, which is probably the best/simplest.
Regarding non-constant-pool strings, right we need to care about the strings referring back to the source, and serializing them like the constant pool but without interning sounds like a good way to do it.

An alternative would be when serializing to copy & embed any string/constant pool entry in the serialized output and therefore not need the source at all when deserializing.
That would mean larger serialized size but no need to read the source at all which could be a nice speedup in the case the serialized form is saved on disk and newer/equal the source file.
I think the serialized size wouldn't be worse than before + eager loading constant pool and strings, it would be about the same, but gaining to not have to read the source.
That might actually be fairly easy, and would be less temporal memory than serialized + source at the same time in memory (but that's not a big deal).

What do you think of this alternative? I think it might be the best and also be fairly straightforward to implement.

@eregon
Copy link
Member

eregon commented Sep 14, 2023

This works because we still get to keep all of the opening/closing/location information for the child strings, and it removes the need to have a string concat node at all so at least it can be consistently compiled. What do you think of that approach?

So that means replacing StringConcatNode by InterpolatedStringNode, right?
That sounds good.
It would be convenient to have a flag on that node if all children are static, i.e. no interpolation, maybe we can reuse the YP_NODE_FLAG_STATIC_LITERAL flag for that from #1451 ?

@enebo
Copy link
Collaborator Author

enebo commented Sep 14, 2023

Re lazy parsing my intention is to not deserialize at all (i.e. not read the serialized bytes of) lazy methods (until that method is executed).

This is my aspirational goal as well. It may be that we eagerly put strings into the node when serializing in the first place (and still need to retain the constant pool).

This works because we still get to keep all of the opening/closing/location information for the child strings, and it removes the need to have a string concat node at all so at least it can be consistently compiled. What do you think of that approach?

I also believe you mean as @eregon just said that means using InterpolatedStringNode (DStr is this with legacy parsers). I think that is a reasonable solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants