Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce proliferation of cstrings #4481

Open
asl opened this issue Feb 28, 2024 · 11 comments
Open

Reduce proliferation of cstrings #4481

asl opened this issue Feb 28, 2024 · 11 comments
Labels
enhancement This topic discusses an improvement to existing compiler code.

Comments

@asl
Copy link
Contributor

asl commented Feb 28, 2024

cstrings are commonly used as "the string type" in p4c. However, they should be used with more care. The reason is that they are internalized, so each new string is stored forever in some global map.

This makes them great for various containers, both keys and values, for IR strings, etc. However, in many cases cstrings are used just for random transient strings. As a result:

  • Copies are created every time cstring is constructed
  • These copies are stuck forever

As a result, we might observe some elevated memory usage especially for long-lasting processes. I definitely saw that ~1% of runtime of benchmark from #4475 is spent on string hash used for cstring internalization. I have not measured how many strings are in cstring hash, but I'd assume a lot.

Instead, one should resort to std::string and std::string_view for transient values that are not required to be saved forever. Also, we might consider adding some kind of StringMap with string keys that would own keys and not pollute the common cstring cache.

@asl asl added the enhancement This topic discusses an improvement to existing compiler code. label Feb 28, 2024
@asl
Copy link
Contributor Author

asl commented Feb 28, 2024

Tagging @fruffy

@vlstill
Copy link
Contributor

vlstill commented Feb 28, 2024

Instead, one should resort to std::string and std::string_view for transient values that are not required to be saved forever.

I agree there 100 %. We might need to provide some utility functions to make the migration easier. Some of the uses might be there just because some things are easier with cstring then with std::string (or because we relatively recently switched do C++17).

Also, we might consider adding some kind of StringMap with string keys that would own keys and not pollute the common cstring cache.

Could you please give a use-case for that?

@asl
Copy link
Contributor Author

asl commented Feb 28, 2024

Could you please give a use-case for that?

Just loud thinking – if one would need to have internalized string keys (e.g. for faster lookup), but definitely would worth starting from the proper usecase in the code

@fruffy
Copy link
Collaborator

fruffy commented Feb 28, 2024

I agree, I sometimes wonder whether the interning does more harm than good. It might make sense to rename cstring to something like CachedString. Simply to make it explicit to users that this string will stay around. Or use an explicit StringMap, as you suggest.

In the case of P4Testgen, cstrings can work well, because the interpreter ends up allocating and reusing the same kinds of strings a lot. We fixed the majority of cstring leaks there and performed benchmarks running the interpreter for days without running out of memory, but that does not mean there aren't any cstring leaks left.

@asl
Copy link
Contributor Author

asl commented Mar 8, 2024

I collected some statistics on downstream project. And the results are overwhelming. Essentially cstring cache works as sink, collecting everyone and their brother: all format strings, pieces of input, json values, ...

Essentially after frontend the cstring cache contained 300k strings (!), midend raised this up to 311k (though we do not run much midend passes). Number of map lookups was great as well totalling 86M after frontend (each cstring ctor involves at least 1 map lookup, or 2 if the value should be stored).

So, lots of hash table lookups, lots of string hashes, etc.

I will try to see if there is some low-hanging fruit here... Likely cstring should be "storage-only" type for node members. But not a replacement for generic string type...

@fruffy
Copy link
Collaborator

fruffy commented Mar 8, 2024

I will try to see if there is some low-hanging fruit here..

I would guess the newName() method of the ReferenceMap has a lot of weight there. There are a lot of temporary names generated with it.

@vlstill
Copy link
Contributor

vlstill commented May 29, 2024

I will try to see if there is some low-hanging fruit here..

I would guess the newName() method of the ReferenceMap has a lot of weight there. There are a lot of temporary names generated with it.

I would expect these to mostly end in the IR nodes, which is OK-ish. The problem is that cstring is convenient, in some ways more convenient than std::string so programmers just tend to use it as there is no big hurtle in using it that could prompt them to ask if it is a good idea :-). Error-handling-related code definitely should not use cstring.

One interesting thing could be to (temporarily?) disable1 the operator+/operator+= for cstring... this is almost never a good idea to use as there is a high chance that an intermediate value will get cached.

Similarly with to_cstring, join... maybe we could replace them with freestanding functions on string for the stuff that is not already covered over std::string. Stuff like substr, replace, etc. should return std::string (or std::string_view in cases like substr). I'm not sure how much would this break for backends and that will be always hard to know when we don't even know all the 3rd party backends that are based on P4C...

Footnotes

  1. Or we could start by making a lot of these functions [[deprecated]] and, clean up our code (possibly with some targetted -Werror in CI), and let the backends decide for themselves. But this cannot be done in cases of changing types or conversions of course.

@asl
Copy link
Contributor Author

asl commented May 29, 2024

One interesting thing could be to (temporarily?) disable1 the operator+/operator+= for cstring... this is almost never a good idea to use as there is a high chance that an intermediate value will get cached.

Yes, exactly. I already started to look into various cstring-related things. Good news is that currently it is almost always done via std::string. At least in toString methods of various nodes. There are few notable cases, where I fixed this.

There are extreme cases in backends. bpf / ebpf here are the most notable examples where cstring's are created for almost everything for no particular reasons :) Overall I'd expect that this would be the typical case for downstream backends as far as I can see from few (limited) cases ;)

@fruffy
Copy link
Collaborator

fruffy commented Jun 7, 2024

#4694 is a painful commit, it breaks a lot of back ends. But it looks like it is necessary. It shows that there are a lot of unneeded cstring instantiations. For example, why do we need to cache all the options here:
https://github.com/p4lang/p4c/blob/main/frontends/common/parser_options.h#L60

@ChrisDodd
Copy link
Contributor

What is the cost of all these "unnecessary" cstrings? Yes an extra copy occurs when creating one, but that would happen with a std::string too. Yes, they are (currently) not garbage collected, but the total size of all cstrings is trivial compared to the total heap size -- generally less than 0.01% I'd be concerned with doing a lot of work that would have no (or even negative) overall effect.

@asl
Copy link
Contributor Author

asl commented Jun 8, 2024

Yes, they are (currently) not garbage collected, but the total size of all cstrings is trivial compared to the total heap size -- generally less than 0.01% I'd be concerned with doing a lot of work that would have no (or even negative) overall effect.

It is not the total size that matters. Every cstring construction involves map lookup. Sometimes two lookups if the string is new and should be copied. Each lookup is hash calculation plus string equality check. The "cstring proliferation problem" is not that visible inside frontend / midend. Though we store bunch of stuff there: we do store the whole input several times (whole input lines, all parsed tokens including all comments and finally parsed identifiers). So, essentially we do use cstrings for something transient, we do use cstrings for string literals a lot in many places, where just string_view will be enough, etc., etc.

I do see cstring::save_to_cache in the profiles and sometimes it is more than just 1%. Also, the less we store, less the GC overhead. As #4694 (comment) shows, just allocating cstring cache in non GC-ed arena yields 3% improvement due to less GC work here. Overall, all these strings is something for GC to scan for pointers and liveness.

The real problem is backends. Backends tend to use cstring as "the P4C string type". And this is just plain wrong. Besides the statistics presented in #4481 (comment) (300k cstrings created in frontend with 86M cstring cache map lookups there), the downstream backend increased the number of strings saved to 2M.

So, the intention is to ensure that it would be hard to create cstring implicitly. Also, these implicit conversions make string_view argument overloads impossible due to ambiguity. And we already obtained the speedup :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement This topic discusses an improvement to existing compiler code.
Projects
None yet
Development

No branches or pull requests

4 participants