-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce proliferation of cstrings #4481
Comments
Tagging @fruffy |
I agree there 100 %. We might need to provide some utility functions to make the migration easier. Some of the uses might be there just because some things are easier with cstring then with std::string (or because we relatively recently switched do C++17).
Could you please give a use-case for that? |
Just loud thinking – if one would need to have internalized string keys (e.g. for faster lookup), but definitely would worth starting from the proper usecase in the code |
I agree, I sometimes wonder whether the interning does more harm than good. It might make sense to rename In the case of P4Testgen, cstrings can work well, because the interpreter ends up allocating and reusing the same kinds of strings a lot. We fixed the majority of cstring leaks there and performed benchmarks running the interpreter for days without running out of memory, but that does not mean there aren't any cstring leaks left. |
I collected some statistics on downstream project. And the results are overwhelming. Essentially cstring cache works as sink, collecting everyone and their brother: all format strings, pieces of input, json values, ... Essentially after frontend the cstring cache contained 300k strings (!), midend raised this up to 311k (though we do not run much midend passes). Number of map lookups was great as well totalling 86M after frontend (each cstring ctor involves at least 1 map lookup, or 2 if the value should be stored). So, lots of hash table lookups, lots of string hashes, etc. I will try to see if there is some low-hanging fruit here... Likely cstring should be "storage-only" type for node members. But not a replacement for generic string type... |
I would guess the |
I would expect these to mostly end in the IR nodes, which is OK-ish. The problem is that One interesting thing could be to (temporarily?) disable1 the Similarly with Footnotes
|
Yes, exactly. I already started to look into various cstring-related things. Good news is that currently it is almost always done via There are extreme cases in backends. bpf / ebpf here are the most notable examples where |
#4694 is a painful commit, it breaks a lot of back ends. But it looks like it is necessary. It shows that there are a lot of unneeded cstring instantiations. For example, why do we need to cache all the options here: |
What is the cost of all these "unnecessary" cstrings? Yes an extra copy occurs when creating one, but that would happen with a |
It is not the total size that matters. Every cstring construction involves map lookup. Sometimes two lookups if the string is new and should be copied. Each lookup is hash calculation plus string equality check. The "cstring proliferation problem" is not that visible inside frontend / midend. Though we store bunch of stuff there: we do store the whole input several times (whole input lines, all parsed tokens including all comments and finally parsed identifiers). So, essentially we do use cstrings for something transient, we do use cstrings for string literals a lot in many places, where just I do see The real problem is backends. Backends tend to use So, the intention is to ensure that it would be hard to create |
cstrings are commonly used as "the string type" in p4c. However, they should be used with more care. The reason is that they are internalized, so each new string is stored forever in some global map.
This makes them great for various containers, both keys and values, for IR strings, etc. However, in many cases
cstring
s are used just for random transient strings. As a result:cstring
is constructedAs a result, we might observe some elevated memory usage especially for long-lasting processes. I definitely saw that ~1% of runtime of benchmark from #4475 is spent on string hash used for
cstring
internalization. I have not measured how many strings are incstring
hash, but I'd assume a lot.Instead, one should resort to
std::string
andstd::string_view
for transient values that are not required to be saved forever. Also, we might consider adding some kind ofStringMap
with string keys that would own keys and not pollute the common cstring cache.The text was updated successfully, but these errors were encountered: