Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API request: string.Intern(ReadOnlySpan<char> ...) #28368

Closed
ladeak opened this issue Jan 10, 2019 · 19 comments
Closed

API request: string.Intern(ReadOnlySpan<char> ...) #28368

ladeak opened this issue Jan 10, 2019 · 19 comments
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Runtime
Milestone

Comments

@ladeak
Copy link
Contributor

ladeak commented Jan 10, 2019

String.Intern currently accepts a string parameter. When processing a ReadOnlySpan<char> it would be nice to be able to intern a string without a necessary string allocation upfront. This would be especially useful for strings that would be used frequently.

For this reason I would request the following overloads added to string:

public class String
{
    // Existing APIs
    public static string Intern(string str);
    public static string? IsInterned(string str);

    // New APIs
    public static string Intern(ReadOnlySpan<char> str);
    public static bool IsInterned(ReadOnlySpan<char> str);
}
@khellang
Copy link
Member

public static string string.Intern(ReadOnlySpan<char> str)
public static string string.Intern(Span<char> str)

How are these supposed to work without allocating a string? You mean if a matching string is already in the intern pool?

@AlgorithmsAreCool
Copy link
Contributor

@khellang
Here is an example I think could use this API. I want to parse some token that we know is very common. So when we read the data we want to pass it to string.Intern(string) to ensure we are not creating duplicate strings.

But today we have to do this:

Token ProcessToken(Span<char> token)
{
    var temp = new String(token);

    return new Token {
        Value = string.Intern(temp);
    }
}

temp gets thrown away as garbage 99.9% of the time.

This would be nicer:

Token ProcessToken(Span<char> token)
{
    return new Token {
        Value = string.Intern(token);
    }
}

@jkotas
Copy link
Member

jkotas commented Jan 10, 2019

Note that string.Intern uses a global singleton. It is required to intern the string 100% of the time, across all threads. It means that it needs to take locks, some of the time at least. And the current implementation never releases the string once it has been interned.

You typically get much better performance results by using a local interning table that does imperfect interning (i.e. it is ok for it to create duplicate strings in rare cases - hash collisions, multiple threads competing to create the same string, etc.)

We can still add this API, but it is unlikely to be very useful to actually make things faster.

@danmoseley
Copy link
Member

Wouldn't we always recommend against string.Intern anyway for reasons @jkotas states? If so, I think we should close this rather than extend its API.

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 5.0 milestone Feb 1, 2020
@mjsabby
Copy link
Contributor

mjsabby commented Mar 31, 2020

@jkotas @danmosemsft now that the C# compiler will be getting module initializers, and it already has support for embedding byte arrays in PE file can we please revisit this? With module initializers there becomes a more defined place (even if you consider it as a convention) to do this work.

This would be very useful to the Bing team.

@jkotas
Copy link
Member

jkotas commented Mar 31, 2020

I do not see how module initializers change the equation on this.

Could you please provide details for how you would like to use this API, and what do you expect to save by doing that?

@mjsabby
Copy link
Contributor

mjsabby commented Mar 31, 2020

We have a ton of string data (not strings, but UTF-16 bytes in our PE files). We would use module initializers to access this data in the file and call string.Intern on them.

@jkotas
Copy link
Member

jkotas commented Apr 2, 2020

Ok, I have been thinking about this and I see situations where this may be useful.

@jkotas jkotas reopened this Apr 2, 2020
@jkotas jkotas added api-ready-for-review and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Apr 2, 2020
@GrabYourPitchforks
Copy link
Member

A big concern of mine is that folks might see a spanified version of string.Intern and be tempted to pass untrusted data through it in an attempt to be more allocation-efficient. For instance:

ReadOnlySpan<char> requestLine = GetRequestLine(); // = "GET /path HTTP/1.1" (as a span)
ReadOnlySpan<char> httpVerb = requestLine[..requestLine.IndexOf(' ')]; // = "GET" (as a span)
string verbAsInternedString = string.Intern(httpVerb); // = "GET" (as a string)

Basically, the sample above shows an "allocation-free" way of extracting the literal strings "GET" / "POST" / etc. from the request line, but it introduces a security flaw in the application. This allows the request payload to send arbitrary nonsensical verbs to the application, and those verbs will get added to a static singleton table which is never cleared. Eventually this will result in OOMing the web server process: a denial of service attack.

To discourage such use, I would instead recommend the following API:

public sealed class string
{
    public static bool TryGetInternedString(ReadOnlySpan<char> str, [NotNullWhen(true)] out string? internedString);
}

This allows you to query whether the provided buffer exists in the intern table using a familiar try pattern. If the buffer does not exist in the intern table, it outs null. If you want to add the buffer to the intern table, you're forced to bounce through ROS<char>.ToString first. The call site clearly performs an allocation to perform the interning operation, which reduces the odds that somebody sees this as convenient "allocation-free" API that should be called on a per-request basis or within a hot path.

@jkotas
Copy link
Member

jkotas commented May 19, 2020

@GrabYourPitchforks Good catch as always.

TryGetInternedString

The pit of failure with API like this is that it depends on the global intern table being populated with the string you care about. The population of the global intern table is not very predictable. For example, the behavior of this method would depend on how the JIT inliner works (inlining happens to populate the global intern table as side-effect). The existing IsInterned method has this problem too, but TryGetInternedString would increase the exposure.

This suggests that we should go back to the drawing board with this API.

@jkotas jkotas added api-needs-work API needs work before it is approved, it is NOT ready for implementation and removed api-ready-for-review labels May 19, 2020
@jkotas jkotas modified the milestones: 5.0, Future May 19, 2020
@GrabYourPitchforks
Copy link
Member

inlining happens to populate the global intern table as side-effect

This isn't quite my area of expertise. But doesn't this mean that if somebody calls string.Intern("Foo") at app start, then when a method containing ldstr "Foo" is compiled it'll end up pointing to the previously-interned string instance?

@stephentoub
Copy link
Member

Rather than integrating this with string.Intern, at least the scenarios I'm aware of here would be addressed if we had a Dictionary<string, ...>.TryGetValue(ReadOnlySpan<char>, ...) extension method, ala #27229. The sticking point I've seen there ends up being the comparer; we can do the "right thing" for comparers we know about (e.g. any built-in StringComparer), but for arbitrary comparers, we'd need to ToString the span to be able to invoke the comparer, which is far from ideal. I wonder whether we could stomach throwing from such a TryGetValue if the dictionary was using a comparer we didn't understand; in the future if we introduced an ISpanEqualityComparer or some such thing, we could make it work in more cases. But I imagine the 99% case for such an API will use one of the built-in StringComparers, probably Ordinal or OrdinalIgnoreCase.

@danmoseley
Copy link
Member

I'm not clear why we would want to double down on String.Intern, which was sort of recommended against as long as I can remember for reasons mentioned above and one of those is the one Levi points to. What might be worth doing is offering a first class interning interface that is pluggable. This was discussed here already:
#20073 (comment)

@jkotas
Copy link
Member

jkotas commented May 19, 2020

The scenario where @mjsabby 's wanted to use this API was a static initialization of large configuration data. It does not have the problem with denial-of-service and there are benefits from interning the program string constants and the configuration data keys against the same pool. It may be too niche case to warrant adding the API.

@jeffhandley
Copy link
Member

It may be too niche case to warrant adding the API.

I am not convinced we would want to create a new API for this case. The scenarios do see to be in niche cases, and there are risks of misuse that have been called out.

@mjsabby, @ladeak, and @AlgorithmsAreCool -- I am inclined to close this issue. Please let me know if you have strong objections to closing it.

@AlgorithmsAreCool
Copy link
Contributor

@jeffhandley

I say close it. I've stopped using trying to use string.intern. I almost think it should be marked Obsolete.

I'm thinking about opening a separate issue for a better string caching API in general.

@benaadams
Copy link
Member

Since this was closed; I wrote a package called Ben.StringIntern for an instance based intern pool and put on NuGet with api as follows:

public class InternPool
{
    [return: NotNullIfNotNull("value")]
    public string? Intern(string? value);
    public string InternAscii(ReadOnlySpan<byte> asciiValue);
    public string InternUtf8(ReadOnlySpan<byte> utf8Value);
    public string Intern(ReadOnlySpan<char> value);
}

Isn't thread safe so you'll have to handle your own locking; and haven't written tests yet so don't know if it works 😅; so YMMV

@GrabYourPitchforks
Copy link
Member

Isn't thread safe so you'll have to handle your own locking

Isn't that what [MethodImpl(MethodImplOptions.Synchronized)] is for?

...
...

I'll show myself out...

@benaadams
Copy link
Member

Added a test, fixed some bugs, bumped to 0.0.2

@ghost ghost locked as resolved and limited conversation to collaborators Jan 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Runtime
Projects
None yet
Development

No branches or pull requests