Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocate string literals on frozen segments #49576

Merged
merged 69 commits into from
Sep 13, 2022

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Mar 13, 2021

UPD: This PR allocates all string literals (from non-collectible assemblies and dynamic contexts) on the Frozen Object Heap - it allows us to optimize a redundant indirection:

static string Test() => "test";

Was:

       48B810B70098A3020000 mov      rax, 0x2A39800B710
       488B00               mov      rax, gword ptr [rax]
       C3                   ret      
; Total bytes of code 14

Now:

       48B818D1466998020000 mov      rax, 0x2986946D118
       C3                   ret      
; Total bytes of code 11

Inspired by @jkotas's comment #49429 (comment)

We can do the same trick for other things: typeof(), static fields, etc

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@jkotas
Copy link
Member

jkotas commented Mar 13, 2021

Do we need to save string objects in that "pinned table" for such cases?

Yes. From ECMA spec: the CLI guarantees that the result of two ldstr instructions referring to two metadata tokens that have the same sequence of characters, return precisely the same string object (a process known as “string interning”).

@jkotas
Copy link
Member

jkotas commented Mar 13, 2021

Should we prefer LOH over POH for really large strings

It is GC internal implementation detail. We are asking GC to allocate a pinned object. It is up to the GC to do it as best as it can.

@@ -11798,7 +11798,17 @@ InfoAccessType CEEJitInfo::constructStringLiteral(CORINFO_MODULE_HANDLE scopeHnd
}
else
{
*ppValue = (LPVOID)ConstructStringLiteral(scopeHnd, metaTok); // throws
if (!GetModule(scopeHnd)->IsCollectible() && !IsCompilingForNGen() && !IsReadyToRunCompilation())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!GetModule(scopeHnd)->IsCollectible() && !IsCompilingForNGen() && !IsReadyToRunCompilation())
if (!GetModule(scopeHnd)->IsCollectible())

This method is never called for those cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be a good idea to do a similar change for R2R. The change for that would be in crossgen2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if it makes sense for R2R, since R2R code cannot access the string literal directly anyway, we probably cannot save in terms of code size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can save one indirection (ie one instruction) for R2R.

@jkotas
Copy link
Member

jkotas commented Mar 13, 2021

cc @Maoni0 @cshung More uses of pinned heap

@@ -489,7 +489,7 @@ StringLiteralEntry *GlobalStringLiteralMap::AddStringLiteral(EEStringData *pStri
{
PinnedHeapHandleBlockHolder pStrObj(&m_PinnedHeapHandleTable,1);
// Create the COM+ string object.
STRINGREF strObj = AllocateStringObject(pStringData);
STRINGREF strObj = AllocateStringObject(pStringData, true); // TODO: don't use POH for collectible assemblies?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not a have good sense for how much we should try to avoid fragmentation of POH. It may be ok to allocate these on POH even for the collectible assemblies, at least to start with. @Maoni0 @cshung Any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that by frequently loading and unloading collectible assemblies, these string literals turn into lots of small holes in the POH that we effectively cannot reuse? If so, I think it is better for us to not allocating them on the POH.

@EgorBo EgorBo marked this pull request as ready for review March 15, 2021 12:06
@EgorBo EgorBo added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 15, 2021
@cshung
Copy link
Member

cshung commented Mar 15, 2021

Is there a scenario where we can demonstrate how well this optimization works compared to what it was?

@EgorBo
Copy link
Member Author

EgorBo commented Mar 15, 2021

Is there a scenario where we can demonstrate how well this optimization works compared to what it was?

Will post a jit-diff. Initially I wanted to optimize typeof(x) where currently we have to emit a helper call every time, but if we allocate all type objects on the POH we can do exactly the same thing. I decided to start from strings to test.

@EgorBo
Copy link
Member Author

EgorBo commented Mar 16, 2021

jit-diff is quite promising (--pmi -f):

PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies for  default jit

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 53043631
Total bytes of diff: 52685864
Total bytes of delta: -357767 (-0.67% of base)
    diff is an improvement.


Top file regressions (bytes):
           2 : System.Diagnostics.FileVersionInfo.dasm (0.05% of base)

Top file improvements (bytes):
      -84762 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-2.64% of base)
      -35210 : System.Private.Xml.dasm (-1.00% of base)
      -17647 : System.Private.CoreLib.dasm (-0.38% of base)
      -17574 : System.Data.Common.dasm (-1.15% of base)
      -17420 : System.Management.dasm (-4.54% of base)
      -16231 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.29% of base)
      -11930 : System.Collections.Immutable.dasm (-0.92% of base)
      -11832 : Microsoft.CodeAnalysis.CSharp.dasm (-0.27% of base)
       -7995 : Microsoft.CodeAnalysis.dasm (-0.44% of base)
       -6880 : System.Private.DataContractSerialization.dasm (-0.91% of base)
       -6716 : FSharp.Core.dasm (-0.18% of base)
       -5743 : System.Net.Http.dasm (-0.85% of base)
       -5532 : System.Linq.Expressions.dasm (-0.71% of base)
       -5221 : System.DirectoryServices.AccountManagement.dasm (-1.42% of base)
       -4045 : System.Data.OleDb.dasm (-1.38% of base)
       -3635 : dotnet-Microsoft.XmlSerializer.Generator.dasm (-10.56% of base)
       -3603 : System.Net.HttpListener.dasm (-1.72% of base)
       -3585 : System.Speech.dasm (-0.85% of base)
       -3390 : System.CodeDom.dasm (-1.92% of base)
       -3243 : Newtonsoft.Json.dasm (-0.38% of base)

189 total files with Code Size differences (188 improved, 1 regressed), 82 unchanged.

Top method regressions (bytes):
        1797 (20.49% of base) : System.Private.Xml.dasm - SchemaNames:.ctor(XmlNameTable):this
         247 ( 1.67% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - AspNetTraceEventParser:EnumerateTemplates(Func`3,Action`1):this
         219 (10.39% of base) : System.Management.dasm - WqlEventQuery:ParseQuery(String):this
         183 ( 6.77% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - BlockContext:CreateMissingEnd(ushort,byref):StatementSyntax:this
         177 (590.00% of base) : System.Private.Xml.dasm - XmlTextReaderImpl:.ctor(Stream,XmlNameTable):this
         167 (66.27% of base) : System.Private.Xml.dasm - XmlDocument:Load(Stream):this
         157 (152.43% of base) : System.Private.Xml.dasm - XmlTextReader:.ctor(Stream,XmlNameTable):this
         152 (194.87% of base) : System.Private.Xml.dasm - XmlTextReaderImpl:.ctor(Stream):this
         150 (112.78% of base) : System.Private.Xml.dasm - XmlTextReader:.ctor(Stream):this
         143 ( 6.17% of base) : System.Management.dasm - SelectQuery:ParseQuery(String):this
         141 (47.32% of base) : System.Private.Xml.dasm - XPathDocument:.ctor(Stream):this
         131 ( 0.84% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - MicrosoftAntimalwareEngineTraceEventParser:EnumerateTemplates(Func`3,Action`1):this
         113 ( 1.88% of base) : System.Reflection.Metadata.dasm - MetadataReader:InitializeProjectedTypes()
         108 ( 3.71% of base) : System.Private.Xml.dasm - XmlSerializationReaderILGen:WriteMemberBegin(ref):this
         102 (340.00% of base) : System.Private.Xml.dasm - XmlTextReaderImpl:.ctor(TextReader,XmlNameTable):this
          94 (23.86% of base) : System.Configuration.ConfigurationManager.dasm - ConfigXmlDocument:LoadSingleElement(String,XmlTextReader):this
          86 (34.13% of base) : System.Private.Xml.dasm - XmlDocument:Load(TextReader):this
          86 (30.60% of base) : System.Private.Xml.dasm - XmlDocument:LoadXml(String):this
          82 (79.61% of base) : System.Private.Xml.dasm - XmlTextReader:.ctor(TextReader,XmlNameTable):this
          77 (98.72% of base) : System.Private.Xml.dasm - XmlTextReaderImpl:.ctor(TextReader):this

Top method improvements (bytes):
      -12403 (-19.35% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - CtfTraceEventSource:InitEventMap():Dictionary`2
      -10796 (-9.90% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - ApplicationServerTraceEventParser:EnumerateTemplates(Func`3,Action`1):this
       -3794 (-8.59% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - KernelTraceEventParser:EnumerateTemplates(Func`3,Action`1):this
       -3279 (-20.45% of base) : System.Management.dasm - ManagementClassGenerator:GenerateTypeConverterClass():CodeTypeDeclaration:this
       -2803 (-18.20% of base) : System.Management.dasm - ManagementClassGenerator:AddToDateTimeFunction():this
       -2467 (-19.64% of base) : System.Management.dasm - ManagementClassGenerator:AddToDMTFDateTimeFunction():this
       -2175 (-19.55% of base) : System.Management.dasm - ManagementClassGenerator:AddToDMTFTimeIntervalFunction():this
       -2141 (-21.68% of base) : System.Management.dasm - ManagementClassGenerator:AddToTimeSpanFunction():this
       -1644 (-14.93% of base) : System.ComponentModel.TypeConverter.dasm - CultureInfoMapper:CreateMap():Dictionary`2
       -1530 (-14.86% of base) : System.Private.CoreLib.dasm - CultureData:get_RegionNames():Dictionary`2
       -1272 (-19.78% of base) : System.Private.Xml.dasm - XsltLoader:.ctor():this
       -1261 (-3.89% of base) : Microsoft.CodeAnalysis.dasm - DesktopAssemblyIdentityComparer:.cctor()
       -1053 (-7.04% of base) : System.Data.Common.dasm - XmlTreeGen:HandleTable(DataTable,XmlDocument,XmlElement,bool):XmlElement:this
        -900 (-3.87% of base) : System.Text.RegularExpressions.dasm - RegexCharClass:.cctor()
        -882 (-10.82% of base) : Microsoft.CodeAnalysis.dasm - WellKnownMembers:.cctor()
        -771 (-16.74% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SyntaxFacts:GetText(ushort):String
        -760 (-17.16% of base) : System.Private.Xml.dasm - TypeScope:AddSoapEncodedTypes(String)
        -702 (-2.38% of base) : System.Private.Xml.dasm - XmlILMethods:.cctor()
        -697 (-12.30% of base) : System.Data.Common.dasm - XSDSchema:.cctor()
        -656 (-14.12% of base) : System.DirectoryServices.AccountManagement.dasm - FilterFactory:.cctor()

Top method regressions (percentages):
         177 (590.00% of base) : System.Private.Xml.dasm - XmlTextReaderImpl:.ctor(Stream,XmlNameTable):this
          68 (377.78% of base) : System.Private.Xml.dasm - XmlAttributeOverrides:get_Item(Type):XmlAttributes:this
         102 (340.00% of base) : System.Private.Xml.dasm - XmlTextReaderImpl:.ctor(TextReader,XmlNameTable):this
          40 (222.22% of base) : System.Private.Xml.dasm - XmlQualifiedName:.ctor(String):this
         152 (194.87% of base) : System.Private.Xml.dasm - XmlTextReaderImpl:.ctor(Stream):this
         157 (152.43% of base) : System.Private.Xml.dasm - XmlTextReader:.ctor(Stream,XmlNameTable):this
          24 (114.29% of base) : System.Private.Xml.dasm - XmlQualifiedName:.ctor():this
         150 (112.78% of base) : System.Private.Xml.dasm - XmlTextReader:.ctor(Stream):this
          77 (98.72% of base) : System.Private.Xml.dasm - XmlTextReaderImpl:.ctor(TextReader):this
          82 (79.61% of base) : System.Private.Xml.dasm - XmlTextReader:.ctor(TextReader,XmlNameTable):this
         167 (66.27% of base) : System.Private.Xml.dasm - XmlDocument:Load(Stream):this
          75 (56.39% of base) : System.Private.Xml.dasm - XmlTextReader:.ctor(TextReader):this
          64 (51.61% of base) : System.Private.Xml.dasm - XmlReflectionImporter:GetAttributes(Type,bool):XmlAttributes:this
         141 (47.32% of base) : System.Private.Xml.dasm - XPathDocument:.ctor(Stream):this
          46 (45.54% of base) : System.Private.DataContractSerialization.dasm - NamespaceManager:.ctor():this
          38 (40.00% of base) : System.Net.Primitives.dasm - SystemNetworkCredential:.ctor():this
          38 (40.00% of base) : System.Net.Primitives.dasm - NetworkCredential:.ctor():this
          48 (35.56% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - FilterParams:.ctor():this
          86 (34.13% of base) : System.Private.Xml.dasm - XmlDocument:Load(TextReader):this
          86 (30.60% of base) : System.Private.Xml.dasm - XmlDocument:LoadXml(String):this

Top method improvements (percentages):
         -74 (-70.48% of base) : System.Composition.AttributedModel.dasm - SharedAttribute:.ctor():this
         -74 (-63.25% of base) : System.Composition.AttributedModel.dasm - SharedAttribute:.ctor(String):this
         -31 (-55.36% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Scanner:MakeEndOfInterpolatedStringToken():SyntaxToken:this
         -13 (-38.24% of base) : Microsoft.Extensions.Hosting.Systemd.dasm - <>c:<UseSystemd>b__0_1(ConsoleLoggerOptions):this
         -11 (-37.93% of base) : System.Text.Json.dasm - ThrowHelper:GetInvalidOperationException_ExpectedNumber(ubyte):InvalidOperationException
         -11 (-37.93% of base) : System.Text.Json.dasm - ThrowHelper:GetInvalidOperationException_ExpectedBoolean(ubyte):InvalidOperationException
         -11 (-37.93% of base) : System.Text.Json.dasm - ThrowHelper:GetInvalidOperationException_ExpectedString(ubyte):InvalidOperationException
         -11 (-37.93% of base) : System.Text.Json.dasm - ThrowHelper:GetInvalidOperationException_ExpectedComment(ubyte):InvalidOperationException
         -11 (-37.93% of base) : System.Text.Json.dasm - ThrowHelper:GetInvalidOperationException_ExpectedChar(ubyte):InvalidOperationException
         -42 (-37.50% of base) : System.Data.Common.dasm - SqlChars:GetXsdType(XmlSchemaSet):XmlQualifiedName
         -42 (-37.50% of base) : System.Data.Common.dasm - SqlDateTime:GetXsdType(XmlSchemaSet):XmlQualifiedName
         -42 (-37.50% of base) : System.Data.Common.dasm - SqlDecimal:GetXsdType(XmlSchemaSet):XmlQualifiedName
         -42 (-37.50% of base) : System.Data.Common.dasm - SqlDouble:GetXsdType(XmlSchemaSet):XmlQualifiedName
         -42 (-37.50% of base) : System.Data.Common.dasm - SqlGuid:GetXsdType(XmlSchemaSet):XmlQualifiedName
         -42 (-37.50% of base) : System.Data.Common.dasm - SqlInt16:GetXsdType(XmlSchemaSet):XmlQualifiedName
         -42 (-37.50% of base) : System.Data.Common.dasm - SqlInt32:GetXsdType(XmlSchemaSet):XmlQualifiedName
         -42 (-37.50% of base) : System.Data.Common.dasm - SqlInt64:GetXsdType(XmlSchemaSet):XmlQualifiedName
         -42 (-37.50% of base) : System.Data.Common.dasm - SqlMoney:GetXsdType(XmlSchemaSet):XmlQualifiedName
         -42 (-37.50% of base) : System.Data.Common.dasm - SqlSingle:GetXsdType(XmlSchemaSet):XmlQualifiedName
         -42 (-37.50% of base) : System.Data.Common.dasm - SqlString:GetXsdType(XmlSchemaSet):XmlQualifiedName

36750 total methods with Code Size differences (36250 improved, 500 regressed), 231451 unchanged.

It not only removes indirect loads but also folds some nullchecks we don't fold now, example: sharplab.io /cc @dotnet/jit-contrib

There are regressions because indirect loads become constants and JIT prefer not to do CSE for them intentionally (see enableConstCSE or JitConstCSE)

PS: We might want to lower this weight for ldstr token to let the inliner inline more.

@Maoni0
Copy link
Member

Maoni0 commented Mar 17, 2021

I'm wondering how the improvement shown by jit-diff translates to actual perf wins. this means you could potentially pinning way more objects than what do we now. even though putting these on their own heap means grouping them together, it still means that group cannot move which prevents GC from compacting them away to form larger spaces around them (and this will affect regions more than they affect segments). the question is if this is worth the speed increase. of course this is scenario dependent. do we know of some customer scenarios where we could try this change with that demonstrates the end result of this change, ie, including the GC perf difference?

@jkotas
Copy link
Member

jkotas commented Mar 17, 2021

The string literals live till the end of the process, so there won't be any fragmentation caused by these string literals along. There is going to be a fragmentation only when these very long lived string literals are mixed with short-lived objects.

The string literals have same characteristics as the statics arrays that Andrew have moved to pinned heap as well. Are we worried about the POH fragmentation from the statics arrays?

I think we need to have a simple rule about whether it is ok to put objects that live till end of the process on the POH. If it is ok, we should take advantage of it anywhere we see an opportunity to gain something from it, no matter how small the gain is.

@jkotas
Copy link
Member

jkotas commented Mar 17, 2021

Would it help if we can tell to the GC that the object being allocated is going to live till end of the process?

@Maoni0
Copy link
Member

Maoni0 commented Mar 17, 2021

the object Andrew moved to POH is just the spine. this is pinning both the spine and the objects the spine points to, no?

the concern I have is about if we need to allocate more of these as the process runs. if that's never the case then it's fine. if we don't allow collectible assemblies. is this purely during startup?

@redknightlois
Copy link

redknightlois commented Mar 17, 2021

the question is if this is worth the speed increase. of course this is scenario dependent. do we know of some customer scenarios where we could try this change with that demonstrates the end result of this change, ie, including the GC perf difference?

@Maoni0 Recently I needed to optimize some pretty heavy type conversion routines. If you can improve the Type as mentioned by @EgorBo we have a very performance sensitive scenario to try it.

@jkotas
Copy link
Member

jkotas commented Mar 17, 2021

this is pinning both the spine and the objects the spine points to, no?

Yes, this is pinning objects that the spine points to. The string literals are typically going to be allocated at the same time we are allocating new spines.

is this purely during startup?

Yes, the string literals are allocated during warm up. Once the process is warmed up and there is no new code being JITed anymore, no new string literals are going to be allocated. The full warm can take a very long time in large apps, even hours, until all codepath through the app get exercised.

@Maoni0
Copy link
Member

Maoni0 commented Mar 17, 2021

Once the process is warmed up and there is no new code being JITed anymore,

@jkotas no new jitting can occur as the process runs? even after that potentially long warmup stage?

@jkotas
Copy link
Member

jkotas commented Mar 17, 2021

@jkotas no new jitting can occur as the process runs? even after that potentially long warmup stage?

Right, JITing can still happen even after warm up:

  • If it is for tiered compilation, the string literals produced by the tier-0 code earlier will get reused. There is no fragmentation concern here.
  • If it is for dynamically generated code, we can disable this optimization for this case as discussed above.

My point is that there are real world scenarios that warm up for a very long time, or that never fully warm up in practice. For example, some of the Bing services with a lot of code. There is a new version getting deployed before the process gets fully warmed up.

@davidfowl
Copy link
Member

Yes, this is pinning objects that the spine points to. The string literals are typically going to be allocated at the same time we are allocating new spines.

For my edification, what is the spine?

@jkotas
Copy link
Member

jkotas commented Mar 17, 2021

For my edification, what is the spine?

Pinned object[] that stores pointers to the string literals. The same array is also used for statics. For example, the code for void f() => "Hello"; is:

mov eax, [0xxxxxx] ; this is address of element in the spine `object[]`
ret

@davidfowl
Copy link
Member

So this change generally avoids the storing string literals in the pinned object[] and avoids the indirection unless you're in the edge cases:

  • Dynamically generated IL
  • Non default ALC?

@jkotas
Copy link
Member

jkotas commented Mar 17, 2021

As implemented right now, it is unconditionally enabled everywhere. We can change it to only kick for non-unloadable code only (that includes non-default non-collectible ALCs).

I would like to understand better how much we should be worried about POH fragmentation. The POH allocations will typically have long or infinite lifetime. That's fine on its own. But we are likely going to have a problem if there are short-lived allocations mixed in. That makes me think - should our guidance be to avoid short-lived allocations on POH?

@cshung
Copy link
Member

cshung commented Sep 19, 2022

@ww898, if I am reading the code correctly, GetGenerationBounds should already have the frozen segments included. They are not labeled as such though. Are you seeing failures in your profiler or you are just asking?

@noahfalk
Copy link
Member

noahfalk commented Sep 19, 2022

@noahfalk Do you think there is anything to add in diagnostics for this perf optimizations?

I'm not aware of any specific tools that rely on strings being/not-being in the frozen object heap, but agreed with @ww898 that we need to preserve the invariants that already exist for other allocations.
You already captured the need for the profiler object allocation callback and GetGenerationBounds in dotnet/diagnostics#4156. I also added GetObjectGeneration as a 3rd profiler API that needs to be updated. Thanks for the heads up!

@davmason - we need to get this on the TODO list for .NET 8.

EDIT: Another one we need to consider is how this interacts with EventSource AllocationTick events for sampling. We probably want to be consistent that if we consider this to be an allocation for ICorProfiler then we also need to consider it an allocation for EventSource.

@noahfalk
Copy link
Member

I was thinking on this further and there are more diagnostics we need to go through on this...

Anywhere that we enumerate contents of the managed heap or sizes of the heap we need to include the Frozen Object Heap. We've got a few possible conceptual designs:
a. Treat the Frozen Object Heap as a private implementation detail of some other part of the GC heap, for example considering it a subset of SOH gen2, or a subset of the pinned object heap.
b. Disclose the Frozen Object Heap as a new top level category within the GC. We would now have 4 GC regions - SOH, POH, LOH, and Frozen Object Heap (FOH).
c. Disclose the Frozen Object Heap as a new memory region that holds managed objects, but is distinct from the GC entirely.

Whatever choice we make here could impact a variety of APIs, tools, and docs. For comparison, look at the diagnostics work involved when adding the Pinned Object Heap: dotnet/diagnostics#1408
The option with the least amount of work is option (a) because it doesn't expose the change to any tooling, but at minimum we still need to review how we are accounting for it internally. Right now our accounting is not consistent - in some places we include FOH as part of gen2 SOH but in other places we do not include it. For example !eeheap enumerates the frozen object segements but the System.Runtime GC counter for gen2 size does not include frozen object segments.

I was chatting with @Maoni0 about whether we need to diagnostic tools to have visibility about Frozen Object Heap as a distinct region and it was unclear how big we expect this heap to be? If the heap is relatively small it is easier to sweep it under the rug, but as it grows larger the more people will care about the visibility.

@jkotas
Copy link
Member

jkotas commented Sep 20, 2022

(b) option makes the most sense to me. FOH is very similar to POH, except that it cannot be ever collected (once the object is allocated, it is stays there until process exists). I think it is important for the diagnostic tools to be able to tell the difference between frozen object and other objects, so they can exclude them from analysis. Otherwise, you may see inquires like "I have a ton of string objects in my heap that are not referenced from anywhere. Why is the GC not collecting them?".

Frozen Object Heap as a distinct region and it was unclear how big we expect this heap to be?

The size of the frozen object heap is roughly proportional to the app code size in the current usage pattern. If you have an app with a lot of code and very little actual data, FOH can be a large fraction of managed memory; and vice versa.

I would expect the FOH to be low single digit percent of the total managed memory in typical case.

@noahfalk
Copy link
Member

noahfalk commented Sep 21, 2022

Thanks @jkotas! From #75738 you were mentioning:

We may want to have a flag on frozen segments that says whether or not it counts towards GC hard limit. Frozen segments created that are backed by read-only file-mappings should not count towards GC hard limit.

My take on option (b) is that all segments in the Frozen Object Heap, including those from read-only file-mappings, would count towards the GCHeapHardLimit. Is that what you were thinking as well now or did you have another way of framing the story that lets people reason about this proposed distinction? I want to ensure that if we decide to follow option (b) we all agree what it implies.

If we all agree that FOH requires explicit visibility to avoid confusion (no pushback from me) the other question is "does the perf benefit of FOH justify the full costs of supporting it?"
Reading above, the discussion on POH vs FOH appeared to mostly focus on what option would give maximal perf and was concluded in an offline discussion between @jkotas and @Maoni0. However it is not clear how that discussion weighed implementation costs. It appears if we chose to use POH there is very little further cost, but if we choose FOH then multiple teams must collectively pay months of extra dev work to build out all the support we are envisioning. Do we have some idea what amount of perf improvement we are getting from FOH, beyond what we could get from POH, that justifies all the additional work?

My guess is that fully supporting option (b) entails:

  • 2-5 weeks of work adding tests and/or making updates to ICorProfiler, DAC, DBI, SOS, CLRMD, runtime events, TraceEvent, and runtime counters
  • ? amount of work on the GC team ensuring that GC BCL APIs, config switches, internal statistics and quota enforcement all account for FOH segments
  • updates by others for PerfView, Visual Studio, OpenTelemetry, and a variety of 3rd party .NET profiler vendors.

@Maoni0
Copy link
Member

Maoni0 commented Sep 21, 2022

I feel like I should have paid this more attention when this started because the diagnostics part of estimate was clearly missing. so that's my bad.

I can see that FOH can be beneficial for other things long term to justify the fairly large amount of diag work needed but that requires more support other than diag, for example to construct such a heap. does the string literal part justify the cost of diag work? that's hard to say. I have not seen much data (if I have seen any data, I have since forgotten what it looked like). so it seems like a good step to show some perf data.

@EgorBo
Copy link
Member Author

EgorBo commented Sep 21, 2022

String literals themselves probably don't justify the effort, but I planned to move other types of objects there, e.g. RuntimeType objects almost ready to merge: #75573 and in their case we avoid a call, not just a single indirect load, as you can see from the diff - a lot of improvements inside -14241 : System.Text.Json.dasm (-1.22% of base) but it needs a detailed perf analysis and I am not ready to tell why would FOH be better here than POH since initially in this PR I decided to use POH but it raised concerns around mixing long-living (never collected) objects with short-living ones since POH has a public API and users might use it for short-living ones creating fragmentation.

Other candidates for FOH(or, again, POH):

  • static readonly fields without mutable gc fields - e.g. we have plenty of string, Type, etc objects saved to such fields

PS: NativeAOT uses FOH too, don't we want to have a first-class diag for it too?

@jkotas
Copy link
Member

jkotas commented Sep 21, 2022

I got convinced earlier that the FOH is the right solution for frozen string literals and the other types of objects that we plan to use this for. I expect that using POH for this would be conflicting with other uses of POH. POH fragmentation issues are hard to reason about, and using POH for frozen string literals would make it even harder.

I think the code quality improvements that we are getting from this change have favorable ROI compared to out other investments, even when you include the extra diagnostic costs.

My guess is that fully supporting option (b) entails:

For the most part, this is just stamping what was required for POH. Do we expect that we create more types of specialized heaps like this in future? If yes, can we do something to make it cheaper?

@noahfalk
Copy link
Member

Thanks all!

I think the code quality improvements that we are getting from this change have favorable ROI compared to out other investments, even when you include the extra diagnostic costs.

OK, that works fine for me. I just wanted to be sure the new information around costs didn't change the decision making.

For the most part, this is just stamping what was required for POH

Yeah, I think of it as POH work + these two additional areas:

  • FOH uses a different allocation call path that doesn't include the ETW/EventPipe/ICorProfiler instrumentation so we need to add that.
  • FOH isn't currently integrated into GC accounting mechanisms, quotas, BCL APIs, and docs. I don't know how much work will be required.

Do we expect that we create more types of specialized heaps like this in future? If yes, can we do something to make it cheaper?

I'm hoping we can stop adding more specialized heaps (at least ones that need public visibility) to prevent the conceptual overhead for .NET devs doing memory analysis from growing ever larger. Many of our users appear to struggle to understand the GC space and I assume every additional specialization raises the difficulty bar further. Does anyone anticipate needing more of them?

PS: NativeAOT uses FOH too, don't we want to have a first-class diag for it too?

It might change in time, but at the moment NativeAOT diagnostics is low priority.

@jkotas - did you have any thoughts on the GCHardHeapLimit part? You were proposing that file-mapped segments wouldn't count towards the limit earlier and I am not sure if you are still proposing that?

@EgorBo - I'm guessing more work has emerged here than you were initially planning on? I'm not sure how much you are prepared to take on yourself vs. getting more devs involved. We may want to pull in @davmason and chat about what makes sense here.

@Maoni0
Copy link
Member

Maoni0 commented Sep 21, 2022

Does anyone anticipate needing more of them?

I don't have anything in mind for foreseeable future. note that FOH is not a new feature - it existed before I even started working on the GC. because it was used so rarely the support wasn't there (including both diag, creation and etc). it's just that now we are expecting to use it much more prevalently so we are paying the (long overdue) debt.

@jkotas
Copy link
Member

jkotas commented Sep 22, 2022

I'm hoping we can stop adding more specialized heaps (at least ones that need public visibility) to prevent the conceptual overhead for .NET devs doing memory analysis from growing ever larger.

I think we need to distinguish between what the .NET memory analysis tool writers need vs. most .NET devs need.

.NET memory analysis tools writers want to understand and have access to every relevant runtime detail. It is the reason for exposing details like POH or FOH in the profiler APIs. The people writing these memory analysis tools are experts and eager to take advantage of these details to make their tools shine.

I agree it is important to keep the concept count low for .NET devs, ideally to just core ones like GC generations. I do not believe that .NET devs need to know about POH or FOH. It is good-enough approximation to think about these as part of Gen2.

The corollary of this is that it makes sense to me to expose FOH in profiler APIs that are targeted at .NET memory analysis tool writers, but I would not expose them in the (default set of) profiler counters that are targeted at .NET devs.

You devs your proposing that file-mapped segments wouldn't count towards the limit earlier and I am not sure if you are still proposing that?

I think about the GC hard limit as private working set limit. Do you agree?

If you agree, the file-mapped segments are not private working set so they should not count against the limit. I do not have strong opinion about this details and I would be ok with counting all frozen heap segments against the hard limit without exceptions. If somebody has scenario with file-mapped heap segments, they can workaround this by artificially bumping the hard limit up.

@VSadov
Copy link
Member

VSadov commented Sep 22, 2022

Are the frozen segments read-only or copy-on-write?
Would they be added to the process' commit charge in the next GC that will do marking? If that is the case, perhaps they should be counted?

@jkotas
Copy link
Member

jkotas commented Sep 22, 2022

Good point. The frozen segments are not read-only in general case, even when file mapped, so that things like lock(frozenObject) work. It means that they are always potential private working set in general case and should be counted against the hard limit. (The workaround with artificially bumping the hard limit up still works if you know what you are doing.)

@Maoni0
Copy link
Member

Maoni0 commented Sep 22, 2022

Would they be added to the process' commit charge in the next GC that will do marking?

they won't be for regions. GC will not touch these segments as they are completely out of range that GC handles (there's actually not much of a point to even register these with the GC for regions anymore). if we disallowed things from user code that would modify these objects like lock we could really make these read only.

@noahfalk
Copy link
Member

I think we need to distinguish between what the .NET memory analysis tool writers need vs. most .NET devs need.

I would tweak this into three groups:

  1. .NET memory analysis tool writers
  2. .NET developers diagnosing .NET memory performance issues (potential memory analysis tool users)
  3. Other .NET developers that are not diagnosing memory perf issues

I expect groups (1) and (2) will often want to gather this information (even though it won't be relevant to all investigations). Group (3) won't care, but might get inadvertently exposed to it via docs, blogs, APIs or performance counters.

I would not expose them in the (default set of) profiler counters that are targeted at .NET devs

Agreed in principle at least :) We already did expose LOH and POH in the System.Runtime counter set so it would feel odd to leave out FOH only. If I got a do-over I wish I could put less GC info there and instead put it some advanced GC area. I'm trying to work with the OpenTelemetry team to create some default dashboards that are simpler and don't expose every counter in System.Runtime by default.

I think about the GC hard limit as private working set limit. Do you agree?

I think of the limit as being measured the same way as GC heap size or GC commited bytes. If GC includes private+shared in those calculations then I'd expect limit to be private+shared as well.

This is how I was thinking of the options:

  1. All GC statistics exclude shared pages. Any tools that enumerates all the objects, such as SOS DumpHeap, will see confusing discrepancies such as sum_of_all_object_sizes > reported_gc_heap_size.
  2. Some GC statistics are implicitly private pages only, others include both private+shared. This seems likely to cause lots of confusion. For example someone could set HardHeapLimit=50MB and then see GC heap size = 60MB because one of them includes shared pages and the other doesn't.
  3. We introduce new statistics that explicitly delineate shared, private, and shared+private. This avoids inconsistencies in (1) and (2), but we've again increased the concept count for users doing GC diagnostics.
  4. We treat all GC statistics and quotas as being implicitly private+shared pages. This one is both consistent and doesn't increase concept count, at the cost of not giving users as fine-grained info and control on private bytes. It also handles COW pages with no additional complexity.

Of those options I lean towards (4) at the moment, but if there are other pros/cons I am leaving out I'm all ears. It sounded like the COW issue may have driven everyone to option (4) as well?

@jkotas
Copy link
Member

jkotas commented Sep 22, 2022

(2) will often want to gather this information (even though it won't be relevant to all investigations)

I think that POH (and FOH) details are only relevant in small fraction of the investigations. I would argue that majority of people doing (2) do not need to know what they are and just treat them as Gen2. And if these details are relevant for an advanced investigations, you are likely going to need much more than that.

I agree with you that we need to be thinking more about basic and advanced categories. Do you expect that we will remove some perf counters from the default set (e.g. what dotnet-counters displays) as part of that?

We already did expose LOH and POH in the System.Runtime counter set so it would feel odd to leave out FOH only

I do not think that it is odd. We allocate the string literals and runtime types in regular heap today and nobody asked for a counter that shows how much these items take. They just get included in Gen2 size (once the app runs for a while). So we can just continue doing that and avoid teaching everybody what this new number means. We moved around some internal implementation details, but it does not mean that this move needs to leak into what everybody sees.

Of those options I lean towards (4) at the moment,

Sounds good.

@noahfalk
Copy link
Member

I agree with you that we need to be thinking more about basic and advanced categories. Do you expect that we will remove some perf counters from the default set (e.g. what dotnet-counters displays) as part of that?

I am avoiding removing anything from System.Runtime, but I do think we have flexibility in the dotnet-counters UI not to show everything in System.Runtime by default. My plan was to first focus on the default dashboards with OpenTelemetry, then see what we can bring back to dotnet-counters from that exercise.

We allocate the string literals and runtime types in regular heap today and nobody asked for a counter that shows how much these items take. They just get included in Gen2 size ...

I am more worried that if we start disclosing the existence of FOH elsewhere (in VS, in SOS, in docs/blogs, in memory profilers) then people will become aware of it, look at the list of counters and ask why is every heap except FOH here? How is it being accounted for? They may also do things like compare the output of GC.GetMemoryInfo().GenerationInfo[2] or SOS !DumpHeap -stat to the reported value of the GC gen2 counter and become mistrustful of the data if it doesn't match. We have some accounting mechanisms that treat FOH, POH, and LOH as part of gen2 and others that treat none of them as gen2. This would create a 3rd way of accounting, used only for performance counters, where POH and LOH are distinguished but FOH has been folded into gen2. I worry that more accounting variations == more confusion.

@jkotas
Copy link
Member

jkotas commented Sep 23, 2022

We have some accounting mechanisms that treat FOH, POH, and LOH as part of gen2 and others that treat none of them as gen2.

Should we make a breaking change so that FOH, POH, and LOH are treated as part of gen2 everywhere? It would eliminate the current confusion about the accounting, and create a path forward for FOH without raising the concept count for everybody.

@davidfowl
Copy link
Member

Should we make a breaking change so that FOH, POH, and LOH are treated as part of gen2 everywhere? It would eliminate the current confusion about the accounting, and create a path forward for FOH without raising the concept count for everybody.

Only if we have counters that also have the break down. @noahfalk and I have been discussion another specific GC category for "advanced" counters.

@Maoni0
Copy link
Member

Maoni0 commented Sep 23, 2022

Should we make a breaking change so that FOH, POH, and LOH are treated as part of gen2 everywhere?

POH and LOH are logically part of gen2 (meaning that they are only collected during gen2 GCs) and that has always been the case. that doesn't mean we will not distinguish them - they still have different perf characteristics so for more advanced diagnosis scenarios they should be displayed/analyzed separately.

@jkotas
Copy link
Member

jkotas commented Sep 23, 2022

I agree that the breakdowns are useful for advanced analysis. I am not suggesting to remove them.

My suggestion has been just to fix the confusion about what "Gen2" means, and make it mean the same thing everywhere. Ie make this formula to be universally true: Gen2 = Gen2SOH + LOH + POH + FOH.

@noahfalk
Copy link
Member

My suggestion has been just to fix the confusion about what "Gen2" means, and make it mean the same thing everywhere

I am skeptical that we might make some breaking changes and still not have reduced confusion that much. Even if we are very diligent to use terminology only one way going forward, we can't prevent other people and tools from referring to Gen2SOH using the term "Gen2". Also we've got APIs that may not use the term exactly, but they use things like "Generation" or "GenerationInfo" which are close enough for people to make assumptions. For example would you favor redefining the behavior of GC.GetGCMemoryInfo().GenerationInfo[2] and eliminating indices 3 and 4?

I was expecting that the cat is out of the bag on this one and the best we can practically do is acknowledge that "gen2" has different contextual interpretations and make the context as clear as possible when we use it.

@EgorBo
Copy link
Member Author

EgorBo commented Sep 29, 2022

@EgorBo - I'm guessing more work has emerged here than you were initially planning on?

@noahfalk sorry, didn't notice your comment, I'd love to help with the diagnostics part of this

@noahfalk
Copy link
Member

@noahfalk sorry, didn't notice your comment, I'd love to help with the diagnostics part of this

Np, and cool 👍 I need to wrap up a few pressing things but I'll try to chat with you and @davmason probably mid-to-late next week to figure out the next steps on this. Thanks!

@jkotas jkotas mentioned this pull request Oct 3, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Oct 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.