You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The is a lot of memory traffic in jitted code. Despite the JIT removing the dispatch overhead, there is still a lot of memory traffic from:
Pushing to, and popping from, the evaluation stack
Reference counting operations
We can reduce these with top-of-stack (TOS) caching and deferred reference counting, respectively.
TOS Caching
Top of stack caching promises significant performance improvements as we can considerably reduce
memory traffic. Ignoring refcounting, LOAD_FAST goes from a dependent load and store to just the load. COPY and SWAP need no memory accesses.
(Partially) Deferred Reference Counting
For practical reasons, mainly immediate reclamation, we cannot do full deferred reference counting.
We can however reduce the cost of reference counting by tagging references.
With tagged references and TOS caching, a LOAD_FAST is just one load and an ALU operation. COPY is just a single ALU operation.
Although the tagged version is not as effective as full deferred reference counting, it does mean we can relax
our guarantees about the state of the VM when GC occurs.
With full deferred reference counting it is vital that the GC be able to see the exact stack.
With tagged deferred reference counting, we only need to guarantee that it cannot
appear to the GC that an object not on the stack is on the stack.
We still want to keep the stack pointer up to date though, as it is important for GC performance.
The more stack references that the GC can see, the more it can cheaply mark as reachable, reducing the overhead of cycle collection.
Overall, this means that TOS caching is correct, should be a large positive effect on jitted code performance and a negligible negative impact on GC performance.
Escaping calls.
Escaping calls are important because:
GC can only occur during an escaping call.
Almost anything can change in an escaping call, which can invalidate JIT optimizations
For this reason the interpreter code generators understand escaping calls, and save the stack around them.
Since the C compiler would have had to spill registers around the call anyway, this is effectively free.
Py_DECREF
Py_DECREF(obj) does this:
if (is_mortal(obj)) {
obj->ob_refcnt--;
if (obj->ob_refcnt==0) {
Py_Dealloc(obj);
}
}
Py_Dealloc escapes, which means that Py_DECREF can escape. This is a problem as Py_DECREF is ubiquitous
and escaping calls defeat many optimization and cause code bulk due the spilling around the call.
Deferred reference counting doesn't help, as it only makes the tests around Py_Dealloc faster. The escaping call to Py_Dealloc is still present.
So we have to treat Py_DECREF specially.
Although Py_Dealloc escapes in general, it doesn't escape for many common objects, like int, float, str. For specialized bytecodes we can mark the Py_DECREF as not escaping.
Do not save the stack pointer across Py_DECREF. This will enable TOS caching across Py_DECREFs. In theory this might reduce the performance of the GC a bit, it practice it should not make any difference.
Py_DECREF can still escape though, so eliminating as many Py_DECREFs as possible is worthwhile. We can elimiate many decrefs in the JIT optimizer.
Correct collection of objects
Avoiding premature collection of objects
The reference counting mechanism will not collect an object if there is at least one counted reference to an object.
The cycle GC will not collect an object if there is at least one reference to it that the GC can detect and is not part of a cycle.
Noting that references on the stack are not part of cycles, that means that if there is at least one counted reference to an object on the stack, then that object will not be collected. Therefore it is safe for any other references to that object to not be visible to the GC or to be counted.
Therefore we can defer (tag) a reference on the stack, provided that:
That reference is on the stack, and there is at least one counted reference to the same object that will outlive this reference.
Deferred collection objects
Some objects are prone to contention. For the free-threaded (racy) build we want to tag references to these objects to avoid contention on the reference count, to do this we are willing to tolerate deferred collection of these objects.
These objects should be created with a reference count of one higher than the true value, and have a bit set in their flags to mark them
as deferred. These objects can then be only be collected by the cyclic GC.
We can defer (tag) a reference on the stack deferred even if there is no other reference to that object, provided that we can be guaranteed that the GC will be able to see this reference. This will be a problem for TOS caching which can hide the TOS from the GC.
This means that TOS caching will not be possible for the free-threaded build.
When can we tag references?
We can defer (tag) a reference on the stack, provided when that reference is on the stack, and there is at least one counted reference to the same object that will outlive this reference.
Note that this is transitive, we can tag a reference if there is another tagged reference is on the stack that will outlive this reference, or a transitive reference to the same object outlives this reference.
Any reference that is passed to a callee can remain tagged, as the caller will outlive the callee. However any reference passed up the stack by return or yield must be converted to a counted reference.
A reference can be deferred (tagged) only if it is:
Guaranteed to be visible to the GC, or
outlived by another reference that is visible to the GC and is not part of a cycle.
No object that is not on the stack can be seen as being on the stack by the GC (or any other code).
The above invariants ensure that the GC will not be able to collect non-garbage objects.
However, we cannot know, in general, whether a reference is part of a cycle. But we do know
whether a reference is on the stack and can choose a different set of invariants that can be enforced
and imply the more general invariants:
A reference can be deferred (tagged) only if it is:
On the stack, and
Guaranteed to be outlived by another reference, also on the stack, that transitively refers to the same object.
The stack pointer saved to memory must not point beyond the portion of the stack that is valid and written to memory
Weakening the stack invariant
We previously had of goal of implementing the invariant that the entire evaluation stack of all frames would
be in memory and precisely visible to the GC. This would have allowed full deferred reference counting,
but since we have discarded full deferred reference counting for other reasons we no longer need the stronger invariant.
Implementation
We should be able to implement TOS caching and deferred references independently, as long we can rely on the invariants above. The current implementation does support the invariants apart form marking Py_DECREF as escaping. Without that information we cannot be sure that the stack is spilled correctly.
For example, the generated code for POP_TOP (ignoring stats and such):
_PyStackRef value;
value = stack_pointer[-1];
PyStackRef_CLOSE(value);
stack_pointer += -1;
When PyStackRef_CLOSE(value) is called, value is visible on the stack.
The generated code would need to be:
_PyStackRef value;
value = stack_pointer[-1];
stack_pointer += -1;
PyStackRef_CLOSE(value);
Once that is done, we can implement TOS caching and deferred reference counting independently.
The text was updated successfully, but these errors were encountered:
The is a lot of memory traffic in jitted code. Despite the JIT removing the dispatch overhead, there is still a lot of memory traffic from:
We can reduce these with top-of-stack (TOS) caching and deferred reference counting, respectively.
TOS Caching
Top of stack caching promises significant performance improvements as we can considerably reduce
memory traffic. Ignoring refcounting,
LOAD_FAST
goes from a dependent load and store to just the load.COPY
andSWAP
need no memory accesses.(Partially) Deferred Reference Counting
For practical reasons, mainly immediate reclamation, we cannot do full deferred reference counting.
We can however reduce the cost of reference counting by tagging references.
With tagged references and TOS caching, a
LOAD_FAST
is just one load and an ALU operation.COPY
is just a single ALU operation.Although the tagged version is not as effective as full deferred reference counting, it does mean we can relax
our guarantees about the state of the VM when GC occurs.
With full deferred reference counting it is vital that the GC be able to see the exact stack.
With tagged deferred reference counting, we only need to guarantee that it cannot
appear to the GC that an object not on the stack is on the stack.
We still want to keep the stack pointer up to date though, as it is important for GC performance.
The more stack references that the GC can see, the more it can cheaply mark as reachable, reducing the overhead of cycle collection.
Overall, this means that TOS caching is correct, should be a large positive effect on jitted code performance and a negligible negative impact on GC performance.
Escaping calls.
Escaping calls are important because:
For this reason the interpreter code generators understand escaping calls, and save the stack around them.
Since the C compiler would have had to spill registers around the call anyway, this is effectively free.
Py_DECREF
Py_DECREF(obj)
does this:Py_Dealloc
escapes, which means thatPy_DECREF
can escape. This is a problem asPy_DECREF
is ubiquitousand escaping calls defeat many optimization and cause code bulk due the spilling around the call.
Deferred reference counting doesn't help, as it only makes the tests around
Py_Dealloc
faster. The escaping call toPy_Dealloc
is still present.So we have to treat
Py_DECREF
specially.Py_Dealloc
escapes in general, it doesn't escape for many common objects, likeint
,float
,str
. For specialized bytecodes we can mark thePy_DECREF
as not escaping.Py_DECREF
. This will enable TOS caching acrossPy_DECREF
s. In theory this might reduce the performance of the GC a bit, it practice it should not make any difference.Py_DECREF
can still escape though, so eliminating as manyPy_DECREF
s as possible is worthwhile. We can elimiate many decrefs in the JIT optimizer.Correct collection of objects
Avoiding premature collection of objects
Noting that references on the stack are not part of cycles, that means that if there is at least one counted reference to an object on the stack, then that object will not be collected. Therefore it is safe for any other references to that object to not be visible to the GC or to be counted.
Therefore we can defer (tag) a reference on the stack, provided that:
Deferred collection objects
Some objects are prone to contention. For the free-threaded (racy) build we want to tag references to these objects to avoid contention on the reference count, to do this we are willing to tolerate deferred collection of these objects.
These objects should be created with a reference count of one higher than the true value, and have a bit set in their flags to mark them
as deferred. These objects can then be only be collected by the cyclic GC.
We can defer (tag) a reference on the stack deferred even if there is no other reference to that object, provided that we can be guaranteed that the GC will be able to see this reference. This will be a problem for TOS caching which can hide the TOS from the GC.
This means that TOS caching will not be possible for the free-threaded build.
When can we tag references?
We can defer (tag) a reference on the stack, provided when that reference is on the stack, and there is at least one counted reference to the same object that will outlive this reference.
Note that this is transitive, we can tag a reference if there is another tagged reference is on the stack that will outlive this reference, or a transitive reference to the same object outlives this reference.
Any reference that is passed to a callee can remain tagged, as the caller will outlive the callee. However any reference passed up the stack by
return
oryield
must be converted to a counted reference.The above invariants ensure that the GC will not be able to collect non-garbage objects.
However, we cannot know, in general, whether a reference is part of a cycle. But we do know
whether a reference is on the stack and can choose a different set of invariants that can be enforced
and imply the more general invariants:
Weakening the stack invariant
We previously had of goal of implementing the invariant that the entire evaluation stack of all frames would
be in memory and precisely visible to the GC. This would have allowed full deferred reference counting,
but since we have discarded full deferred reference counting for other reasons we no longer need the stronger invariant.
Implementation
We should be able to implement TOS caching and deferred references independently, as long we can rely on the invariants above. The current implementation does support the invariants apart form marking
Py_DECREF
as escaping. Without that information we cannot be sure that the stack is spilled correctly.For example, the generated code for
POP_TOP
(ignoring stats and such):When
PyStackRef_CLOSE(value)
is called,value
is visible on the stack.The generated code would need to be:
Once that is done, we can implement TOS caching and deferred reference counting independently.
The text was updated successfully, but these errors were encountered: