Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pointer tagging aka tagged numbers #7

Closed
markshannon opened this issue Mar 5, 2021 · 23 comments
Closed

Pointer tagging aka tagged numbers #7

markshannon opened this issue Mar 5, 2021 · 23 comments

Comments

@markshannon
Copy link
Member

markshannon commented Mar 5, 2021

https://en.wikipedia.org/wiki/Tagged_pointer

Currently, a reference to a Python object is a PyObject *. This means that we need to do a lot of boxing, unboxing and pointer chasing when handling ints and floats.
Because ints and floats are pure value types, we can freely replace references to them with their values, and vice-versa.

For 32 bit machines, using a single bit to tag small (up to 31 bit) integers is the most obvious scheme.

Tag Meaning Encoding
0 int (31 bit) val<<1
1 PyObject * (including large ints and all floats) ((intptr_t)val)-1

For 64 bit machines, there at least two schemes that we could use.

  1. NaN-tagging works for machines with up to about 52 bits of VM space, and allows tagging of all floats.
  2. A tagging scheme that allows 64 bit (aligned) pointers and boxes almost all floats, works as follows:
Tag Meaning Encoding
0 int (61 bit) val<<3
1-6 float abs(val) < 2**512 rotate_bits(val, 4)
7 PyObject * (including large ints and large floats) ((intptr_t)val)-1

Any C extensions that use macros to access the internals of tuples, lists, etc. would need recompiling, but they need recompiling for every release anyway.

@gvanrossum gvanrossum changed the title Pointer tagging. Pointer tagging aka tagged numbers Mar 5, 2021
@gvanrossum
Copy link
Collaborator

So are you proposing to use the lowest N bits for the tag? (N being 1 or 3.)

Why not represent pointers as themselves (tag == 0)? So that forgetting to box is more obviously always an error?

If tagged pointers can be seen by 3rd part code that would break the ABI (alas). I think that (currently) the macros that look inside tuple are part of the ABI.

I had considered a scheme where tagged pointers can only occur in local variables and on the evaluation stack, but are boxed when they escape to a place where 3rd party code can see them. (With PEP 651 we could even return them from functions and use them as args.) I assume that they would never be placed in tuples or lists, though.

@markshannon
Copy link
Member Author

Why not represent pointers as themselves (tag == 0)?

Catching errors quickly is one reason. The other is performance; operations on integers are simpler and faster.

If tagged pointers can be seen by 3rd part code that would break the ABI (alas). I think that (currently) the macros that look inside tuple are part of the ABI.

The macros that look inside sequence objects are part of the API, not the ABI. Thankfully.

Tagged pointers should not be visible to the ABI or current API.
We would probably need to add new C-API functions.

E.g. to get an item from a tuple we might add a new function:

PyValue Python_TupleGetItem(intptr_t index);

The old API would wrap that:

PyObject* PyTuple_GetItem(intptr_t index) {
    PyValue val = Python_TupleGetItem(index);
    if (val == PyValue_ERROR) {
        return NULL;
    }
    return PyValue_AsNewReference(val);
}

@gvanrossum
Copy link
Collaborator

What's the scope of your proposed change then? I can't imagine every field that's currently a PyObject * becoming a PyValue. My own proposal would be to change the evaluation stack and function locals into PyValue arrays, and that sounds complex enough.

If we're going as far as having tuples and lists of PyValues I propose to create new datatypes for those. Perhaps we should start with tuples only -- the list implementation is actually rather complex (over 3400 lines, compared to tuple's 1200). If we do it right we could convert a value-tuple into a classic tuple in-place (if the refcount is 1). Tuples are already the type that the compiler knows most about.

Thinking some more about this, even without special sequence types, we'd need:

  • macros to test what kind of PyValue we have, i.e. pointer, number, float, int
  • macros to convert between C values (PyObject *, double, int) and PyValue encodings, and back
  • macros to incref and decref a PyValue (i.e. skipping if it's not a pointer)
  • functions (or macros) to convert between a PyValue and a PyObject * (boxing or decoding) and back (unboxing or encoding)

There could be two types of unboxing/encoding operations (PyObject * -> PyValue):

  • Lazy unboxing (pointer encoding) keeps number objects as pointers to PyLongObject and PyFloatObject
  • Aggressive unboxing (full encoding) converts number objects that are within the range for PyValue numbers to encoded numbers

Not sure that the distinction is important, but I figure the lazy version is slower, and everything needs to be prepared to handle boxed numbers anyway since the value may not fit in the 30 or 60 bits we have available.

PS. Why do you use all 6 available values for unboxed floats? Isn't two bits, i.e. 4 values, sufficient? Then we'd have some extra values if we ever want to encode other types.

@markshannon
Copy link
Member Author

My own proposal would be to change the evaluation stack and function locals into PyValue arrays, and that sounds complex enough.

It's a good place to start. It should reduce the amount of unboxing, if not as much as using tagged values everywhere.

I think adding new datatypes for tuples and lists might well cause more problems than it solves.

Some degree of automation will be necessary, maybe quite a lot of automation. I think the changes can be largely automated.
A lot of code takes PyObject * and casts it to a more specific type. That code would remain correct (except when casting to an int or float). There are almost no dereferences of plain PyObject * in the code base, most code uses macros like Py_INCREF, Py_DECREF, and PyTYPE.

Initial PyValue variants of the C-API can be autogenerated to wrap the PyObject * versions, until the proper version is written.

It's definitely not a small task though.

@gvanrossum
Copy link
Collaborator

Using tagged values everywhere would be a "Python 4.0" type event, because so much code exists that does not limit itself to the ABI. But I think we can get away with using them only for values store in the frame object's f_localsplus. (This includes cell objects -- presumably the pointer to the cell is encoded, and the pointer in the cell is also encoded.)

For automation, perhaps @ericsnowcurrently has tools that can form a starting point. (His tools do something else but IIRC they can scan through the entire CPython code base intelligently looking for certain patterns -- I believe more complex than a regex can do.)

A lot of code takes PyObject * and casts it to a more specific type. That code would remain correct [...]

Do you mean we could automatically translate it to something that's still correct? Because if I understand you correctly, if a PyObject * field is really a PyValue then such a cast is not correct (the resulting pointer would be off by one byte). So I presume I am misunderstanding what you are saying here.

There's a concrete experiment we could do: implement pointer encoding for f_localsplus (including cell objects) but without introducing unboxed numbers. This would give us an estimate of how much performance we'd lose due to encoding overhead, which would set a baseline for how much performance we'd have to gain by using unboxed numbers.

@gvanrossum
Copy link
Collaborator

See also Neil Schemenauer's tagged_int branch.

@markshannon
Copy link
Member Author

Do you mean we could automatically translate it to something that's still correct?

Yes, but as you point out, we would need to convert the value to a pointer. So (PyFooObject *)obj would become
(PyFooObject *)PyValue_AsObjectPtr(val) where

inline PyObject *
PyValue_AsObjectPtr(PyValue v) {
     return (PyObject *)(v.bits+1);
}

Because the cast to PyFooObject * implies that the reference cannot be an int or float (so must be an object pointer) there is no need for any additional check.

@gvanrossum
Copy link
Collaborator

I see. I am starting a small experiment where only f_localsplus contains such values. We'll see if I can get it working, then we'll run the benchmark on it.

@ericsnowcurrently
Copy link
Collaborator

ericsnowcurrently commented Mar 10, 2021

See also Neil Schemenauer's tagged_int branch.

Also see Victor's branch: vstinner/cpython#6 (read the TAGGED_POINTERS.rst file first)

@gvanrossum
Copy link
Collaborator

I've decided to abandon my experiment for now. The immediate reason is the vectorcall API. Let me explain.

My plan was to store tagged pointers in f_localsplus and regular PyObject * everywhere else, converting between the two e.g. when we make a call to a runtime function, so that (for the time being) functions like PyObject_GetItem() doesn't trip over the tagged values. Looking at the current code for the BINARY_SUBSCR opcode:

        case TARGET(BINARY_SUBSCR): {
            PyObject *sub = POP();
            PyObject *container = TOP();
            PyObject *res = PyObject_GetItem(container, sub);
            Py_DECREF(container);
            Py_DECREF(sub);
            SET_TOP(res);
            if (res == NULL)
                goto error;
            DISPATCH();
        }

Note that the POP macro effectively expands to *--stack_pointer where stack_pointer is a local variable that points directly into the evaluation stack (i.e., somewhere in f_localsplus). In my version of the code, f_localsplus is no longer an array of PyObject * but it's instead an array of PyValue, and stack_pointer has type PyValue *.

There seem to be two ways we can slice this. We could change the POP() and SET_TOP() macro to do the conversion between PyValue and PyObject *. The POP() macro transfers ownership so this could probably be made to work.

The other way would be to let POP() and friends return PyValue and do the conversion explicitly in the caller. The latter would end up with code like this:

        case TARGET(BINARY_SUBSCR): {
            PyValue subv = POP();
            PyObject *sub = PyValue_Box(subv);
            PyValue_DECREF(subv);
            PyValue containerv = TOP();
            PyObject *container = PyValue_Box(containerv);
            PYValue_DECREF(containerv);
            PyObject *res = PyObject_GetItem(container, sub);
            Py_DECREF(container);
            Py_DECREF(sub);
            if (res == NULL)
                goto error;
            PyValue resv = PyValue_Unbox(res);
            Py_DECREF(res);
            SET_TOP(resv);
            DISPATCH();
        }

So far so good (it looks like hiding this complexity in POP() etc. seems a good idea :-).

But now we get to the CALL_FUNCTION instruction. Here we end up using PyObject_Vectorcall() passing it a pointer to an array of PyObject * values. This saves a series of POP() calls, and is a significant performance optimization of calls of all nature. But now I'm stuck with my tagged pointer idea, because the stack is an array of PyValue values, while the vectorcall API wants an array of PyObject * values. I'd have to allocate a new temporary array and convert the PyValue values to PyObject * (by calling PyValue_Box()), pass the new array to PyObject_Vectorcall(), and after that returns I'd have to decref the boxed values and deallocate the temporary array. This would kill all performance gain from using unboxed values.

It may be possible to rescue the experiment by making the boxed and unboxed PyObject * values have the same bit pattern. Then I could convert the argument array in-place by only boxing the tagged numbers in there, and no cleanup would be required after the vector call returns.

@gvanrossum
Copy link
Collaborator

Some comments on Victor's attempt:

  • Tagged values may appear anywhere.
  • He uses tag 0 for pointers, i.e. they have the same bit pattern tagged and untagged. This would solve my vectorcall problem above.
  • He somehow chose to tag only objects that are already singleton values: None, True, False, and integers from the small integer cache (-5..256).
  • Because of that, he has to change every place that tests for None. I don't understand why he makes this problem for himself. (And it breaks the C API. Of course the C API is broken anyway because tagged values may appear anywhere.)
  • I don't think he shows any perf improvements.

@gvanrossum
Copy link
Collaborator

FWIW there have been other related attempts to just optimize int operations on single-digit ints (without tagging). See https://bugs.python.org/issue10044 and https://bugs.python.org/issue21955.

@gvanrossum
Copy link
Collaborator

So I lied, I kept playing with this. I haven't made that much progress, but I changed things around so that the box/unbox operations change inline integers and leave object pointers alone, rather than the other way around in Mark's proposed tagging scheme. (Note, I haven't actually proved that this is any better yet.)

One observation: it is kind of painful that the cells used for closures are part of the f_localsplus array. It is statically (i.e. from the bytecode) fixed which array elements are cells, and that's a contiguous chunk of items (which are never NULL), but in most cases it is dynamically determined whether a local is a cell or not (e.g. in dict_to_map() in frameobject.c). Maybe we can work on separating the two arrays. (I found this tidbit in Skip Montanaro's proto-PEP for his register VM, see #22.)

@gvanrossum
Copy link
Collaborator

gvanrossum commented Mar 31, 2021

One nice things of tagged ints everywhere is that you could use these directly as the size of strings, lists etc.

Another idea: Using HPy we could perhaps hide tagged numbers from 3rd parties (or internal modules that aren't yet aware of tagged numbers).

@gvanrossum
Copy link
Collaborator

I've got a mostly working (in the sense of not-crashing) version of tagged integers for everything in f_localsplus only. The code is at https://github.com/gvanrossum/cpython/tree/tagnum.

It's about 10-15% slower than master (on a micro-benchmark). That's not surprising -- I'm doing all the work to check for tags and but I'm not yet doing any operations that preserve tags (not even LOAD_FAST or ROT_TWO). To check (some) correctness I've just implemented a specialization in FOR_ITER where if the iterator is a range iter object it returns an unboxed integer, and this seems to work.

Next up (next week) I hope to work on making all operations that just move data around between locals, cells and the value stack skip most of the tagging overhead (there's still one tag check because all reference count operations need to check for tags). After that I'm planning to work on basic arithmetic. Don't expect speedups, just an evaluation of the feasibility and thoughts about the right API (I've gone over half a dozen different approaches to boxing so far :-). The jury is still out on whether it's better to tag integers or pointers, and I haven't touched floats.

@markshannon
Copy link
Member Author

markshannon commented Apr 8, 2021

I had a look at your implementation.

I would strongly recommend making PyValue a struct. We'll need all the help that the compiler can offer.
E.g. (64 bit version)

typedef struct _pyv {
    uint64_t bits;
} PyValue;

To help with conversions I would also recommend a helper union:

typedef union convert {
     uint64_t bits;
     int64_t i;
     double d;
     PyObject *p;
} _PyConvert;

For example:

static inline int
PyValue_IsTaggedInt(PyValue v) {
    (v.bits & 7) == 0;
}

static inline double
PyValue_UntagFloat(PyValue v) {
    assert(PyValue_IsTaggedFloat(v));
    _PyConvert c;
    c.bits = rotate4(v.bits);
    return c.d;
}

One other thing. Returning borrowed references is asking for trouble.

@gvanrossum
Copy link
Collaborator

Yeah, I already discovered that my current macros don't sufficiently type-check. :-( I will incorporate the struct and helper union.

Regarding borrowed references, I originally was not returning them, but almost all of the code where I had to insert boxing operations was effectively returning borrowed references -- for example the TOP() macro is really just a fancy way of saying *stack_pointer which is effectively a borrowed reference; ditto for e.g. GETLOCAL(i). All the code became simpler when I switched to using PyValue_BoxInPlace(&v).

@gvanrossum
Copy link
Collaborator

Hm, but if we're using inline functions instead of macros the type-safety would not be a problem, right?

static inline PyValue_FromObject(PyObject *o) { return (PyValue)o; }

Then a call to PyValue_FromObject(42) would be a type error. (What went wrong with my code is that currently it's a macro,

#define PyValue_FromObject(o) ((PyValue)(o))

and now PyValue_FromObject(42) expands to (PyValue)(42) which the compiler accepts.

OTOH, if we're going to make PyValue a struct, why not make the type itself a union?

typedef union _py_value{
     uint64_t bits;
     int64_t i;
     PyObject *p;
} PyValue;

(I'm leaving out double because I don't think it works this way.)

@markshannon
Copy link
Member Author

markshannon commented Apr 9, 2021

Inline functions will help a bit, but the fundamental issue is that you don't want to allow casts at all, because there is very often a mask, shift or rotate required and it is all too easy to forget them.

((PyObject *)v) where v is a PyValue is never safe. Making PyValue a struct will ensure that it is always a compile time error.

I don't remember which casts are legal for unions.
When I was playing around, I also tried a union in a struct, by I found making PyValue a struct and using a helper union for conversions resulted in the cleanest code.

What doesn't work for doubles?

I would recommend you tag pointers in some way, so that any casts that do slip though explode at runtime.

@gvanrossum
Copy link
Collaborator

gvanrossum commented Apr 12, 2021

Well, using a struct found several spots that I had missed, so I'm convinced of that.

I did find that I had to introduce a macro to check for NULL, because this is an error:

some_func(PyValue v)
{
    if (v != PyValue_NULL)  // Error
        ...
}

I had to rewrite all those tests as

    if (!PyValue_IsNULL(v))

There were occasional comparisons of one value to another, e.g.

    if (v != w)  // Error

which I rewrote as

    if (v.bits != w.bits)  // No error

I think it would work with a union too (the casting rules for unions are pretty similar to those for structs), but keeping the union private affords a little bit extra safety.

Regarding double, I have no idea what your function rotate_bits(val, 4) would do. I'm not actually sure how you would represent doubles. My understanding was that you reduce the number of bits used for the exponent in IEEE 754 double format by one or two, and adjusting things so that only the very largest exponents (> 10e150 or so) and the very smallest (> 0 but < 10e-150, say) would not fit in the tagged format. I honestly have no idea how the bit patterns would work out, but I somehow doubt that it would make sense to do the transformation using double precision float operations -- I'd expect that you'd just extract the exponent bits, shove two extra sign bits in there, and move the mantissa down by two bits.

I haven't decided to tag pointers (instead of numbers) yet, but I think that the amount of code written and the speed at which it executes are exactly the same, so your argument (that it catches anything that slips through the net) makes sense now.

@gvanrossum
Copy link
Collaborator

So there's an issue (theoretical or not) with boxing negative values. The C standard says it's undefined what happens to sign bits on a right shift of a signed value. Since the boxing operation would use bits >> 3 as the mathematical value to be boxed, that's undefined if the bits represent a negative value.

Can you think of a closed expression to solve that? The best I can think is would be to | in three copies of the original sign bit, which is something like (((unsigned)i >>63) * 7) << 61.

Would it be worth testing if the compiler preserves sign bits so we can avoid this? (MSC docs claim that for signed operands, copies of the sign bit are shifted in.)

@gvanrossum
Copy link
Collaborator

Okay, first positive result: after returning tagged ints from range(n) and implementing a special code path for tagged ints in BINARY_ADD, a micro-benchmark nearly doubled in speed:

# In a function, so i is a local
for i in range(1000): i + 1

See https://github.com/gvanrossum/cpython/tree/tagnum

@gvanrossum
Copy link
Collaborator

I propose to stop here, since this is just a proof of concept. We'd have to spend a lot more time on this to defeat the 10-20% slowdown I see on code that doesn't benefit from the fast path.

Some things left to do as part of productionizing:

  • Handle 32-bit architectures
  • Solve sign extension (see comment above)
  • Make object tag nonzero

And more ideas:

  • Convert co_consts to tagged values (need to cache on code object)
  • Add fast paths to some more operations (+=, -, *, //, what else?)
  • Add tagged floats?

@faster-cpython faster-cpython locked and limited conversation to collaborators Dec 2, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants