Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

list.insert is slow, likely due to manual memmove #83982

Closed
pochmann mannequin opened this issue Feb 29, 2020 · 12 comments
Closed

list.insert is slow, likely due to manual memmove #83982

pochmann mannequin opened this issue Feb 29, 2020 · 12 comments
Labels
3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@pochmann
Copy link
Mannequin

pochmann mannequin commented Feb 29, 2020

BPO 39801
Nosy @tim-one, @rhettinger, @pochmann

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2020-02-29.16:11:28.232>
labels = ['interpreter-core', '3.9', 'performance']
title = 'list.insert is slow, likely due to manual memmove'
updated_at = <Date 2020-05-18.01:31:11.080>
user = 'https://github.com/pochmann'

bugs.python.org fields:

activity = <Date 2020-05-18.01:31:11.080>
actor = 'rhettinger'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Interpreter Core']
creation = <Date 2020-02-29.16:11:28.232>
creator = 'Stefan Pochmann'
dependencies = []
files = []
hgrepos = []
issue_num = 39801
keywords = []
message_count = 10.0
messages = ['362986', '362993', '362996', '362998', '363000', '363001', '363004', '363034', '363035', '363036']
nosy_count = 3.0
nosy_names = ['tim.peters', 'rhettinger', 'Stefan Pochmann']
pr_nums = []
priority = 'low'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue39801'
versions = ['Python 3.9']

@pochmann
Copy link
Mannequin Author

pochmann mannequin commented Feb 29, 2020

Using a list's insert function is much slower than using slice assignment:

python -m timeit -n 100000 -s "a=[]" "a.insert(0,0)"
100000 loops, best of 5: 19.2 usec per loop

python -m timeit -n 100000 -s "a=[]" "a[0:0]=[0]"
100000 loops, best of 5: 6.78 usec per loop

(Note that the list starts empty but grows to 100,000 elements.)

At first I thought maybe it's the attribute lookup or function call overhead or so, but inserting near the end shows that that's negligible:

python -m timeit -n 100000 -s "a=[]" "a.insert(-1,0)"
100000 loops, best of 5: 79.1 nsec per loop

I asked at StackOverflow and someone pointed out that list.insert uses a manual loop instead of memmove: https://stackoverflow.com/a/60466572/12671057

@pochmann pochmann mannequin added 3.8 (EOL) end of life interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Feb 29, 2020
@pochmann pochmann mannequin changed the title list.insert is slow due to manual memmove list.insert is slow, likely due to manual memmove Feb 29, 2020
@pochmann pochmann mannequin changed the title list.insert is slow due to manual memmove list.insert is slow, likely due to manual memmove Feb 29, 2020
@pochmann
Copy link
Mannequin Author

pochmann mannequin commented Feb 29, 2020

I believe it also affects bisect.insort, which I occasionally use when I need a "self-sorting list" (I can't easily test it, as I'm not set up to modify the C version of bisect.insort).

And also the popular sortedcontainers package, which relies on such list operations to be fast: http://www.grantjenks.com/docs/sortedcontainers/

@pochmann
Copy link
Mannequin Author

pochmann mannequin commented Feb 29, 2020

Benchmarking with two *Python* versions of bisect.insort, the "insert" version takes 1.08 seconds to insort the shuffled range(10**5) while the slice assignment version only takes 0.46 seconds:

from timeit import timeit
import random
from bisect import bisect_right

def insort1(a, x):
    lo = bisect_right(a, x)
    a.insert(lo, x)

def insort2(a, x):
    lo = bisect_right(a, x)
    a[lo:lo] = [x]

for _ in range(3):
    a = list(range(10**5))
    random.shuffle(a)
    for insort in insort1, insort2:
        it = iter(a)
        s = []
        print(timeit(lambda: insort(s, next(it)), number=len(a)))
    print()

@tim-one
Copy link
Member

tim-one commented Feb 29, 2020

Be cautious about this. Many years ago I looked into this, mostly in the context of the list sort's binary insertion sort, and results were all over the map, depending on the compiler in use, the optimization level, and the range at which (if any!) memmove() was actually faster than a simple loop. A fancy memmove() will (at least) arrange to copy (say) 8 bytes at a time, but the overhead of setting that up (+ the function call) made it a loser when the number of bytes to be moved was "too small".

Don't know whether the comment in listobject.c's binarysort() still applies:

       Caution: using memmove is much slower under MSVC 5;
       we're not usually moving many slots. */

Optimizing to move 100,000 elements isn't really sane - cutting the constant factor on an inherently quadratic-time too-simplistic basic approach isn't really doing you a favor ;-)

Also note that sortedcontainers does not use unbounded-length Python lists under the covers. It puts an upper bound on the length of the Python lists it uses, precisely to avoid visibly quadratic-time behavior.

@rhettinger
Copy link
Contributor

Here's a diff you can try out:

diff --git a/Objects/listobject.c b/Objects/listobject.c
index 3c39c6444b..ac6c9cc07a 100644
--- a/Objects/listobject.c
+++ b/Objects/listobject.c
@@ -272,8 +272,7 @@ ins1(PyListObject *self, Py_ssize_t where, PyObject *v)
     if (where > n)
         where = n;
     items = self->ob_item;
-    for (i = n; --i >= where; )
-        items[i+1] = items[i];
+    memmove(items+where+1, items+where+0, (n - where) * sizeof(PyObject *));
     Py_INCREF(v);
     items[where] = v;
     return 0;

For me, that patch makes little difference in the timings.

Another thing that could be tried is to have ins1() call list_ass_slice() to do the work. That way, both paths will use the same underlying insertion code. It would be interesting to see if this makes any difference.

A few other thoughts:

  • We usually prefer deques to lists when prepending because the former are O(1) and the latter are O(n).

  • Modern compilers are very good at optimizing data movements in a simple for-loop.

  • It is very difficult to get good timings for these kind of operations. Internally, both make a call to list_resize() which in turn uses realloc(). The latter function can be very fast or very slow depending solely on whether it can resize in-place or has to move the data. Accordingly, timings are easily confounded by minor changes to memory useage. To get a clearcut timing, you would need to isolate just the for-loop or memmove operation and not include Python calling overhead, realloc operations, and everything else that is going on in the given timings.

  • Another source of confounding is the compiler itself. Changes that makes small improvements on Clang may hurt the GCC build or vice-versa. The comparison can also be thrown-off by whether you're timing a PGO build (which tends to do an amazing job of optimizing for-foops). The results may also differ between 32-bit builds and 64-builds.

  • The timing isn't representative of how people use insert(x, 0). It would be a mistake to optimize an uncommon case if it comes at the expense of the common case (insertions into small lists). To guard against this, you would need to measure inserts into a small list to find-out whether memmove() has more setup time than the for-loop. When we've done such investigations in the past, for-loops were found to beat memmove() for sizes under 16. Almost certainly this changed as compilers and processors have changed over the years.

  • For smaller list sizes, the calling overhead dominates the insertion time. Disassembling the two different calls show a good deal of overhead for calls.

>>> from dis import dis
>>> dis('a.insert(0,0)')
  1           0 LOAD_NAME                0 (a)
              2 LOAD_METHOD              1 (insert)
              4 LOAD_CONST               0 (0)
              6 LOAD_CONST               0 (0)
              8 CALL_METHOD              2
             10 RETURN_VALUE
>>> dis('a[0:0]=[0]')
  1           0 LOAD_CONST               0 (0)
              2 BUILD_LIST               1
              4 LOAD_NAME                0 (a)
              6 LOAD_CONST               0 (0)
              8 LOAD_CONST               0 (0)
             10 BUILD_SLICE              2
             12 STORE_SUBSCR
             14 LOAD_CONST               1 (None)
             16 RETURN_VALUE

@pochmann
Copy link
Mannequin Author

pochmann mannequin commented Feb 29, 2020

Good point, Tim. Although binarysort really moves very few slots (I think at most 62, and average like 13). That might not be representative for how people use list.insert, either. I think I mainly use it for the mentioned case of a "self-sorting list", either naively via bisect.insort with a single list or via sortedcontainers' SortedList using multiple lists (used it only once so far). If I'm not mistaken, SortedList has a default load factor of 1000 and splits at double that size.

I might do more reasonable benchmarks later (haven't read Raymond's reply yet).

In any case, binarysort does its own inserting, so at least it wouldn't be affected if list.insert were changed.

@rhettinger
Copy link
Contributor

Marking this as "pending". It is more of a personal research project than a bug report, feature request, or known optimization.

The underlying hypothesis is that compilers aren't smart enough to generate good code for a simple for-loop that moves data.

The evidence is weak as involves timing unusual cases, with significantly different calling patterns, and that make repeated calls to realloc() which can throw the results off wildly.

@rhettinger rhettinger added 3.9 only security fixes and removed 3.8 (EOL) end of life labels Feb 29, 2020
@pochmann
Copy link
Mannequin Author

pochmann mannequin commented Feb 29, 2020

I have better benchmarks now and am trying some more, though I guess we roughly agree about when memmove is faster and when it's slower but that the real question is how large the common case is.

Do you know where I can find those past investigations? Was memmove *faster* for sizes *over* 16? What is the common case, i.e., how do people use list.insert?

@pochmann
Copy link
Mannequin Author

pochmann mannequin commented Mar 1, 2020

I think I have a decent way to isolate the memmove vs for-loop times, in Python code. Here are example results, five rounds of inserting with list.insert or with slice assignment. The times for each of the two ways are pretty stable:

insert slice
0.024221 0.006754
0.024157 0.006710
0.024015 0.006734
0.024206 0.006731
0.024123 0.006769

For each run, I build a list of length 2**20 and append one value. Then I can insert 2**17 more values without the list internally resizing.

Then I measure inserting at index -1 and consider that the baseline, as it includes all the overhead costs like function calls and other differences between the two insertion methods.

Then I measure inserting at index -500 and subtract the baseline time. This difference should reflect how long the memmove or the for-loop took, as that's the difference between inserting at -1 and at -500. It's what I showed in those results above.

I chose -500 here because the use case I have in mind is sortedcontainers.SortedList, which I think mostly involves lists of length 1000 or more (up to 2000), and insertions somewhere in the middle.

Results for index -50 instead:

insert slice
0.002479 0.002217
0.002546 0.002113
0.002566 0.002007
0.002425 0.002081
0.002555 0.002261

I'm btw running Python 3.8.1 32-bit on Windows 10 64-bit on an Intel i5-7200U.

Rest of this message is the benchmark code:

from timeit import repeat

statements = (
  'a.insert(-1,0)',
  'a.insert(-50,0)',
  'a[-1:0] = 0,',
  'a[-50:0] = 0,',
)

# Verify that the statements work and don't cause internal list resizes.
for stmt in statements:
    a = [0] * 2**20
    a.append(0)
    size_before = a.__sizeof__()
    stmt = compile(stmt, '', 'single')
    for _ in range(2**17):
        exec(stmt)
    assert len(a) == 2**20 + 1 + 2**17
    assert a.__sizeof__() == size_before

# Run the benchmark.
print('  insert    slice')
for _ in range(5):
    times = []
    for stmt in statements:
        t = min(repeat(stmt, 'a=[0]*2**20; a.append(0)', number=2**17, repeat=20))
        times.append(t)
    print('%10.6f' * 2 % (times[1] - times[0], times[3] - times[2]))

@pochmann
Copy link
Mannequin Author

pochmann mannequin commented Mar 1, 2020

I misspoke. I do the insertions at -1 and -500 not on the same list but each on a fresh list (of length 2**20+1).

@iritkatriel
Copy link
Member

See also #94508.

@gvanrossum
Copy link
Member

I suspect there's very little appetite to try this, based on Tim's and Raymond's caution, so I'm just going to close this, as well as the other one. (@pochmann, if you want to blow new life into this, just add a comment and I'll reopen.)

@gvanrossum gvanrossum closed this as not planned Won't fix, can't repro, duplicate, stale Aug 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
Projects
None yet
Development

No branches or pull requests

4 participants