-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
list.insert is slow, likely due to manual memmove #83982
Comments
Using a list's insert function is much slower than using slice assignment:
(Note that the list starts empty but grows to 100,000 elements.) At first I thought maybe it's the attribute lookup or function call overhead or so, but inserting near the end shows that that's negligible:
I asked at StackOverflow and someone pointed out that list.insert uses a manual loop instead of memmove: https://stackoverflow.com/a/60466572/12671057 |
I believe it also affects bisect.insort, which I occasionally use when I need a "self-sorting list" (I can't easily test it, as I'm not set up to modify the C version of bisect.insort). And also the popular sortedcontainers package, which relies on such list operations to be fast: http://www.grantjenks.com/docs/sortedcontainers/ |
Benchmarking with two *Python* versions of bisect.insort, the "insert" version takes 1.08 seconds to insort the shuffled range(10**5) while the slice assignment version only takes 0.46 seconds: from timeit import timeit
import random
from bisect import bisect_right
def insort1(a, x):
lo = bisect_right(a, x)
a.insert(lo, x)
def insort2(a, x):
lo = bisect_right(a, x)
a[lo:lo] = [x]
for _ in range(3):
a = list(range(10**5))
random.shuffle(a)
for insort in insort1, insort2:
it = iter(a)
s = []
print(timeit(lambda: insort(s, next(it)), number=len(a)))
print() |
Be cautious about this. Many years ago I looked into this, mostly in the context of the list sort's binary insertion sort, and results were all over the map, depending on the compiler in use, the optimization level, and the range at which (if any!) memmove() was actually faster than a simple loop. A fancy memmove() will (at least) arrange to copy (say) 8 bytes at a time, but the overhead of setting that up (+ the function call) made it a loser when the number of bytes to be moved was "too small". Don't know whether the comment in listobject.c's binarysort() still applies:
Optimizing to move 100,000 elements isn't really sane - cutting the constant factor on an inherently quadratic-time too-simplistic basic approach isn't really doing you a favor ;-) Also note that sortedcontainers does not use unbounded-length Python lists under the covers. It puts an upper bound on the length of the Python lists it uses, precisely to avoid visibly quadratic-time behavior. |
Here's a diff you can try out: diff --git a/Objects/listobject.c b/Objects/listobject.c
index 3c39c6444b..ac6c9cc07a 100644
--- a/Objects/listobject.c
+++ b/Objects/listobject.c
@@ -272,8 +272,7 @@ ins1(PyListObject *self, Py_ssize_t where, PyObject *v)
if (where > n)
where = n;
items = self->ob_item;
- for (i = n; --i >= where; )
- items[i+1] = items[i];
+ memmove(items+where+1, items+where+0, (n - where) * sizeof(PyObject *));
Py_INCREF(v);
items[where] = v;
return 0; For me, that patch makes little difference in the timings. Another thing that could be tried is to have ins1() call list_ass_slice() to do the work. That way, both paths will use the same underlying insertion code. It would be interesting to see if this makes any difference. A few other thoughts:
>>> from dis import dis
>>> dis('a.insert(0,0)')
1 0 LOAD_NAME 0 (a)
2 LOAD_METHOD 1 (insert)
4 LOAD_CONST 0 (0)
6 LOAD_CONST 0 (0)
8 CALL_METHOD 2
10 RETURN_VALUE
>>> dis('a[0:0]=[0]')
1 0 LOAD_CONST 0 (0)
2 BUILD_LIST 1
4 LOAD_NAME 0 (a)
6 LOAD_CONST 0 (0)
8 LOAD_CONST 0 (0)
10 BUILD_SLICE 2
12 STORE_SUBSCR
14 LOAD_CONST 1 (None)
16 RETURN_VALUE |
Good point, Tim. Although binarysort really moves very few slots (I think at most 62, and average like 13). That might not be representative for how people use list.insert, either. I think I mainly use it for the mentioned case of a "self-sorting list", either naively via bisect.insort with a single list or via sortedcontainers' SortedList using multiple lists (used it only once so far). If I'm not mistaken, SortedList has a default load factor of 1000 and splits at double that size. I might do more reasonable benchmarks later (haven't read Raymond's reply yet). In any case, binarysort does its own inserting, so at least it wouldn't be affected if list.insert were changed. |
Marking this as "pending". It is more of a personal research project than a bug report, feature request, or known optimization. The underlying hypothesis is that compilers aren't smart enough to generate good code for a simple for-loop that moves data. The evidence is weak as involves timing unusual cases, with significantly different calling patterns, and that make repeated calls to realloc() which can throw the results off wildly. |
I have better benchmarks now and am trying some more, though I guess we roughly agree about when memmove is faster and when it's slower but that the real question is how large the common case is. Do you know where I can find those past investigations? Was memmove *faster* for sizes *over* 16? What is the common case, i.e., how do people use list.insert? |
I think I have a decent way to isolate the memmove vs for-loop times, in Python code. Here are example results, five rounds of inserting with list.insert or with slice assignment. The times for each of the two ways are pretty stable: insert slice For each run, I build a list of length 2**20 and append one value. Then I can insert 2**17 more values without the list internally resizing. Then I measure inserting at index -1 and consider that the baseline, as it includes all the overhead costs like function calls and other differences between the two insertion methods. Then I measure inserting at index -500 and subtract the baseline time. This difference should reflect how long the memmove or the for-loop took, as that's the difference between inserting at -1 and at -500. It's what I showed in those results above. I chose -500 here because the use case I have in mind is sortedcontainers.SortedList, which I think mostly involves lists of length 1000 or more (up to 2000), and insertions somewhere in the middle. Results for index -50 instead: insert slice I'm btw running Python 3.8.1 32-bit on Windows 10 64-bit on an Intel i5-7200U. Rest of this message is the benchmark code: from timeit import repeat
statements = (
'a.insert(-1,0)',
'a.insert(-50,0)',
'a[-1:0] = 0,',
'a[-50:0] = 0,',
)
# Verify that the statements work and don't cause internal list resizes.
for stmt in statements:
a = [0] * 2**20
a.append(0)
size_before = a.__sizeof__()
stmt = compile(stmt, '', 'single')
for _ in range(2**17):
exec(stmt)
assert len(a) == 2**20 + 1 + 2**17
assert a.__sizeof__() == size_before
# Run the benchmark.
print(' insert slice')
for _ in range(5):
times = []
for stmt in statements:
t = min(repeat(stmt, 'a=[0]*2**20; a.append(0)', number=2**17, repeat=20))
times.append(t)
print('%10.6f' * 2 % (times[1] - times[0], times[3] - times[2])) |
I misspoke. I do the insertions at -1 and -500 not on the same list but each on a fresh list (of length 2**20+1). |
See also #94508. |
I suspect there's very little appetite to try this, based on Tim's and Raymond's caution, so I'm just going to close this, as well as the other one. (@pochmann, if you want to blow new life into this, just add a comment and I'll reopen.) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: