-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Micro-optimization: Use memmove instead of for loop in list.insert() #94508
Comments
Do you have any benchmarks for small lists? |
Although I did not expect a difference, the memmove variant seems to be slightly faster for smaller lists too.
Benchmark codeimport random
import timeit
runs = 100000
for element_count in [1, 10, 100, 1000, 10000]:
setup = f'import random;lst=[random.randrange(100000) for _ in range({element_count})]; x = random.randrange(100000)'
slice_time = timeit.timeit(stmt='lst[:0] = [x]', setup=setup, number=runs)
insert_time = timeit.timeit(stmt='lst.insert(0, x)', setup=setup, number=runs)
print(f"For list of length {element_count}:")
print(f" Slice assignment: {slice_time}")
print(f" list.insert: {insert_time}")
print(f" Difference: {insert_time - slice_time:+}") |
This should be checked for other compilers as well, so that an improvement on one build doesn't negatively impact another. Also, be sure to run timings using the same full optimization settings that we use for the production builds. Also, it is a little surprising that memmove would beat a for-loop. Modern compilers are great at optimizing for-loops like this. Conceptually, memmove() is at a disadvantage because its byte oriented API can't know that we're moving an exact multiple of sizeof(PyObject *), that the data is pointer aligned, and that a direction test doesn't have to be performed. That said, perhaps the for-loop optimizations are being hobbled by |
update results from this comment are not representative, see the comments below @radiantly @rhettinger I also find it very surprising, but I can confirm the results on my system. Also for small lists Note: also the Benchmark: compare list.insert to slice assignmentBenchmark
Results
System gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1) |
Previous issue about it: |
How many repetitions did |
@pochmann The exact benchmark code is in the details section of my previous comment. The size of the list was picked from 1, 2, 10, 100, 1000, 5000 I also created a plain C program to test the for loop vs. the memmove. This showed no difference at all between the two options. I will look into the exact compiler settings of cpython to see if I can find something that explains the differences within cpython. C test program
|
@eendebakpt Those are the initial sizes. Before all the insertions. It doesn't show how many insertions it did and thus not how large the lists grew. For example if pyperf decided to do 10000 repetitions, then with an initial size of 1, the insertions were on sizes 1 to 10000, i.e., with an average size of 5000. Not on size 1 (except the very first insertion). |
@eendebakpt I'm not even convinced the results show that the PR made it faster. Sure, "1.38x faster" sounds like it was faster, but... What if it actually got slower, enough that pyperf decided to only do half as many repetitions as before the PR, and thus it only did half as many insertions, meaning on average half the list size and thus on average half the work? Then it was 1.38x faster at doing half as much work, meaning it actually was 2/1.38 = 1.45x slower. At least with my admittedly small knowledge of pyperf, I can't rule that out. Or is that not how pyperf works? |
I think you have a point, I will check later today |
@radiantly Btw your benchmark with lengths 1, 10, 100, etc also has that issue that it doesn't really measure insertions at those sizes. I just didn't ask about it since we can see your fixed As I showed in the mentioned previous issue, memmove seemed over 3 times faster at inserting at index -500, which I chose due to |
@pochmann The results from my earlier comment are indeed not representative. With a fixed test I see no performance difference for small lists, and a small improvement with memmove for the larger lists. Test:
Result:
So for some reason the optimization of the for loop is not as good as the |
Also, make sure this gets tested across different compilers, builds, or machines. The margin Is so tight that it is easy to fool yourself, making Python slightly faster on one build while negatively impacting others. At one time, the current code was shown to be fastest. Note the quality of memmove implementations varies across compilers so it takes effort to draw conclusions. The pyperf timings are also challenging to interpret — these tight loops tend to benefit from perfect branch prediction and all memory accesses fitting into L1 cache, so they aren't at all representative — that is why it was a good suggestion to look at the impacts of the changes on Grant's extensive benchmarks for SortedContainers. |
@rhettinger wrote:
Out of curiosity, do you have a link to those? It may be worth rolling those into |
I'm honestly just going to close this, just like I did with #83982. But @eendebakpt or @radiantly, if you disagree, just post a comment here and I'll gladly reopen! (I'm not promising the fix will be merged though. Somebody would have to do more benchmarking on more compilers. With PGO and LTO.) |
Background
I noticed a difference in speed when inserting an element to the beginning of the list using list.insert vs list slice assignment. Upon inspecting the code, the usage of
memmove
in the list slice assignment code seems to be the providing a slight speed boost over the for loop used in the list.insert implementation.The following code was used to benchmark the difference:
Original:
Using
memmove
in list.insert():(Compiled with optimizations and tested on Arch Linux, kernel 5.18)
The text was updated successfully, but these errors were encountered: