Micro-optimization: Use memmove instead of for loop in list.insert() #94508

radiantly · 2022-07-02T18:29:19Z

Background

I noticed a difference in speed when inserting an element to the beginning of the list using list.insert vs list slice assignment. Upon inspecting the code, the usage of memmove in the list slice assignment code seems to be the providing a slight speed boost over the for loop used in the list.insert implementation.

The following code was used to benchmark the difference:

import random
import timeit

runs = 100000
setup = 'import random;lst=[random.randrange(100000) for _ in range(100000)]; x = random.randrange(100000)'

print(timeit.timeit(stmt='lst[:0] = [x]', setup=setup, number=runs))
print(timeit.timeit(stmt='lst.insert(0, x)', setup=setup, number=runs))

Original:

Slice assignment: 7.394628149999335s
list.insert:      8.910868348000804s

Using memmove in list.insert():

Slice assignment: 7.252985826999065s
list.insert:      7.215608489997976s

(Compiled with optimizations and tested on Arch Linux, kernel 5.18)

The text was updated successfully, but these errors were encountered:

eendebakpt · 2022-07-03T11:47:29Z

Do you have any benchmarks for small lists?

radiantly · 2022-07-03T13:36:09Z

Although I did not expect a difference, the memmove variant seems to be slightly faster for smaller lists too.

main branch

patched

For list of length 1:
 Slice assignment: 0.9193216409912566
 list.insert:      1.1760083219996886
 Difference:      +0.25668668100843206
For list of length 10:
 Slice assignment: 0.9002103189995978
 list.insert:      1.175773538008798
 Difference:      +0.2755632190092001
For list of length 100:
 Slice assignment: 0.9053751649917103
 list.insert:      1.176510691002477
 Difference:      +0.2711355260107666
For list of length 1000:
 Slice assignment: 0.924577939993469
 list.insert:      1.1964725340076257
 Difference:      +0.2718945940141566
For list of length 10000:
 Slice assignment: 1.0966810410027392
 list.insert:      1.4154912599915406
 Difference:      +0.31881021898880135

For list of length 1:
 Slice assignment: 0.9132311169960303
 list.insert:      0.9028571050002938
 Difference:      -0.01037401199573651
For list of length 10:
 Slice assignment: 0.9108372859918745
 list.insert:      0.9029034480045084
 Difference:      -0.007933837987366132
For list of length 100:
 Slice assignment: 0.908425076995627
 list.insert:      0.9027242980082519
 Difference:      -0.005700778987375088
For list of length 1000:
 Slice assignment: 0.9295631379936822
 list.insert:      0.9190660210006172
 Difference:      -0.010497116993064992
For list of length 10000:
 Slice assignment: 1.0984339760034345
 list.insert:      1.095934324010159
 Difference:      -0.00249965199327562

Benchmark code

import random
import timeit

runs = 100000

for element_count in [1, 10, 100, 1000, 10000]:
    setup = f'import random;lst=[random.randrange(100000) for _ in range({element_count})]; x = random.randrange(100000)'

    slice_time = timeit.timeit(stmt='lst[:0] = [x]', setup=setup, number=runs)
    insert_time = timeit.timeit(stmt='lst.insert(0, x)', setup=setup, number=runs)
    
    print(f"For list of length {element_count}:")
    print(f" Slice assignment: {slice_time}")
    print(f" list.insert:      {insert_time}")
    print(f" Difference:      {insert_time - slice_time:+}")

rhettinger · 2022-07-03T16:18:19Z

This should be checked for other compilers as well, so that an improvement on one build doesn't negatively impact another. Also, be sure to run timings using the same full optimization settings that we use for the production builds.

Also, it is a little surprising that memmove would beat a for-loop. Modern compilers are great at optimizing for-loops like this. Conceptually, memmove() is at a disadvantage because its byte oriented API can't know that we're moving an exact multiple of sizeof(PyObject *), that the data is pointer aligned, and that a direction test doesn't have to be performed. That said, perhaps the for-loop optimizations are being hobbled by -fno-strict-aliasing.

eendebakpt · 2022-07-03T21:10:31Z

update results from this comment are not representative, see the comments below

@radiantly @rhettinger I also find it very surprising, but I can confirm the results on my system. Also for small lists memmove seems to be faster.

Note: also the bytearray uses memmove for the insert method.

Benchmark: compare list.insert to slice assignment

Benchmark

import pyperf

runner = pyperf.Runner()

for n in [1, 2, 10, 100, 1000, 50_000]:
    runner.timeit(name=f"list({n}).insert(0, None)",
	      stmt="lst.insert(0, None)", setup=f'import random; lst=[random.random() for ii in range({n})]')
    runner.timeit(name=f"list({n}).slice",
	      stmt="lst[:0] = [None]", setup=f'import random; lst=[random.random() for ii in range({n})]')

Results

list(1).insert(0, None): Mean +- std dev: [main] 5.66 us +- 0.12 us -> [pr] 4.11 us +- 0.06 us: 1.38x faster
list(1).slice: Mean +- std dev: [main] 4.15 us +- 0.06 us -> [pr] 4.19 us +- 0.08 us: 1.01x slower
list(2).insert(0, None): Mean +- std dev: [main] 5.65 us +- 0.10 us -> [pr] 4.12 us +- 0.05 us: 1.37x faster
list(2).slice: Mean +- std dev: [main] 4.19 us +- 0.08 us -> [pr] 4.15 us +- 0.07 us: 1.01x faster
list(10).insert(0, None): Mean +- std dev: [main] 5.63 us +- 0.09 us -> [pr] 4.12 us +- 0.06 us: 1.37x faster
list(100).insert(0, None): Mean +- std dev: [main] 5.62 us +- 0.08 us -> [pr] 4.14 us +- 0.11 us: 1.36x faster
list(100).slice: Mean +- std dev: [main] 4.22 us +- 0.07 us -> [pr] 4.19 us +- 0.05 us: 1.01x faster
list(1000).insert(0, None): Mean +- std dev: [main] 5.95 us +- 0.11 us -> [pr] 4.38 us +- 0.08 us: 1.36x faster
list(50000).insert(0, None): Mean +- std dev: [main] 20.9 us +- 0.2 us -> [pr] 16.6 us +- 0.2 us: 1.26x faster

Benchmark hidden because not significant (3): list(10).slice, list(1000).slice, list(50000).slice

Geometric mean: 1.16x faster

System

gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

pochmann · 2022-07-05T03:27:23Z

Previous issue about it:
#83982

pochmann · 2022-07-05T04:07:08Z

How many repetitions did pyperf do? I.e., how "small" were the lists on average (for each initial size n)?

eendebakpt · 2022-07-05T18:57:29Z

@pochmann The exact benchmark code is in the details section of my previous comment. The size of the list was picked from 1, 2, 10, 100, 1000, 5000

I also created a plain C program to test the for loop vs. the memmove. This showed no difference at all between the two options. I will look into the exact compiler settings of cpython to see if I can find something that explains the differences within cpython.

C test program

#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/timeb.h>

double get_time_ms (double t0)
{
    struct timespec ts;

    if (clock_gettime (CLOCK_MONOTONIC, &ts) == 0)
    {
        return (double) ((double) (ts.tv_sec) +
                         (double) (ts.tv_nsec) / 1000000000) -t0;
    }
    else
        return 0;
}

void
plain_loop (int **items, int N, int where)
{
    for (int i = N - 1; --i >= where;)
        items[i + 1] = items[i];

}

void
forward_loop (int **items, int N, int where)
{
    // not equivalent to plain_loop, included for reference
    for (int i = where; i < N; i++)
        items[i] = items[i + 1];

}

void
memmove_impl (int **items, int N, int where)
{
    memmove (&items[where + 1], &items[where], (N - where) * sizeof (int *));
}

#define N 400000
int
main (int argc, char **argv)
{
    int i;
    double t0, dt;

    int *items[N];
    for (long ii = 0; ii < N; ii++)
        items[ii] = (int *) (ii % 20);

    t0 = get_time_ms (0);

    for (int k = 0; k < 10; k++)
    {
        for (int where = 0; where < 200; where++)
        {
            plain_loop (items, N, where);
        }
    }

    dt = get_time_ms (t0);
    printf ("dt: %f\n", dt);

    t0 = get_time_ms (0);

    for (int k = 0; k < 10; k++)
    {
        for (int where = 0; where < 200; where++)
        {
            memmove_impl (items, N, where);
        }
    }

    dt = get_time_ms (t0);
    printf ("dt: %f\n", dt);

    t0 = get_time_ms (0);

    for (int k = 0; k < 10; k++)
    {
        for (int where = 0; where < 200; where++)
        {
            forward_loop (items, N, where);
        }
    }

    dt = get_time_ms (t0);
    printf ("dt: %f\n", dt);
}

// gcc -O3 -fno-strict-aliasing test.c  && ./a.out

pochmann · 2022-07-05T21:56:16Z

@eendebakpt Those are the initial sizes. Before all the insertions. It doesn't show how many insertions it did and thus not how large the lists grew.

For example if pyperf decided to do 10000 repetitions, then with an initial size of 1, the insertions were on sizes 1 to 10000, i.e., with an average size of 5000. Not on size 1 (except the very first insertion).

pochmann · 2022-07-05T22:15:53Z

@eendebakpt I'm not even convinced the results show that the PR made it faster. Sure, "1.38x faster" sounds like it was faster, but...

What if it actually got slower, enough that pyperf decided to only do half as many repetitions as before the PR, and thus it only did half as many insertions, meaning on average half the list size and thus on average half the work? Then it was 1.38x faster at doing half as much work, meaning it actually was 2/1.38 = 1.45x slower.

At least with my admittedly small knowledge of pyperf, I can't rule that out. Or is that not how pyperf works?

eendebakpt · 2022-07-05T22:21:59Z

@eendebakpt I'm not even convinced the results show that the PR made it faster. Sure, "1.38x faster" sounds like it was faster, but...

What if it actually got slower, enough that pyperf decided to only do half as many repetitions as before the PR, and thus it only did half as many insertions, meaning on average half the list size and thus on average half the work? Then it was 1.38x faster at doing half as much work, meaning it actually was 2/1.38 = 1.45x slower.

At least with my admittedly small knowledge of pyperf, I can't rule that out. Or is that not how pyperf works?

I think you have a point, I will check later today

pochmann · 2022-07-06T01:57:22Z

@radiantly Btw your benchmark with lengths 1, 10, 100, etc also has that issue that it doesn't really measure insertions at those sizes. I just didn't ask about it since we can see your fixed runs = 100000. Meaning for example for initial length 1, you measured insertion in a list of average size 50000 (which I wouldn't call small).

As I showed in the mentioned previous issue, memmove seemed over 3 times faster at inserting at index -500, which I chose due to sortedcontainers.SortedList. Inserting a few million random numbers into a SortedList might be a good "real life" test (something reasonable/realistic to do, unlike our inserting 100000 values at the start of a list, which Tim would probably still say "isn't really sane"). I'm curious how that's affected by the PR.

eendebakpt · 2022-07-06T12:01:53Z

@pochmann The results from my earlier comment are indeed not representative. With a fixed test I see no performance difference for small lists, and a small improvement with memmove for the larger lists.

Test:

import pyperf
runner = pyperf.Runner()
runner.timeit(name=f"x.insert(2, 1); x.pop(-1)",stmt='x.insert(2, 1); x.pop(1)', setup='x=[None]*1_00_000')

Result:

Mean +- std dev: [m] 65.5 us +- 0.5 us -> [pr] 56.9 us +- 0.7 us: 1.15x faster

So for some reason the optimization of the for loop is not as good as the memmove. The difference is relatively small though, and only present for large lists.

rhettinger · 2022-07-24T03:44:10Z

Also, make sure this gets tested across different compilers, builds, or machines. The margin Is so tight that it is easy to fool yourself, making Python slightly faster on one build while negatively impacting others. At one time, the current code was shown to be fastest. Note the quality of memmove implementations varies across compilers so it takes effort to draw conclusions. The pyperf timings are also challenging to interpret — these tight loops tend to benefit from perfect branch prediction and all memory accesses fitting into L1 cache, so they aren't at all representative — that is why it was a good suggestion to look at the impacts of the changes on Grant's extensive benchmarks for SortedContainers.

mdboom · 2022-08-01T17:24:30Z

@rhettinger wrote:

Grant's extensive benchmarks for SortedContainers.

Out of curiosity, do you have a link to those? It may be worth rolling those into pyperformance...

gvanrossum · 2022-08-31T00:34:49Z

I'm honestly just going to close this, just like I did with #83982. But @eendebakpt or @radiantly, if you disagree, just post a comment here and I'll gladly reopen! (I'm not promising the fix will be merged though. Somebody would have to do more benchmarking on more compilers. With PGO and LTO.)

radiantly added the type-feature A feature request or enhancement label Jul 2, 2022

radiantly mentioned this issue Jul 2, 2022

gh-94508: Micro-optimization: Use memmove instead of for loop in list.insert() #94509

Closed

AlexWaygood added the performance Performance or resource usage label Jul 3, 2022

eendebakpt mentioned this issue Jul 6, 2022

Micro optimize list.pop #94603

Closed

iritkatriel added the pending The issue will be closed if no feedback is provided label Aug 17, 2022

iritkatriel mentioned this issue Aug 17, 2022

list.insert is slow, likely due to manual memmove #83982

Closed

gvanrossum closed this as completed Aug 31, 2022

gvanrossum removed the pending The issue will be closed if no feedback is provided label Aug 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Micro-optimization: Use memmove instead of for loop in list.insert() #94508

Micro-optimization: Use memmove instead of for loop in list.insert() #94508

radiantly commented Jul 2, 2022 •

edited

Loading

eendebakpt commented Jul 3, 2022

radiantly commented Jul 3, 2022

rhettinger commented Jul 3, 2022 •

edited

Loading

eendebakpt commented Jul 3, 2022 •

edited

Loading

pochmann commented Jul 5, 2022

pochmann commented Jul 5, 2022 •

edited

Loading

eendebakpt commented Jul 5, 2022 •

edited

Loading

pochmann commented Jul 5, 2022 •

edited

Loading

pochmann commented Jul 5, 2022

eendebakpt commented Jul 5, 2022

pochmann commented Jul 6, 2022

eendebakpt commented Jul 6, 2022

rhettinger commented Jul 24, 2022

mdboom commented Aug 1, 2022

gvanrossum commented Aug 31, 2022

Micro-optimization: Use memmove instead of for loop in list.insert() #94508

Micro-optimization: Use memmove instead of for loop in list.insert() #94508

Comments

radiantly commented Jul 2, 2022 • edited Loading

eendebakpt commented Jul 3, 2022

radiantly commented Jul 3, 2022

rhettinger commented Jul 3, 2022 • edited Loading

eendebakpt commented Jul 3, 2022 • edited Loading

pochmann commented Jul 5, 2022

pochmann commented Jul 5, 2022 • edited Loading

eendebakpt commented Jul 5, 2022 • edited Loading

pochmann commented Jul 5, 2022 • edited Loading

pochmann commented Jul 5, 2022

eendebakpt commented Jul 5, 2022

pochmann commented Jul 6, 2022

eendebakpt commented Jul 6, 2022

rhettinger commented Jul 24, 2022

mdboom commented Aug 1, 2022

gvanrossum commented Aug 31, 2022

radiantly commented Jul 2, 2022 •

edited

Loading

rhettinger commented Jul 3, 2022 •

edited

Loading

eendebakpt commented Jul 3, 2022 •

edited

Loading

pochmann commented Jul 5, 2022 •

edited

Loading

eendebakpt commented Jul 5, 2022 •

edited

Loading

pochmann commented Jul 5, 2022 •

edited

Loading