Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micro-optimization: Use memmove instead of for loop in list.insert() #94508

Closed
radiantly opened this issue Jul 2, 2022 · 15 comments
Closed

Micro-optimization: Use memmove instead of for loop in list.insert() #94508

radiantly opened this issue Jul 2, 2022 · 15 comments
Labels
performance Performance or resource usage type-feature A feature request or enhancement

Comments

@radiantly
Copy link

radiantly commented Jul 2, 2022

Background

I noticed a difference in speed when inserting an element to the beginning of the list using list.insert vs list slice assignment. Upon inspecting the code, the usage of memmove in the list slice assignment code seems to be the providing a slight speed boost over the for loop used in the list.insert implementation.

The following code was used to benchmark the difference:

import random
import timeit

runs = 100000
setup = 'import random;lst=[random.randrange(100000) for _ in range(100000)]; x = random.randrange(100000)'

print(timeit.timeit(stmt='lst[:0] = [x]', setup=setup, number=runs))
print(timeit.timeit(stmt='lst.insert(0, x)', setup=setup, number=runs))

Original:

Slice assignment: 7.394628149999335s
list.insert:      8.910868348000804s

Using memmove in list.insert():

Slice assignment: 7.252985826999065s
list.insert:      7.215608489997976s

(Compiled with optimizations and tested on Arch Linux, kernel 5.18)

@radiantly radiantly added the type-feature A feature request or enhancement label Jul 2, 2022
@AlexWaygood AlexWaygood added the performance Performance or resource usage label Jul 3, 2022
@eendebakpt
Copy link
Contributor

Do you have any benchmarks for small lists?

@radiantly
Copy link
Author

Although I did not expect a difference, the memmove variant seems to be slightly faster for smaller lists too.

main branch patched
For list of length 1:
 Slice assignment: 0.9193216409912566
 list.insert:      1.1760083219996886
 Difference:      +0.25668668100843206
For list of length 10:
 Slice assignment: 0.9002103189995978
 list.insert:      1.175773538008798
 Difference:      +0.2755632190092001
For list of length 100:
 Slice assignment: 0.9053751649917103
 list.insert:      1.176510691002477
 Difference:      +0.2711355260107666
For list of length 1000:
 Slice assignment: 0.924577939993469
 list.insert:      1.1964725340076257
 Difference:      +0.2718945940141566
For list of length 10000:
 Slice assignment: 1.0966810410027392
 list.insert:      1.4154912599915406
 Difference:      +0.31881021898880135
For list of length 1:
 Slice assignment: 0.9132311169960303
 list.insert:      0.9028571050002938
 Difference:      -0.01037401199573651
For list of length 10:
 Slice assignment: 0.9108372859918745
 list.insert:      0.9029034480045084
 Difference:      -0.007933837987366132
For list of length 100:
 Slice assignment: 0.908425076995627
 list.insert:      0.9027242980082519
 Difference:      -0.005700778987375088
For list of length 1000:
 Slice assignment: 0.9295631379936822
 list.insert:      0.9190660210006172
 Difference:      -0.010497116993064992
For list of length 10000:
 Slice assignment: 1.0984339760034345
 list.insert:      1.095934324010159
 Difference:      -0.00249965199327562
Benchmark code
import random
import timeit

runs = 100000

for element_count in [1, 10, 100, 1000, 10000]:
    setup = f'import random;lst=[random.randrange(100000) for _ in range({element_count})]; x = random.randrange(100000)'

    slice_time = timeit.timeit(stmt='lst[:0] = [x]', setup=setup, number=runs)
    insert_time = timeit.timeit(stmt='lst.insert(0, x)', setup=setup, number=runs)
    
    print(f"For list of length {element_count}:")
    print(f" Slice assignment: {slice_time}")
    print(f" list.insert:      {insert_time}")
    print(f" Difference:      {insert_time - slice_time:+}")

@rhettinger
Copy link
Contributor

rhettinger commented Jul 3, 2022

This should be checked for other compilers as well, so that an improvement on one build doesn't negatively impact another. Also, be sure to run timings using the same full optimization settings that we use for the production builds.

Also, it is a little surprising that memmove would beat a for-loop. Modern compilers are great at optimizing for-loops like this. Conceptually, memmove() is at a disadvantage because its byte oriented API can't know that we're moving an exact multiple of sizeof(PyObject *), that the data is pointer aligned, and that a direction test doesn't have to be performed. That said, perhaps the for-loop optimizations are being hobbled by -fno-strict-aliasing.

@eendebakpt
Copy link
Contributor

eendebakpt commented Jul 3, 2022

update results from this comment are not representative, see the comments below

@radiantly @rhettinger I also find it very surprising, but I can confirm the results on my system. Also for small lists memmove seems to be faster.

Note: also the bytearray uses memmove for the insert method.

Benchmark: compare list.insert to slice assignment

Benchmark

import pyperf

runner = pyperf.Runner()

for n in [1, 2, 10, 100, 1000, 50_000]:
    runner.timeit(name=f"list({n}).insert(0, None)",
	      stmt="lst.insert(0, None)", setup=f'import random; lst=[random.random() for ii in range({n})]')
    runner.timeit(name=f"list({n}).slice",
	      stmt="lst[:0] = [None]", setup=f'import random; lst=[random.random() for ii in range({n})]')

Results

list(1).insert(0, None): Mean +- std dev: [main] 5.66 us +- 0.12 us -> [pr] 4.11 us +- 0.06 us: 1.38x faster
list(1).slice: Mean +- std dev: [main] 4.15 us +- 0.06 us -> [pr] 4.19 us +- 0.08 us: 1.01x slower
list(2).insert(0, None): Mean +- std dev: [main] 5.65 us +- 0.10 us -> [pr] 4.12 us +- 0.05 us: 1.37x faster
list(2).slice: Mean +- std dev: [main] 4.19 us +- 0.08 us -> [pr] 4.15 us +- 0.07 us: 1.01x faster
list(10).insert(0, None): Mean +- std dev: [main] 5.63 us +- 0.09 us -> [pr] 4.12 us +- 0.06 us: 1.37x faster
list(100).insert(0, None): Mean +- std dev: [main] 5.62 us +- 0.08 us -> [pr] 4.14 us +- 0.11 us: 1.36x faster
list(100).slice: Mean +- std dev: [main] 4.22 us +- 0.07 us -> [pr] 4.19 us +- 0.05 us: 1.01x faster
list(1000).insert(0, None): Mean +- std dev: [main] 5.95 us +- 0.11 us -> [pr] 4.38 us +- 0.08 us: 1.36x faster
list(50000).insert(0, None): Mean +- std dev: [main] 20.9 us +- 0.2 us -> [pr] 16.6 us +- 0.2 us: 1.26x faster

Benchmark hidden because not significant (3): list(10).slice, list(1000).slice, list(50000).slice

Geometric mean: 1.16x faster

System

gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

@pochmann
Copy link
Contributor

pochmann commented Jul 5, 2022

Previous issue about it:
#83982

@pochmann
Copy link
Contributor

pochmann commented Jul 5, 2022

How many repetitions did pyperf do? I.e., how "small" were the lists on average (for each initial size n)?

@eendebakpt
Copy link
Contributor

eendebakpt commented Jul 5, 2022

@pochmann The exact benchmark code is in the details section of my previous comment. The size of the list was picked from 1, 2, 10, 100, 1000, 5000

I also created a plain C program to test the for loop vs. the memmove. This showed no difference at all between the two options. I will look into the exact compiler settings of cpython to see if I can find something that explains the differences within cpython.

C test program
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/timeb.h>

double get_time_ms (double t0)
{
    struct timespec ts;

    if (clock_gettime (CLOCK_MONOTONIC, &ts) == 0)
    {
        return (double) ((double) (ts.tv_sec) +
                         (double) (ts.tv_nsec) / 1000000000) -t0;
    }
    else
        return 0;
}

void
plain_loop (int **items, int N, int where)
{
    for (int i = N - 1; --i >= where;)
        items[i + 1] = items[i];

}

void
forward_loop (int **items, int N, int where)
{
    // not equivalent to plain_loop, included for reference
    for (int i = where; i < N; i++)
        items[i] = items[i + 1];

}

void
memmove_impl (int **items, int N, int where)
{
    memmove (&items[where + 1], &items[where], (N - where) * sizeof (int *));
}

#define N 400000
int
main (int argc, char **argv)
{
    int i;
    double t0, dt;

    int *items[N];
    for (long ii = 0; ii < N; ii++)
        items[ii] = (int *) (ii % 20);

    t0 = get_time_ms (0);

    for (int k = 0; k < 10; k++)
    {
        for (int where = 0; where < 200; where++)
        {
            plain_loop (items, N, where);
        }
    }

    dt = get_time_ms (t0);
    printf ("dt: %f\n", dt);

    t0 = get_time_ms (0);

    for (int k = 0; k < 10; k++)
    {
        for (int where = 0; where < 200; where++)
        {
            memmove_impl (items, N, where);
        }
    }

    dt = get_time_ms (t0);
    printf ("dt: %f\n", dt);

    t0 = get_time_ms (0);

    for (int k = 0; k < 10; k++)
    {
        for (int where = 0; where < 200; where++)
        {
            forward_loop (items, N, where);
        }
    }

    dt = get_time_ms (t0);
    printf ("dt: %f\n", dt);
}

// gcc -O3 -fno-strict-aliasing test.c  && ./a.out

@pochmann
Copy link
Contributor

pochmann commented Jul 5, 2022

@eendebakpt Those are the initial sizes. Before all the insertions. It doesn't show how many insertions it did and thus not how large the lists grew.

For example if pyperf decided to do 10000 repetitions, then with an initial size of 1, the insertions were on sizes 1 to 10000, i.e., with an average size of 5000. Not on size 1 (except the very first insertion).

@pochmann
Copy link
Contributor

pochmann commented Jul 5, 2022

@eendebakpt I'm not even convinced the results show that the PR made it faster. Sure, "1.38x faster" sounds like it was faster, but...

What if it actually got slower, enough that pyperf decided to only do half as many repetitions as before the PR, and thus it only did half as many insertions, meaning on average half the list size and thus on average half the work? Then it was 1.38x faster at doing half as much work, meaning it actually was 2/1.38 = 1.45x slower.

At least with my admittedly small knowledge of pyperf, I can't rule that out. Or is that not how pyperf works?

@eendebakpt
Copy link
Contributor

@eendebakpt I'm not even convinced the results show that the PR made it faster. Sure, "1.38x faster" sounds like it was faster, but...

What if it actually got slower, enough that pyperf decided to only do half as many repetitions as before the PR, and thus it only did half as many insertions, meaning on average half the list size and thus on average half the work? Then it was 1.38x faster at doing half as much work, meaning it actually was 2/1.38 = 1.45x slower.

At least with my admittedly small knowledge of pyperf, I can't rule that out. Or is that not how pyperf works?

I think you have a point, I will check later today

@pochmann
Copy link
Contributor

pochmann commented Jul 6, 2022

@radiantly Btw your benchmark with lengths 1, 10, 100, etc also has that issue that it doesn't really measure insertions at those sizes. I just didn't ask about it since we can see your fixed runs = 100000. Meaning for example for initial length 1, you measured insertion in a list of average size 50000 (which I wouldn't call small).

As I showed in the mentioned previous issue, memmove seemed over 3 times faster at inserting at index -500, which I chose due to sortedcontainers.SortedList. Inserting a few million random numbers into a SortedList might be a good "real life" test (something reasonable/realistic to do, unlike our inserting 100000 values at the start of a list, which Tim would probably still say "isn't really sane"). I'm curious how that's affected by the PR.

@eendebakpt
Copy link
Contributor

@pochmann The results from my earlier comment are indeed not representative. With a fixed test I see no performance difference for small lists, and a small improvement with memmove for the larger lists.

Test:

import pyperf
runner = pyperf.Runner()
runner.timeit(name=f"x.insert(2, 1); x.pop(-1)",stmt='x.insert(2, 1); x.pop(1)', setup='x=[None]*1_00_000')

Result:

Mean +- std dev: [m] 65.5 us +- 0.5 us -> [pr] 56.9 us +- 0.7 us: 1.15x faster

So for some reason the optimization of the for loop is not as good as the memmove. The difference is relatively small though, and only present for large lists.

@rhettinger
Copy link
Contributor

Also, make sure this gets tested across different compilers, builds, or machines. The margin Is so tight that it is easy to fool yourself, making Python slightly faster on one build while negatively impacting others. At one time, the current code was shown to be fastest. Note the quality of memmove implementations varies across compilers so it takes effort to draw conclusions. The pyperf timings are also challenging to interpret — these tight loops tend to benefit from perfect branch prediction and all memory accesses fitting into L1 cache, so they aren't at all representative — that is why it was a good suggestion to look at the impacts of the changes on Grant's extensive benchmarks for SortedContainers.

@mdboom
Copy link
Contributor

mdboom commented Aug 1, 2022

@rhettinger wrote:

Grant's extensive benchmarks for SortedContainers.

Out of curiosity, do you have a link to those? It may be worth rolling those into pyperformance...

@iritkatriel iritkatriel added the pending The issue will be closed if no feedback is provided label Aug 17, 2022
@gvanrossum
Copy link
Member

I'm honestly just going to close this, just like I did with #83982. But @eendebakpt or @radiantly, if you disagree, just post a comment here and I'll gladly reopen! (I'm not promising the fix will be merged though. Somebody would have to do more benchmarking on more compilers. With PGO and LTO.)

@gvanrossum gvanrossum removed the pending The issue will be closed if no feedback is provided label Aug 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

8 participants