Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convoy effect with I/O bound threads and New GIL #52194

Closed
dabeaz mannequin opened this issue Feb 16, 2010 · 100 comments
Closed

Convoy effect with I/O bound threads and New GIL #52194

dabeaz mannequin opened this issue Feb 16, 2010 · 100 comments
Labels
3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@dabeaz
Copy link
Mannequin

dabeaz mannequin commented Feb 16, 2010

BPO 7946
Nosy @gvanrossum, @loewis, @arigo, @gpshead, @jcea, @mdickinson, @ncoghlan, @pitrou, @scoder, @larryhastings, @ericvsmith, @tarekziade, @carljm, @coderanger, @phsilva, @merwok, @alex, @jab, @briancurtin, @florentx, @cool-RR, @dimaqq, @thedrow, @corona10, @Sophist-UK
Files
  • udp-ioclient.py
  • udp-iotest.py
  • gilprio2.patch
  • prioritygil.tar.gz: Priority GIL implementation from PyCON
  • linux-7946.patch: Linux (libc) patch for py32
  • gilinter2.patch
  • writes.py
  • schedtest.py: Scheduling test
  • dabeaz_gil.patch: Dave's GIL patch
  • nir-ccbench-xp32.log
  • nir-ccbench-linux.log
  • ccbench-osx.log
  • bfs.patch: Updated implementation of the BFS scheduler.
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-01-02.02:43:00.577>
    created_at = <Date 2010-02-16.20:48:43.897>
    labels = ['interpreter-core', '3.10', 'performance']
    title = 'Convoy effect with I/O bound threads and New GIL'
    updated_at = <Date 2022-03-21.22:05:04.974>
    user = 'https://bugs.python.org/dabeaz'

    bugs.python.org fields:

    activity = <Date 2022-03-21.22:05:04.974>
    actor = 'Sophist'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-01-02.02:43:00.577>
    closer = 'dabeaz'
    components = ['Interpreter Core']
    creation = <Date 2010-02-16.20:48:43.897>
    creator = 'dabeaz'
    dependencies = []
    files = ['16315', '16316', '16323', '16324', '16570', '16676', '16968', '17095', '17106', '17194', '17370', '17393', '17504']
    hgrepos = []
    issue_num = 7946
    keywords = ['patch']
    message_count = 100.0
    messages = ['99438', '99439', '99459', '99461', '99477', '99814', '99815', '99838', '99840', '99846', '99875', '100094', '101116', '101118', '101120', '101125', '101127', '101141', '101142', '101144', '101146', '101148', '101192', '101197', '101612', '101613', '101697', '101700', '101707', '101711', '101714', '101715', '101724', '101737', '101772', '101773', '101821', '101825', '101854', '102043', '102649', '102890', '103154', '103186', '103320', '103414', '103460', '104194', '104195', '104197', '104198', '104203', '104219', '104233', '104234', '104288', '104293', '104303', '104309', '104312', '104317', '104319', '104320', '104324', '104325', '104330', '104335', '104369', '104452', '104453', '104473', '104478', '104613', '104899', '105687', '105835', '105874', '106030', '106780', '106782', '156883', '223109', '223110', '346399', '346495', '347621', '347640', '347641', '377825', '377865', '378187', '378200', '378236', '385088', '385109', '415693', '415698', '415703', '415708', '415715']
    nosy_count = 49.0
    nosy_names = ['gvanrossum', 'loewis', 'jhylton', 'arigo', 'gregory.p.smith', 'jcea', 'mark.dickinson', 'ncoghlan', 'pitrou', 'scoder', 'movement', 'larry', 'eric.smith', 'kevinwatters', 'tarek', 'karld', 'carljm', 'coderanger', 'phsilva', 'durin42', 'eric.araujo', 'nirai', 'alex', 'stuaxo', 'andrix', 'konryd', 'jab', 'brian.curtin', 'hozn', 'victorpoluceno', 'flox', 'DazWorrall', 'cool-RR', 'rh0dium', 'rcohen', 'dabeaz', 'mahmoudimus', 'portante', 'aconrad', 'ysj.ray', 'thouis', 'donaldjeo', 'Michele', 'jmehnle', 'Dima.Tisnek', 'Omer.Katz', 'corona10', 'Sophist', 'maartenbreddels']
    pr_nums = []
    priority = 'normal'
    resolution = 'wont fix'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue7946'
    versions = ['Python 3.10']

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Feb 16, 2010

    Background
    -----------
    In order to multitask with threads, a critical part of the Python
    interpreter implementation concerns the behavior of I/O operations
    such as read, write, send, and receive. Specifically, whenever an I/O
    operation is carried out, the interpreter releases the GIL so that
    other threads can run while the program waits for the I/O operation to
    complete.

    Observed Behavior of I/O Operations
    ------------------------------------
    The release of the GIL for I/O is a critical aspect of making sure the
    interpreter doesn't make all threads hang while waiting. However, the
    release of the GIL also assumes a worst-case scenario. In practice,
    a large number of I/O requests actually complete immediately with no
    actual blocking. For example, if a program is sending on a socket,
    send() operations will typically complete immediately if buffer space
    is available to accept the data. Likewise, read() and recv()
    operations may return immediately if data is already available in the
    operating system.

    For system calls that complete immediately, a thread quickly releases
    and then almost immediately reacquires the GIL to continue running.
    However, given that the I/O operation didn't block, the release of the
    GIL was technically unnecessary in this case.

    Behavior of the new GIL
    -----------------------
    A feature of the new Python GIL implementation is that the interpreter
    no longer periodically signals waiting threads (e.g., the check
    interval). Instead, thread switching is based on a timed wait on a
    condition variable. Should a timeout occur, a thread will indicate
    that it wants the GIL and the currently running thread will be forced
    to give it up.

    Although this scheme solves the problem of CPU-bound threads
    thrashing, it introduces a highly pronounced "convoy effect" when
    CPU-bound threads and I/O bound threads get mixed together. A major
    part of the problem is caused by the bahvior of I/O as described
    above. Specifically, when an I/O bound thread executes an I/O call,
    it always releases the GIL. Since the GIL is released, a CPU bound
    thread is now free to acquire the GIL and run. However, if the I/O
    call completes immediately (which is common), the I/O bound thread
    immediately stalls upon return from the system call. To get the GIL
    back, it now has to go through the timeout process to force the
    CPU-bound thread to release the GIL again.

    It should be noted that the behavior described also occurs in Python
    2, but due to the small check interval, an I/O bound thread that wants
    the GIL back will tend to get it without having to wait very long.

    Example
    -------
    Here is a very simple example that illustrates the problem. In this
    example, there is one CPU-bound thread that hammers the CPU and there
    is an I/O bound thread that handles some network communication (an
    echo server):

    # iotest.py
    import time
    import threading
    from socket import *
    
    # CPU-bound thread (just hammers the CPU)
    def spin():
        while True:
            pass
    
    # I/O-bound thread (an echo TCP server)
    def echo_server():
        s = socket(AF_INET, SOCK_STREAM)
        s.setsockopt(SOL_SOCKET, SO_REUSEADDR,1)
        s.bind(("",15000))
        s.listen(1)
        while True:
            c,a = s.accept()
            while True:
                data = c.recv(8192)
                if not data:
                    break
                c.sendall(data)
            c.close()
        s.close()
    
    # Launch the CPU-bound thread
    t1 = threading.Thread(target=spin)
    t1.daemon=True
    t1.start()
    
    # Run the I/O server
    echo_server()

    Here is a benchmark program that runs as a client for the echo_server()
    thread defined above. It sends a sequence of messages and reads the
    response back. It then reports some timings at the end.

    # echoclient.py
    from socket import *
    import time
    
    CHUNKSIZE = 16384
    NUMMESSAGES = 640     # Total of 10MB
    
    # Dummy message
    msg = b"x"*CHUNKSIZE
    
    # Connect and send messages
    s = socket(AF_INET,SOCK_STREAM)
    s.connect(("",15000))
    start = time.time()
    for n in range(NUMMESSAGES):
        s.sendall(msg)
        bytes_recv = len(msg)
        # Get the response back
        while bytes_recv > 0:
            data = s.recv(bytes_recv)
            bytes_recv -= len(data)
    s.close()
    end = time.time()
    print("%0.3f seconds (%0.3f bytes/sec)" % (end-start, (CHUNKSIZE*NUMMESSAGES)/(end-start)))

    Performance Results
    -------------------
    These results are from running the above programs on a dual-core
    MacBook running OS-X Snow Leopard. I also get similar behavior on a
    quad-core desktop machine.

    If you run the iotest.py program using Python 2.6.4 and execute
    the client, you get this result:

    bash % python echoclient.py
    1.064 seconds (9854148.739 bytes/sec)

    If you switch the iotest.py to Python 3.2 and rerun, you get this
    result:

    bash % python echoclient.py
    12.340 seconds (849726.150 bytes/sec)

    Notice that there is a factor 12 performance difference.

    Modify the iotest.py program so that there are 2 CPU-bound
    threads spinning. Just add this extra code:

        t2 = threading.Thread(target=spin)
        t2.daemon
        t2.start()

    Now, repeat the above tests. For Python 2.6.4, you get this:

    bash-3.2$ python echoclient.py
    0.358 seconds (29319821.410 bytes/sec)
    

    (Yes the performance actually improves! That's left as an exercise
    for the reader to figure out why)

    Now, switch the iotest.py server to Python 3.2 and retry:

    base-3 $ python echoclient.py
    59.545 seconds (176098.609 bytes/sec)    
    

    Notice how the addition of one CPU-bound thread made the time go up by
    more than a factor 4!

    Now, disable all but one of the CPU cores and try the test again in
    Python 3.2:

    bash-3.2$ python echoclient.py
    0.120 seconds (87246036.201 bytes/sec)
    

    Here, you see that it runs about 500 times faster than with two cores
    (yikes!)

    What's causing this behavior?
    -----------------------------
    In the iotest.py program, there is an inner loop that
    looks like this:

            while True:
                data = c.recv(8192)
                if not data:
                    break
                c.sendall(data)

    The I/O operations recv() and sendall() always release the GIL when
    they execute. However, when this happens, CPU bound threads jump in
    and start running again. The problem gets worse as the number of
    CPU-bound threads increases--CPU bound threads might cycle round-robin
    before the I/O bound thread runs again. The problem is more
    pronounced on multiple CPU cores because when the GIL is released, one
    of the cores will typically go handle the system call while the other
    core wakes up the waiting CPU-bound thread (which then almost
    immediately steals the GIL).

    Is it worth fixing?
    -------------------
    I claim yes. There are many applications, such as those carried
    out with the multiprocessing library, that will operate by trying
    to overlap computation and I/O in some manner (for example, receiving
    the next chunk of data to work on while carrying out calculations
    on the currently received data).

    In heavily loaded I/O bound applications such as servers with
    hundreds of simultaneously connected clients, the release of the GIL
    on short I/O operations may cause a lot of unintended thrashing
    as threads cycle amongst themselves. This would most likely
    manifest itself as an increased turnaround time for requests.

    How to fix?
    -----------
    Modify all I/O operations in the interpreter to not release the
    GIL if they won't block. Either that or maybe there's some sort of
    really sneaky easy solution (unknown).

    The effect can be minimized by setting the switch interval to a really
    small value using sys.setswitchinterval(). However, doing this
    greatly increases the amount of thread context-switching--something
    that's also undesirable.

    @dabeaz dabeaz mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Feb 16, 2010
    @florentx florentx mannequin assigned pitrou Feb 16, 2010
    @florentx florentx mannequin added performance Performance or resource usage and removed type-bug An unexpected behavior, bug, or error labels Feb 16, 2010
    @alex
    Copy link
    Member

    alex commented Feb 16, 2010

    Would the idea of priority-GIL requests that Antoine had in his original patch solve this issue?

    @pitrou
    Copy link
    Member

    pitrou commented Feb 17, 2010

    Just a quick test under Linux (on a dual quad core machine):

    • with iotest.py and echo_client.py both running Python 2.7: 25.562 seconds (410212.450 bytes/sec)
    • with iotest.py and echo_client.py both running Python 3.2: 28.160 seconds (372362.459 bytes/sec)

    As already said, the "spinning endlessly" loop is a best case for thread switching latency in 2.x, because the opcodes are very short. If each opcode in the loop has an average duration of 20 ns, and with the default check interval of 100, the GIL gets speculatively released every 2 us (yes, microseconds). That's why I suggested trying more "realistic" workloads, as in ccbench.

    Also, as I told you, there might also be interactions with the various timing heuristics the TCP stack of the kernel applies. It would be nice to test with UDP.

    That said, the observations are interesting.

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Feb 17, 2010

    The comment on the CPU-bound workload is valid--it is definitely true that Python 2.6 results will degrade as the workload of each tick is increased. Maybe a better way to interpreter those results is as a baseline of what kind of I/O performance is possible if there is a quick I/O response time.

    However, ignoring that and the comparison between Python 2.6 and 3.2, there is still a serious performance issue with I/O in 3.2. For example, the dramatic decrease in I/O performance as there are more CPU bound threads competing and the fact that there is a huge performance gain when all but one CPU core is disabled.

    I tried the test using UDP packets and get virtually the exact same behavior described. For instance, echoing 10MB (sent in 8k UDP packets) takes about 0.6s in Python 2.6 and 12.0s in Python-3.2. The time shoots up to more than 40s if there are two CPU-bound threads.

    The problem being described really doesn't have anything to do with TCP vs. UDP or any part of the network stack. It has everything to do with how the operating system buffers I/O requests and how I/O operations such as sends and receives complete immediately without blocking depending on system buffer characteristics (e.g., if there is space in the send buffer, a send will return immediately without blocking). The fact that the GIL is released when it's not necessary in these cases is really the source of the problem.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 17, 2010

    We could try not to release the GIL when socket methods are called on a non-blocking socket.

    Regardless, I've re-run the tests under the Linux machine, with two spinning threads:

    • python 2.7:
      25.580 seconds (409914.612 bytes/sec)
    • python 3.2:
      32.216 seconds (325485.029 bytes/sec)

    (and as someone mentioned, the "priority requests" mechanism which was in the original new GIL patch might improve things. It's not an ideal time for me to test, right now :-))

    @pitrou
    Copy link
    Member

    pitrou commented Feb 22, 2010

    I'm attaching Dave's new UDP-based benchmarks, which eliminate the dependency on the TCP stack's behaviour.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 22, 2010

    And here is an experimental patch which enables the "priority requests" mechanism which was in the original new GIL patch. "Experimental" because it only enables them on a couple of socket methods (enough to affect the benchmarks).

    Here are the UDP benchmark results (with 2 background threads):

    • 2.x trunk (old GIL):
      11.379 seconds (921515.168 bytes/sec)
    • vanilla py3k (new GIL):
      27.400 seconds (382689.529 bytes/sec)
    • patched py3k (new GIL + priority requests):
      1.828 seconds (5737130.676 bytes/sec)

    And here's patched py3k with 8 background threads:
    3.058 seconds (3429136.193 bytes/sec)

    (benchmarks run on a 8-core Linux machine)

    @pitrou
    Copy link
    Member

    pitrou commented Feb 22, 2010

    See also bpo-7993 for a patch adding a similar bandwidth benchmark to ccbench.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 22, 2010

    Here is an improved version of the priority requests patch.
    With this patch I get the following result (2 background threads):
    0.255 seconds (41109347.194 bytes/sec)

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Feb 22, 2010

    I posted some details about the priority GIL modifications I showed during my PyCON open-space session here:

    http://www.dabeaz.com/blog/2010/02/revisiting-thread-priorities-and-new.html

    I am attaching the .tar.gz file with modifications if anyone wants to look at them. Note: I would not consider this to be solid enough to be any kind of official patch. People should only look at it for the purposes of experimentation and for coming up with something better.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 22, 2010

    Here is another patch based on a slightly different approach. Instead of being explicitly triggered in I/O methods, priority requests are decided based on the computed "interactiveness" of a thread. Interactiveness itself is a simple saturated counter (incremented when the GIL is dropped without request, decremented when the GIL is dropped after a request).

    Benchmark numbers are basically the same as with gilprio2.patch.

    @briancurtin
    Copy link
    Member

    The UDP benchmarks look even better on Windows.
    These are results of the attached scripts when running on an Intel Xeon (quad-core) 2.33GHz running Windows Server 2003 x64.

    vanilla py3k
    32-bit: 26.323 || 64-bit: 25.29

    gilprio2 patched py3k
    32-bit: 0.932 || 64-bit: 0.682

    gilinter patched py3k
    32-bit: 0.958 || 64-bit: 0.703

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Mar 15, 2010

    Here's a short benchmark for everyone who thinks that my original benchmark was somehow related to TCP behavior. This one doesn't even involve sockets:

    from threading import Thread
    import time
    
    def writenums(f,n):
        start = time.time()
        for x in range(n):
            f.write("%d\n" % x)
        end = time.time()
        print(end-start)
    
    def spin():
        while True:
            pass

    # Uncomment to add a thread
    #t1 = Thread(target=spin)
    #t1.daemon=True
    #t1.start()

    writenums(open("/tmp/nums","w"),1000000)

    If I run this on my Macbook with no threads, it takes about 1.05 seconds. If the one spinning thread is turned on, the time jumps to about 4.5 seconds. What you're seeing is that the spinning thread unfairly hogs the CPU.

    If I use my own patched version (new GIL with priorities), the threaded version drops back down to about 1.10 seconds. I have not tried it with Antoine's latest patch, but would expect similar results as he is also using priorities.

    Just to be clear, this issue is not specific to sockets or TCP.

    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Mar 15, 2010

    On some platforms the difference is not so important.
    I ran it in Debian Lenny AMD64 "Core2 Duo P9600 @2.66GHz".

    # Python 3.2a0 (py3k:78982M, Mar 15 2010, 15:40:42)
    # [GCC 4.3.4] on linux2

    0.67s without thread
    0.84s with spinning thread

    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Mar 15, 2010

    With line buffering, I see the issue.

    • 6 s without thread
    • 115 s with the spinning thread (varying: 60 s, 98 s)
    • 16 s with the spinning thread and the last "gilinter.patch"

    # Modified version of the test case, with bufsize=1

    from threading import Thread
    import time
    
    def writenums(f, n):
        start = time.time()
        for x in range(n):
            f.write("%d\n" % x)
        end = time.time()
        print(end-start)
    
    def spin():
        while True:
            pass
    
    t1 = Thread(target=spin)
    t1.daemon=True
    # Uncomment to add a thread
    #t1.start()

    # With line buffering
    writenums(open("./nums", "w", 1), 1000000)

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Mar 15, 2010

    Whoa, that's pretty diabolically evil with bufsize=1. On my machine, doing that just absolutely kills the performance (13 seconds without the spinning thread versus 557 seconds with the thread!). Or, put another way, the writing performance drops from about 0.5 Mbyte/sec down to 12 Kbytes/sec with the thread. With my priority GIL, the time is about 29 seconds with the thread (consistent with your experiment using the gilinter patch).

    One thing to think about with this example is the proper priority of I/O handling generally. What if, instead of a file, this example code was writing on a pipe to another process? For that, you would probably want that I/O thread to be able to blast its data to the receiver as fast as it reasonably can so that it can be done with it and get back to other work.

    In terms of practical relevance, this test again represents a simple situation where computation is overlapped with I/O processing. Perhaps the program has just computed a big result which is now being streamed somewhere else by a background thread. In the meantime, the program is now working on computing the next result (the spinning thread). Think queues, actors, or any number of similar things---there are programs that try to operate like this.

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Mar 15, 2010

    Almost forgot--if I turn off one of the CPU cores, the time drops from 557 seconds to 32 seconds. Gotta love it!

    @pitrou
    Copy link
    Member

    pitrou commented Mar 16, 2010

    One thing to think about with
    this example is the proper priority of I/O handling generally. What
    if, instead of a file, this example code was writing on a pipe to
    another process? For that, you would probably want that I/O thread
    to be able to blast its data to the receiver as fast as it reasonably
    can so that it can be done with it and get back to other work.

    We should be careful with statements such as "you want probably want
    [...]". There may be situations where you want such a thing; others
    where you don't really want (or care about) it.

    While IO responsiveness and throughput can be an important measure of
    performance, it is not the only one and depending on the situation it
    actually may not matter at all.

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Mar 16, 2010

    Oh the situation definitely matters. Although, in the big picture, most programmers would probably prefer to have fast I/O performance over slow I/O performance :-).

    @pitrou
    Copy link
    Member

    pitrou commented Mar 16, 2010

    Oh the situation definitely matters. Although, in the big picture,
    most programmers would probably prefer to have fast I/O performance
    over slow I/O performance :-).

    Yes, of course. But that's not the point. We could try to improve GIL
    behaviour in every thinkable situation, at the expense of code
    complexity. But we don't want to make the code too complex and
    delicate to maintain, and that's why I've asked for advice on
    python-dev.

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Mar 16, 2010

    I absolutely agree 100% that it is not worth trying to fix the GIL for every conceivable situation (although if you could, I wouldn't complain).

    To me, there are really only two scenarios worth worrying about:

    1. Get rid of all of that multicore lock thrashing that happens if there are multiple CPU-bound threads competing. Python programmers should know that intentionally using threads like this won't offer a performance boost. However, it would still be nice to avoid all of that needless overhead when it happens by design or by accident (the new GIL already addresses this).

    2. Make sure that you can get reasonable I/O performance when there is *one* CPU-bound thread competing with some background I/O processing. This covers the use case of overlapped I/O and computation as well as almost every practical problem related to communicating processes, multiprocessing, queuing, etc.

    As for everything else, it's probably not worth worrying about so much. If someone is only doing I/O (e.g., a web server), their code is going to behave about the same as before (although maybe slightly better under heavy load if there's less GIL contention). Situations where someone intentionally tries to set up multiple long-running CPU-bound threads seems pretty unlikely---the very presence of the GIL wouldn't make that kind of programming model attractive in the first place, so why would they do it?

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Mar 16, 2010

    You know, I almost wonder whether this whole issue could be fixed by just adding a user-callable function to optionally set a thread priority number. For example:

        sys.setpriority(n)

    Modify the new GIL code so that it checks the priority of the currently running thread against the priority of the thread that wants the GIL. If the running thread has lower priority, it immediately drops the GIL.

    Other than having this added preemption, do nothing else---just throw it all back to the user to come up with the proper "priorities."

    If there was something like this, it would completely fix the overlapped compute and I/O problem I mentioned. I'd just set a higher priority on the background I/O threads and be done with it. Problem solved.

    Ok, it's only a thought.

    @nirai
    Copy link
    Mannequin

    nirai mannequin commented Mar 16, 2010

    I tried Florent's modification to the write test and did not see the effect on my machine with an updated revision of Python32.

    I am running Ubuntu Karmic 64 bit.
    7s - no background threads.
    20s - one background thread.

    According to the following documentation the libc condition is using scheduling policy when waking a thread and not FIFO order:
    The following documentation suggests ordering in Linux is not FIFO:
    http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwait.html#tag_03_518_08_06

    I upload a quick and dirty patch (linux-7946.patch) to the new GIL just to reflect this by avoiding the timed waits.

    On my machine it behaves reasonably both with the TCP server and with the write test, but so does unpatched Python 3.2.

    I noticed high context switching rate with dave's priority GIL - with both tests it goes above 40K/s context switches.

    @nirai
    Copy link
    Mannequin

    nirai mannequin commented Mar 16, 2010

    I updated the patch with a small fix and increased the ticks countdown-to-release considerably. This seems to help the OS classify CPU bound threads as such and actually improves IO performance.

    @nirai
    Copy link
    Mannequin

    nirai mannequin commented Mar 24, 2010

    I upload bfs.patch

    To apply the patch use the following commands on updated python 3.2:
    $ patch -fp1 < bfs.patch
    $ ./configure

    The patch replaces the GIL with a scheduler. The scheduler is a simplified implementation of the recent kernel Brain F**k Scheduler by the Linux hacker Con Kolivas:

    http://ck.kolivas.org/patches/bfs/sched-BFS.txt
    "The goal of the Brain Fuck Scheduler, referred to as BFS from here on, is to
    completely do away with the complex designs of the past for the cpu process
    scheduler and instead implement one that is very simple in basic design.
    The main focus of BFS is to achieve excellent desktop interactivity and
    responsiveness without heuristics and tuning knobs that are difficult to
    understand, impossible to model and predict the effect of, and when tuned to
    one workload cause massive detriment to another."

    Con Kolivas is the hacker whose work inspired the current CFS scheduler of the Linux Kernel.

    On my core 2 duo laptop it performs as follows compared to the other patches:

    1. Florent's writenums() test: ~same
    2. UDP test: x6 faster
    3. cpued test: works as expected, while the other patches starve the pure python threads.

    cpued test spins 3 threads, 2 of them pure python and the 3rd does time.sleep(0) every ~1ms:

    import threading
    import time
    
    def foo(n):
        while n > 0:
            'y' in 'x' * n
            n -= 1
    
    def bar(sleep, name):
        for i in range(100):
            print (name, i, sleep)
            for j in range(300):
                foo(1500)
                if sleep:
                    time.sleep(0)
    
    t0 = threading.Thread(target=bar, args=(False, 't0'))
    t1 = threading.Thread(target=bar, args=(False, 't1'))
    t2 = threading.Thread(target=bar, args=(True, 't2-interactive'))
    
    list(map(threading.Thread.start, [t0, t1, t2]))
    list(map(threading.Thread.join, [t0, t1, t2]))

    The patch is still work in progress. In particular:

    1. I still need to add support for Windows.
    2. It currently requires Posix clock_gettime() and assumes good timer resolution.
    3. I only verified it builds on Ubuntu Karmic 64bit.
    4. I still need to optimize it and address cleanup.

    The scheduler is very simple, straight forward and flexible, and it addresses the tuning problems discussed recently.

    I think it can be a good replacement to the GIL, since Python really needs a scheduler, not a lock.

    @Michele
    Copy link
    Mannequin

    Michele mannequin commented May 19, 2010

    Attached ccbench-osx.log made today on OSX on latest svn checkout. Hope it helps

    @nirai
    Copy link
    Mannequin

    nirai mannequin commented May 30, 2010

    Updated bfs.patch with BSD license and copyright notice.

    ! Current version patches cleanly and builds with Python revision svn r81201.

    bpo-7946 and proposed patches were put on hold indefinitely following this python-dev discussion: http://mail.python.org/pipermail/python-dev/2010-May/100115.html

    I would like to thank the Python developer community and in particular David and Antoine for a most interesting ride.

    Any party interested in sponsoring further development or porting patch to Python 2.x is welcome to contact me directly at nir@winpdb.org

    Nir

    @gpshead
    Copy link
    Member

    gpshead commented May 30, 2010

    Thanks for all your work Nir! I personally think the BFS approach is the best we've seen yet for this problem!

    Having read the thread you linked to in full (ignoring the tagents bikeshedding and mudslinging that went on there), it sounds like the general consensus is that we should take thread scheduling changes slowly and let the existing new implementation bake in the 3.2 release. That puts this issue as a possibility for 3.3 if users demonstrate real world application problems in 3.2.

    (personally I'd say it is already obvious that there are problems an wde should go ahead with your BFS based approach but realistically the we're still better off in 3.2 than we were in 3.1 and 2.x as is)

    @vstinner
    Copy link
    Member

    gettimeofday returns you wall clock time: if a process
    that modifies time is running, e.g. ntpd, you'll likely
    to run into trouble. the value returned is _not_ monotonic,
    ...

    The issue bpo-12822 asks to use monotonic clocks when available.

    @dimaqq
    Copy link
    Mannequin

    dimaqq mannequin commented Jul 15, 2014

    What happened to this bug and patch?

    @pitrou
    Copy link
    Member

    pitrou commented Jul 15, 2014

    Not much :) The patch is complex and the issue hasn't proved to be
    significant in production code.
    Do you have a (real-world) workload where this shows up?

    Le 15/07/2014 09:52, Dima Tisnek a écrit :

    Dima Tisnek added the comment:

    What happened to this bug and patch?

    @thedrow
    Copy link
    Mannequin

    thedrow mannequin commented Jun 24, 2019

    Celery 5 is going async and in order to isolate the main event loop from task execution, the tasks are going to be executed in a different thread with it's own event loop.

    This thread may or may not be CPU bound.
    The main thread is I/O bound.

    This patch should help a lot.

    I like Nir's approach a lot (although I haven't looked into the patch itself yet). It's pretty novel.
    David's patch is also very interesting.

    I'm willing to help.

    @arigo
    Copy link
    Mannequin

    arigo mannequin commented Jun 25, 2019

    Note that PyPy has implemented a GIL which does not suffer from this problem, possibly using a simpler approach than the patches here do. The idea is described and implemented here:

    https://bitbucket.org/pypy/pypy/src/default/rpython/translator/c/src/thread_gil.c

    @thedrow
    Copy link
    Mannequin

    thedrow mannequin commented Jul 10, 2019

    FYI I can verify that the original benchmark is still valid on Python 3.7.3.
    I'm running the client on an 8 core CPU.
    The result is 30.702 seconds (341534.322 bytes/sec).

    I'll need somebody to decide how we're going to fix this problem.
    I can do the legwork.

    @gpshead gpshead added the 3.9 only security fixes label Jul 10, 2019
    @gpshead
    Copy link
    Member

    gpshead commented Jul 10, 2019

    I suggest:
    (1) turning one of the patches (probably the last BFS one?) into a PR against the github master branch (3.9) and,
    (2) if none of the existing pyperformance workloads already demonstrates the problems with the existing GIL implementation, adopt one of the benchmarks here into an additional pyperformance workload.
    (3) demonstrate pyperformance results (overall and targeted tests) before and after the change.

    Simultaneously, it'd also be interesting to see someone create an alternate PR using a PyPy inspired GIL implementation as that could prove to be a lot easier to maintain.

    Lets make a data driven decision here.

    People lost interest in actually landing a fix to this issue in the past because it wasn't impacting their daily lives or applications (or if it was, they already adopted a workaround). Someone being interested enough to do the work to justify it going in is all it should take to move forward.

    @gpshead
    Copy link
    Member

    gpshead commented Jul 10, 2019

    (unassigning as it doesn't make sense to assign to anyone unless they're actually working on it)

    @larryhastings
    Copy link
    Contributor

    FWIW: I think David's cited behavior proves that the GIL is de facto a scheduler. And, in case you missed it, scheduling is a hard problem, and not a solved problem. There are increasingly complicated schedulers with new approaches and heuristics. They're getting better and better... as well as more and more complex. BFS is an example of that trend from ten years ago. But the Linux kernel has been shy about merging it, not sure why (technical deficiency? licensing problem? personality conflict? the name?).

    I think Python's current thread scheduling approach is almost certainly too simple. My suspicion is that we should have a much more elaborate scheduler--which hopefully would fix most (but not all!) of these sorts of pathological performance regressions. But that's going to be a big patch, and it's going to need a champion, and that champion would have to be more educated about it than I am, so I don't think it's gonna be me.

    @dabeaz
    Copy link
    Mannequin Author

    dabeaz mannequin commented Oct 3, 2020

    About nine years ago, I stood in front of a room of Python developers, including many core developers, and gave a talk about the problem described in this issue. It included some live demos and discussion of a possible fix.

    https://www.youtube.com/watch?v=fwzPF2JLoeU

    Based on subsequent interest, I think it's safe to say that this issue will never be fixed. Probably best to close this issue.

    @gpshead
    Copy link
    Member

    gpshead commented Oct 7, 2020

    It's a known issue and has been outlined very well and still comes up from time to time in real world applications, which tend to see this issue and Dave's presentation and just work around it in any way possible for their system and move on with life.

    Keeping it open even if nobody is actively working on it makes sense to me as it is still a known issue that could be resolved should someone have the motivation to complete the work.

    @gpshead gpshead added 3.10 only security fixes and removed 3.9 only security fixes labels Oct 7, 2020
    @dimaqq
    Copy link
    Mannequin

    dimaqq mannequin commented Oct 8, 2020

    My 2c as Python user:

    Back in 2010, I've used multithreading extensively, both for concurrency and performance. Others used multiprocessing or just shelled out. People talked about using **the other** core, or sometimes the other socket on a server.

    Now in 2020, I'm using asyncio exclusively. Some colleagues occasionally still shell out 🙈. None talking about using all cores on a single machine, rather, we'd spin up dozens of identical containers, which are randomly distributed across N machines, and the synchronisation is offloaded to some database (e.g. atomic ops in redis; transactions in sql).

    In my imagination, I see future Python as single-threaded (from user's point of view, that is without multithreading api), that features speculative out-of-order async task execution (using hardware threads, maybe pinned) that's invisible to the user.

    @vstinner
    Copy link
    Member

    vstinner commented Oct 8, 2020

    If someone wants to close this issue, I suggest to write a short section in the Python documentation to give some highlights on the available options and stategies to maximize performances and list drawbacks of each method. Examples:

    • Multiple threads (threading): limited by the GIL
    • Multiple processes (concurrent.futures, multiprocessing, distributed application): limited by shared data
    • Concurrent programming (asyncio): limited to 1 thread

    These architectures are not exclusive. asyncio can use multiple threads and be distributed in multiple processes.

    I would be bad to go too deep into the technical details, but I think that we can describe some advantages and drawbacks which are common on all platforms.

    @dabeaz dabeaz mannequin closed this as completed Jan 2, 2021
    @stuaxo
    Copy link
    Mannequin

    stuaxo mannequin commented Jan 14, 2021

    Catching up on the comments on this, it seems like nobody has enough certainty to say it will work well enough.

    In Linux, the scheduler is pluggable, which lets other non-default schedulers be shipped and tried in the real world.

    • See schedutil, introduced in Linux 4.7 in 2016 default for some architectures in 2020, in 2021 still not default everywhere.

    Similarly I think this needs more testing than it will get living here as a bug. If, like Linux the scheduler was pluggable, it could be shipped and enabled by real users that were brave enough and data could be collected.

    @maartenbreddels
    Copy link
    Mannequin

    maartenbreddels mannequin commented Jan 15, 2021

    In case someone finds it useful, I've written a blog post on how to visualize the GIL:
    https://www.maartenbreddels.com/perf/jupyter/python/tracing/gil/2021/01/14/Tracing-the-Python-GIL.html

    In the comments (or at maartenbreddels/fastblog#3 (comment) )
    I looked specifically at the iotest.py example (no solutions though).

    @Sophist-UK
    Copy link
    Mannequin

    Sophist-UK mannequin commented Mar 21, 2022

    Please see also faster-cpython/ideas#328 for a proposal for a simple (much simpler than BFS) GIL scheduler only allocating the GIL between runable O/S threads waiting to have ownership of the GIL, and using the O/S scheduler for scheduling the threads.

    @thedrow
    Copy link
    Mannequin

    thedrow mannequin commented Mar 21, 2022

    I think that we should focus our efforts on removing the GIL, now that we
    have a feasible solution for doing so without breaking anything (hopefully)
    however the removal of the GIL is still far from being complete and will
    need to be rebased upon the latest Python version to be merged.

    This issue would probably hurt Celery since some users use it with a thread
    pool and it uses a few threads itself but it seems like fixing it is too
    much effort so if we were to invest a lot of effort, I'd focus on removing
    the problem entirely.

    Instead, I suggest we document this with a warning in the relevant place so
    that people would know to avoid or workaround the problem entirely.

    On Mon, Mar 21, 2022, 20:32 Guido van Rossum <report@bugs.python.org> wrote:

    Change by Guido van Rossum <guido@python.org>:

    ----------
    nosy: +gvanrossum


    Python tracker <report@bugs.python.org>
    <https://bugs.python.org/issue7946\>


    @Sophist-UK
    Copy link
    Mannequin

    Sophist-UK mannequin commented Mar 21, 2022

    I think that we should focus our efforts on removing the GIL, now that we have a feasible solution for doing so without breaking anything

    Is this really a thing? Something that is definitely happening in a reasonable timescale?

    Or are there some big compatibility issues likely to rear up and at best create delays, and at worse derail it completely?

    Can someone give me some links about this please?

    @gvanrossum
    Copy link
    Member

    Start here: https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfzZsDFosB5e6BfnXLlejd9l0/edit

    AFAICT the SC hasn't made up their minds about this.

    @Sophist-UK
    Copy link
    Mannequin

    Sophist-UK mannequin commented Mar 21, 2022

    https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfzZsDFosB5e6BfnXLlejd9l0/edit

    1. The steering committee hasn't given the go ahead for this yet, and we have no idea when such a decision will be made nor whether the decision with be yes or no.

    2. Even after the decision is made "Removing the GIL will be a large, multi-year undertaking with increased risk for introducing bugs and regressions."

    3. The promised performance gains are actually small by comparison to the existing project that Guido is running is hoping to achieve. It is unclear whether the no-gil implementation would impact those gains, and if so whether a small reduction in the large planned performance gains would actually more than wipe out the modest performance gains promised by the no-gil project.

    4. Given that this "Removing the GIL will require making trade-offs and taking on substantial risk", it is easy to see that this project could come across an unacceptable risk or insurmountable technical problem at any point and thus fall by the wayside.

    Taking all of the above points together, I think that there is still merit in considering the pros and cons of a GIL scheduler.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants