Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-evaluation of pystack #7396

Closed
wants to merge 17 commits into from
Closed

Re-evaluation of pystack #7396

wants to merge 17 commits into from

Conversation

bill88t
Copy link

@bill88t bill88t commented Dec 28, 2022

Currently all boards, excluding espressif/TG-Watch, have the exact same pystack size, defined by:
py/circuitpy_mpconfig.h:453: #define CIRCUITPY_PYSTACK_SIZE 1536

However, at least for me, this is a bottleneck.
My project has at this point, been optimised as much as it goes,
but I am still hitting the pystack limit with some of the latest changes.
(RuntimeError: pystack exhausted)

This pr intends to find a better value than the global 1536 that is currently applied.
As an initial test, the value is increased by 512 bytes.

@anecdata
Copy link
Member

anecdata commented Dec 28, 2022

Are you testing different values on different ports / boards? I suspect this should be by port, if not by board (existence or lack of PSRAM could be a complicating factor). 512 is a lot out of a SAMD21, but likely to be inconsequential on an espressif board.

@bill88t
Copy link
Author

bill88t commented Dec 28, 2022

Initially I added 512 globally to test which boards will outright fail or have critically low ram.
After evaluating the actions results, I will do per-port/per-board modifications.
I plan on massively increasing the value on psram boards.

@anecdata
Copy link
Member

anecdata commented Dec 28, 2022

Sounds good. It is a balancing act even with PSRAM, since it trades off with espidf memory (sockets and wifi, AP connections, wifi monitor, etc). tannewt or others could speak better about leaving some headroom for future BLE features. There is the option in settings.toml to allocate some PSRAM to espidf memory, so that helps some (at the cost of speed as you noted on Discord).

@RetiredWizard
Copy link

For whatever it's worth.... My usage is probably a bit atypical and quickly stresses the stack storage on all the board builds. I've found it easiest to simply disable stack usage in circuitpython by setting MICROPY_ENABLE_PYSTACK to (0) and MICROPY_STACKLESS to (1). There's probably a downside to running this way but at least for my usage I haven't found any issues yet.

@bill88t
Copy link
Author

bill88t commented Dec 28, 2022

This is quite significant news..
This could perhaps be the way-to-go for a lot of ports if there are no sideffects.

@anecdata
Copy link
Member

Stackless issue:
#3362

@RetiredWizard
Copy link

For anyone that wants to optimize memory/stack size, I stole this bit of code from a posted Micropython (I think) issue a year or so ago:

def calcRecursionDepth(wldCLen,recursiveFail):
    wldCLen += 1
    if not recursiveFail and wldCLen < 90:
        try:
            (wldCLen,recursiveFail) = calcRecursionDepth(wldCLen,recursiveFail)
        except:
            recursiveFail = True

    return (wldCLen,recursiveFail)

w = 0
rf = False
print(calcRecursionDepth(w,rf))

Interestingly, if you define this function within another function the amount of recursion you can achieve is reduces by much more than I would have expected. I suppose that indicates that function definitions consume a fair amount of stack space.

@bill88t
Copy link
Author

bill88t commented Dec 28, 2022

Tomorrow along with the resizing tests, I will create stackless builds and evaluate their performance.
If anything, my project is the perfect test for this.
It uses both large amounts of ram and stack.

@bill88t
Copy link
Author

bill88t commented Dec 29, 2022

Now all the boards with MB's of ram, are set to have 5k of pystack. There is no reason not to.
nrf got a boost of 512b, and rp2 got a boost of 1kb.

The rp2 board with the least available ram is the picow and it still has 10kb more than nrf52840. (Usable)
nrf on the other hand, has very little to begin with, so I only increased it by 512b.

I will now build a stackless build for picow and test my ljinux + webserver + ssh to see how it performs.

@bill88t
Copy link
Author

bill88t commented Dec 29, 2022

After further testing, I have determined stackless is the way to go.
There is no reason to further limit usable heap when stackless is an option.

@bill88t bill88t changed the title Re-evaluation of pystack size Re-evaluation of pystack Dec 29, 2022
@dhalbert dhalbert marked this pull request as draft December 29, 2022 17:19
@anecdata
Copy link
Member

After further testing, I have determined stackless is the way to go.
There is no reason to further limit usable heap when stackless is an option.

Could you please elaborate... what are the pros and cons of pystack vs. stackless for different application types? Defining stackless would help too, and how it's affected by other defines in the build.

@bill88t
Copy link
Author

bill88t commented Dec 29, 2022

I just commented all my findings in #3362.

@bill88t
Copy link
Author

bill88t commented Jan 3, 2023

The tested* stackless ports are now in this pr.

*For esp I only tested S2, I will also test C3, but I cannot test the rest.

@dhalbert dhalbert modified the milestones: 8.x.x, 9.0.0 Jan 3, 2023
@RetiredWizard
Copy link

RetiredWizard commented Jan 28, 2023

I have also been running stackless on the same chips bill88t listed for more than a year. In addition, I've been using ESP32, ESP32-Pico, ESP32-C3, ESP32-S3, MIMXRT10XX, SAMD51, BROADCOM and STM32L4 boards without any issues, although I have not been running any tests designed to look for memory/stack type errors specifically. My use typically stresses RAM usage and FLASH I/O. The peripherals I tend to use are I2c, SPI, simple PWM GPIO, the RTC library and WiFi if available.

@bill88t
Copy link
Author

bill88t commented Jan 29, 2023

I too have now tested C3 with stackless, all good.
I would have liked to also test the old esp32, but I don't have any, so I will take RetiredWizard's word that it works.
I will publish my memory testing utils later.

@bill88t
Copy link
Author

bill88t commented Feb 7, 2023

I have taken a quick read through some source files, and py/scope.h seems a good candidate for the recursion limit.
However I still have no idea how to implement it..

@tannewt tannewt requested a review from jepler February 7, 2023 17:06
@gneverov
Copy link

gneverov commented Feb 7, 2023

I think that moving from "pystack" to "stackless" would be a step backwards. "Stackless" allocates a stack frame on the heap at every function call, which is not the most efficient way to implement a call stack. Besides the overhead introduced on every function call, another cost is the emergent impact it has on the heap: it will cause the GC to run more often and produce a more fragmented heap.

I think a better solution would be to make the size of the "pystack" configurable at boot time. This way the few people who need a larger stack can get it without impacting the performance of most people who don't.

@bill88t
Copy link
Author

bill88t commented Feb 7, 2023

Configuring pystack size with something like settings.toml would pretty much share the same benefits as stackless.
Users could lower it for more ram, or raise it for massive recursion levels and nested funcs.

However I have to say, I have not noticed any slowdowns due to stackless.. (and my code is absolute bloat)
Thinking about it, with an original stack size of 1.5k, one frame should be a couple dozen bytes, so what are the odds even of that frame being unable to find space and needing gc?? You are basically oom at that point in time..
Please correct me if I am wrong at this.

It would be more of an issue on boards that are barely-able-to-walk-let-alone-run, but this change is currently proposed for boards that really run.

I would like to hear a decision from adadevs on the matter, cause once I finish my exams I want to implement sys.setrecursionlimit to limit the chaos. (If anyone in the meantime wants it done & merged, feel free to pr to the pr or push here, perms granted)
If we are going to do configurable pystack we should close this pr and move on.
I will gladly help either way.

EDIT: Thought about it some more.

@anecdata
Copy link
Member

anecdata commented Feb 7, 2023

GC may be less of an issue than fragmentation, consider nested functions that do a of of heavy lifting so the small stack frames get dispersed through the heap.

I'm a big fan of runtime configuration of things like memory, not sure if stackless could be a runtime thing like PYSTACK = 0 or STACKLESS = True overriding the PYSTACK value.

@dhalbert
Copy link
Collaborator

dhalbert commented Feb 7, 2023

The standard heap block size is 16 bytes.

Some discussion of performance hit of STACKLESS and PYSTACK here (and above and below in that issue): micropython#6087 (comment)

@bill88t
Copy link
Author

bill88t commented Feb 10, 2023

Warning, my knowledge on C and board internals is sevely limited, take the following data with a large chonk of salt.

It seems pystack cannot be configured in runtime. In it's current form at least.

Starting from main.c:125:

#if MICROPY_ENABLE_PYSTACK
static size_t PLACE_IN_DTCM_BSS(_pystack[CIRCUITPY_PYSTACK_SIZE / sizeof(size_t)]);
#endif

This is the first reference of _pystack, this symbol is not present in any other files. (ctag were used)
Wikipedia says bss is 'block starting symbol'.
We will come back to this.

Next we go to main.c:162:

#if MICROPY_ENABLE_PYSTACK
mp_pystack_init(_pystack, _pystack + (sizeof(_pystack) / sizeof(size_t)));
#endif

So it seems to init the stack here. So this means the allocation is here, right? No.

Using ctags we see that mp_pystack_init is in py/pystack.c:33:

void mp_pystack_init(void *start, void *end) {
    MP_STATE_THREAD(pystack_start) = start;
    MP_STATE_THREAD(pystack_end) = end;
    MP_STATE_THREAD(pystack_cur) = start;
}

This is just defining boundaries, as if this is already allocated.

So this isn't our func for allocation. Let's keep looking.

mp_pystack_alloc on the same file, oh this seems good, let's look where it is use- NO RESULTS? What?
Hmm, let's keep reading main.c..
Hm hmm hmm, init function ends??
Let's keep going with main():
There is only a single ref of stack_init in main() and that is cstack.

So we have to have passed it.
Let's go back to main.c:162:
What is this bss file definition thingy?
Ctags send us to supervisor/linker.h:34:

#define PLACE_IN_DTCM_BSS(name) name __attribute__((section(".dtcm_bss." #name)))

Is this defining an attribute of sorts?
It would make sense since _pystack is somehow defined later on, but there is no clear definition..
If it is allocated explicitly through here it means we somehow have to get this out of a file definition and into a normal function..

Wait isn't there a mp_pystack_alloc? Shouldn't we be using this?
And is it safe to take out of bss and into a global pointer?
Too many questions, too little time.
Without a little help or info I can't go any further.

My ctags were produced by the following command:

ctags -R --excmd=pattern --exclude=Makefile --exclude="*.js" --exclude="*.py"

@tannewt
Copy link
Member

tannewt commented Feb 10, 2023

Next we go to main.c:162:

#if MICROPY_ENABLE_PYSTACK
mp_pystack_init(_pystack, _pystack + (sizeof(_pystack) / sizeof(size_t)));
#endif

So it seems to init the stack here. So this means the allocation is here, right? No.

This is what we'd have to change. We'd need to make it use a supervisor allocation instead because that'd be dynamic. The heap is already done this way (except it takes all remaining space.)

@bill88t
Copy link
Author

bill88t commented Feb 10, 2023

So I have to try to deduct CIRCUITPY_PYSTACK_SIZE from the allocation of the heap, make the alloc in this func and remove all the bss stuff right?

@gneverov
Copy link

@bill88t Currently _pystack is a statically allocated array stored as a global. ("bss" is the name of the section of memory that contains zero-initialized global variables. (idk why it is called bss))

You need to change _pystack to be just a pointer and then dynamically allocate the array before mp_pystack_init is called. As @tannewt suggests, it would make the most sense to allocate this memory from the supervisor heap. That will involve a bit of refactoring moving things between main and start_mp and dealing with the #define.

mp_pystack_init initializes already allocated memory for the pystack. mp_pystack_alloc allocates a chunk of memory from the pystack itself. It is used by consumers of the stack. It is not part of the setup of _pystack.

__attribute__ is a compiler directive that is used to give hints to the C compiler. It is not really relevant here.

@gneverov
Copy link

You will use allocate_memory(uint32_t length, bool high, bool movable) to allocate the pystack. Probably as allocate_memory(pystack size, false, false) but I haven't doubled checked those parameters.

You need to do this before calling allocate_remaining_memory() which allocates the pyheap.

@bill88t
Copy link
Author

bill88t commented Feb 10, 2023

I have started working onto it on another branch https://github.com/bill88t/circuitpython/tree/settings-toml-pystack
Currently it does not work (boot at all), and I have no idea why, I have plugged the debug pins and with openocd & gdb I see it's stuck in enable_interrupts.

I will switch the PR branch to that once it's somewhat working..

Progress:

  • Allocate dynamically
  • Test if alloc is good.
  • Fetch value from settings.toml
  • Convert value to a multiple of stack frames.

@bill88t
Copy link
Author

bill88t commented Feb 11, 2023

(gdb) watch _pystack
Hardware watchpoint 6: _pystack
(gdb) r
target halted due to debug-request, current mode: Thread 
xPSR: 0x01000000 pc: 0x00000178 msp: 0x20041f00
[New Thread 2]

Thread 1 received signal SIGTRAP, Trace/breakpoint trap.
bss_fill_test () at sdk/src/rp2_common/pico_standard_link/crt0.S:252
252         bne bss_fill_loop
(gdb) s
bss_fill_loop () at sdk/src/rp2_common/pico_standard_link/crt0.S:249
249         stm r1!, {r0}

It seems the code still expects the alloc to be in bss..

The current crash is in:
sdk/src/rp2_common/pico_runtime/runtime.c:283 cause of panic (fmt=0x100a80a0 "ep %d %s was already available")

@bill88t
Copy link
Author

bill88t commented Feb 11, 2023

Yea this stuff is pure assembly. I am bailing out. Branch is staying to stackless since this thing actually works..
My minimal progress onto dynamically allocating pystak is available here once again.
If someone wants to continue it be my guest.
I will create a fragmentation test to see as to if stackless is actually problematic or not.

@bill88t
Copy link
Author

bill88t commented Feb 11, 2023

Made a script to check how well stackless performs with random allocs and massive recursion combined.
The intention is trying to making it fail to alloc and hardcrash.

from os import urandom
import board
import digitalio
led = digitalio.DigitalInOut(board.LED)
led.direction = digitalio.Direction.OUTPUT
led.value = 1

# Edit
cap = 1000 # Max possible target level


# Do not edit
limit = 100000 # Stop when you have done a total of 'this number' allocations
allocs = 0 # Current allocs
abort = False
failed = False
failed_on = None

def gen_bool() -> bool:
    return int.from_bytes(urandom(1), 'big') > 127

def gen_target() -> int:
    res = gen_int()
    if res < 0:
        res = -res
    while res > cap:
        res -= int.from_bytes(urandom(1), 'big')
    if res < 0:
        res = -res
    return res

def gen_int() -> int:
    res = 0
    for i in range(4):
        d = int.from_bytes(urandom(1), 'big')
        if gen_bool():
            res += d
        else:
            res -= d
        del d
    return res

def gen_small_int() -> int:
    res = int.from_bytes(urandom(1), 'big')
    res //= 16
    return res

def level(mine=0, target=0, once=False) -> int:
    global levelups, td, allocs, failed, failed_on, abort
    mystack = list()
    for i in range(gen_small_int()):
        mystack.append(gen_int())
    allocs += len(mystack)
    while True:
        if target == mine or levelups == limit:
            if levelups < limit and not (once or failed):
                target = gen_target()
            else:
                target = 0
        #print(f"Level: {mine} | Target: {target} | Total levelups: {levelups} | Current allocations: {allocs}")
        if target > mine:
            led.value = 1
            levelups += 1
            try:
                target = level(mine+1, target, once)
            except KeyboardInterrupt:
                abort = True
                failed = True
                target = 0
            except:
                failed = True
                failed_on = mine+1
                target = 0
        elif target < mine or target == 0:
            led.value = 0
            break
    allocs -= len(mystack)
    del mystack
    return target

for i in [10, 100, 500, 1000, 2000]: # configure numbers to test here for
    if not (abort or failed):
        print(f"Running test: {i}")
        failed = False
        failed_on = None
        levelups = 0
        level(target=i, once=True)
        if not failed:
            print(f"PASS - {i}")
        else:
            print(f"FAIL - {i}", end="")
            if failed_on is not None:
                print(" - On: " + str(failed_on), end="")
            print()
    else:
        break


if not abort: # this runs until alloc limit reached
    limit = 10000 # Edit
    print(f"Running test: limit, for limit {limit}")
    failed = False
    failed_on = None
    levelups = 0
    level()
    if not failed:
        print("PASS - limit")
    else:
        print(f"FAIL - limit", end="")
        if failed_on is not None:
            print(" - On: " + str(failed_on), end="")
        print()

if abort:
    print("ABORTED")

led.value = 0

Feel free to try it out.

Pico with this script can climb up about 1.2k before failing with RuntimeError: maximum recursion depth exceeded.
Unlike the previous attempts at stressing it, this includes allocations.
As a result we can see stack frame allocation struggling when little ram is available.
The gc slowdown right before failing is pretty evident. Do make sure to enable the print to see it.
Do note fstrings with variables will take up some memory, so expect a reduced maximum.

Example output:

Running test: 10
PASS - 10
Running test: 100
PASS - 100
Running test: 500
PASS - 500
Running test: 1000
PASS - 1000
Running test: 2000
FAIL - 2000 - On: 1385
Running test: limit, for limit 10000
PASS - limit

@tannewt
Copy link
Member

tannewt commented Feb 13, 2023

_pystack is the statically allocated version. When switching to the supervisor allocation, you'll want to delete the _pystack variable. It shouldn't be used then.

@bill88t
Copy link
Author

bill88t commented Feb 13, 2023

Well renamed the variable onto that branch, it's not like a name change is going to affect the fact it won't bo- HOLDUP.

image

Why did it work now?????
It also erased the internal filesystem for some reason..

The raw error text is here in case anyone wants to take a look at it:

assertion "MP_STATE_MEM(gc_pool_start) >= MP_STATE_MEM(gc_finaliser_table_start) + gc_finaliser_table_byte_len" failed: file "../../py/gc.c", line 151, function: gc_init

@tannewt
Copy link
Member

tannewt commented Feb 14, 2023

allocate_memory doesn't return a pointer to the memory directly. It returns a pointer to a struct that includes info about it. https://github.com/adafruit/circuitpython/blob/main/supervisor/memory.h#L65

I'd probably make it work more like the heap and have it passed around instead of a global.

@bill88t
Copy link
Author

bill88t commented Feb 15, 2023

I got it.

Adafruit CircuitPython 8.0.0-4-g1f1a495e26-dirty on 2023-02-15; Raspberry Pi Pico with rp2040
>>> import os
>>> a = "exec(a)"
>>> exec(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <module>
RuntimeError: pystack exhausted
>>>

@bill88t bill88t mentioned this pull request Feb 15, 2023
2 tasks
@bill88t
Copy link
Author

bill88t commented Feb 15, 2023

Closing in favor of #7585.

@bill88t bill88t closed this Feb 15, 2023
@bill88t bill88t deleted the pystack branch February 25, 2023 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants