Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack] Increase stack to 5k and reduce stack allocations in rules #1858

Merged
merged 3 commits into from
Oct 5, 2018

Conversation

TD-er
Copy link
Member

@TD-er TD-er commented Oct 4, 2018

Removed some unneeded stack allocations of 240 bytes in the rules processing.

Also gave the process_system_event_queue() a bit more priority to make sure it will be looked at at least once every 500 msec.

Removed some unneeded stack allocations of 240 bytes in the rules processing.

Also gave the `process_system_event_queue()` a bit more priority to make sure it will be looked at at least once every 500 msec.
@TD-er TD-er added the Category: Stabiliy Things that work, but not as long as desired label Oct 4, 2018
@uzi18
Copy link
Contributor

uzi18 commented Oct 5, 2018

why not just drop -DNO_EXTRA_4K_HEAP?

@TD-er
Copy link
Member Author

TD-er commented Oct 5, 2018

For core 2.4.1 it doesn't do anything.
For 2.4.2 it may prevent other issues.
But since it is also documented in comments, it could go.

@clumsy-stefan
Copy link
Contributor

clumsy-stefan commented Oct 5, 2018

just a question for my understanding: if I add -DNO_EXTRA_4K_HEAP as compile-flags in Arduino Framework, my nodes go into boot-loops. does this change something in the settings structure/place or similar? Do I need to start with a "clean" node if I change this option (eg. zero all setings/memory before flashing)?

after removing this option and compiling/flashing again, the nodes worked as before... btw: I use latest esp git core normally....

@TD-er
Copy link
Member Author

TD-er commented Oct 5, 2018

@clumsy-stefan
If you're using the latest Arduino core libs (based on core 2.4.2), then this define does move the position of the Arduino stack.
They gained 4k of RAM by letting the Arduino stack overlap the system stack a little.
Since the cause of the reboots/crashes is still unknown, it was one thing to try at the expense of 4k of free RAM.

You should no go into boot loops. But you may need to recompile everything (make clean)

@clumsy-stefan
Copy link
Contributor

ok, thanks.

I did recompile (incl. the lwip2 libs) and did get these loops... I'll investigate a bit further at the weekend why this happens... it seems to happen just when it tries to connect to the WiFi as the "startup" rule gets processed (plays a sound on the speaker)

@TD-er
Copy link
Member Author

TD-er commented Oct 5, 2018

Hmm, reboots like that (wifi related) also happened a while ago when the 'nightly builds' resulted in faulty compiles.
Remember those WiFi struggles in April/May?
I heard from the Tasmota guys that they also experience issues between Windows/Linux builds.
If you had run nightly builds before, you may want to reset to factory defaults just to make sure the stored settings do not differ between builds.

@clumsy-stefan
Copy link
Contributor

I always ran my own builds, as I need some plugins that are not in the standard build. I compile it on a Mac in the Arduino framework. but as said, it just happened when I addeded -DNO_EXTRA_4K_HEAP -DCONT_STACKSIZE=5120
Actually I just realize I can't say if it was the first or the second define which caused it as I added them at the same time and as it failed immediatley removed both again... so I'll try again with them individually... and have it on the serial console to see what actually happens...

Core is now set to 2.4.1 again, so this flag is not needed anymore.
Also not sure if it did fix something in 2.4.2 core.
@TD-er TD-er merged commit d9e6826 into letscontrolit:mega Oct 5, 2018
@TD-er TD-er deleted the bugfix/increase_stack_5k branch October 5, 2018 22:57
@clumsy-stefan
Copy link
Contributor

@TD-er: if I add -DCONT_STACKSIZE=5120 to the Arduino build environment and compile it I run into bootloops (serial log attached, the wdt reset happens all at the end, before it connects to the WiFi...)

Not sure what's different between the Arduino Framework and the plattformio build.... but for sure this define kills my builds....

any idea where I could/should look?

ets Jan 8 2013,rst cause:2, boot mode:(3,7)

load 0x4010f000, len 1384, room 16
tail 8
chksum 0x2d
csum 0x2d
v14808c9a
@cp:0
ld
⸮Urules1.txt 1
rules2.txt 0
rules3.txt 0
rules4.txt 0

RuleDebug Processing:rules1.txt
flags CMI parse output:
RuleDebug: 100: system#boot do
RuleDebug: 101: rtttl,15:d=8,o=5,b=150:8f
RuleDebug: 101: timerset,2,60
RuleDebug: 000: endon
RuleDebug: 001: endon
RuleDebug: 100: system#sleep do
RuleDebug: 101: rtttl,15:d=4,o=5,b=125:c.,c,8c,c.,d#,8d,d,8c,c,8c,2c.
RuleDebug: 000: endon
RuleDebug: 100: system#boot do
RuleDebug: 101: rtttl,15:d=8,o=5,b=150:8f
RuleDebug: 000: endon
RuleDebug: 001: endon
RuleDebug: 100: system#sleep do
RuleDebug: 101: rtttl,15:d=4,o=5,b=125:c.,c,8c,c.,d#,8d,d,8c,c,8c,2c.
RuleDebug: 000: endon
mode : sta(60:01:94:51:ef:e0)
add if0
RuleDebug Processing:rules1.txt
flags CMI parse output:
RuleDebug: 110: system#boot do
12071 : WIFI : Set WiFi to STA
12103 : WIFI : Connecting clumsy_ap2 attRuleDebug: 111: rtttl,15:d=8,o=5,b=150:8f
empt #0
12114 : EVENT: System#Boot
12122 : ACT : rtttl,15:d=8,o=5,b=150:8f
12334 : SW : rtttl,15:d=8,o=5,b=150:8f
RuleDebug: 111: timerset,2,60
12338 : ACT : timerSet,2,60
RuleDebug: 000: endon
RuleDebug: 001: endon
RuleDebug: 100: system#sleep do
RuleDebug: 101: rtttl,15:d=4,o=5,b=125:c.,c,8c,c.,d#,8d,d,8c,c,8c,2c.
RuleDebug: 000: endon
RuleDebug: 110: system#boot do
RuleDebug: 111: rtttl,15:d=8,o=5,b=150:8f
12362 : ACT : rtttl,15:d=8,o=5,b=150:8f
12571 : SW : rtttl,15:d=8,o=5,b=150:8f
RuleDebug: 000: endon
RuleDebug: 001: endon
RuleDebug: 100: system#sleep do
RuleDebug: 101: rtttl,15:d=4,o=5,b=125:c.,c,8c,c.,d#,8d,d,8c,c,8c,2c.
RuleDebug: 000: endon
12588 : Command: timerset
RuleDebug Processing:rules1.txt
flags CMI parse output:
RuleDebug: 100: system#boot do
12592 : SYS : 31.00
12595 : EVENT: sysinfoRuleDebug: 101: rtttl,15:d=8,o=5,b=150:8f
#RuleDebug: 101: timerset,2,60
rRuleDebug: 000: endon
sRuleDebug: 001: endon
sRuleDebug: 100: system#sleep do
iRuleDebug: 101: rtttl,15:d=4,o=5,b=125:c.,c,8c,c.,d#,8d,d,8c,c,8c,2c.
=RuleDebug: 000: endon
3RuleDebug: 100: system#boot do
1.RuleDebug: 101: rtttl,15:d=8,o=5,b=150:8f
0RuleDebug: 000: endon
0RuleDebug: 001: endon

RuleDebug: 100: system#sleep do

RuleDebug: 101: rtttl,15:d=4,o=5,b=125:c.,c,8c,c.,d#,8d,d,8c,c,8c,2c.
RuleDebug: 000: endon
RuleDebug Processing:rules1.txt
flags CMI parse output:
RuleDebug: 100: system#boot do
12753 : SYS : 19512.00
12755 : EVENT: sysinfo#mem=19512.00
RuleDebug: 101: rtttl,15:d=8,o=5,b=150:8f
RuleDebug: 101: timerset,2,60
RuleDebug: 000: endon
RuleDebug: 001: endon
RuleDebug: 100: system#sleep do
RuleDebug: 101: rtttl,15:d=4,o=5,b=125:c.,c,8c,c.,d#,8d,d,8c,c,8c,2c.
RuleDebug: 000: endon
RuleDebug: 100: system#boot do
RuleDebug: 101: rtttl,15:d=8,o=5,b=150:8f
RuleDebug: 000: endon
RuleDebug: 001: endon
RuleDebug: 100: system#sleep do
RuleDebug: 101: rtttl,15:d=4,o=5,b=125:c.,c,8c,c.,d#,8d,d,8c,c,8c,2c.
RuleDebug: 000: endon
RuleDebug Processing:rules1.txt
flags CMI parse output:
RuleDebug: 100: system#boot do
12911 : SYS : 100.00
12914 : EVENT: sysinfo#load=100.00
RuleDebug: 101: rtttl,15:d=8,o=5,b=150:8f
RuleDebug: 101: timerset,2,60
RuleDebug: 000: endon
RuleDebug: 001: endon
RuleDebug: 100: system#sleep do
RuleDebug: 101: rtttl,15:d=4,o=5,b=125:c.,c,8c,c.,d#,8d,d,8c,c,8c,2c.
RuleDebug: 000: endon
RuleDebug: 100: system#boot do
RuleDebug: 101: rtttl,15:d=8,o=5,b=150:8f
RuleDebug: 000: endon
RuleDebug: 001: endon
RuleDebug: 100: system#sleep do
RuleDebug: 101: rtttl,15:d=4,o=5,b=125:c.,c,8c,c.,d#,8d,d,8c,c,8c,2c.
RuleDebug: 000: endon
RuleDebug Processing:rules1.txt
flags CMI parse output:
RuleDebug: 100: system#boot do
13070 : SYS : 0.00
13073 : EVENT: sysinfo#uptime=0.00
RuleDebug: 101: rtttl,15:d=8,o=5,b=150:8f
RuleDebug: 101: timerset,2,60
RuleDebug: 000: endon
RuleDebug: 001: endon
RuleDebug: 100: system#sleep do
RuleDebug: 101: rtttl,15:d=4,o=5,b=125:c.,c,8c,c.,d#,8d,d,8c,c,8c,2c.
RuleDebug: 000: endon
RuleDebug: 100: system#boot do
RuleDebug: 101: rtttl,15:d=8,o=5,b=150:8f
RuleDebug: 000: endon
RuleDebug: 001: endon
RuleDebug: 100: system#sleep do
RuleDebug: 101: rtttl,15:d=4,o=5,b=125:c.,c,8c,c.,d#,8d,d,8c,c,8c,2c.
RuleDebug: 000: endon
13324 : CCS811 : No values found.
RuleDebug Processing:rules1.txt
flags CMI parse output:
RuleDebug: 100: system#boot do
13347 : ADC : Analog value: 73 = 7.136
13350 : EVENT: air#Air=7.14
RuleDebug: 101: rtttl,15:d=8,o=5,b=150:8f
RuleDebug: 101: timerset,2,60
RuleDebug: 000: endon
RuleDebug: 001: endon
RuleDebug: 100: system#sleep do
RuleDebug: 101: rtttl,15:d=4,o=5,b=125:c.,c,8c,c.,d#,8d,d,8c,c,8c,2c.
RuleDebug: 000: endon
RuleDebug: 100: system#boot do
RuleDebug: 101: rtttl,15:d=8,o=5,b=150:8f
RuleDebug: 000: endon
RuleDebug: 001: endon
RuleDebug: 100: system#sleep do
RuleDebug: 101: rtttl,15:d=4,o=5,b=125:c.,c,8c,c.,d#,8d,d,8c,c,8c,2c.
RuleDebug: 000: endon
13539 : BMx280 : Detected BME280
13921 : WD : Uptime 0 ConnectFailures 0 FreeMem 18600
scandone
state: 0 -> 2 (b0)
state: 2 -> 3 (0)
state: 3 -> 5 (10)
add 0
aid 17
cnt

ets Jan 8 2013,rst cause:4, boot mode:(3,7)

wdt reset
load 0x4010f000, len 1384, room 16
tail 8
chksum 0x2d
csum 0x2d
v9bc8ea1b
~ld
⸮Urules1.txt 1
rules2.txt 0
rules3.txt 0
rules4.txt 0

RuleDebug Processing:rules1.txt
flags CMI parse output:
RuleDebug: 100: system#boot do
89 :

INIT : Booting version: SMY_2.07 Oct 6 2018 08:25:31 (ESP82xx Core 9bc8ea1b, NONOS SDK 2.RuleDebug: 101: rtttl,15:d=8,o=5,b=150:8f
2RuleDebug: 101: timerset,2,60
.RuleDebug: 000: endon
1RuleDebug: 001: endon
(cfd48f3), LWIRuleDebug: 100: system#sleep do
P: 2.0.3)

@s0170071
Copy link
Contributor

s0170071 commented Oct 6, 2018

tested with my oscillating rules:

Free Mem: | 13168 (11824 - sendContentBlocking) <<- draining at 1k/s
Free Stack: | 2496 (1268 - sendContentBlocking) <<- stable

If you disable the rules processing immediately before it runs out of memory, it will fully recover.

@s0170071
Copy link
Contributor

s0170071 commented Oct 6, 2018

@TD-er I do also have trouble with a larger stack and the extra4kHeap. Could it be that the larger cont stack does overlaps too much with the sys stack when you define extra4kheap ?

@s0170071
Copy link
Contributor

s0170071 commented Oct 6, 2018

@TD-er sorry to spam you, but I just got an idea. The last time I have seen a memory drain that recovers (see 2 posts above) was when the ESP got wifi packets that no one read. The packets are stored by the core/lwip until it crashes. The core guys consider this a feature, not a bug. Its a bug on our side.
Only question is: are these MQTT, udp or webserver packets. My guess: its not webserver as the web interface was accessilble during the memory drain.

@TD-er
Copy link
Member Author

TD-er commented Oct 6, 2018

Just to make sure we're talking about the same thing here.
The define is NO_EXTRA_4K_HEAP and what this flag does (on core 2.4.2 and up) is make sure the location of the Arduino stack does not overlap with the system stack.
So it is the same behavior as before and switching this define does not do anything on older core libraries.

For core 2.4.2 and up, not adding this define does yield 4k extra RAM, but it lowers the location of the Arduino stack and overlaps the system stack.
This may be dangerous when lots of things are allocated on the system stack. We actually allocate some things on that 'sys' stack. Everything in ESPEasy-globals.h that will end up on the stack, is allocated on the 'sys' stack.
So that's why I did add it to the PlatformIO.ini config.
Since that config no longer uses 2.4.2, but now 2.4.1, it is no longer needed and thus I removed it from the ini file.

The 'cont' stack does grow when defining CONT_STACKSIZE=..., but that will not overlap (extra) with the 'sys' stack, since that overlap (if present) is fixed. It will use memory that is no longer available for the heap.

@TD-er
Copy link
Member Author

TD-er commented Oct 6, 2018

@s0170071 Do you have some links on how to detect if there is data in the LWIP buffers and how to empty them?

@s0170071
Copy link
Contributor

s0170071 commented Oct 6, 2018

I think you just have to call the various handle functions often enough....
Maybe rules should be scheduled e.g only once every msec and terminate after 500us latest.

@TD-er
Copy link
Member Author

TD-er commented Oct 6, 2018

And what if there are packages sent to the node for which no handle function is set?

@s0170071
Copy link
Contributor

s0170071 commented Oct 6, 2018

Then it would not recover if I stop the rules ;-)

@TD-er
Copy link
Member Author

TD-er commented Oct 6, 2018

And what calls do you have in the rules that read from the LWIP buffers?

@s0170071
Copy link
Contributor

s0170071 commented Oct 6, 2018

None.

@uzi18
Copy link
Contributor

uzi18 commented Oct 6, 2018

@TD-er NO_EXTRA_4K_HEAP also enables WPS support

@TD-er
Copy link
Member Author

TD-er commented Oct 6, 2018

@uzi18 That's good to know, and also a very big "why would anyone ever do that???"

@s0170071
Copy link
Contributor

s0170071 commented Oct 6, 2018

@TD-er

found a bug: static byte nestingLevel = 0; If you initialize it again and again each time you come by, what good is a check of that byte two lines later... ?

String rulesProcessingFile(String fileName, String& event)
{
  checkRAM(F("rulesProcessingFile"));
  if(ESP.getFreeHeap() <10000) return String();
  if (Settings.SerialLogLevel == LOG_LEVEL_DEBUG_DEV){
    Serial.print(F("RuleDebug Processing:"));
    Serial.println(fileName);
    Serial.println(F("     flags CMI  parse output:"));
     }

  static byte nestingLevel = 0;
  int data = 0;
  String log = "";

  nestingLevel++;
  if (nestingLevel > RULES_MAX_NESTING_LEVEL)

edit: it is in misc.ino

@TD-er
Copy link
Member Author

TD-er commented Oct 6, 2018

A static variable will only be initialized once.
Also it will be the same variable for all instances. (e.g. when it is being used in a plugin)

So this is not a bug.

@s0170071
Copy link
Contributor

s0170071 commented Oct 6, 2018

@TD-er sorry, my bad. You're right.
Found some small change that stops the memory drain for the oscillating rules. If I just process one rules file, the heap memory remains stable at around 13k. Please note the last line of the code:

void rulesProcessing(String& event)
{
  checkRAM(F("rulesProcessing"));
  unsigned long timer = millis();
  if (loglevelActiveFor(LOG_LEVEL_INFO)) {
    String log = F("EVENT: ");
    log += event;
    addLog(LOG_LEVEL_INFO, log);
  }

  for (byte x = 0; x < 1; x++)

To save you some searching, the rules I use are:

on System#Boot do
 gpio,12,0
 event,toggle
endon

on RelayOn do
 gpio,12,1
 event,toggle
endon

 

on RelayOff do
 gpio,12,0
 event,toggle
// gpio,13,1
endon

on toggle do
if [RelayState#State]=1
 event,RelayOff
else
 event,RelayOn
endif
endon

@s0170071
Copy link
Contributor

s0170071 commented Oct 6, 2018

There is an f.close() missing at the end of function rulesProcessingFile(String fileName, String& event)

@TD-er
Copy link
Member Author

TD-er commented Oct 6, 2018

@s0170071 That's silly, I looked at the code of FS.h and it seems there is no destructor of the object.
It would make sense to call the close() function on destruction.

@s0170071
Copy link
Contributor

s0170071 commented Oct 6, 2018

It is not resolving the issue but f.close seems to delay the inevitable crash a bit.
I am now trying to figure out why just processing one rules file seems to fix it.

@TD-er
Copy link
Member Author

TD-er commented Oct 6, 2018

I am already adding a f.close() to all instances of fs::File
I think it will keep some file open and then the SPIFFS code will keep some buffers active.
So this is for sure a memory leak.

And it may also cause strange corruptions I guess when there remains a lock to a file.

@s0170071
Copy link
Contributor

s0170071 commented Oct 6, 2018

ok, I got a theory, just right now not the time to prove it:
depending on where you look about file.read(), it may return -1. The offending rules files are all empty, so there is a fair chance read actually returns -1 on those files.

Looking at function rulesProcessingFile(String fileName, String& event), int len may become -1 and that may be when things go south...
It may also be that we read more bytes than there are in the file. Depends on what f.read((byte*)buf, RULES_BUFFER_SIZE); really does.

 while (f.available())
  {
    len = f.read((byte*)buf, RULES_BUFFER_SIZE);
    for (int x = 0; x < len; x++) {
      data = buf[x];

@TD-er
Copy link
Member Author

TD-er commented Oct 6, 2018

It looks like this byte buf[RULES_BUFFER_SIZE] is actually what the variable name suggests, a buffer for the file it is reading.
So it reads a block from the file of at most RULES_BUFFER_SIZE and creates a String from it with until it detects a newline.
len is a signed integer, so that's also covered.

Only thing is that this buf is allocated at the stack and remains allocated as long as parseCompleteNonCommentLine is running.
So we may have more than one of these buffers (64 bytes) on the stack, depending on the nesting level.

TD-er added a commit to TD-er/ESPEasy that referenced this pull request Oct 6, 2018
As suggested [here](letscontrolit#1858 (comment)) the previous used flag may also enable WPS.
So I used [this suggestion](esp8266/Arduino#5148 (comment)) to only re-allocate the stack at link time.
@s0170071
Copy link
Contributor

s0170071 commented Oct 6, 2018

@TD-er some simplification in misc.ino, rulesProcessingFile()

    line= f.readStringUntil('\n');

is pretty much the same as:

    len = f.read((byte*)buf, RULES_BUFFER_SIZE);
    for (int x = 0; x < len; x++) {
      data = buf[x];

      SPIFFS_CHECK(data >= 0, fileName.c_str());

      if (data != 10) {
        line += char(data);
      }
      else
      {      // if line complete, parse this rule

and you won't need the buf

furthermore in rulesProcessing, if I set event to "" for the empty rules files, the memory drain is gone.
(see middle line)

    if(activeRuleSets[x]){
      //if (x >0) event="";
      rulesProcessingFile(fileName, event);
    }

so runnig an event to an empty file via rulesProcessingFile(fileName, event); drains memory.
Can't figure out why though. Maybe you see it.

@TD-er
Copy link
Member Author

TD-er commented Oct 6, 2018

Maybe because there isn't a newline at the end of an empty file?
Also I guess it will be a problem when the last line doesn't have a newline at the end.

@s0170071
Copy link
Contributor

s0170071 commented Oct 7, 2018

Just a quick heads-up: The unit from tonight that only processes one rules file is approaching

Uptime: | 0 days 9 hours 12 minutes
Load: | 74.60% (LC=3)
Free Mem: | 11048 (7856 - ruleMatch2)
Free Stack: | 2848 (1104 - rulesProcessingFile2)

rules were running all night.
Core 2.4.2, no extra build flags. (extra heap and standard stack)

disabling rules, no reboot shows:

Uptime: | 0 days 9 hours 18 minutes
Load: | 59.70% (LC=8776)
Free Mem: | 12736 (6704 - sendContentBlocking)
Free Stack: | 3520 (1104 - rulesProcessingFile2)

with reboot:

Uptime: | 0 days 0 hours 0 minutes
Load: | 77.00% (LC=517)
Free Mem: | 14192 (13200 - sendContentBlocking)
Free Stack: | 3424 (1648 - LoadCustomTaskSettings)

@s0170071
Copy link
Contributor

s0170071 commented Oct 7, 2018

Arggghh !, I know why it worked better when only processing rules1.txt.
I somehow ended up with the same rules in all four rules files!

Still, the basic question remains, where did the memory go?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Stabiliy Things that work, but not as long as desired
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants