Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

currentMillis and targetMills may not be addressing rollover #16

Closed
abqmichael opened this issue Jul 1, 2019 · 6 comments
Closed

currentMillis and targetMills may not be addressing rollover #16

abqmichael opened this issue Jul 1, 2019 · 6 comments

Comments

@abqmichael
Copy link

I have been having a problem with events either getting dropped or my code not handling them properly. (I am using a NodeMCU ESP32 board.) In digging through EspMQTTClient, I identified one corner-case problem in how currentMillis and targetMills are compared to determine whether the event should fire. (This is within EspMQTTClient::loop()).

Currently, these are compared using
unsigned long currentMillis = millis();
...
if (mDelayedExecutionList[i].targetMillis <= currentMillis)

My thought is to imagine that the time returned by millis as it nears 2^32 or 4294967296 (49 days). First I imagined that the time is 2^32 - 1000 and a new delay got set 10000 ms (10 sec) into the future. At this point, the target would be computed as 9000 ms. As targetMillis <= currentMillis, the event would fire right away.

Now imagine the reverse, an event gets set with targetMillis to 2^32 - 100. Then imagine that the user's work took 200 milliseconds to complete so that currentMillis rolled around to a small number. At this point, targetMillis would be a big number while currentMillis would be small and we would be unlikely to see that event get fired for another 49 days.

Sadly, I can't figure my way out of this conundrum. Comparing the numbers isn't right. Subtracting the two times will always return a positive number (unsigned numbers are like that). It might be useful to not generate events that land within some small delta of the epoch. Or it might be just good to document the limiting condition and let the user decide what to do.

@githubbuh
Copy link

abqmichael,

If you are not already, it will be helpful for debugging to log all incoming and outgoing MQTT messages. Be sure to capture the message published by the node during onConnectionEstablished(). It could well be that the MQTT message going to the node is published to the topic, but the node is unable to respond because it is reconnecting. If the onConnectionEstablished MQTT message is logged, it will be possible to determine if that is the case.

Hope this helps.

@abqmichael
Copy link
Author

Thanks for the attention. I did some further testing that is pointing the finger that something is causing millis() to roll over at 2^32/1000 instead of the 2^32. That is, the rollover happens about every two hours. It might be that something inside millis is counting nano-seconds. However, this first test used your code, the libraries it depends upon and some other libraries like time.h and U8g2lib. So it might be that something inside one of these codes is responsible for the resetting.

To that end, I am working on a second test that only uses raw Arduino to minimally test mills. I would be happy to send you either of these codes. Here's how my first test worked:

My first test involved writing a program that would create/schedule three delays (quick, medium, slow) and print a status on the slow event... e.g.,;

void slow_event() {
slow_event_counter++;
print_status();
client.executeDelayed(SLOW_DELAY, slow_event);
}

I also went into EspMQTT and created a method that would let me query the DelayedExecutionList and hooked that up to a method subscribed to an MQTT message (e.g., with client.subscribe("m/twiddle", twiddle);) That method looks like this:

void EspMQTTClient::printDelayedExecutionList() {
if (mDelayedExecutionListSize > 0)
{
unsigned long currentMillis = millis();
Serial.print("EspMQTTClient::printDelayedExecutionList at T = ");
Serial.print(currentMillis);
Serial.print(" (");
Serial.print(currentMillis, HEX);
Serial.print(") callbacks:\n");
for(byte i = 0 ; i < mDelayedExecutionListSize ; i++)
{
unsigned long delay_remaining = mDelayedExecutionList[i].targetMillis - currentMillis;
{
Serial.print("Callback "); Serial.print(i);
Serial.print("\t target = "); Serial.print(mDelayedExecutionList[i].targetMillis);
Serial.print("\t delta = "); Serial.print(delay_remaining);
Serial.print("\t f = "); Serial.print((unsigned long)mDelayedExecutionList[i].callback, HEX);
Serial.print("\n");
// note, function pointer is at mDelayedExecutionList[i].callback;
}
}
} else {
Serial.print("EspMQTTClient::printDelayedExecutionList, no pending callbacks\n");
}
}

My twiddle method looked like this:

void twiddle(const String & payload) {
Serial.print("<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>\n");
// log_msg(true, "twiddle '" + payload + " ");
if (payload.equalsIgnoreCase("quick")) {
log_msg(true, "Adding quick event\n");
client.executeDelayed(QUICK_DELAY, quick_event);
} else if (payload.equalsIgnoreCase("medium")) {
log_msg(true, "Adding medium event\n");
client.executeDelayed(MEDIUM_DELAY, medium_event);
} else if (payload.equalsIgnoreCase("slow")) {
log_msg(true, "Adding slow event\n");
client.executeDelayed(SLOW_DELAY + 1, slow_event);
} else if (payload.equalsIgnoreCase("stack")) {
client.printDelayedExecutionList();
} else {
log_msg(true, "Adding slow, medium and quick events\n");
client.executeDelayed(QUICK_DELAY, quick_event);
client.executeDelayed(MEDIUM_DELAY, medium_event);
client.executeDelayed(SLOW_DELAY + 1, slow_event);
}
print_status();
Serial.print("<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>\n");
}

When I ran the test, my events played up until time got just short of 4294967. EspMQTT was running fine at that point so I went in with twiddle and saw that my three events were sitting there waiting for time to get someplace a little short of 4294967.

As an aside, it would be nice if EspMQTT had some utilities for managing and querying the callbacks. For example, it would be nice if you could create the callbacks with a name and then be able to query EspMQTT with the horizon time and name of each event in the system.

In developing my code and tests, I found that my code seemingly randomly creates multiple copies of some of my events which EspMQTT politely executes. I only found that by running the printDelayedExecutionList. My thought is that it would be easier to track down the bugs that led to this if I had some way of querying and managing the events.

@abqmichael
Copy link
Author

abqmichael commented Jul 2, 2019

My next test proves that something is wrong with the underlying system. The code below shows that time gets reset to zero some time around 4294961.

I am using the Arduino development environment on a NodeMCU ESP32 board. Per install instructions, I selected the Heltech_WIFI_KIT_32, which I think uses the Espressif IoT Development Framework (I will need to double check that). After about 1 hour 11 minutes, the millis time nears 4294961 and a few seconds after that, it drops to zero.

As this may be a problem on other boards (for at least a while), it would be nice if EspMQTT could give a work-around in the event system. I will try a few things and report back.

#define CHECK_IN_DELAY 100000
#define LOOP_DELAY 20
unsigned long last_loop_time = 0;
unsigned long next_checkin = 10000;
unsigned long greatest_time_seen = 0;

void setup() {
// put your setup code here, to run once:
Serial.begin(115200); // intentionally slower than upload speed
delay(1000);
Serial.print("\nProgram to detect millis rollover started, millis = ");
last_loop_time = millis(); // Initialize at setup time
Serial.println(last_loop_time);
Serial.print("<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>\n\n");
}

void print_status(unsigned long this_loop_time) {
Serial.print("T[0] = ");
Serial.print(this_loop_time);
Serial.print(",\t T[-1] = ");
Serial.print(last_loop_time);
Serial.print(",\t T[0] - T[-1] = ");
Serial.print(this_loop_time - last_loop_time);
Serial.print(", \t T[max] = ");
Serial.println(greatest_time_seen);
}

// to 14324 Next check in 24324
// Resetting greatest time seen from 14335
// T = 14338, Last loop =14335, greatest time seen = 14335

void loop() {
// put your main code here, to run repeatedly:
unsigned long this_loop_time = millis();
int time_step_size = this_loop_time - last_loop_time;
if (this_loop_time >= greatest_time_seen) {
greatest_time_seen = this_loop_time;
} else {
// Time just got less than it was on the last loop.
Serial.print("Resetting T[max] from ");
Serial.print(greatest_time_seen);
greatest_time_seen = this_loop_time;
Serial.print(" to ");
Serial.print(greatest_time_seen);
next_checkin = this_loop_time + CHECK_IN_DELAY;
Serial.print(". Next check in ");
Serial.println(next_checkin);
print_status(this_loop_time);
}
if (last_loop_time > this_loop_time) {
Serial.print("Time warp warning: time went backwards from ");
Serial.print(last_loop_time); Serial.print(" to "); Serial.println(this_loop_time);
next_checkin = this_loop_time + CHECK_IN_DELAY;
} else if (time_step_size > 200) {
Serial.print("Slow processor warning: last loop took ");
Serial.print(time_step_size); Serial.print(" millis time now "); Serial.println(this_loop_time);
}
if (this_loop_time >= next_checkin) {
print_status(this_loop_time);
delay(CHECK_IN_DELAY / 2000); // giving the processor a break
next_checkin = this_loop_time + CHECK_IN_DELAY;
}
delay(LOOP_DELAY);
last_loop_time = this_loop_time; // could use millis() here but then I'll have a bigger drift on events
}

@abqmichael
Copy link
Author

abqmichael commented Jul 3, 2019

I found a bug report from February, 2019, that identifies the problem as being inside espressif. The bug report is titled:

unsigned long millis() timer does not wrap / ignores top 10 bits / causes regression #2455
and
Correct millis() error causes by micros() overflow every about 72 minutes #2438

The bug is in cores/esp32/esp32-hal-misc.c

unsigned long IRAM_ATTR millis()
{
return (unsigned long) (micros() / 1000);
}

One commenter recommends it should return 1000ULL and the author says he implemented that. However, it doesn't appear that the fix has made it into a release.

REF: espressif/arduino-esp32#2455

@abqmichael
Copy link
Author

Final note. I did a fresh install of the ESP32 board library and it looks like the fix is in version 1.02.

My problem, apparently, was that though Arduino thought it had the newest library, my install was pointing to an old one. So I uninstalled the ESP32 board library and deleted a subdirectory with the driver code from my Arduino directory. Then did a fresh install and re-ran the count up test.

@plapointe6
Copy link
Owner

Thanks for the workaround @abqmichael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants