Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wishlist: add a configurable timeout between seeing "OB LB" and actual shutdown #321

Closed
dark-penguin opened this issue Sep 16, 2016 · 7 comments · Fixed by #2406
Closed
Labels
enhancement Shutdowns and overrides and battery level triggers Issues and PRs about system shutdown, especially if battery charge/runtime remaining is involved upsmon
Milestone

Comments

@dark-penguin
Copy link

In many cases, it would be very convenient to have the system not react immediately to an "OB LB" status, but wait a little to make sure it doesn't go away in a couple of seconds. For example:

  • When performing a manual periodical discharge of the battery in order to prevent sulfatation and correct the estimated on-battery time (if we had a few seconds to turn the power back on after it goes into the alarm stare, then we wouldn't have to go through all the trouble of disabling the whole NUT first);
  • When the estimated on-battery time is less than "alert" time, the whole system shuts down during an automatic periodical UPS self-test. There is a workaround for that, but what if you did not expect the battery to be that bad, or if your UPS miscalculated something? Not to mention specifying a "panic timeout" is simply easier (and more reliable) than doing the procedure of overriding the UPS values described in the manuals.

See this thread in the mailing lists for more info and extremely complicated workarounds that could be solved easily:
http://lists.alioth.debian.org/pipermail/nut-upsuser/2016-September/010260.html

@clepple
Copy link
Member

clepple commented Sep 21, 2016

For the first case, the ignorelb option can be added to ups.conf: http://networkupstools.org/docs/man/ups.conf.html#_ups_fields (only the driver would need to be restarted - if this is complicated, please request that your distribution simplify that procedure).

I'm not sure I follow the second part. If you can't trust the UPS to get through a self-test without signaling low battery, how can you be sure that there is enough power left to shut down properly in a real power failure? I don't think many UPSes reliably report that a test is in progress, so I doubt we could add generic logic to "ignore LB during a test".

In the current driver architecture, the only state that is carried over between polling cycles is the connection information (to reconnect to USB devices) and whether or not the previous poll worked (for data-stale notification). Adding a timeout, while it sounds simple, would require adding an extra history layer to drivers to keep track of when the LB flag was last seen, and to handle all of the possible transitions. As someone who expects the UPS to provide a working LB signal, I would prefer that any such changes happen outside of the driver and upsmon.

There has been talk of integrating a Lua interpreter into drivers - maybe there is room for another upsmon/upssched hybrid which uses a scripting language to capture the intricacies of situations like these.

@dark-penguin
Copy link
Author

OK, I understand that it would be hard to implement due to the current driver architecture. More thoughts about this later; first, let me explain other things so that everything else is clear.

For the first case of "periodical training": yes, adding "ignorelb" manually every time would help, but it would be even easier to just stop the driver for the training time or disconnect the interface cable. That's what I'm trying to avoid; if I could just add a 10-second timeout, I wouldn't have to do anything at all - 10 seconds is enough for me to flip the power switch back on.

I don't suggest that we go to extreme complications to implement this, but it would be a very useful feature to have in general; I'll explain with more examples. And I'm not talking about trying to find out whether it's "just a self-test" or something like that. What I'm talking about is:

  1. It is not uncommon at all to have a power loss (OB-state) for only a moment. The possible reasons are:
  • A periodical 5-second long self test
  • A 0.1-second power loss - the UPS goes on battery for a few seconds (those happen very often where I live)
  • A 0.5-second voltage pit - when your refrigerator powers on (those happen very often, too)
  • A kid flipping a power switch or pulling the plug
  • Someone stepping on a switch on a power extender, or accidentally hitting the switch, or accidentally pulling the plug out...
  • You want to plug your computer into another power outlet
  • You want to discharge your UPS to "train" it
    Those are only what came to my mind immediately.
  1. It is also not uncommon at all to have your UPS in a LB-state, sometimes for prolonged periods of time, for no good reason. The possible reasons are:
  • You just bought a new battery, your UPS does not believe it's new yet
  • You have a battery that can hold for only 5 minutes, which is more than enough to do a shutdown, but your UPS does not agree
  • You turn on another computer, load increases for a moment, with this load the remaining time goes down for a moment
  • Poor UPS firmware which does that for no reason (I saw that one on the mailing list recently)
  • Half-discharged UPS which can hold for only 4 minutes (which is still more than enough)
    Again, those are only what came to my mind immediately.
  1. When those two happen at the same time, which is really not as uncommon as we hope, everything shuts down immediately.
  • And I don't mind everything shutting down right after a power loss - I have no reason to maximize my on-battery runtime; but I would like to be sure it's a real power loss, not just something that happened for a moment.
  • And if the power was gone "for a moment", this means it will be back a second later, and my UPS will ignore the "Kill power now!"command.
  • And please remember that the "CS hack" does not work at the moment.
  • Neither does POWEROFF_WAIT, which means I'm really out of options.
  • Adding a simple panic timeout would solve all those problems and potentially other ones.
    When using the UPS, we expect the power to come back shortly after it's gone, and if it's not back after a certain timeout, then we do a shutdown. But there are a lot of cases when power comes back immediately after going into the LB state - sometimes going into this state is even the reason for power coming back. So we do need another timeout to cover that.

So, consider these examples:

  • I've just replaced my batteries. They can hold for 10 minutes.
  • I know that my shutdown only takes one minute. My UPS shuts down when there's only 5 minutes left.
  • After a few months, my batteries can only hold for 5 minutes. I know this is more than enough to do a shutdown, and I don't mind immediate shutdown upon power loss, even if I'm aware of this.
  • But I'm still not aware. While I'm away, my UPS decides to do a routine battery check, or power went out for a moment, or something else from that list happened... And suddenly everything is powered down immediately.
  • Maybe later I will replace the batteries or change the UPS settings to set LB when there's only two minutes left, but that's not really necessary - again, I don't mind immediate shutdown, and 5 minutes is more than enough.
  • I've just replaced my batteries. My UPS still doesn't believe they can hold for more than three minutes.
  • Upon a periodical self-test, momentary power loss, momentary voltage down... everything will shut down immediately.
  • Of course, I can spend a few days training my UPS, or calibrate the batteries (doing which actually killed my last battery pack for some reason), or change the low-battery warning to 1 minute left, and later remember to change it, but it would be so very much easier to simply add a 10-second panic timeout just to see if the power returns before panicking!..

For some cases, it's indeed possible to add the "ignorelb" option, and configure other options and overrides to have NUT set the LB itself. But that's more complex, and in some cases, that wouldn't help. And anyway, shutting down without waiting even for a moment does seem like a hasty decision to me.

So, the question is, how to implement it. I'm not very familiar with the inner details of NUT, but based on what I see... We have FINALDELAY and HOSTSYNC. What I expected when I read the manual was:

  • FINALDELAY: wait for this long after sending NOTIFY_SHUTDOWN to warn the users, then execute SHUTDOWNCMD. Of course, if the power is back, send the users an apology for the trouble ("Shutdown canceled - power is back") and cancel the shutdown.
  • HOSTSYNC (on slaves): after seeing "OB LB", don't do anything until the master sends you the "Okay, now start the shutdown" command. If he doesn't after this long - and if the power is not back, of course - then shut down anyway.

So, when the UPS goes into "OB LB" state, slaves see it, but don't react without a command from the master (unless the command never arrives). The master tells everyone to get ready for shutdown, then waits a little to see if the power comes back a few seconds later, and only then sends the "OK, shutdown now!" command. Then the master waits for the slaves to shut down, and starts its own shutdown procedure.

This way, after the shutdown command has been sent, there is no way back; but before the command is sent, after waiting for FINALDELAY - it's still not too late to cancel the shutdown!

Would this be possible to implement somehow? If it changes too much in the established shutdown order, this may very well be optional, toggled by a special parameter in nut.conf or something.

(POWEROFF_WAIT: In case you've missed the Debian bug report, here is it: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=835634 )
(By the way, "/sbin/upsmon -K" doesn't work either - https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=835555 , but the Debian bugs policy is to not post bugs upstream if they are already posted in Debian, so I didn't copy it...)

@jimklimov jimklimov added this to the 2.8.2 milestone Jun 13, 2023
@jimklimov jimklimov added the Shutdowns and overrides and battery level triggers Issues and PRs about system shutdown, especially if battery charge/runtime remaining is involved label Aug 29, 2023
@jimklimov jimklimov modified the milestones: 2.8.2, 2.8.3 Apr 5, 2024
@jimklimov
Copy link
Member

jimklimov commented Apr 5, 2024

I've finally got to reading through this thread and referenced Debian bugs. Some of this seems still relevant, but not all (after the years of changes).

Regarding POWEROFF_WAIT and systemd, this should have got fixed by nutshutdown scripting changes included in NUT v2.8.1 and later releases. Generally note that it is systems-dependent; for example Solaris 10+/illumos SMF core imposes hard timeouts to kill everything and halt the system when told to (maybe making this power-race-avoidance logic into a kernel driver that would block and reboot could be a solution).

Checking the concerns about upsmon -K not working, found that the POWERDOWNFLAG value must be set in upsmon.conf, there is no compiled-in default, but this bit of info is not really exposed - PR pending now. With the file existing and containing the magic string, upsmon -K currently (checked with NUT master after v2.8.2) does return exit code 0, so shell chaining with && works.

A timer for "OB LB" delay might indeed be an option - I suppose for short-lived glitches we could use a similar mechanism to what was recently introduced to avoid shutdowns during calibrations etc. when the UPS reports cycling different states - sometimes hovering in bogus limbo for a few seconds. This would probably also be tied to the number of POLLFREQ(ALERT) cycles - e.g. "ignore the state for X cycles in a row, issue FSD if not cleared by then". Now there's precedent for something like this in NUT v2.8.2 (maybe 2.8.1 already); need to check if this particular use-case was not actually addressed by now, e.g.:

CC @desertwitch for a "second opinion" :)

jimklimov added a commit to jimklimov/nut that referenced this issue Apr 5, 2024
jimklimov added a commit to jimklimov/nut that referenced this issue Apr 5, 2024
…OWERDOWNFLAG must be configured (no compiled-in default in upsmon) [networkupstools#321]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this issue Apr 5, 2024
…LAG setting [networkupstools#321]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this issue Apr 5, 2024
…lue or absence of explicit POWERDOWNFLAG setting [networkupstools#321]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this issue Apr 5, 2024
…imal functional content of upsmon.conf [networkupstools#321]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this issue Apr 5, 2024
…t lookups into POWERDOWNFLAG file; note recommended locations [networkupstools#321]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this issue Apr 5, 2024
…ols#321]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
jimklimov added a commit to jimklimov/nut that referenced this issue Apr 5, 2024
Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
@jimklimov
Copy link
Member

For a practical pointer, in the is_ups_critical() method the spot to extend would be just after the ST_CAL check, with throttling logic and data/state storage similar to pollfail_log_throttle_max and ups->pollfail_log_throttle_count implementation (peppered around the source):

nut/clients/upsmon.c

Lines 1111 to 1136 in 60f76bf

/* not OB or not LB = not critical yet */
if ((!flag_isset(ups->status, ST_ONBATT))
|| (!flag_isset(ups->status, ST_LOWBATT))
)
return 0;
/* must be OB+LB now */
/* if UPS is calibrating, don't declare it critical */
/* FIXME: Consider UPSes where we can know if they have other power
* circuits (bypass, etc.) and whether those do currently provide
* wall power to the host - and that we do not have both calibration
* and a real outage, when we still should shut down right now.
*/
if (flag_isset(ups->status, ST_CAL)) {
upslogx(LOG_WARNING, "%s: seems that UPS [%s] is OB+LB now, but "
"it is also calibrating - not declaring a critical state",
__func__, ups->upsname);
return 0;
}
/* if we're a primary, declare it critical so we set FSD on it */
if (flag_isset(ups->status, ST_PRIMARY))
return 1;
/* must be a secondary now */

@desertwitch
Copy link
Contributor

Thanks for the CC, commits look good and reasonable. I will read more and report back tomorrow, my (mental) bandwidth is a bit limited at the weekends. 😉

@desertwitch
Copy link
Contributor

desertwitch commented Apr 8, 2024

So I've just given this some thought and I think the only thing that hasn't been addressed yet is the OB LB switch to ignore the condition for a configurable time and only trigger FSD when that time has elapsed. I do for the most part agree with Charles about OB LB being a bit of a critical status to ignore, but given the recent anomalies with some UPS reporting it as part of calibration cycles I think it would be fair to offer it as a non-default option for cases where "ignorelb" would be too much. I'd keep default behavior to instant FSD upon encountering the status OB LB just to be safe, but allow users to modify this behavior accordingly with a similar setting as we've implemented for the intermittent OFF states we've seen (OFFDURATION). Either a configurable time or amount of encounters of the status, although I think a configurable time would be easier to approximate when facturing in the individual battery condition, seeing as most people probably know about how much load their batteries can still hold in minutes. So probably easier to think in time here rather than cycles - would also match better with OFFDURATION. Perhaps LBDURATION? Anyhow, the place recommended by Jim seems perfect for this.

jimklimov added a commit to jimklimov/nut that referenced this issue Apr 16, 2024
@jimklimov
Copy link
Member

Posted the remaining PR for the "payload" of this wish - testing/review would be welcome :)

jimklimov added a commit to jimklimov/nut that referenced this issue Apr 17, 2024
…DOWNFLAG setting [networkupstools#321]

Signed-off-by: Jim Klimov <jimklimov+nut@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Shutdowns and overrides and battery level triggers Issues and PRs about system shutdown, especially if battery charge/runtime remaining is involved upsmon
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants