Disk capacity watchdog sometimes ignores config values #747

vstax · 2017-05-31T13:45:00Z

Default value for disk capacity watchdog is 85 (watchdog.disk.threshold_disk_use = 85). I got system A with 25% of free space:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       246G  173G   61G  75% /mnt/avs

and system B with 83% of free space:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       246G  193G   41G  83% /mnt/avs

With default values watchdog doesn't trigger for system A (expected) but triggers for system B (unexpected: it's set to 85%, but the disk use is 83%!):

[W]	storage_1@192.168.3.54	2017-05-31 16:34:45.595124 +0300	1496237685	leo_watchdog_disk:check/4	307[{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,202329436},{available,42444940},{use_percentage,84},{use_percentage_str,"83%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]

This is first problem. Second problem is that config value is ignored. I set watchdog.disk.threshold_disk_use = 95 for system B but I still see the same watchdog message, which shouldn't happen. I set watchdog.disk.threshold_disk_use = 70 for system A and I do get watchdog message. [EDIT: I got wrong results here originally, fixed]

In other words, reducing the value works as expected, but increasing it doesn't. Also, there is something fishy about triggering at 83% disk usage with default set to 85.

The text was updated successfully, but these errors were encountered:

mocchira · 2017-06-01T06:02:35Z

@vstax

As the warning level messages caused by the hard coded value (80%) defined at https://github.com/leo-project/leo_watchdog/blob/develop/include/leo_watchdog.hrl#L92, so it seems to work as expected. (The error level messages should be logged when the actual usage exceeds watchdog.disk.threshold_disk_use)

However this spec can confuse much users like you so that we'd like to take another look whether we really need a hard coded soft limit (IMHO, no needs).

vstax · 2017-09-28T15:04:49Z

@mocchira I'm trying to run gateways on servers that have some other load and get this in logs:

[I]	gateway_k02@k02.selectel.cloud.lan	2017-09-28 15:38:29.690264 +0300	1506602309	null:null	0	["alarm_handler",58,32,"{set,{{disk_almost_full,\"/var/elasticsearch\"},[]}}"]

Surely enough there is such filesystem. And it's 80% busy which is totally fine since it's set up to stay around 75-82% busy all the time:

[root@k02 ~]# LANG=C df /var/elasticsearch/
Filesystem                  1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_def-elastic 1056758196 801589612 201465112  80% /var/elasticsearch

But 1) gateway shouldn't look there at all! None of the paths in config files point to that filesystem and 2) there should be a way to disable it, but there are no disk watchdog-related options in config file (and according to schema, disk watchdog defaults as "disabled" for gateway).
Is it possible to avoid these (false alarms) somehow?

mocchira · 2017-09-29T05:15:29Z

WIP

mocchira · 2017-09-29T07:12:52Z

@vstax
Turned out that http://erlang.org/doc/man/disksup.html erlang built-in module to retrieve the disk usage dump those logs once the disk usage go higher than 80%(default) regardless of our watchdog settings. To disable it, leo_gateway remote-console and then follow below instructions

Erlang/OTP 20 [erts-9.0] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V9.0  (abort with ^G)
(gateway_0@127.0.0.1)1> disksup:get_almost_full_threshold().
80
(gateway_0@127.0.0.1)4> disksup:set_almost_full_threshold(1.0).
(gateway_0@127.0.0.1)5> disksup:get_almost_full_threshold().
100

To make it permanent change, adding below lines into leo_gateway.schema as https://github.com/leo-project/leofs/pull/819/files does will do its job.

%% @doc Disable checking the disk usage with disksup by setting its threshold to 1.0(100%)
{mapping,
 "os_mon.disk_almost_full_threshold",
 "os_mon.disk_almost_full_threshold",
 [
  {datatype, integer},
  {default, 1}
 ]}.

I will send the PR to add the above patch later.

vstax · 2017-10-02T13:24:25Z

@mocchira Thank you, the schema fix works.

vstax changed the title ~~Disk capacity watchdog ignores config values~~ Disk capacity watchdog sometimes ignores config values May 31, 2017

vstax mentioned this issue May 31, 2017

Deleting bucket eventually fails and makes delete queues stuck #725

Open

mocchira assigned yosukehara and mocchira Jun 1, 2017

mocchira added Document survey labels Jun 1, 2017

yosukehara mentioned this issue Aug 22, 2017

[leo_storage] How to restrict user's hard disk space #806

Closed

mocchira mentioned this issue Sep 29, 2017

gateway: Disable checking the disk usage with disksup #856

Merged

mocchira added this to the 1.4.3 milestone Mar 30, 2018

mocchira modified the milestones: 1.4.3, 1.5.0 Jul 25, 2018

yosukehara modified the milestones: 1.5.0, v1 docs Feb 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk capacity watchdog sometimes ignores config values #747

Disk capacity watchdog sometimes ignores config values #747

vstax commented May 31, 2017 •

edited

Loading

mocchira commented Jun 1, 2017

vstax commented Sep 28, 2017

mocchira commented Sep 29, 2017

mocchira commented Sep 29, 2017

vstax commented Oct 2, 2017

Disk capacity watchdog sometimes ignores config values #747

Disk capacity watchdog sometimes ignores config values #747

Comments

vstax commented May 31, 2017 • edited Loading

mocchira commented Jun 1, 2017

vstax commented Sep 28, 2017

mocchira commented Sep 29, 2017

mocchira commented Sep 29, 2017

vstax commented Oct 2, 2017

vstax commented May 31, 2017 •

edited

Loading