Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk capacity watchdog sometimes ignores config values #747

Open
vstax opened this issue May 31, 2017 · 5 comments
Open

Disk capacity watchdog sometimes ignores config values #747

vstax opened this issue May 31, 2017 · 5 comments
Assignees
Milestone

Comments

@vstax
Copy link
Contributor

vstax commented May 31, 2017

Default value for disk capacity watchdog is 85 (watchdog.disk.threshold_disk_use = 85). I got system A with 25% of free space:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       246G  173G   61G  75% /mnt/avs

and system B with 83% of free space:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       246G  193G   41G  83% /mnt/avs

With default values watchdog doesn't trigger for system A (expected) but triggers for system B (unexpected: it's set to 85%, but the disk use is 83%!):

[W]	storage_1@192.168.3.54	2017-05-31 16:34:45.595124 +0300	1496237685	leo_watchdog_disk:check/4	307[{triggered_watchdog,disk_usage},{disk_use_per,[{filesystem,"/dev/sdb1"},{blocks,257897904},{used,202329436},{available,42444940},{use_percentage,84},{use_percentage_str,"83%"},{mounted_on,"/mnt/avs"}]},{mounted_on,"/mnt/avs"}]

This is first problem. Second problem is that config value is ignored. I set watchdog.disk.threshold_disk_use = 95 for system B but I still see the same watchdog message, which shouldn't happen. I set watchdog.disk.threshold_disk_use = 70 for system A and I do get watchdog message. [EDIT: I got wrong results here originally, fixed]

In other words, reducing the value works as expected, but increasing it doesn't. Also, there is something fishy about triggering at 83% disk usage with default set to 85.

@vstax vstax changed the title Disk capacity watchdog ignores config values Disk capacity watchdog sometimes ignores config values May 31, 2017
@mocchira
Copy link
Member

mocchira commented Jun 1, 2017

@vstax

As the warning level messages caused by the hard coded value (80%) defined at https://github.com/leo-project/leo_watchdog/blob/develop/include/leo_watchdog.hrl#L92, so it seems to work as expected. (The error level messages should be logged when the actual usage exceeds watchdog.disk.threshold_disk_use)

However this spec can confuse much users like you so that we'd like to take another look whether we really need a hard coded soft limit (IMHO, no needs).

@vstax
Copy link
Contributor Author

vstax commented Sep 28, 2017

@mocchira I'm trying to run gateways on servers that have some other load and get this in logs:

[I]	gateway_k02@k02.selectel.cloud.lan	2017-09-28 15:38:29.690264 +0300	1506602309	null:null	0	["alarm_handler",58,32,"{set,{{disk_almost_full,\"/var/elasticsearch\"},[]}}"]

Surely enough there is such filesystem. And it's 80% busy which is totally fine since it's set up to stay around 75-82% busy all the time:

[root@k02 ~]# LANG=C df /var/elasticsearch/
Filesystem                  1K-blocks      Used Available Use% Mounted on
/dev/mapper/vg_def-elastic 1056758196 801589612 201465112  80% /var/elasticsearch

But 1) gateway shouldn't look there at all! None of the paths in config files point to that filesystem and 2) there should be a way to disable it, but there are no disk watchdog-related options in config file (and according to schema, disk watchdog defaults as "disabled" for gateway).
Is it possible to avoid these (false alarms) somehow?

@mocchira
Copy link
Member

WIP

@mocchira
Copy link
Member

@vstax
Turned out that http://erlang.org/doc/man/disksup.html erlang built-in module to retrieve the disk usage dump those logs once the disk usage go higher than 80%(default) regardless of our watchdog settings. To disable it, leo_gateway remote-console and then follow below instructions

Erlang/OTP 20 [erts-9.0] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V9.0  (abort with ^G)
(gateway_0@127.0.0.1)1> disksup:get_almost_full_threshold().
80
(gateway_0@127.0.0.1)4> disksup:set_almost_full_threshold(1.0).
(gateway_0@127.0.0.1)5> disksup:get_almost_full_threshold().
100

To make it permanent change, adding below lines into leo_gateway.schema as https://github.com/leo-project/leofs/pull/819/files does will do its job.

%% @doc Disable checking the disk usage with disksup by setting its threshold to 1.0(100%)
{mapping,
 "os_mon.disk_almost_full_threshold",
 "os_mon.disk_almost_full_threshold",
 [
  {datatype, integer},
  {default, 1}
 ]}.

I will send the PR to add the above patch later.

@vstax
Copy link
Contributor Author

vstax commented Oct 2, 2017

@mocchira Thank you, the schema fix works.

@mocchira mocchira added this to the 1.4.3 milestone Mar 30, 2018
@mocchira mocchira modified the milestones: 1.4.3, 1.5.0 Jul 25, 2018
@yosukehara yosukehara modified the milestones: 1.5.0, v1 docs Feb 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants