Skip to content

Commit

Permalink
Disable failing source collectors (#59)
Browse files Browse the repository at this point in the history
  • Loading branch information
pederhan authored Jul 31, 2023
1 parent 8b70704 commit 31539c9
Show file tree
Hide file tree
Showing 14 changed files with 626 additions and 33 deletions.
54 changes: 48 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,11 +133,11 @@ WantedBy=multi-user.target

## Source collectors

As outlined in the [Application](#application) section, source collectors are Python modules (files) that are placed in a directory defined by the option `source_collector_dir` in the `[zac]` table of the config file. Zabbix-auto-config will attempt to load all modules in the directory that are referenced in the configuration file by name. Modules that are referenced in the config but not found in the directory will be ignored.
Source collectors are Python modules placed in a directory specified by the `source_collector_dir` option in the `[zac]` table of the configuration file. Zabbix-auto-config attempts to load all modules referenced by name in the configuration file from this directory. If any referenced modules cannot be found in the directory, they will be ignored.

A source collector is a module that contains a function named `collect` that returns a list of `Host` objects. Zabbix-auto-config uses these host objects to create/update hosts in Zabbix.
A source collector module contains a function named `collect` that returns a list of `Host` objects. These host objects are used by Zabbix-auto-config to create or update hosts in Zabbix.

A module that collects hosts from a file could look like this:
Here's an example of a source collector module that reads hosts from a file:

```python
# path/to/source_collector_dir/load_from_json.py
Expand All @@ -153,18 +153,60 @@ def collect(*args: Any, **kwargs: Any) -> List[Host]:
return [Host(**host) for host in f.read()]
```

Any module that contains a function named `collect` which takes a an arbitrary number of arguments and keyword arguments and returns a list of `Host` objects is recognized as a source collector module. Type annotations are optional, but recommended.
A module is recognized as a source collector if it contains a `collect` function that accepts an arbitrary number of arguments and keyword arguments and returns a list of `Host` objects. Type annotations are optional but recommended.

The corresponding config entry to load the `load_from_json.py` module above could look like this:
The configuration entry for loading a source collector module, like the `load_from_json.py` module above, includes both mandatory and optional fields. Here's how it can be configured:

```toml
[source_collectors.load_from_json]
module_name = "load_from_json"
update_interval = 60
error_tolerance = 5
error_duration = 360
exit_on_error = false
disable_duration = 3600
filename = "hosts.json"
```

The `module_name` and `update_interval` options are required for all source collector modules. Any other options are passed as keyword arguments to the `collect` function.
The following configurations options are available:

### Mandatory configuration

#### module_name
`module_name` is the name of the module to load. This is the name that will be used in the configuration file to reference the module. It must correspond with the name of the module file, without the `.py` extension.

#### update_interval
`update_interval` is the number of seconds between updates. This is the interval at which the `collect` function will be called.

### Optional configuration (error handling)

If `error_tolerance` number of errors occur within `error_duration` seconds, the collector is disabled. Source collectors do not tolerate errors by default and must opt-in to this behavior by setting `error_tolerance` and `error_duration` to non-zero values. If `exit_on_error` is set to `true`, the application will exit. Otherwise, the collector will be disabled for `disable_duration` seconds.


#### error_tolerance

`error_tolerance` (default: 0) is the maximum number of errors tolerated within `error_duration` seconds.

#### error_duration

`error_duration` (default: 0) specifies the duration in seconds to track and log errors. This value should be at least equal to `error_tolerance * update_interval` to ensure correct error detection.

For instance, with an `error_tolerance` of 5 and an `update_interval` of 60, `error_duration` should be no less than 300 (5 * 60). However, it is advisable to choose a higher value to compensate for processing intervals between error occurrences and the subsequent error count checks, as well as any potential delays from the source collectors.

A useful guide is to set `error_duration` as `(error_tolerance + 1) * update_interval`, providing an additional buffer equivalent to one update interval.

#### exit_on_error

`exit_on_error` (default: true) determines if the application should terminate, or disable the failing collector when number of errors exceed the tolerance. If set to `true`, the application will exit. Otherwise, the collector will be disabled for `disable_duration` seconds. For backwards compatibility with previous versions of Zabbix-auto-config, this option defaults to `true`. In a future major version, the default will be changed to `false`.

#### disable_duration

`disable_duration` (default: 3600) is the duration in seconds to disable collector for. If set to 0, the collector is disabled indefinitely, requiring a restart of the application to re-enable it.

### Keyword arguments

Any extra config options specified in the configuration file will be passed to the `collect` function as keyword arguments. In the example above, the `filename` option is passed to the `collect` function, and then accessed via `kwargs["filename"]`.


## Host modifiers

Expand Down
11 changes: 10 additions & 1 deletion config.sample.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,17 @@ managed_inventory = ["location"]
[source_collectors.mysource]
module_name = "mysource"
update_interval = 60
error_tolerance = 5 # Tolerate 5 errors within `error_duration` seconds
error_duration = 360 # should be greater than update_interval
exit_on_error = false # Disable source if it fails
disable_duration = 3600 # Time in seconds to wait before reactivating a disabled source
kwarg_passed_to_source = "value" # extra fields are passed to the source module as kwargs
another_kwarg = "value2" # We can pass an arbitrary number of kwargs to the source module


[source_collectors.othersource]
module_name = "mysource"
update_interval = 60
source = "other"
error_tolerance = 0 # no tolerance for errors (default)
exit_on_error = true # exit application if source fails
source = "other" # extra kwarg used in mysource module
13 changes: 11 additions & 2 deletions tests/conftest.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import multiprocessing
import os
from pathlib import Path
from typing import Iterable
Expand Down Expand Up @@ -101,7 +102,7 @@ def sample_config():

@pytest.fixture
def hostgroup_map_file(tmp_path: Path) -> Iterable[Path]:
contents = hostgroup_map = """
contents = """
# This file defines assosiation between siteadm fetched from Nivlheim and hostsgroups in Zabbix.
# A siteadm can be assosiated only with one hostgroup or usergroup.
# Example: <siteadm>:<host/user groupname>
Expand All @@ -120,4 +121,12 @@ def hostgroup_map_file(tmp_path: Path) -> Iterable[Path]:
"""
map_file_path = tmp_path / "siteadmin_hostgroup_map.txt"
map_file_path.write_text(contents)
yield map_file_path
yield map_file_path


@pytest.fixture(autouse=True, scope="session")
def setup_multiprocessing_start_method() -> None:
# On MacOS we have to set the start mode to fork
# when using multiprocessing-logging
if os.uname == "Darwin":
multiprocessing.set_start_method("fork", force=True)
90 changes: 89 additions & 1 deletion tests/test_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import tomli

import pytest
from pydantic import Extra
from pydantic import Extra, ValidationError
import zabbix_auto_config.models as models


Expand Down Expand Up @@ -35,3 +35,91 @@ def test_config_extra_field_allowed(
assert len(caplog.records) == 0
finally:
models.Settings.__config__.extra = original_extra


def test_sourcecollectorsettings_defaults():
# Default setting should be valid
settings = models.SourceCollectorSettings(
module_name="foo",
update_interval=60,
)
assert settings.module_name == "foo"
assert settings.update_interval == 60



def test_sourcecollectorsettings_no_tolerance() -> None:
"""Setting no error tolerance will cause the error_duration to be set
to a non-zero value.
Per note in the docstring of SourceCollectorSettings.error_duration,
the value of error_duration is set to a non-zero value to ensure that
the error is not discarded when calling RollingErrorCounter.check().
"""
settings = models.SourceCollectorSettings(
module_name="foo",
update_interval=60,
error_tolerance=0,
error_duration=0,
)
assert settings.error_tolerance == 0
# In case the actual implementaiton changes in the future, we don't
# want to test the _exact_ value, but we know it will not be 0
assert settings.error_duration > 0


def test_sourcecollectorsettings_no_error_duration():
# TODO: check if we can just remove this test
# In order to not have an error_duration, error_tolerance must be 0 too
settings = models.SourceCollectorSettings(
module_name="foo",
update_interval=60,
error_duration=0,
error_tolerance=0,
)
# See docstring in test_sourcecollectorsettings_no_tolerance
assert settings.error_duration > 0

# With tolerance raises an error
# NOTE: we test the error message in depth in test_sourcecollectorsettings_invalid_error_duration
with pytest.raises(ValidationError):
models.SourceCollectorSettings(
module_name="foo",
update_interval=60,
error_duration=0,
error_tolerance=5,
)


def test_sourcecollectorsettings_duration_too_short():
# Error_duration should be greater or equal to the product of
# error_tolerance and update_interval
with pytest.raises(ValidationError) as exc_info:
models.SourceCollectorSettings(
module_name="foo",
update_interval=60,
error_tolerance=5,
error_duration=180,
)
errors = exc_info.value.errors()
assert len(errors) == 1
error = errors[0]
assert error["loc"] == ("error_duration",)
assert "greater than 300" in error["msg"]
assert error["type"] == "value_error"


def test_sourcecollectorsettings_duration_negative():
# We should not be able to pass in negative values to error_duration
with pytest.raises(ValidationError) as exc_info:
models.SourceCollectorSettings(
module_name="foo",
update_interval=60,
error_tolerance=5,
error_duration=-1,
)
errors = exc_info.value.errors()
assert len(errors) == 1
error = errors[0]
assert error["loc"] == ("error_duration",)
assert error["type"] == "value_error.number.not_ge"
139 changes: 139 additions & 0 deletions tests/test_errcount.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
import datetime
import operator
import time
from typing import Callable

import pytest

from zabbix_auto_config.errcount import Error, RollingErrorCounter, get_td


def test_get_td():
"""Sanity test that the get_td() helper function works as we expect."""
td = get_td(60)
assert td.total_seconds() == 60


def test_rolling_error_counter_init():
"""Test that we can create a RollingErrorCounter object."""
rec = RollingErrorCounter(60, 5)
assert rec.duration == 60
assert rec.tolerance == 5
assert rec.errors == []


def test_rolling_error_counter_init_negative_duration():
"""Test that we can't create a RollingErrorCounter object with a negative duration."""
with pytest.raises(ValueError) as exc_info:
RollingErrorCounter(-60, 5)
assert "duration" in str(exc_info.value)


def test_rolling_error_counter_init_negative_tolerance():
"""Test that we can't create a RollingErrorCounter object with a negative tolerance."""
with pytest.raises(ValueError) as exc_info:
RollingErrorCounter(60, -5)
assert "tolerance" in str(exc_info.value)


def test_rolling_error_counter_add():
"""Test that we can add errors to the RollingErrorCounter object."""
rec = RollingErrorCounter(60, 5)
rec.add()
assert len(rec.errors) == 1
time.sleep(0.01) # ensure that the timestamp is always different
rec.add()
assert len(rec.errors) == 2
assert rec.errors[0] < rec.errors[1]


def test_rolling_error_counter_count():
"""Test that we can count errors in the RollingErrorCounter object."""
rec = RollingErrorCounter(0.03, 5)
# This test is a bit timing sensitive, but we should be able to
assert rec.count() == 0
rec.add()
assert rec.count() == 1
rec.add()
assert rec.count() == 2
rec.add()
assert rec.count() == 3
rec.add()
assert rec.count() == 4
time.sleep(0.03) # enough to reset the counter
assert rec.count() == 0


def test_rolling_error_counter_count_is_rolling():
"""Check that the error counter is actually rolling by incrementally adding
and sleeping. At some point we should see the counter decrease because
an entry has expired."""
rec = RollingErrorCounter(0.03, 5)
rec.add()
assert rec.count() == 1
time.sleep(0.01)
rec.add()
assert rec.count() == 2
time.sleep(0.01)
rec.add()
assert rec.count() == 3
time.sleep(0.011) # just to be sure the first one expired
rec.add()
assert rec.count() == 3


def test_rolling_error_counter_tolerance_exceeded():
"""Check that tolerance_exceeded() returns True when the tolerance is exceeded."""
rec = RollingErrorCounter(60, 5)
assert not rec.tolerance_exceeded()
for _ in range(6): # tolerance + 1
rec.add()
assert rec.count() == 6
assert rec.tolerance_exceeded()

# Resetting the counter should should make the check pass
rec.reset()
assert not rec.tolerance_exceeded()


def test_rolling_error_counter_tolerance_exceeded_0():
"""Test tolerance_exceeded with a 0 tolerance."""
rec = RollingErrorCounter(60, 0)
assert rec.count() == 0
assert not rec.tolerance_exceeded()
rec.add()
assert rec.count() == 1
assert rec.tolerance_exceeded()


def test_error_comparison():
err1 = Error(timestamp=datetime.datetime(2020, 1, 1, 0, 0, 0))
err2 = Error(timestamp=datetime.datetime(2021, 1, 1, 0, 0, 0))

assert err1 < err2
assert err2 > err1
assert err1 <= err2
assert err2 >= err1
assert err1 != err2
assert err1 == err1
assert err2 == err2

def test_type_error(
op: Callable[[object, object], bool], obj1: object, obj2: object
):
# Test inside function so we get better introspection on failure
with pytest.raises(TypeError) as exc_info:
op(obj1, obj2)
assert "Can't compare Error" in str(exc_info.value)

operators = [
operator.lt,
operator.le,
operator.eq,
operator.ne,
operator.ge,
operator.gt,
]
# Comparison of Error with non-Error
for op in operators:
test_type_error(op, err1, "foo")
Empty file.
Loading

0 comments on commit 31539c9

Please sign in to comment.