Disable failing source collectors (#59)

unioslo · Jul 31, 2023 · 31539c9 · 31539c9
1 parent 8b70704
commit 31539c9
Show file tree

Hide file tree

Showing 14 changed files with 626 additions and 33 deletions.
diff --git a/README.md b/README.md
@@ -133,11 +133,11 @@ WantedBy=multi-user.target
 
 ## Source collectors
 
-As outlined in the [Application](#application) section, source collectors are Python modules (files) that are placed in a directory defined by the option `source_collector_dir` in the `[zac]` table of the config file. Zabbix-auto-config will attempt to load all modules in the directory that are referenced in the configuration file by name. Modules that are referenced in the config but not found in the directory will be ignored.
+Source collectors are Python modules placed in a directory specified by the `source_collector_dir` option in the `[zac]` table of the configuration file. Zabbix-auto-config attempts to load all modules referenced by name in the configuration file from this directory. If any referenced modules cannot be found in the directory, they will be ignored.
 
-A source collector is a module that contains a function named `collect` that returns a list of `Host` objects. Zabbix-auto-config uses these host objects to create/update hosts in Zabbix.
+A source collector module contains a function named `collect` that returns a list of `Host` objects. These host objects are used by Zabbix-auto-config to create or update hosts in Zabbix.
 
-A module that collects hosts from a file could look like this:
+Here's an example of a source collector module that reads hosts from a file:
 
 ```python
 # path/to/source_collector_dir/load_from_json.py
@@ -153,18 +153,60 @@ def collect(*args: Any, **kwargs: Any) -> List[Host]:
         return [Host(**host) for host in f.read()]
 ```
 
-Any module that contains a function named `collect` which takes a an arbitrary number of arguments and keyword arguments and returns a list of `Host` objects is recognized as a source collector module. Type annotations are optional, but recommended.
+A module is recognized as a source collector if it contains a `collect` function that accepts an arbitrary number of arguments and keyword arguments and returns a list of `Host` objects. Type annotations are optional but recommended.
 
-The corresponding config entry to load the `load_from_json.py` module above could look like this:
+The configuration entry for loading a source collector module, like the `load_from_json.py` module above, includes both mandatory and optional fields. Here's how it can be configured:
 
 ```toml
 [source_collectors.load_from_json]
 module_name = "load_from_json"
 update_interval = 60
+error_tolerance = 5
+error_duration = 360
+exit_on_error = false
+disable_duration = 3600
 filename = "hosts.json"
 ```
 
-The `module_name` and `update_interval` options are required for all source collector modules. Any other options are passed as keyword arguments to the `collect` function.
+The following configurations options are available:
+
+### Mandatory configuration
+
+#### module_name
+`module_name` is the name of the module to load. This is the name that will be used in the configuration file to reference the module. It must correspond with the name of the module file, without the `.py` extension.
+
+#### update_interval
+`update_interval` is the number of seconds between updates. This is the interval at which the `collect` function will be called.
+
+### Optional configuration (error handling)
+
+If `error_tolerance` number of errors occur within `error_duration` seconds, the collector is disabled. Source collectors do not tolerate errors by default and must opt-in to this behavior by setting `error_tolerance` and `error_duration` to non-zero values. If `exit_on_error` is set to `true`, the application will exit. Otherwise, the collector will be disabled for `disable_duration` seconds.
+
+
+#### error_tolerance
+
+`error_tolerance` (default: 0) is the maximum number of errors tolerated within `error_duration` seconds. 
+
+#### error_duration
+
+`error_duration` (default: 0) specifies the duration in seconds to track and log errors. This value should be at least equal to `error_tolerance * update_interval` to ensure correct error detection. 
+
+For instance, with an `error_tolerance` of 5 and an `update_interval` of 60, `error_duration` should be no less than 300 (5 * 60). However, it is advisable to choose a higher value to compensate for processing intervals between error occurrences and the subsequent error count checks, as well as any potential delays from the source collectors. 
+
+A useful guide is to set `error_duration` as `(error_tolerance + 1) * update_interval`, providing an additional buffer equivalent to one update interval.
+
+#### exit_on_error
+
+`exit_on_error` (default: true) determines if the application should terminate, or disable the failing collector when number of errors exceed the tolerance. If set to `true`, the application will exit. Otherwise, the collector will be disabled for `disable_duration` seconds. For backwards compatibility with previous versions of Zabbix-auto-config, this option defaults to `true`. In a future major version, the default will be changed to `false`.
+
+#### disable_duration
+
+`disable_duration` (default: 3600) is the duration in seconds to disable collector for. If set to 0, the collector is disabled indefinitely, requiring a restart of the application to re-enable it.
+
+### Keyword arguments
+
+Any extra config options specified in the configuration file will be passed to the `collect` function as keyword arguments. In the example above, the `filename` option is passed to the `collect` function, and then accessed via `kwargs["filename"]`.
+
 
 ## Host modifiers
 

diff --git a/config.sample.toml b/config.sample.toml
@@ -25,8 +25,17 @@ managed_inventory = ["location"]
 [source_collectors.mysource]
 module_name = "mysource"
 update_interval = 60
+error_tolerance = 5              # Tolerate 5 errors within `error_duration` seconds
+error_duration = 360             # should be greater than update_interval
+exit_on_error = false            # Disable source if it fails
+disable_duration = 3600          # Time in seconds to wait before reactivating a disabled source
+kwarg_passed_to_source = "value" # extra fields are passed to the source module as kwargs
+another_kwarg = "value2"         # We can pass an arbitrary number of kwargs to the source module
+
 
 [source_collectors.othersource]
 module_name = "mysource"
 update_interval = 60
-source = "other"
+error_tolerance = 0      # no tolerance for errors (default)
+exit_on_error = true     # exit application if source fails
+source = "other"         # extra kwarg used in mysource module
diff --git a/tests/conftest.py b/tests/conftest.py
@@ -1,3 +1,4 @@
+import multiprocessing
 import os
 from pathlib import Path
 from typing import Iterable
@@ -101,7 +102,7 @@ def sample_config():
 
 @pytest.fixture
 def hostgroup_map_file(tmp_path: Path) -> Iterable[Path]:
-    contents = hostgroup_map = """
+    contents = """
 # This file defines assosiation between siteadm fetched from Nivlheim and hostsgroups in Zabbix.
 # A siteadm can be assosiated only with one hostgroup or usergroup.
 # Example: <siteadm>:<host/user groupname>
@@ -120,4 +121,12 @@ def hostgroup_map_file(tmp_path: Path) -> Iterable[Path]:
 """
     map_file_path = tmp_path / "siteadmin_hostgroup_map.txt"
     map_file_path.write_text(contents)
-    yield map_file_path
+    yield map_file_path
+
+
+@pytest.fixture(autouse=True, scope="session")
+def setup_multiprocessing_start_method() -> None:
+    # On MacOS we have to set the start mode to fork
+    # when using multiprocessing-logging
+    if os.uname == "Darwin":
+        multiprocessing.set_start_method("fork", force=True)
diff --git a/tests/test_config.py b/tests/test_config.py
@@ -2,7 +2,7 @@
 import tomli
 
 import pytest
-from pydantic import Extra
+from pydantic import Extra, ValidationError
 import zabbix_auto_config.models as models
 
 
@@ -35,3 +35,91 @@ def test_config_extra_field_allowed(
         assert len(caplog.records) == 0
     finally:
         models.Settings.__config__.extra = original_extra
+
+
+def test_sourcecollectorsettings_defaults():
+    # Default setting should be valid
+    settings = models.SourceCollectorSettings(
+        module_name="foo",
+        update_interval=60,
+    )
+    assert settings.module_name == "foo"
+    assert settings.update_interval == 60
+
+
+
+def test_sourcecollectorsettings_no_tolerance() -> None:
+    """Setting no error tolerance will cause the error_duration to be set
+    to a non-zero value.
+
+    Per note in the docstring of SourceCollectorSettings.error_duration,
+    the value of error_duration is set to a non-zero value to ensure that
+    the error is not discarded when calling RollingErrorCounter.check().
+    """
+    settings = models.SourceCollectorSettings(
+        module_name="foo",
+        update_interval=60,
+        error_tolerance=0,
+        error_duration=0,
+    )
+    assert settings.error_tolerance == 0
+    # In case the actual implementaiton changes in the future, we don't
+    # want to test the _exact_ value, but we know it will not be 0
+    assert settings.error_duration > 0
+
+
+def test_sourcecollectorsettings_no_error_duration():
+    # TODO: check if we can just remove this test
+    # In order to not have an error_duration, error_tolerance must be 0 too
+    settings = models.SourceCollectorSettings(
+        module_name="foo",
+        update_interval=60,
+        error_duration=0,
+        error_tolerance=0,
+    )
+    # See docstring in test_sourcecollectorsettings_no_tolerance
+    assert settings.error_duration > 0
+
+    # With tolerance raises an error
+    # NOTE: we test the error message in depth in test_sourcecollectorsettings_invalid_error_duration
+    with pytest.raises(ValidationError):
+        models.SourceCollectorSettings(
+            module_name="foo",
+            update_interval=60,
+            error_duration=0,
+            error_tolerance=5,
+        )
+
+
+def test_sourcecollectorsettings_duration_too_short():
+    # Error_duration should be greater or equal to the product of
+    # error_tolerance and update_interval
+    with pytest.raises(ValidationError) as exc_info:
+        models.SourceCollectorSettings(
+            module_name="foo",
+            update_interval=60,
+            error_tolerance=5,
+            error_duration=180,
+        )
+    errors = exc_info.value.errors()
+    assert len(errors) == 1
+    error = errors[0]
+    assert error["loc"] == ("error_duration",)
+    assert "greater than 300" in error["msg"]
+    assert error["type"] == "value_error"
+
+
+def test_sourcecollectorsettings_duration_negative():
+    # We should not be able to pass in negative values to error_duration
+    with pytest.raises(ValidationError) as exc_info:
+        models.SourceCollectorSettings(
+            module_name="foo",
+            update_interval=60,
+            error_tolerance=5,
+            error_duration=-1,
+        )
+    errors = exc_info.value.errors()
+    assert len(errors) == 1
+    error = errors[0]
+    assert error["loc"] == ("error_duration",)
+    assert error["type"] == "value_error.number.not_ge"
diff --git a/tests/test_errcount.py b/tests/test_errcount.py
@@ -0,0 +1,139 @@
+import datetime
+import operator
+import time
+from typing import Callable
+
+import pytest
+
+from zabbix_auto_config.errcount import Error, RollingErrorCounter, get_td
+
+
+def test_get_td():
+    """Sanity test that the get_td() helper function works as we expect."""
+    td = get_td(60)
+    assert td.total_seconds() == 60
+
+
+def test_rolling_error_counter_init():
+    """Test that we can create a RollingErrorCounter object."""
+    rec = RollingErrorCounter(60, 5)
+    assert rec.duration == 60
+    assert rec.tolerance == 5
+    assert rec.errors == []
+
+
+def test_rolling_error_counter_init_negative_duration():
+    """Test that we can't create a RollingErrorCounter object with a negative duration."""
+    with pytest.raises(ValueError) as exc_info:
+        RollingErrorCounter(-60, 5)
+    assert "duration" in str(exc_info.value)
+
+
+def test_rolling_error_counter_init_negative_tolerance():
+    """Test that we can't create a RollingErrorCounter object with a negative tolerance."""
+    with pytest.raises(ValueError) as exc_info:
+        RollingErrorCounter(60, -5)
+    assert "tolerance" in str(exc_info.value)
+
+
+def test_rolling_error_counter_add():
+    """Test that we can add errors to the RollingErrorCounter object."""
+    rec = RollingErrorCounter(60, 5)
+    rec.add()
+    assert len(rec.errors) == 1
+    time.sleep(0.01)  # ensure that the timestamp is always different
+    rec.add()
+    assert len(rec.errors) == 2
+    assert rec.errors[0] < rec.errors[1]
+
+
+def test_rolling_error_counter_count():
+    """Test that we can count errors in the RollingErrorCounter object."""
+    rec = RollingErrorCounter(0.03, 5)
+    # This test is a bit timing sensitive, but we should be able to
+    assert rec.count() == 0
+    rec.add()
+    assert rec.count() == 1
+    rec.add()
+    assert rec.count() == 2
+    rec.add()
+    assert rec.count() == 3
+    rec.add()
+    assert rec.count() == 4
+    time.sleep(0.03)  # enough to reset the counter
+    assert rec.count() == 0
+
+
+def test_rolling_error_counter_count_is_rolling():
+    """Check that the error counter is actually rolling by incrementally adding
+    and sleeping. At some point we should see the counter decrease because
+    an entry has expired."""
+    rec = RollingErrorCounter(0.03, 5)
+    rec.add()
+    assert rec.count() == 1
+    time.sleep(0.01)
+    rec.add()
+    assert rec.count() == 2
+    time.sleep(0.01)
+    rec.add()
+    assert rec.count() == 3
+    time.sleep(0.011)  # just to be sure the first one expired
+    rec.add()
+    assert rec.count() == 3
+
+
+def test_rolling_error_counter_tolerance_exceeded():
+    """Check that tolerance_exceeded() returns True when the tolerance is exceeded."""
+    rec = RollingErrorCounter(60, 5)
+    assert not rec.tolerance_exceeded()
+    for _ in range(6):  # tolerance + 1
+        rec.add()
+    assert rec.count() == 6
+    assert rec.tolerance_exceeded()
+
+    # Resetting the counter should should make the check pass
+    rec.reset()
+    assert not rec.tolerance_exceeded()
+
+
+def test_rolling_error_counter_tolerance_exceeded_0():
+    """Test tolerance_exceeded with a 0 tolerance."""
+    rec = RollingErrorCounter(60, 0)
+    assert rec.count() == 0
+    assert not rec.tolerance_exceeded()
+    rec.add()
+    assert rec.count() == 1
+    assert rec.tolerance_exceeded()
+
+
+def test_error_comparison():
+    err1 = Error(timestamp=datetime.datetime(2020, 1, 1, 0, 0, 0))
+    err2 = Error(timestamp=datetime.datetime(2021, 1, 1, 0, 0, 0))
+
+    assert err1 < err2
+    assert err2 > err1
+    assert err1 <= err2
+    assert err2 >= err1
+    assert err1 != err2
+    assert err1 == err1
+    assert err2 == err2
+
+    def test_type_error(
+        op: Callable[[object, object], bool], obj1: object, obj2: object
+    ):
+        # Test inside function so we get better introspection on failure
+        with pytest.raises(TypeError) as exc_info:
+            op(obj1, obj2)
+        assert "Can't compare Error" in str(exc_info.value)
+
+    operators = [
+        operator.lt,
+        operator.le,
+        operator.eq,
+        operator.ne,
+        operator.ge,
+        operator.gt,
+    ]
+    # Comparison of Error with non-Error
+    for op in operators:
+        test_type_error(op, err1, "foo")
diff --git a/tests/test_processing/__init__.py b/tests/test_processing/__init__.py