InfluxDB send retry after IOError #7247

janLo · 2017-04-23T14:47:09Z

Description:

This adds an optional max_retries parameter to the InfluxDB component
to specify if and how often the component should try to send the data
if the connection failed due to an IOError.

The sending will be scheduled for a retry in 20 seconds as often as the
user specified. This can be handy for flaky getwork connections between
the DB and Homeassistant or outages like daily DSL reconnects.

Pull request in home-assistant.github.io with documentation (if applicable): home-assistant/home-assistant.io#2485

Example entry for `configuration.yaml` (if applicable):

influxdb:
  host: 192.168.1.190
  port: 20000
  database: DB_TO_STORE_EVENTS
  username: MY_USERNAME
  password: MY_PASSWORD
  ssl: true
  verify_ssl: true
  max_retries: 3
  default_measurement: state
  blacklist:
     - entity.id1
     - entity.id2
  whitelist:
     - entity.id3
     - entity.id4
  tags:
    instance: prod
    source: hass

Checklist:

If user exposed functionality or configuration variables are added/changed:

Documentation added/updated in home-assistant.github.io

If the code communicates with devices, web services, or third-party tools:

Local tests with tox run successfully. Your PR cannot be merged unless tests pass
Tests have been added to verify that the new code works.

mention-bot · 2017-04-23T14:47:10Z

@janLo, thanks for your PR! By analyzing the history of the files in this pull request, we identified @lwis, @fabaff and @kk7ds to be potential reviewers.

pvizeli · 2017-04-25T07:10:40Z

Please rebase it

janLo · 2017-04-25T11:14:41Z

Rebased. :)

pvizeli

I don't like that. I think this functionality need to handle on API site and not inside home-assistant. I think that will be the wrong place. The code looks okay but make things complexer they should be

janLo · 2017-04-26T19:54:00Z

@pvizeli So do you suggest to maintain use a fork of the official influxDB api? I don't think that this will go upstream. Especially as this would mean, that the API has to be aware of home-assistants scheduling system.

MartinHjelmare · 2017-04-26T20:59:59Z

I don't think this belongs in the influxdb component, either. It's a matter of connection reliability not specific to influxdb.

Would it be possible to abstract it into a helper decorator, similar to how we have a throttle decorator, but for retries? That would then be reusable all over home assistant and have the same potential configuration.

janLo · 2017-04-26T21:30:01Z

It would be easy. It just takes the possibility to have specific log messages - and I'm not sure if its really worth having an decorator for that. Anyway, it could look like that:

in the influx component:

    @RetryOnError(hass, retry_limit=max_tries, retry_delay=20)
    def _write_data(json_body):
        try:
            influx.write_points(json_body)
        except exceptions.InfluxDBClientError:
            _LOGGER.exception("Error saving event %s to InfluxDB", json_body)

In util:

class RetryOnError(object):
    def __init__(self, hass, retry_limit=0, retry_delay=20):
        self.hass = hass
        self.retry_limit = retry_limit
        self.retry_delay = datetime.timedelta(seconds=retry_delay)

    def __call__(self, method):

        @functools.wraps(method)
        def wrapper(*args, **kwargs):

            def scheduled(retry=0, event=None):
                try:
                    method(*args, **kwargs)
                except Exception:
                    target = dt_util.utcnow() + self.retry_delay
                    if retry == self.retry_limit:
                        raise
                    track_point_in_utc_time(self.hass,
                                            functools.partial(scheduled,
                                                              retry + 1),
                                            target)

            scheduled()

        return wrapper

Looks still not very nice, is more code and lacks specific log messages.
Shall I go for that or leave it?

MartinHjelmare · 2017-04-26T21:42:59Z

I'll like the helper decorator approach better. But I think @balloob should have a say.

pvizeli · 2017-04-27T10:21:53Z

My suggestion is that we have ca. 300-400 component/platform they could be the same problem. So If we need to have 30 lines per component/platform give that 10'500 lines that we need to maintenance and fill our bus and job handler with data and can make new problems. At the end we can not handle that and a API can handle that better and if the API will not support that could be that have a reason...

houndci-bot · 2017-04-27T18:28:18Z

tests/util/test_init.py

+        self.assertEqual(mock_method.call_count, 1)
+
+
+        mock_method.side_effect = Exception()


too many blank lines (2)

houndci-bot · 2017-04-27T18:28:18Z

homeassistant/components/influxdb.py

@@ -97,7 +103,7 @@ def influx_event_listener(event):
        state = event.data.get('new_state')
        if state is None or state.state in (
                STATE_UNKNOWN, '', STATE_UNAVAILABLE) or \
-                state.entity_id in blacklist:
+                        state.entity_id in blacklist:


continuation line over-indented for visual indent

houndci-bot · 2017-04-27T18:28:18Z

homeassistant/components/influxdb.py


+import functools


'functools' imported but unused

houndci-bot · 2017-04-27T18:28:18Z

homeassistant/components/influxdb.py

@@ -5,13 +5,16 @@
 https://home-assistant.io/components/influxdb/
 """
 import logging
+import datetime


'datetime' imported but unused

janLo · 2017-04-27T18:29:25Z

Rebased and refactored the retry to a RetryOnError decorator in homeassistant.util (including unittests).

This adds an optional max_retries parameter to the InfluxDB component to specify if and how often the component should try to send the data if the connection failed due to an IOError. The sending will be scheduled for a retry in 20 seconds as often as the user specified. This can be handy for flaky getwork connections between the DB and Homeassistant or outages like daily DSL reconnects. Signed-off-by: Jan Losinski <losinski@wh2.tu-dresden.de>

Signed-off-by: Jan Losinski <losinski@wh2.tu-dresden.de>

This replaces the scheduling logic in the InfluxDB component with the RetryOnError decorator from homeassistant.util Signed-off-by: Jan Losinski <losinski@wh2.tu-dresden.de>

janLo · 2017-04-27T18:46:18Z

... and fixed another rebase error.

Signed-off-by: Jan Losinski <losinski@wh2.tu-dresden.de>

pvizeli · 2017-04-28T09:36:41Z

Sorry but that is to wired. I.e. you should only schedule new stuff is not a other schedule penden and you will go back to old problem. If you endless schedule stuff we run into a mem problem in some case. Let's make a PR here: https://github.com/influxdata/influxdb-python and fix that on base. But for now I'm silent and wait what Paulus mean.

janLo · 2017-04-28T10:45:58Z

@pvizeli No, I schedule only "max retries" times. I'm not seeing this in any for going upstream, because hey don't have any scheduling logic - so blocking would be the only thing they could do - which is not an option I think.

Of course, we can add fancy logic lo limit the number of "pending" calls - but I think that would add unnecessary overhead. I use the solution from here since a few weeks without any problems. As the calls are always dropped after some retries its also not a problem that it grows somehow in memory.

balloob

So I don't think that it should perse have to go upstream into influx Python. It's up to the caller of the API to implement retry logic as the caller knows what the current conditions are.

I do however share the concern with Pascal is that these retry chains are getting orphaned - we do not track them. I think that having a generic decorator is not the right solution because we will promote the wrong solution of just fire and forget including a retry. This will hurt us in the long term.

For now I would say that the retry logic should stay in the InfluxDB component. Probably also keep track of how many retries are going on and make sure to start dropping events when the list of pending events in delay are getting too big.

balloob · 2017-04-29T23:22:28Z

homeassistant/util/__init__.py

+                # pylint: disable=broad-except
+                try:
+                    method(*args, **kwargs)
+                except Exception:


We should never catch generic exceptions. It should always be a specific one. Add it as an argument to the decorator and default it to IOError

I can change that. I thought it might be good to act on all exceptions that are not handled within the decorated method.

Catching Exception is never a good idea. So many things will go unnoticed, including a user pressing ctrl+c !

balloob · 2017-04-30T05:10:56Z

homeassistant/components/influxdb.py

@@ -48,6 +49,7 @@
        vol.Optional(CONF_DB_NAME, default=DEFAULT_DATABASE): cv.string,
        vol.Optional(CONF_PORT): cv.port,
        vol.Optional(CONF_SSL): cv.boolean,
+        vol.Optional(CONF_RETRY_COUNT): cv.positive_int,


There is no default so this value can be 0

Ok, so I just set a default here too?

balloob · 2017-04-30T05:11:49Z

homeassistant/util/__init__.py

+    in seconds as arguments.
+    """
+
+    def __init__(self, hass, retry_limit=0, retry_delay=20):


20 seems very high, I think by default it should be 0.

And since 0 is the default, you can make hass an optional argument.

0 is actually "higher" - as this tells how long to wait before the next retry. So 0 would meas "instant retry" - which makes no sense most of the time.

If hass is optional, shall we throw then an exception if someone uses this decorator with a value other than 0 and don't pass a hass instance?

janLo

Ok, so I change it back from the prior request to make a generic decorator and just put it in the component. Is this the final decision?

To argue against the orphaned requests: They do not happen. There is a retry limit. So if you exceed the retry limit, the request gets dropped (as it was before without any retrying). So if you have 10 requests per second, 20 secs of retry delay and 4 retries, you have never more than 1000 requests queues, as after 100 seconds the first request gets dropped.

So there is never a "getting too big". Of course we can add some heuristics to guess when it is necessary to stop retrying and instead ping? if the server is available again. But I think this would be way to complex for this simple usecase.

janLo · 2017-05-01T15:43:09Z

homeassistant/util/__init__.py

+    in seconds as arguments.
+    """
+
+    def __init__(self, hass, retry_limit=0, retry_delay=20):


0 is actually "higher" - as this tells how long to wait before the next retry. So 0 would meas "instant retry" - which makes no sense most of the time.

janLo · 2017-05-01T15:44:10Z

homeassistant/util/__init__.py

+    in seconds as arguments.
+    """
+
+    def __init__(self, hass, retry_limit=0, retry_delay=20):


If hass is optional, shall we throw then an exception if someone uses this decorator with a value other than 0 and don't pass a hass instance?

janLo · 2017-05-01T15:45:34Z

homeassistant/components/influxdb.py

@@ -48,6 +49,7 @@
        vol.Optional(CONF_DB_NAME, default=DEFAULT_DATABASE): cv.string,
        vol.Optional(CONF_PORT): cv.port,
        vol.Optional(CONF_SSL): cv.boolean,
+        vol.Optional(CONF_RETRY_COUNT): cv.positive_int,


Ok, so I just set a default here too?

janLo · 2017-05-01T15:45:40Z

homeassistant/util/__init__.py

+                # pylint: disable=broad-except
+                try:
+                    method(*args, **kwargs)
+                except Exception:


I can change that. I thought it might be good to act on all exceptions that are not handled within the decorated method.

balloob · 2017-05-02T16:05:32Z

The requests are still orphaned because influxdb integration has no clue how many requests are still in flight or how many requests will hit the DB in the next X time.

As long as you keep it in influxdb, I'm fine with it.

balloob · 2017-06-24T05:15:37Z

This PR seems to have gone stale. Closing it. You can reopen it when you're ready to finish it.

janLo · 2017-06-26T18:15:13Z

I can't reopen it by myself.

lwis · 2017-07-08T09:49:48Z

Closing this again as there hasn't been any activity since re-opening.

Also is this solved by #8209?

janLo · 2017-07-10T13:05:28Z

It is not solved. I'll work on this if its open as soon as I've time. At the moment it looks like I can get this done in the second half of August. I don't understand why unsolved problems that are fixed and just need some polishing are closed.

…

On Sat, Jul 8, 2017 at 11:49 AM Lewis Juggins ***@***.***> wrote: Closed #7247 <#7247>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7247 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAF3krw_R5x8lBIxgmIQtyRUztlzKiupks5sL1DFgaJpZM4NFbdd> .

lwis · 2017-07-10T13:10:46Z

So we know what's waiting for review, or ready to be merged.

A pull request signifies "My changes are ready, take my changes!", while this is not the case for this PR. I'd suggest continuing to work on the branch, and poke us when you're ready for review.

Perspective: If everyone raised a PR for their unfinished change, it would be a nightmare to triage.

janLo · 2017-07-10T19:46:30Z

Ok, For me it's dismissing work: "Go away". Especially as I cannot reopen it. And I don't want to work on closed PRs.

It works well. It is tested and can be merged I use it every day. It is in the state "My changes are ready, take my changes!".
Only cosmetic changes were requested - after it was finished and working.

Is this the new policy, to close all working PRs where small changes are requested?

Landrash · 2017-07-17T17:05:15Z

@janLo In general stale PR's are closed to keep the code base and PR's incoming maintainable. There's quite a few PR's to all the Home Assistant repositories and the people who review and merge are limited in numbers and are all doing it in their free time, with limited limited tools(Github Gui).

Taking it as "go away" is the last thing you should do since you've gotten valuable feedback on what some of the core maintainers think needs changing to integrate this into HA. There's nothing that prevents you from working on this in your own fork until the requested changes have been made and just simply put a reply on this PR to have this reopened at that time.
It's really the minimal amount of effort for everyone involved (submitters and maintainers) that can be done with the limited tools available.
If you have a better way of keeping a project the size of HA organized I'm sure that any feedback on that process would be greatly appreciated(but not in this PR). 😄

janLo · 2017-07-17T17:22:30Z

As I said, The PR is working, tested and ready to merge. Only cosmetic changes where requested that I cannot provide at the moment doe to limited time. To close it means to me, that you're not interested in the so-far provided work and don't want to wait for the requested polishing. How about a flag "polishing in progress" - that can be filtered out until the Author requests a new review? Closing without merge means reject to me. I'm maintaining my own fork anyway, so it works for well for me without a merge to upstream. If someone else wants to pick it up - feel free.

…

On Mon, Jul 17, 2017 at 7:09 PM Fredrik Lindqvist ***@***.***> wrote: @janLo <https://github.com/janlo> In general stale PR's are closed to keep the code base and PR's incoming maintainable. There's quite a few PR's to all the Home Assistant repositories and the people who review and merge are limited in numbers and are all doing it in their free time, with limited limited tools(Github Gui). Taking it as "go away" is the last thing you should do since you've gotten valuable feedback on what some of the core maintainers think needs changing to integrate this into HA. There's nothing that prevents you from working on this in your own fork until the requested changes have been made and just simply put a reply on this PR to have this reopened at that time. It's really the minimal amount of effort for everyone involved (submitters and maintainers) that can be done with the limited tools available. If you have a better way of keeping a project the size of HA organized I'm sure that any feedback on that process would be greatly appreciated(but not in this PR). 😄 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7247 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAF3kvx0AsGqUpP3p_4KWOvzufMv6um3ks5sO5RagaJpZM4NFbdd> .

homeassistant added cla-signed integration: influxdb labels Apr 23, 2017

balloobbot added component and removed cla-signed integration: influxdb labels Apr 23, 2017

janLo mentioned this pull request Apr 23, 2017

Document max retries for InfluxDB home-assistant/home-assistant.io#2485

Closed

janLo changed the title ~~Influx retry~~ InfluxDB send retry after IOError Apr 23, 2017

janLo force-pushed the influx-retry branch from 3100ba2 to 79fe729 Compare April 25, 2017 11:14

homeassistant added the cla-signed label Apr 25, 2017

pvizeli requested changes Apr 26, 2017

View reviewed changes

pvizeli assigned balloob Apr 26, 2017

pvizeli requested a review from balloob April 27, 2017 10:23

homeassistant added the cla-signed label Apr 27, 2017

houndci-bot reviewed Apr 27, 2017

View reviewed changes

janLo added 3 commits April 27, 2017 20:30

Add unittest for influx write retries

4078d46

Signed-off-by: Jan Losinski <losinski@wh2.tu-dresden.de>

Add RetryOnError as helper decorator in util

361ace7

Signed-off-by: Jan Losinski <losinski@wh2.tu-dresden.de>

janLo force-pushed the influx-retry branch from 1e9ac5b to 29300a0 Compare April 27, 2017 18:31

homeassistant added the cla-signed label Apr 27, 2017

Add unittests for RetryOnError

ce11b34

Signed-off-by: Jan Losinski <losinski@wh2.tu-dresden.de>

janLo force-pushed the influx-retry branch from 29300a0 to 3cfc1b6 Compare April 27, 2017 18:35

homeassistant added the cla-signed label Apr 27, 2017

Use RetryOnError decorator in InfluxDB

e2b3d7e

This replaces the scheduling logic in the InfluxDB component with the RetryOnError decorator from homeassistant.util Signed-off-by: Jan Losinski <losinski@wh2.tu-dresden.de>

janLo force-pushed the influx-retry branch from 3cfc1b6 to e2b3d7e Compare April 27, 2017 18:46

homeassistant added the cla-signed label Apr 27, 2017

Make the linters happy

9de569e

Signed-off-by: Jan Losinski <losinski@wh2.tu-dresden.de>

janLo force-pushed the influx-retry branch from 9e9e026 to 9de569e Compare April 27, 2017 19:41

homeassistant added the cla-signed label Apr 27, 2017

balloobbot added the core label Apr 27, 2017

balloob requested changes Apr 30, 2017

View reviewed changes

janLo commented May 1, 2017

View reviewed changes

balloob closed this Jun 24, 2017

lwis reopened this Jun 26, 2017

lwis closed this Jul 8, 2017

home-assistant locked and limited conversation to collaborators Oct 20, 2017

		self.assertEqual(mock_method.call_count, 1)


		mock_method.side_effect = Exception()

InfluxDB send retry after IOError #7247

InfluxDB send retry after IOError #7247

Conversation

janLo commented Apr 23, 2017 • edited Loading

Description:

Example entry for configuration.yaml (if applicable):

Checklist:

mention-bot commented Apr 23, 2017

pvizeli commented Apr 25, 2017

janLo commented Apr 25, 2017

pvizeli left a comment

Choose a reason for hiding this comment

janLo commented Apr 26, 2017

MartinHjelmare commented Apr 26, 2017 • edited Loading

janLo commented Apr 26, 2017

MartinHjelmare commented Apr 26, 2017

pvizeli commented Apr 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janLo commented Apr 27, 2017

janLo commented Apr 27, 2017

pvizeli commented Apr 28, 2017

janLo commented Apr 28, 2017

balloob left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janLo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

balloob commented May 2, 2017

balloob commented Jun 24, 2017

janLo commented Jun 26, 2017

lwis commented Jul 8, 2017

janLo commented Jul 10, 2017 via email

lwis commented Jul 10, 2017

janLo commented Jul 10, 2017

Landrash commented Jul 17, 2017

janLo commented Jul 17, 2017 via email

janLo commented Apr 23, 2017 •

edited

Loading

Example entry for `configuration.yaml` (if applicable):

MartinHjelmare commented Apr 26, 2017 •

edited

Loading

pvizeli commented Apr 27, 2017 •

edited

Loading

balloob left a comment •

edited

Loading