Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InfluxDB send retry after IOError #7247

Closed
wants to merge 6 commits into from
Closed

Conversation

janLo
Copy link
Contributor

@janLo janLo commented Apr 23, 2017

Description:

This adds an optional max_retries parameter to the InfluxDB component
to specify if and how often the component should try to send the data
if the connection failed due to an IOError.

The sending will be scheduled for a retry in 20 seconds as often as the
user specified. This can be handy for flaky getwork connections between
the DB and Homeassistant or outages like daily DSL reconnects.

Pull request in home-assistant.github.io with documentation (if applicable): home-assistant/home-assistant.io#2485

Example entry for configuration.yaml (if applicable):

influxdb:
  host: 192.168.1.190
  port: 20000
  database: DB_TO_STORE_EVENTS
  username: MY_USERNAME
  password: MY_PASSWORD
  ssl: true
  verify_ssl: true
  max_retries: 3
  default_measurement: state
  blacklist:
     - entity.id1
     - entity.id2
  whitelist:
     - entity.id3
     - entity.id4
  tags:
    instance: prod
    source: hass

Checklist:

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • Local tests with tox run successfully. Your PR cannot be merged unless tests pass
  • Tests have been added to verify that the new code works.

@mention-bot
Copy link

@janLo, thanks for your PR! By analyzing the history of the files in this pull request, we identified @lwis, @fabaff and @kk7ds to be potential reviewers.

@pvizeli
Copy link
Member

pvizeli commented Apr 25, 2017

Please rebase it

@janLo
Copy link
Contributor Author

janLo commented Apr 25, 2017

Rebased. :)

Copy link
Member

@pvizeli pvizeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that. I think this functionality need to handle on API site and not inside home-assistant. I think that will be the wrong place. The code looks okay but make things complexer they should be

@janLo
Copy link
Contributor Author

janLo commented Apr 26, 2017

@pvizeli So do you suggest to maintain use a fork of the official influxDB api? I don't think that this will go upstream. Especially as this would mean, that the API has to be aware of home-assistants scheduling system.

@MartinHjelmare
Copy link
Member

MartinHjelmare commented Apr 26, 2017

I don't think this belongs in the influxdb component, either. It's a matter of connection reliability not specific to influxdb.

Would it be possible to abstract it into a helper decorator, similar to how we have a throttle decorator, but for retries? That would then be reusable all over home assistant and have the same potential configuration.

@janLo
Copy link
Contributor Author

janLo commented Apr 26, 2017

It would be easy. It just takes the possibility to have specific log messages - and I'm not sure if its really worth having an decorator for that. Anyway, it could look like that:

in the influx component:

    @RetryOnError(hass, retry_limit=max_tries, retry_delay=20)
    def _write_data(json_body):
        try:
            influx.write_points(json_body)
        except exceptions.InfluxDBClientError:
            _LOGGER.exception("Error saving event %s to InfluxDB", json_body)

In util:

class RetryOnError(object):
    def __init__(self, hass, retry_limit=0, retry_delay=20):
        self.hass = hass
        self.retry_limit = retry_limit
        self.retry_delay = datetime.timedelta(seconds=retry_delay)

    def __call__(self, method):

        @functools.wraps(method)
        def wrapper(*args, **kwargs):

            def scheduled(retry=0, event=None):
                try:
                    method(*args, **kwargs)
                except Exception:
                    target = dt_util.utcnow() + self.retry_delay
                    if retry == self.retry_limit:
                        raise
                    track_point_in_utc_time(self.hass,
                                            functools.partial(scheduled,
                                                              retry + 1),
                                            target)

            scheduled()

        return wrapper

Looks still not very nice, is more code and lacks specific log messages.
Shall I go for that or leave it?

@MartinHjelmare
Copy link
Member

I'll like the helper decorator approach better. But I think @balloob should have a say.

@pvizeli
Copy link
Member

pvizeli commented Apr 27, 2017

My suggestion is that we have ca. 300-400 component/platform they could be the same problem. So If we need to have 30 lines per component/platform give that 10'500 lines that we need to maintenance and fill our bus and job handler with data and can make new problems. At the end we can not handle that and a API can handle that better and if the API will not support that could be that have a reason...

self.assertEqual(mock_method.call_count, 1)


mock_method.side_effect = Exception()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too many blank lines (2)

@@ -97,7 +103,7 @@ def influx_event_listener(event):
state = event.data.get('new_state')
if state is None or state.state in (
STATE_UNKNOWN, '', STATE_UNAVAILABLE) or \
state.entity_id in blacklist:
state.entity_id in blacklist:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continuation line over-indented for visual indent


import functools

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'functools' imported but unused

@@ -5,13 +5,16 @@
https://home-assistant.io/components/influxdb/
"""
import logging
import datetime

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'datetime' imported but unused

@janLo
Copy link
Contributor Author

janLo commented Apr 27, 2017

Rebased and refactored the retry to a RetryOnError decorator in homeassistant.util (including unittests).

janLo added 3 commits April 27, 2017 20:30
This adds an optional max_retries parameter to the InfluxDB component
to specify if and how often the component should try to send the data
if the connection failed due to an IOError.

The sending will be scheduled for a retry in 20 seconds as often as the
user specified. This can be handy for flaky getwork connections between
the DB and Homeassistant or outages like daily DSL reconnects.

Signed-off-by: Jan Losinski <losinski@wh2.tu-dresden.de>
Signed-off-by: Jan Losinski <losinski@wh2.tu-dresden.de>
Signed-off-by: Jan Losinski <losinski@wh2.tu-dresden.de>
Signed-off-by: Jan Losinski <losinski@wh2.tu-dresden.de>
This replaces the scheduling logic in the InfluxDB component with the
RetryOnError decorator from homeassistant.util

Signed-off-by: Jan Losinski <losinski@wh2.tu-dresden.de>
@janLo
Copy link
Contributor Author

janLo commented Apr 27, 2017

... and fixed another rebase error.

Signed-off-by: Jan Losinski <losinski@wh2.tu-dresden.de>
@pvizeli
Copy link
Member

pvizeli commented Apr 28, 2017

Sorry but that is to wired. I.e. you should only schedule new stuff is not a other schedule penden and you will go back to old problem. If you endless schedule stuff we run into a mem problem in some case. Let's make a PR here: https://github.com/influxdata/influxdb-python and fix that on base. But for now I'm silent and wait what Paulus mean.

@janLo
Copy link
Contributor Author

janLo commented Apr 28, 2017

@pvizeli No, I schedule only "max retries" times. I'm not seeing this in any for going upstream, because hey don't have any scheduling logic - so blocking would be the only thing they could do - which is not an option I think.

Of course, we can add fancy logic lo limit the number of "pending" calls - but I think that would add unnecessary overhead. I use the solution from here since a few weeks without any problems. As the calls are always dropped after some retries its also not a problem that it grows somehow in memory.

Copy link
Member

@balloob balloob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I don't think that it should perse have to go upstream into influx Python. It's up to the caller of the API to implement retry logic as the caller knows what the current conditions are.

I do however share the concern with Pascal is that these retry chains are getting orphaned - we do not track them. I think that having a generic decorator is not the right solution because we will promote the wrong solution of just fire and forget including a retry. This will hurt us in the long term.

For now I would say that the retry logic should stay in the InfluxDB component. Probably also keep track of how many retries are going on and make sure to start dropping events when the list of pending events in delay are getting too big.

# pylint: disable=broad-except
try:
method(*args, **kwargs)
except Exception:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never catch generic exceptions. It should always be a specific one. Add it as an argument to the decorator and default it to IOError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change that. I thought it might be good to act on all exceptions that are not handled within the decorated method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching Exception is never a good idea. So many things will go unnoticed, including a user pressing ctrl+c !

@@ -48,6 +49,7 @@
vol.Optional(CONF_DB_NAME, default=DEFAULT_DATABASE): cv.string,
vol.Optional(CONF_PORT): cv.port,
vol.Optional(CONF_SSL): cv.boolean,
vol.Optional(CONF_RETRY_COUNT): cv.positive_int,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no default so this value can be 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I just set a default here too?

in seconds as arguments.
"""

def __init__(self, hass, retry_limit=0, retry_delay=20):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 seems very high, I think by default it should be 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And since 0 is the default, you can make hass an optional argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 is actually "higher" - as this tells how long to wait before the next retry. So 0 would meas "instant retry" - which makes no sense most of the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If hass is optional, shall we throw then an exception if someone uses this decorator with a value other than 0 and don't pass a hass instance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor Author

@janLo janLo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I change it back from the prior request to make a generic decorator and just put it in the component. Is this the final decision?

To argue against the orphaned requests: They do not happen. There is a retry limit. So if you exceed the retry limit, the request gets dropped (as it was before without any retrying). So if you have 10 requests per second, 20 secs of retry delay and 4 retries, you have never more than 1000 requests queues, as after 100 seconds the first request gets dropped.

So there is never a "getting too big". Of course we can add some heuristics to guess when it is necessary to stop retrying and instead ping? if the server is available again. But I think this would be way to complex for this simple usecase.

in seconds as arguments.
"""

def __init__(self, hass, retry_limit=0, retry_delay=20):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 is actually "higher" - as this tells how long to wait before the next retry. So 0 would meas "instant retry" - which makes no sense most of the time.

in seconds as arguments.
"""

def __init__(self, hass, retry_limit=0, retry_delay=20):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If hass is optional, shall we throw then an exception if someone uses this decorator with a value other than 0 and don't pass a hass instance?

@@ -48,6 +49,7 @@
vol.Optional(CONF_DB_NAME, default=DEFAULT_DATABASE): cv.string,
vol.Optional(CONF_PORT): cv.port,
vol.Optional(CONF_SSL): cv.boolean,
vol.Optional(CONF_RETRY_COUNT): cv.positive_int,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I just set a default here too?

# pylint: disable=broad-except
try:
method(*args, **kwargs)
except Exception:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change that. I thought it might be good to act on all exceptions that are not handled within the decorated method.

@balloob
Copy link
Member

balloob commented May 2, 2017

The requests are still orphaned because influxdb integration has no clue how many requests are still in flight or how many requests will hit the DB in the next X time.

As long as you keep it in influxdb, I'm fine with it.

@balloob
Copy link
Member

balloob commented Jun 24, 2017

This PR seems to have gone stale. Closing it. You can reopen it when you're ready to finish it.

@balloob balloob closed this Jun 24, 2017
@janLo
Copy link
Contributor Author

janLo commented Jun 26, 2017

I can't reopen it by myself.

@lwis lwis reopened this Jun 26, 2017
@lwis
Copy link
Member

lwis commented Jul 8, 2017

Closing this again as there hasn't been any activity since re-opening.

Also is this solved by #8209?

@lwis lwis closed this Jul 8, 2017
@janLo
Copy link
Contributor Author

janLo commented Jul 10, 2017 via email

@lwis
Copy link
Member

lwis commented Jul 10, 2017

So we know what's waiting for review, or ready to be merged.

A pull request signifies "My changes are ready, take my changes!", while this is not the case for this PR. I'd suggest continuing to work on the branch, and poke us when you're ready for review.

Perspective: If everyone raised a PR for their unfinished change, it would be a nightmare to triage.

@janLo
Copy link
Contributor Author

janLo commented Jul 10, 2017

Ok, For me it's dismissing work: "Go away". Especially as I cannot reopen it. And I don't want to work on closed PRs.

It works well. It is tested and can be merged I use it every day. It is in the state "My changes are ready, take my changes!".
Only cosmetic changes were requested - after it was finished and working.

Is this the new policy, to close all working PRs where small changes are requested?

@Landrash
Copy link
Contributor

@janLo In general stale PR's are closed to keep the code base and PR's incoming maintainable. There's quite a few PR's to all the Home Assistant repositories and the people who review and merge are limited in numbers and are all doing it in their free time, with limited limited tools(Github Gui).

Taking it as "go away" is the last thing you should do since you've gotten valuable feedback on what some of the core maintainers think needs changing to integrate this into HA. There's nothing that prevents you from working on this in your own fork until the requested changes have been made and just simply put a reply on this PR to have this reopened at that time.
It's really the minimal amount of effort for everyone involved (submitters and maintainers) that can be done with the limited tools available.
If you have a better way of keeping a project the size of HA organized I'm sure that any feedback on that process would be greatly appreciated(but not in this PR). 😄

@janLo
Copy link
Contributor Author

janLo commented Jul 17, 2017 via email

@home-assistant home-assistant locked and limited conversation to collaborators Oct 20, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants