Refactor hinted-handoff service #4516

otoolep · 2015-10-20T21:55:49Z

This change refactors the hinted-handoff service.

Whereas previously the system repeatedly launched a short-lived goroutine per active node (to drain all HH data for that node), it now launches a single long-running goroutine per active node which remains running until the node processor is purged due to being marked inactive (which may never happen). A node is considered active until it is DROPped.

Giving each target node a dedicated, long-running, goroutine makes it easier to deal with some HH edge cases.

The change also cleans up stats, and adds HH diagnostics.

> show diagnostics for 'hh'
name: hh
--------
node    active  last modified                           head                            tail
2       yes     2015-10-26T16:33:48.579769662-07:00     /tmp/influxdb8086/hh/2/1:0      /tmp/influxdb8086/hh/2/1:43
3       yes     2015-10-26T16:34:02.055769467-07:00     /tmp/influxdb8086/hh/3/1:0      /tmp/influxdb8086/hh/3/1:2029


> show stats
name: hh
tags: path=/tmp/influxdb8086/hh
wr_shard_req    wr_shard_req_points
------------    -------------------
154             154

name: hh_processor
tags: node=3, path=/tmp/influxdb8086/hh/3
wr_node_req_fail        wr_shard_req    wr_shard_req_points
----------------        ------------    -------------------
1                       154             154

Multiple rounds of testing done with 3 nodes, with the 2 nodes not receiving the write requests killed every 10 seconds, multiple times. After these 2 nodes were finally brought online and kept online, they received all the missed writes.

pauldix · 2015-10-22T18:26:29Z

etc/config.sample.toml

+
+  # Interval between running checks for data that should be purged. Data is purged from
+  # hinted-handoff queues for two reasons. 1) The data is older than the max age, or
+  # 2) the target node has been dropped from the server. Data is never dropped until


target node? Do you mean shard?

HH queues (which are wrapped in a NodeProcessor) are (and always have been) on a per-node basis. By node I mean an InfluxDB process, server, etc whatever word makes sense.

Then I don't understand what "target node has been dropped from the server" means. That's "target server has been dropped from the server"? Do you actually mean "the target server has been dropped from the cluster"?

Yes I do, that Is a typo.

On Friday, October 23, 2015, Paul Dix notifications@github.com wrote:

In etc/config.sample.toml
#4516 (comment):

enabled = true

dir = "/var/opt/influxdb/hh"

max-size = 1073741824

max-age = "168h"

retry-rate-limit = 0

Hinted handoff will start retrying writes to down nodes at a rate of once per second.

If any error occurs, it will backoff in an exponential manner, until the interval

reaches retry-max-interval. Once writes to all nodes are successfully completed the

interval will reset to retry-interval.

retry-interval = "1s"

retry-max-interval = "1m"

Interval between running checks for data that should be purged. Data is purged from

hinted-handoff queues for two reasons. 1) The data is older than the max age, or

2) the target node has been dropped from the server. Data is never dropped until

Then I don't understand what "target node has been dropped from the
server" means. That's "target server has been dropped from the server"? Do
you actually mean "the target server has been dropped from the cluster"?

—
Reply to this email directly or view it on GitHub
https://github.com/influxdb/influxdb/pull/4516/files#r42859470.

pauldix · 2015-10-22T18:28:28Z

What is a node? I see this referenced a few times, but just a NodeProcessor. Something I'm missing?

otoolep · 2015-10-22T20:36:09Z

A Node is a distinct InfluxDB server i.e. a node in a cluster.

otoolep · 2015-10-26T22:55:16Z

@jwilder

otoolep · 2015-10-26T23:05:58Z

Kicked tests on Circle, it was an unrelated Raft issue.

A NodeProcessor wraps an on-disk queue and the goroutine that attempts to drain that queue and send the data to the associated target node.

jwilder · 2015-10-27T16:29:14Z

services/hh/service.go

 	if s.closing != nil {
 		close(s.closing)
 	}
 	s.wg.Wait()
+	s.closing = nil


Setting this to nil prevents existing goroutines from selecting it. That has created issues in other parts of the code.

Yeah, I've seen this pattern introduced recently and it made sense to me.

So a nil state of this variable is used to check the "open" state of the service. Sounds like I would then need to use a distinct boolean instead.

jwilder · 2015-10-27T16:39:10Z

Not sure if the nil channels will be a problem in this code or no. They did cause problems in other places where they were supposed to abort early from a function call, but did not because they were set to nil after being closed. Up to you.

👍

otoolep · 2015-10-27T21:42:52Z

Use of closing right now should be OK, future enhancements may come.

Refactor hinted-handoff service

otoolep added the 2 - Working label Oct 20, 2015

otoolep force-pushed the hh_processor_per_node branch 15 times, most recently from a105c9a to ff8291f Compare October 21, 2015 04:01

otoolep changed the title ~~Initial NodeProcessor code~~ Refactor HH service Oct 21, 2015

otoolep changed the title ~~Refactor HH service~~ Refactor HH service (WIP) Oct 21, 2015

otoolep force-pushed the hh_processor_per_node branch from ff8291f to 623b763 Compare October 21, 2015 04:14

otoolep changed the title ~~Refactor HH service (WIP)~~ Refactor hinted-handoff service (WIP) Oct 21, 2015

otoolep force-pushed the hh_processor_per_node branch from 623b763 to 5ad30bd Compare October 21, 2015 04:25

pauldix reviewed Oct 22, 2015
View reviewed changes

otoolep force-pushed the hh_processor_per_node branch 3 times, most recently from d12ceec to 574a108 Compare October 23, 2015 19:22

Support configurable purge interval

7d22fc7

otoolep force-pushed the hh_processor_per_node branch from 574a108 to 3c9924c Compare October 26, 2015 21:34

otoolep force-pushed the hh_processor_per_node branch from 3c9924c to 8e18c18 Compare October 26, 2015 21:37

otoolep changed the title ~~Refactor hinted-handoff service (WIP)~~ Refactor hinted-handoff service Oct 26, 2015

otoolep force-pushed the hh_processor_per_node branch 2 times, most recently from 5f091d0 to 605ecaf Compare October 26, 2015 22:50

otoolep force-pushed the hh_processor_per_node branch 5 times, most recently from bd60ccf to 2e57816 Compare October 26, 2015 23:56

otoolep mentioned this pull request Oct 27, 2015

Implement HH backoff correctly #4470

Closed

otoolep force-pushed the hh_processor_per_node branch 4 times, most recently from edafdd3 to f1f9fc3 Compare October 27, 2015 01:54

otoolep added 5 commits October 26, 2015 18:59

Implement NodeProcessor

9a73d26

A NodeProcessor wraps an on-disk queue and the goroutine that attempts to drain that queue and send the data to the associated target node.

Add HH statistics

87299ca

Add HH diagnostics

f703f58

Update CHANGELOG

95f9937

Add node's active state to diagnostic output

f38c536

otoolep force-pushed the hh_processor_per_node branch from f1f9fc3 to f38c536 Compare October 27, 2015 02:00

jwilder reviewed Oct 27, 2015
View reviewed changes

otoolep added a commit that referenced this pull request Oct 27, 2015

Merge pull request #4516 from influxdb/hh_processor_per_node

335e432

Refactor hinted-handoff service

otoolep merged commit 335e432 into master Oct 27, 2015

otoolep removed the 2 - Working label Oct 27, 2015

otoolep deleted the hh_processor_per_node branch October 27, 2015 21:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor hinted-handoff service #4516

Refactor hinted-handoff service #4516

otoolep commented Oct 20, 2015

pauldix Oct 22, 2015

otoolep Oct 22, 2015

pauldix Oct 23, 2015

otoolep Oct 23, 2015

Hinted handoff will start retrying writes to down nodes at a rate of once per second.

If any error occurs, it will backoff in an exponential manner, until the interval

reaches retry-max-interval. Once writes to all nodes are successfully completed the

interval will reset to retry-interval.

Interval between running checks for data that should be purged. Data is purged from

hinted-handoff queues for two reasons. 1) The data is older than the max age, or

2) the target node has been dropped from the server. Data is never dropped until

pauldix commented Oct 22, 2015

otoolep commented Oct 22, 2015

otoolep commented Oct 26, 2015

otoolep commented Oct 26, 2015

jwilder Oct 27, 2015

otoolep Oct 27, 2015

jwilder commented Oct 27, 2015

otoolep commented Oct 27, 2015

Refactor hinted-handoff service #4516

Refactor hinted-handoff service #4516

Conversation

otoolep commented Oct 20, 2015

pauldix Oct 22, 2015

Choose a reason for hiding this comment

otoolep Oct 22, 2015

Choose a reason for hiding this comment

pauldix Oct 23, 2015

Choose a reason for hiding this comment

otoolep Oct 23, 2015

Choose a reason for hiding this comment

Hinted handoff will start retrying writes to down nodes at a rate of once per second.

If any error occurs, it will backoff in an exponential manner, until the interval

reaches retry-max-interval. Once writes to all nodes are successfully completed the

interval will reset to retry-interval.

Interval between running checks for data that should be purged. Data is purged from

hinted-handoff queues for two reasons. 1) The data is older than the max age, or

2) the target node has been dropped from the server. Data is never dropped until

pauldix commented Oct 22, 2015

otoolep commented Oct 22, 2015

otoolep commented Oct 26, 2015

otoolep commented Oct 26, 2015

jwilder Oct 27, 2015

Choose a reason for hiding this comment

otoolep Oct 27, 2015

Choose a reason for hiding this comment

jwilder commented Oct 27, 2015

otoolep commented Oct 27, 2015