Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor hinted-handoff service #4516

Merged
merged 6 commits into from
Oct 27, 2015
Merged

Refactor hinted-handoff service #4516

merged 6 commits into from
Oct 27, 2015

Conversation

otoolep
Copy link
Contributor

@otoolep otoolep commented Oct 20, 2015

This change refactors the hinted-handoff service.

Whereas previously the system repeatedly launched a short-lived goroutine per active node (to drain all HH data for that node), it now launches a single long-running goroutine per active node which remains running until the node processor is purged due to being marked inactive (which may never happen). A node is considered active until it is DROPped.

Giving each target node a dedicated, long-running, goroutine makes it easier to deal with some HH edge cases.

The change also cleans up stats, and adds HH diagnostics.

> show diagnostics for 'hh'
name: hh
--------
node    active  last modified                           head                            tail
2       yes     2015-10-26T16:33:48.579769662-07:00     /tmp/influxdb8086/hh/2/1:0      /tmp/influxdb8086/hh/2/1:43
3       yes     2015-10-26T16:34:02.055769467-07:00     /tmp/influxdb8086/hh/3/1:0      /tmp/influxdb8086/hh/3/1:2029


> show stats
name: hh
tags: path=/tmp/influxdb8086/hh
wr_shard_req    wr_shard_req_points
------------    -------------------
154             154

name: hh_processor
tags: node=3, path=/tmp/influxdb8086/hh/3
wr_node_req_fail        wr_shard_req    wr_shard_req_points
----------------        ------------    -------------------
1                       154             154

Multiple rounds of testing done with 3 nodes, with the 2 nodes not receiving the write requests killed every 10 seconds, multiple times. After these 2 nodes were finally brought online and kept online, they received all the missed writes.

@otoolep otoolep force-pushed the hh_processor_per_node branch 15 times, most recently from a105c9a to ff8291f Compare October 21, 2015 04:01
@otoolep otoolep changed the title Initial NodeProcessor code Refactor HH service Oct 21, 2015
@otoolep otoolep changed the title Refactor HH service Refactor HH service (WIP) Oct 21, 2015
@otoolep otoolep force-pushed the hh_processor_per_node branch from ff8291f to 623b763 Compare October 21, 2015 04:14
@otoolep otoolep changed the title Refactor HH service (WIP) Refactor hinted-handoff service (WIP) Oct 21, 2015
@otoolep otoolep force-pushed the hh_processor_per_node branch from 623b763 to 5ad30bd Compare October 21, 2015 04:25

# Interval between running checks for data that should be purged. Data is purged from
# hinted-handoff queues for two reasons. 1) The data is older than the max age, or
# 2) the target node has been dropped from the server. Data is never dropped until
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target node? Do you mean shard?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HH queues (which are wrapped in a NodeProcessor) are (and always have been) on a per-node basis. By node I mean an InfluxDB process, server, etc whatever word makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I don't understand what "target node has been dropped from the server" means. That's "target server has been dropped from the server"? Do you actually mean "the target server has been dropped from the cluster"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I do, that Is a typo.

On Friday, October 23, 2015, Paul Dix notifications@github.com wrote:

In etc/config.sample.toml
#4516 (comment):

  • enabled = true
  • dir = "/var/opt/influxdb/hh"
  • max-size = 1073741824
  • max-age = "168h"
  • retry-rate-limit = 0
  • Hinted handoff will start retrying writes to down nodes at a rate of once per second.

  • If any error occurs, it will backoff in an exponential manner, until the interval

  • reaches retry-max-interval. Once writes to all nodes are successfully completed the

  • interval will reset to retry-interval.

  • retry-interval = "1s"
  • retry-max-interval = "1m"
  • Interval between running checks for data that should be purged. Data is purged from

  • hinted-handoff queues for two reasons. 1) The data is older than the max age, or

  • 2) the target node has been dropped from the server. Data is never dropped until

Then I don't understand what "target node has been dropped from the
server" means. That's "target server has been dropped from the server"? Do
you actually mean "the target server has been dropped from the cluster"?


Reply to this email directly or view it on GitHub
https://github.com/influxdb/influxdb/pull/4516/files#r42859470.

@pauldix
Copy link
Member

pauldix commented Oct 22, 2015

What is a node? I see this referenced a few times, but just a NodeProcessor. Something I'm missing?

@otoolep
Copy link
Contributor Author

otoolep commented Oct 22, 2015

A Node is a distinct InfluxDB server i.e. a node in a cluster.

@otoolep otoolep force-pushed the hh_processor_per_node branch 3 times, most recently from d12ceec to 574a108 Compare October 23, 2015 19:22
@otoolep otoolep force-pushed the hh_processor_per_node branch from 574a108 to 3c9924c Compare October 26, 2015 21:34
@otoolep otoolep force-pushed the hh_processor_per_node branch from 3c9924c to 8e18c18 Compare October 26, 2015 21:37
@otoolep otoolep changed the title Refactor hinted-handoff service (WIP) Refactor hinted-handoff service Oct 26, 2015
@otoolep otoolep force-pushed the hh_processor_per_node branch 2 times, most recently from 5f091d0 to 605ecaf Compare October 26, 2015 22:50
@otoolep
Copy link
Contributor Author

otoolep commented Oct 26, 2015

@jwilder

@otoolep
Copy link
Contributor Author

otoolep commented Oct 26, 2015

Kicked tests on Circle, it was an unrelated Raft issue.

@otoolep otoolep force-pushed the hh_processor_per_node branch 5 times, most recently from bd60ccf to 2e57816 Compare October 26, 2015 23:56
@otoolep otoolep force-pushed the hh_processor_per_node branch 4 times, most recently from edafdd3 to f1f9fc3 Compare October 27, 2015 01:54
A NodeProcessor wraps an on-disk queue and the goroutine that attempts
to drain that queue and send the data to the associated target node.
@otoolep otoolep force-pushed the hh_processor_per_node branch from f1f9fc3 to f38c536 Compare October 27, 2015 02:00
if s.closing != nil {
close(s.closing)
}
s.wg.Wait()
s.closing = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting this to nil prevents existing goroutines from selecting it. That has created issues in other parts of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've seen this pattern introduced recently and it made sense to me.

So a nil state of this variable is used to check the "open" state of the service. Sounds like I would then need to use a distinct boolean instead.

@jwilder
Copy link
Contributor

jwilder commented Oct 27, 2015

Not sure if the nil channels will be a problem in this code or no. They did cause problems in other places where they were supposed to abort early from a function call, but did not because they were set to nil after being closed. Up to you.

👍

@otoolep
Copy link
Contributor Author

otoolep commented Oct 27, 2015

Use of closing right now should be OK, future enhancements may come.

otoolep added a commit that referenced this pull request Oct 27, 2015
@otoolep otoolep merged commit 335e432 into master Oct 27, 2015
@otoolep otoolep deleted the hh_processor_per_node branch October 27, 2015 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants