Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fresh single-host install: "etcdserver: publish error: etcdserver: request timed out" and "Error: context deadline exceeded" #9670

Closed
cerebrate opened this issue May 1, 2018 · 17 comments

Comments

@cerebrate
Copy link

I've just started playing with etcd, so I've attempted to install the latest release (at this time, 3.3.4) on a single host (actually, an LXC container) to experiment with. Trouble is, I'm getting this error ("etcdserver: publish error: etcdserver: request timed out") showing up repeatedly in the log once the server is started, and attempting to use the etcd with, for example, etcdctl put foo bar produces the error "Error: context deadline exceeded".

The log is as follows:

2018-04-30 22:54:30.714439 I | etcdmain: Loading server configuration from "/etc/etcd.conf"
2018-04-30 22:54:30.716094 I | etcdmain: etcd Version: 3.3.4
2018-04-30 22:54:30.716119 I | etcdmain: Git SHA: fdde8705f
2018-04-30 22:54:30.716132 I | etcdmain: Go Version: go1.9.5
2018-04-30 22:54:30.716142 I | etcdmain: Go OS/Arch: linux/amd64
2018-04-30 22:54:30.716153 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8
2018-04-30 22:54:30.718055 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2018-04-30 22:54:30.718199 I | embed: listening for peers on http://0.0.0.0:2380
2018-04-30 22:54:30.718218 I | embed: pprof is enabled under /debug/pprof
2018-04-30 22:54:30.718254 I | embed: listening for client requests on 0.0.0.0:2379
2018-04-30 22:54:30.743954 I | etcdserver: name = arkane
2018-04-30 22:54:30.743980 I | etcdserver: data dir = /mnt/nas/etcd/arkane.etcd
2018-04-30 22:54:30.743993 I | etcdserver: member dir = /mnt/nas/etcd/arkane.etcd/member
2018-04-30 22:54:30.744005 I | etcdserver: heartbeat = 100ms
2018-04-30 22:54:30.744015 I | etcdserver: election = 1000ms
2018-04-30 22:54:30.744024 I | etcdserver: snapshot count = 10000
2018-04-30 22:54:30.744045 I | etcdserver: advertise client URLs = http://etcd.arkane-systems.lan:2379
2018-04-30 22:54:30.750161 I | etcdserver: restarting member da1dd496f65aadf6 in cluster 4164659cc46c0e7a at commit index 0
2018-04-30 22:54:30.750204 I | raft: da1dd496f65aadf6 became follower at term 0
2018-04-30 22:54:30.750220 I | raft: newRaft da1dd496f65aadf6 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2018-04-30 22:54:30.759231 W | auth: simple token is not cryptographically signed
2018-04-30 22:54:30.761159 I | etcdserver: starting server... [version: 3.3.4, cluster version: to_be_decided]
2018-04-30 22:54:37.761739 E | etcdserver: publish error: etcdserver: request timed out
2018-04-30 22:54:44.762028 E | etcdserver: publish error: etcdserver: request timed out
2018-04-30 22:54:51.762172 E | etcdserver: publish error: etcdserver: request timed out
2018-04-30 22:54:58.762419 E | etcdserver: publish error: etcdserver: request timed out
2018-04-30 22:55:05.762626 E | etcdserver: publish error: etcdserver: request timed out
2018-04-30 22:55:12.762856 E | etcdserver: publish error: etcdserver: request timed out

...and the configuration file I am using is as follows:

# This is the configuration file for the etcd server.

# Human-readable name for this member.
name: 'arkane'

# Path to the data directory.
data-dir: /mnt/nas/etcd/arkane.etcd

# Number of committed transactions to trigger a snapshot to disk.
snapshot-count: 10000

# List of comma separated URLs to listen on for peer traffic.
listen-peer-urls: http://0.0.0.0:2380

# List of comma separated URLs to listen on for client traffic.
listen-client-urls: http://0.0.0.0:2379

# List of this member's peer URLs to advertise to the rest of the cluster.
# The URLs needed to be a comma-separated list.
initial-advertise-peer-urls: http://etcd.arkane-systems.lan:2380

# List of this member's client URLs to advertise to the public.
# The URLs needed to be a comma-separated list.
advertise-client-urls: http://etcd.arkane-systems.lan:2379

# Initial cluster token for the etcd cluster during bootstrap.
initial-cluster-token: 'etcd-cluster'

# Initial cluster state ('new' or 'existing').
initial-cluster-state: 'new'

# Enable runtime profiling data via HTTP server
enable-pprof: true

# Enable debug-level logging for etcd.
debug: true

# Specify a particular log level for each etcd package (eg: 'etcdmain=CRITICAL,etcdserver=DEBUG'.
log-package-levels:

# Specify 'stdout' or 'stderr' to skip journald logging even when running under systemd.
log-outputs: [default]

# Force to create a new one member cluster.
force-new-cluster: false

auto-compaction-mode: periodic
auto-compaction-retention: "1"

Curiously, netstat -a does not seem to show any listening sockets on tcp ports 2379 and 2380 for IPv4, only for IPv6:

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 *:netbios-ssn           *:*                     LISTEN
tcp        0      0 *:ssh                   *:*                     LISTEN
tcp        0      0 localhost:smtp          *:*                     LISTEN
tcp        0      0 localhost:6010          *:*                     LISTEN
tcp        0      0 *:microsoft-ds          *:*                     LISTEN
tcp        0      0 etcd.arkane-systems:ssh ATHENA.arkane-sys:51219 ESTABLISHED
tcp6       0      0 [::]:2379               [::]:*                  LISTEN
tcp6       0      0 [::]:netbios-ssn        [::]:*                  LISTEN
tcp6       0      0 [::]:2380               [::]:*                  LISTEN
tcp6       0      0 [::]:ssh                [::]:*                  LISTEN
tcp6       0      0 localhost:smtp          [::]:*                  LISTEN
tcp6       0      0 localhost:6010          [::]:*                  LISTEN
tcp6       0      0 [::]:microsoft-ds       [::]:*                  LISTEN
tcp6       0      0 fdc9:b01a:9d26:0::35328 fdc9:b01a::microsoft-ds ESTABLISHED

Any ideas as to what might be going on?

@gyuho
Copy link
Contributor

gyuho commented May 2, 2018

Does this still happen without LXC?

@cerebrate
Copy link
Author

Hm. Unfortunately, I'm not sure right now, because all the test hosts I have access to in the lab are LXCs. I'll dig up some spare hardware and see what I can figure out.

@cerebrate
Copy link
Author

Okay. Looks like it doesn't happen without LXC. Using the same config file (save for adjusting hostnames and data-dir path appropriately for the different host), the log looks like:

2018-05-03 05:33:30.341643 I | etcdmain: Loading server configuration from "/etc/etcd.conf"
2018-05-03 05:33:30.342871 I | etcdmain: etcd Version: 3.3.4
2018-05-03 05:33:30.343007 I | etcdmain: Git SHA: fdde8705f
2018-05-03 05:33:30.343105 I | etcdmain: Go Version: go1.9.5
2018-05-03 05:33:30.343209 I | etcdmain: Go OS/Arch: linux/amd64
2018-05-03 05:33:30.343301 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2018-05-03 05:33:30.343503 I | embed: listening for peers on http://0.0.0.0:2380
2018-05-03 05:33:30.343595 I | embed: pprof is enabled under /debug/pprof
2018-05-03 05:33:30.343704 I | embed: listening for client requests on 0.0.0.0:2379
2018-05-03 05:33:30.355475 I | pkg/netutil: resolving x-ubuntu.arkane-systems.lan:2380 to 172.16.1.14:2380
2018-05-03 05:33:30.356079 I | pkg/netutil: resolving x-ubuntu.arkane-systems.lan:2380 to 172.16.1.14:2380
2018-05-03 05:33:30.356423 I | etcdserver: name = arkane
2018-05-03 05:33:30.356571 I | etcdserver: data dir = /home/ubuntu/arkane.etcd
2018-05-03 05:33:30.356714 I | etcdserver: member dir = /home/ubuntu/arkane.etcd/member
2018-05-03 05:33:30.356841 I | etcdserver: heartbeat = 100ms
2018-05-03 05:33:30.356967 I | etcdserver: election = 1000ms
2018-05-03 05:33:30.357134 I | etcdserver: snapshot count = 10000
2018-05-03 05:33:30.357284 I | etcdserver: advertise client URLs = http://x-ubuntu.arkane-systems.lan:2379
2018-05-03 05:33:30.357436 I | etcdserver: initial advertise peer URLs = http://x-ubuntu.arkane-systems.lan:2380
2018-05-03 05:33:30.357571 I | etcdserver: initial cluster = arkane=http://x-ubuntu.arkane-systems.lan:2380
2018-05-03 05:33:30.374677 I | etcdserver: starting member 644c1adcf43ff14c in cluster 4aaf6489aa151969
2018-05-03 05:33:30.374897 I | raft: 644c1adcf43ff14c became follower at term 0
2018-05-03 05:33:30.375046 I | raft: newRaft 644c1adcf43ff14c [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2018-05-03 05:33:30.375171 I | raft: 644c1adcf43ff14c became follower at term 1
2018-05-03 05:33:30.401136 W | auth: simple token is not cryptographically signed
2018-05-03 05:33:30.425528 I | etcdserver: starting server... [version: 3.3.4, cluster version: to_be_decided]
2018-05-03 05:33:30.429608 I | etcdserver: 644c1adcf43ff14c as single-node; fast-forwarding 9 ticks (election ticks 10)
2018-05-03 05:33:30.430412 I | etcdserver/membership: added member 644c1adcf43ff14c [http://x-ubuntu.arkane-systems.lan:2380] to cluster 4aaf6489aa151969
2018-05-03 05:33:30.776047 I | raft: 644c1adcf43ff14c is starting a new election at term 1
2018-05-03 05:33:30.776507 I | raft: 644c1adcf43ff14c became candidate at term 2
2018-05-03 05:33:30.776810 I | raft: 644c1adcf43ff14c received MsgVoteResp from 644c1adcf43ff14c at term 2
2018-05-03 05:33:30.777101 I | raft: 644c1adcf43ff14c became leader at term 2
2018-05-03 05:33:30.777430 I | raft: raft.node: 644c1adcf43ff14c elected leader 644c1adcf43ff14c at term 2
2018-05-03 05:33:30.778345 I | etcdserver: setting up the initial cluster version to 3.3
2018-05-03 05:33:30.778870 I | etcdserver: published {Name:arkane ClientURLs:[http://x-ubuntu.arkane-systems.lan:2379]} to cluster 4aaf6489aa151969
2018-05-03 05:33:30.779573 I | embed: ready to serve client requests
2018-05-03 05:33:30.780010 E | etcdmain: forgot to set Type=notify in systemd service file?
INFO: 2018/05/03 05:33:30 dialing to target with scheme: ""
INFO: 2018/05/03 05:33:30 could not get resolver for scheme: ""
2018-05-03 05:33:30.783342 N | embed: serving insecure client requests on [::]:2379, this is strongly discouraged!
2018-05-03 05:33:30.784529 N | etcdserver/membership: set the initial cluster version to 3.3
2018-05-03 05:33:30.784870 I | etcdserver/api: enabled capabilities for version 3.3

@cerebrate
Copy link
Author

Okay, that gave me an idea and I realized I've been barking up the wrong tree.

It's not the LXC that does it; it's the location of the data directory. /mnt/nas is a CIFS filesystem mounted from our big fileserver. Whether in-LXC or not, putting the data directory on the network filesystem triggers the error, and putting it locally does not.

@gyuho
Copy link
Contributor

gyuho commented May 3, 2018

@cerebrate

the data directory on the network filesystem triggers the error

What error do you see from etcd? All etcd needs is write access to the filesystem whether it's mounted locally or over the network. I don't think this is an etcd issue. If you can provide an easy way to reproduce, we can take a look.

Please reopen or create a new one if you still have issues.

Thanks.

@gyuho gyuho closed this as completed May 3, 2018
@cerebrate
Copy link
Author

@gyuho The error in question is the etcdserver: publish error: etcdserver: request timed out error as seen in the logs above.

Also, it appears that etcd does have write access to the filesystem, as the data directory with its subdirectories (member, with subdirectories in turn snap and wal) are created successfully as are the files in them, appearing identical to those in a local data directory. File permissions back this up, and while etcd's running, these files show as opened in lsof.

As such, and given that I haven't had other issues reading or writing the mounted filesystem so far, I'm fairly confident there's something etcd-specific going on.

All I have to do to reproduce this is:

1, Install etcd (latest 3.3.4 release) on a freshly-installed and updated Ubuntu Xenial.
2. Mount a CIFS share, using a user that has full permissions to the share and all folders therein:

mount -t cifs //mnemosyne.arkane-systems.lan/Antlets /mnt/nas -o username=etcd,cifsacl,sec=ntlm

  1. Start etcd with the data directory set to that share:

etcd --data-dir /mnt/nas/etcddir

...hm. Okay, on the Ubuntu test host, when it's creating the data directory on first run, it produces a different error (which did not occur in the LXC):

etcdserver: create wal error: sync /mnt/nas/etcddir/member: invalid argument

Running it a second time immediately (or subsequently) reverts to the former error as seen above. (This might suggest this is related to #6984 ?) I've attached straces of both the first and second run here:

etctrace1.txt
etctrace2.txt

I believe the error on first-run is relating to the known issue that the CIFS filesystem doesn't support fsync() on directories (always returning EINVAL); not sure about the second and subsequent.

@cerebrate
Copy link
Author

(Evidently can't reopen this myself; I'll not create a new one for now to avoid repo spam.)

@hexfusion hexfusion reopened this May 3, 2018
@gyuho
Copy link
Contributor

gyuho commented May 3, 2018

@cerebrate Try this on your mounted disk?

package main

import (
	"fmt"
	"os"
)

func main() {
	p := "aaa.txt"
	f, err := os.OpenFile(p, os.O_WRONLY|os.O_CREATE, 0600)
	if err != nil {
		panic(err)
	}
	fmt.Println("sync:", f.Sync())
}

And see if it returns the same error message?

@cerebrate
Copy link
Author

@gyuho No, doesn't seem so:

root@x-ubuntu:/mnt/nas/etcd# /home/ubuntu/test
sync: <nil>

@gyuho
Copy link
Contributor

gyuho commented May 3, 2018

@cerebrate

I believe the error on first-run is relating to the known issue that the CIFS filesystem doesn't support fsync() on directories (always returning EINVAL); not sure about the second and subsequent.

To verify this, let me add more detailed error logging around fsync code paths.

@gyuho
Copy link
Contributor

gyuho commented May 3, 2018

@cerebrate Could you try again with current master branch? To build, you need Go 1.10+. You can just run ./build script or make build from repository root.

Also need to enable new logger for more detailed error message:

etcd --data-dir /mnt/nas/etcddir --logger zap

Thanks!

@cerebrate
Copy link
Author

Sure, can do.

First run:

{"level":"info","ts":1525392195.6629446,"caller":"etcdmain/etcd.go:85","msg":"starting etcd","etcd-version":"3.3.0+git","git-sha":"2ad0acd","go-version":"go1.10","go-os":"linux","go-arch":"amd64","max-cpu-set":2,"max-cpu-available":2}
{"level":"info","ts":1525392195.66423,"caller":"embed/etcd.go:115","msg":"configuring peer listeners","listen-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":1525392195.6655223,"caller":"embed/etcd.go:125","msg":"configuring client listeners","listen-client-urls":["http://localhost:2379"]}
{"level":"info","ts":1525392195.7386835,"caller":"etcdserver/backend.go:65","msg":"opened backend db","path":"/mnt/nas/etcddir/member/snap/db","took":0.004826483}
{"level":"info","ts":1525392195.7392898,"caller":"etcdserver/config.go:292","msg":"server configuration","name":"default","data-dir":"/mnt/nas/etcddir","member-dir":"/mnt/nas/etcddir/member","dedicated-wal-dir":"","force-new-cluster":false,"heartbeat-tick-ms":100,"heartbeat-interval":"100ms","election-tick-ms":10,"election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"advertise-client-urls":["http://localhost:2379"],"initial-advertise-peer-urls":["http://localhost:2380"],"initial":true,"initial-cluster":"default=http://localhost:2380","initial-cluster-state":"new","initial-cluster-token":"etcd-cluster","pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":0,"auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
{"level":"info","ts":1525392195.8071294,"caller":"wal/wal.go:252","msg":"releasing flock to rename","from":"/mnt/nas/etcddir/member/wal.tmp","to":"/mnt/nas/etcddir/member/wal"}
{"level":"warn","ts":1525392195.8541706,"caller":"wal/wal.go:202","msg":"failed to fsync the parent data directory file","parent-dir-path":"/mnt/nas/etcddir/member","dir-path":"/mnt/nas/etcddir/member/wal","error":"sync /mnt/nas/etcddir/member: invalid argument"}
{"level":"fatal","ts":1525392195.85429,"caller":"etcdserver/raft.go:430","msg":"failed to create WAL","error":"sync /mnt/nas/etcddir/member: invalid argument","stacktrace":"github.com/coreos/etcd/etcdserver.startNode\n\t/home/ubuntu/go_projects/src/github.com/coreos/etcd/etcdserver/raft.go:430\ngithub.com/coreos/etcd/etcdserver.NewServer\n\t/home/ubuntu/go_projects/src/github.com/coreos/etcd/etcdserver/server.go:390\ngithub.com/coreos/etcd/embed.StartEtcd\n\t/home/ubuntu/go_projects/src/github.com/coreos/etcd/embed/etcd.go:198\ngithub.com/coreos/etcd/etcdmain.startEtcd\n\t/home/ubuntu/go_projects/src/github.com/coreos/etcd/etcdmain/etcd.go:314\ngithub.com/coreos/etcd/etcdmain.startEtcdOrProxyV2\n\t/home/ubuntu/go_projects/src/github.com/coreos/etcd/etcdmain/etcd.go:172\ngithub.com/coreos/etcd/etcdmain.Main\n\t/home/ubuntu/go_projects/src/github.com/coreos/etcd/etcdmain/main.go:47\nmain.main\n\t/home/ubuntu/go_projects/src/github.com/coreos/etcd/main.go:28\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:198"}

Second run:

{"level":"info","ts":1525392284.8419576,"caller":"etcdmain/etcd.go:85","msg":"starting etcd","etcd-version":"3.3.0+git","git-sha":"2ad0acd","go-version":"go1.10","go-os":"linux","go-arch":"amd64","max-cpu-set":2,"max-cpu-available":2}
{"level":"info","ts":1525392284.8526986,"caller":"etcdmain/etcd.go:146","msg":"server has been already initialized","data-dir":"/mnt/nas/etcddir","dir-type":"member"}
{"level":"info","ts":1525392284.8529255,"caller":"embed/etcd.go:115","msg":"configuring peer listeners","listen-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":1525392284.854151,"caller":"embed/etcd.go:125","msg":"configuring client listeners","listen-client-urls":["http://localhost:2379"]}
{"level":"info","ts":1525392284.8892043,"caller":"etcdserver/backend.go:65","msg":"opened backend db","path":"/mnt/nas/etcddir/member/snap/db","took":0.003343559}
{"level":"info","ts":1525392284.9069676,"caller":"etcdserver/config.go:292","msg":"server configuration","name":"default","data-dir":"/mnt/nas/etcddir","member-dir":"/mnt/nas/etcddir/member","dedicated-wal-dir":"","force-new-cluster":false,"heartbeat-tick-ms":100,"heartbeat-interval":"100ms","election-tick-ms":10,"election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"advertise-client-urls":["http://localhost:2379"],"initial-advertise-peer-urls":["http://localhost:2380"],"initial":false,"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":0,"auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
{"level":"info","ts":1525392285.0336146,"caller":"etcdserver/raft.go:496","msg":"restarting local member","cluster-id":"cdf818194e3a8c32","local-member-id":"8e9e05c52164694d","commit-index":0}
{"level":"info","ts":1525392285.0345345,"caller":"raft/raft.go:656","msg":"8e9e05c52164694d became follower at term 0"}
{"level":"info","ts":1525392285.0346575,"caller":"raft/raft.go:364","msg":"newRaft 8e9e05c52164694d [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]"}
{"level":"warn","ts":1525392285.0477502,"caller":"auth/store.go:1245","msg":"simple token is not cryptographically signed"}
{"level":"info","ts":1525392285.0501785,"caller":"etcdserver/quota.go:83","msg":"enabled backend quota with default value","quota-name":"v3-applier","quota-size-bytes":2147483648,"quota-size":"2.1 GB"}
{"level":"info","ts":1525392285.052591,"caller":"embed/etcd.go:209","msg":"configured CORS","cors":["*"]}
{"level":"info","ts":1525392285.0526836,"caller":"embed/etcd.go:221","msg":"configured host whitelist","hosts":["*"]}
{"level":"info","ts":1525392285.0527534,"caller":"etcdserver/server.go:736","msg":"starting etcd server","local-member-id":"8e9e05c52164694d","local-server-version":"3.3.0+git","cluster-version":"to_be_decided"}
{"level":"info","ts":1525392285.0529552,"caller":"etcdserver/quota.go:83","msg":"enabled backend quota with default value","quota-name":"kv","quota-size-bytes":2147483648,"quota-size":"2.1 GB"}
{"level":"info","ts":1525392285.0531054,"caller":"etcdserver/quota.go:83","msg":"enabled backend quota with default value","quota-name":"lease","quota-size-bytes":2147483648,"quota-size":"2.1 GB"}
{"level":"info","ts":1525392285.0554845,"caller":"etcdserver/server.go:644","msg":"starting initial election tick advance","election-ticks":10}
{"level":"info","ts":1525392285.0621436,"caller":"embed/etcd.go:253","msg":"now serving peer/client/metrics","local-member-id":"8e9e05c52164694d","initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://localhost:2380"],"advertise-client-urls":["http://localhost:2379"],"listen-client-urls":["http://localhost:2379"],"listen-metrics-urls":[]}
{"level":"info","ts":1525392285.0622025,"caller":"embed/etcd.go:493","msg":"serving peer traffic","address":"127.0.0.1:2380"}
{"level":"warn","ts":1525392292.056707,"caller":"etcdserver/server.go:1799","msg":"failed to publish local member to cluster through raft","local-member-id":"8e9e05c52164694d","local-member-attributes":"{Name:default ClientURLs:[http://localhost:2379]}","request-path":"/0/members/8e9e05c52164694d/attributes","publish-timeout":7,"error":"etcdserver: request timed out"}
{"level":"warn","ts":1525392299.0574567,"caller":"etcdserver/server.go:1799","msg":"failed to publish local member to cluster through raft","local-member-id":"8e9e05c52164694d","local-member-attributes":"{Name:default ClientURLs:[http://localhost:2379]}","request-path":"/0/members/8e9e05c52164694d/attributes","publish-timeout":7,"error":"etcdserver: request timed out"}
{"level":"warn","ts":1525392306.0580823,"caller":"etcdserver/server.go:1799","msg":"failed to publish local member to cluster through raft","local-member-id":"8e9e05c52164694d","local-member-attributes":"{Name:default ClientURLs:[http://localhost:2379]}","request-path":"/0/members/8e9e05c52164694d/attributes","publish-timeout":7,"error":"etcdserver: request timed out"}
{"level":"warn","ts":1525392313.0589206,"caller":"etcdserver/server.go:1799","msg":"failed to publish local member to cluster through raft","local-member-id":"8e9e05c52164694d","local-member-attributes":"{Name:default ClientURLs:[http://localhost:2379]}","request-path":"/0/members/8e9e05c52164694d/attributes","publish-timeout":7,"error":"etcdserver: request timed out"}
{"level":"warn","ts":1525392320.060187,"caller":"etcdserver/server.go:1799","msg":"failed to publish local member to cluster through raft","local-member-id":"8e9e05c52164694d","local-member-attributes":"{Name:default ClientURLs:[http://localhost:2379]}","request-path":"/0/members/8e9e05c52164694d/attributes","publish-timeout":7,"error":"etcdserver: request timed out"}
{"level":"warn","ts":1525392327.0606666,"caller":"etcdserver/server.go:1799","msg":"failed to publish local member to cluster through raft","local-member-id":"8e9e05c52164694d","local-member-attributes":"{Name:default ClientURLs:[http://localhost:2379]}","request-path":"/0/members/8e9e05c52164694d/attributes","publish-timeout":7,"error":"etcdserver: request timed out"}
^C

@gyuho
Copy link
Contributor

gyuho commented May 4, 2018

@cerebrate

{"level":"warn","ts":1525392195.8541706,"caller":"wal/wal.go:202","msg":"failed to fsync the parent data directory file","parent-dir-path":"/mnt/nas/etcddir/member","dir-path":"/mnt/nas/etcddir/member/wal","error":"sync /mnt/nas/etcddir/member: invalid argument"}

https://github.com/coreos/etcd/blob/2ad0acdea8bb6cc820256e8c2f6f87547303f28b/wal/wal.go#L200-L210

This verifies that:

I believe the error on first-run is relating to the known issue that the CIFS filesystem doesn't support fsync() on directories (always returning EINVAL); not sure about the second and subsequent.

Not sure how etcd would work around this...

@cerebrate
Copy link
Author

As far as I can tell from my digging, other projects that have had this issue (with CIFS and other network file systems) work around it by logging a warning telling you to set the cache=none option and ignoring the EINVAL return from fsync().

That'd certainly work for me, seeing as the NAS in question has all the battery backup it needs to prevent data loss at its end, but I can understand if you don't want to do that. On the other hand, my other option is to use NFS, which on Linux supports but dummies out fsync() on directories anyway (per https://github.com/torvalds/linux/blob/master/fs/nfs/dir.c#L928-L931 ), so it's not like using the network file system that supposedly supports fsync() would gain anyone anything.

@gyuho
Copy link
Contributor

gyuho commented May 4, 2018

As far as I can tell from my digging, other projects that have had this issue (with CIFS and other network file systems) work around it by logging a warning telling you to set the cache=none option and ignoring the EINVAL return from fsync().

Given that we will keep fsync on directory, can you help document CIFS behavior around https://github.com/coreos/etcd/blob/master/Documentation/faq.md?

@cerebrate
Copy link
Author

Sure, I can do that.

@gyuho gyuho closed this as completed May 17, 2018
@hasethuraman
Copy link

@cerebrate and @gyuho
Could you please let me know if fsync limitation in cifs with cache enabled is still there today atleast in etcd:3.4.15-debian-10-r43?

Because when I tried to bring up just etcd server (in above mentioned version) on cifs without cache, I get the below error.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants