RC32: Frequent write errors #2849

bww · 2015-06-09T16:53:21Z

After upgrading to RC32 I'm seeing write errors ranging between 25-75% of requests for no apparent reason. Sometimes there will be no problem for an hour followed by an extended period with high error rates.

I would expect either all (or essentially all) writes to succeed or all writes to fail if there's some kind of misconfiguration. A huge percentage of writes failing, but not all, is concerning.

There isn't any obvious problem with my setup that I can see and I've had earlier 0.9.x versions running successfully for months on this host.

The service responds with the following opaque error:

write failed

I can't find a way to get the service to trace anything more useful than the HTTP request log line, e.g.,

[http] 2015/06/09 16:36:17 172.31.34.132 - root [09/Jun/2015:16:36:17 +0000] POST /write?consistency=&db=xyz&precision=&rp=default HTTP/1.1 500 36 - Go Metrics a7099361-0ec5-11e5-92c2-000000000000 1.400526ms

I tried setting write-tracing to true in config to get some more diagnostic information but as far as I can tell that didn't seem to do anything.

I'm using the RC32 client from this repo in /client to send writes. The following is typical of the input I'm sending:

collector.request_duration.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909362984541
collector.timeout.meter,environment=devel,host=the.host.com count=0,fifteen-minute=0.0,five-minute=0.0,mean=0.0,one-minute=0.0 1433867909363035654
service.notify.clients.count,environment=devel,host=the.host.com count=0 1433867909363045462
service.request_duration.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363051340
collector.request.meter,environment=devel,host=the.host.com count=0,fifteen-minute=0.0,five-minute=0.0,mean=0.0,one-minute=0.0 1433867909363109642
notify.delivery.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363114736
service.requests.meter,environment=devel,host=the.host.com count=0,fifteen-minute=0.0,five-minute=0.0,mean=0.0,one-minute=0.0 1433867909363124859
store.fetch.duration.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363128004
store.store.duration.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363136722
aggregator.metric.count.meter,environment=devel,host=the.host.com count=0,fifteen-minute=0.0,five-minute=0.0,mean=0.0,one-minute=0.0 1433867909363145322
aggregator.aggregate.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363171425
aggregator.timescale.duration.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363180985
aggregator.timescale.count.meter,environment=devel,host=the.host.com count=0,fifteen-minute=0.0,five-minute=0.0,mean=0.0,one-minute=0.0 1433867909363198635
collector.collect_duration.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363201891
collector.errors.meter,environment=devel,host=the.host.com count=0,fifteen-minute=0.0,five-minute=0.0,mean=0.0,one-minute=0.0 1433867909363217543
notify.routing.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363220560
notify.notifications.meter,environment=devel,host=the.host.com count=0,fifteen-minute=0.0,five-minute=0.0,mean=0.0,one-minute=0.0 1433867909363229164
aggregator.metric.duration.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363242469
notify.backlog.count,environment=devel,host=the.host.com count=0 1433867909363251125

Server OS: 64-bit Ubuntu 14.04.2 LTS, from .deb package
Client OS: tested with Ubuntu 14.04.2 LTS and OS X 10.10.3
Release: 0.9 RC32

The text was updated successfully, but these errors were encountered:

toddboom · 2015-06-09T17:05:40Z

@bww RC32 had some significant changes in the write path that might be causing this - we're actively trying to stabilize that new code, so this is really helpful. is this just running on a single node? can you tell us a little more about the average write rate when this happens?

bww · 2015-06-09T17:17:28Z

@toddboom Yeah, it's a single node. Write frequency is pretty light, it's collecting metrics sampled every 5 seconds from 5 or 6 different client hosts.

toddboom · 2015-06-09T17:49:23Z

@bww thanks for the quick response. do the errors go away after a while or does it stay in a high-error state once is starts?

jwilder · 2015-06-09T18:55:29Z

#2853 adds more logging to help troubleshoot write error.

bww · 2015-06-09T23:39:19Z

@toddboom Once it gets into a high-error state (which it invariably does) it seems to remain that way. Restarting the service doesn't really seem to do anything to help. However, clearing out the database (rm -rf style) and recreating it significantly lowers the error rate for a while (although not to zero, but nearly).

Just in case it's somehow significant, I'll also mention that the service runs on a different volume than its data directory.

I'll try the logging in #2853 when I get a minute and see if that yields anything useful.

Jhors2 · 2015-06-10T13:09:00Z

I am having the same issue for what it is worth. However, more specifically I am seeing the following:
[write] 2015/06/10 06:07:08 write failed for shard 4 on node 3: error code 1: write shard 4: input field "value" is type int64, already exists as type float

[write] 2015/06/10 06:07:08 write failed for shard 4 on node 1: error code 1: write shard 4: input field "value" is type float64, already exists as type integer

I am running a 3 node cluster.

beckettsean added this to the 0.9.0 milestone Jun 10, 2015

jwilder closed this as completed in 999f4a4 Jun 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RC32: Frequent write errors #2849

RC32: Frequent write errors #2849

bww commented Jun 9, 2015

toddboom commented Jun 9, 2015

bww commented Jun 9, 2015

toddboom commented Jun 9, 2015

jwilder commented Jun 9, 2015

bww commented Jun 9, 2015

Jhors2 commented Jun 10, 2015

RC32: Frequent write errors #2849

RC32: Frequent write errors #2849

Comments

bww commented Jun 9, 2015

toddboom commented Jun 9, 2015

bww commented Jun 9, 2015

toddboom commented Jun 9, 2015

jwilder commented Jun 9, 2015

bww commented Jun 9, 2015

Jhors2 commented Jun 10, 2015