Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RC32: Frequent write errors #2849

Closed
bww opened this issue Jun 9, 2015 · 6 comments
Closed

RC32: Frequent write errors #2849

bww opened this issue Jun 9, 2015 · 6 comments
Milestone

Comments

@bww
Copy link

bww commented Jun 9, 2015

After upgrading to RC32 I'm seeing write errors ranging between 25-75% of requests for no apparent reason. Sometimes there will be no problem for an hour followed by an extended period with high error rates.

I would expect either all (or essentially all) writes to succeed or all writes to fail if there's some kind of misconfiguration. A huge percentage of writes failing, but not all, is concerning.

There isn't any obvious problem with my setup that I can see and I've had earlier 0.9.x versions running successfully for months on this host.

The service responds with the following opaque error:

write failed

I can't find a way to get the service to trace anything more useful than the HTTP request log line, e.g.,

[http] 2015/06/09 16:36:17 172.31.34.132 - root [09/Jun/2015:16:36:17 +0000] POST /write?consistency=&db=xyz&precision=&rp=default HTTP/1.1 500 36 - Go Metrics a7099361-0ec5-11e5-92c2-000000000000 1.400526ms

I tried setting write-tracing to true in config to get some more diagnostic information but as far as I can tell that didn't seem to do anything.

I'm using the RC32 client from this repo in /client to send writes. The following is typical of the input I'm sending:

collector.request_duration.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909362984541
collector.timeout.meter,environment=devel,host=the.host.com count=0,fifteen-minute=0.0,five-minute=0.0,mean=0.0,one-minute=0.0 1433867909363035654
service.notify.clients.count,environment=devel,host=the.host.com count=0 1433867909363045462
service.request_duration.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363051340
collector.request.meter,environment=devel,host=the.host.com count=0,fifteen-minute=0.0,five-minute=0.0,mean=0.0,one-minute=0.0 1433867909363109642
notify.delivery.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363114736
service.requests.meter,environment=devel,host=the.host.com count=0,fifteen-minute=0.0,five-minute=0.0,mean=0.0,one-minute=0.0 1433867909363124859
store.fetch.duration.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363128004
store.store.duration.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363136722
aggregator.metric.count.meter,environment=devel,host=the.host.com count=0,fifteen-minute=0.0,five-minute=0.0,mean=0.0,one-minute=0.0 1433867909363145322
aggregator.aggregate.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363171425
aggregator.timescale.duration.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363180985
aggregator.timescale.count.meter,environment=devel,host=the.host.com count=0,fifteen-minute=0.0,five-minute=0.0,mean=0.0,one-minute=0.0 1433867909363198635
collector.collect_duration.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363201891
collector.errors.meter,environment=devel,host=the.host.com count=0,fifteen-minute=0.0,five-minute=0.0,mean=0.0,one-minute=0.0 1433867909363217543
notify.routing.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363220560
notify.notifications.meter,environment=devel,host=the.host.com count=0,fifteen-minute=0.0,five-minute=0.0,mean=0.0,one-minute=0.0 1433867909363229164
aggregator.metric.duration.timer,environment=devel,host=the.host.com 50-percentile=0.0,75-percentile=0.0,95-percentile=0.0,99-percentile=0.0,999-percentile=0.0,count=0,fifteen-minute=0.0,five-minute=0.0,max=0,mean=0.0,mean-rate=0.0,min=0,one-minute=0.0,std-dev=0.0 1433867909363242469
notify.backlog.count,environment=devel,host=the.host.com count=0 1433867909363251125

Server OS: 64-bit Ubuntu 14.04.2 LTS, from .deb package
Client OS: tested with Ubuntu 14.04.2 LTS and OS X 10.10.3
Release: 0.9 RC32

@toddboom
Copy link
Contributor

toddboom commented Jun 9, 2015

@bww RC32 had some significant changes in the write path that might be causing this - we're actively trying to stabilize that new code, so this is really helpful. is this just running on a single node? can you tell us a little more about the average write rate when this happens?

@bww
Copy link
Author

bww commented Jun 9, 2015

@toddboom Yeah, it's a single node. Write frequency is pretty light, it's collecting metrics sampled every 5 seconds from 5 or 6 different client hosts.

@toddboom
Copy link
Contributor

toddboom commented Jun 9, 2015

@bww thanks for the quick response. do the errors go away after a while or does it stay in a high-error state once is starts?

@jwilder
Copy link
Contributor

jwilder commented Jun 9, 2015

#2853 adds more logging to help troubleshoot write error.

@bww
Copy link
Author

bww commented Jun 9, 2015

@toddboom Once it gets into a high-error state (which it invariably does) it seems to remain that way. Restarting the service doesn't really seem to do anything to help. However, clearing out the database (rm -rf style) and recreating it significantly lowers the error rate for a while (although not to zero, but nearly).

Just in case it's somehow significant, I'll also mention that the service runs on a different volume than its data directory.

I'll try the logging in #2853 when I get a minute and see if that yields anything useful.

@beckettsean beckettsean added this to the 0.9.0 milestone Jun 10, 2015
@Jhors2
Copy link

Jhors2 commented Jun 10, 2015

I am having the same issue for what it is worth. However, more specifically I am seeing the following:
[write] 2015/06/10 06:07:08 write failed for shard 4 on node 3: error code 1: write shard 4: input field "value" is type int64, already exists as type float

[write] 2015/06/10 06:07:08 write failed for shard 4 on node 1: error code 1: write shard 4: input field "value" is type float64, already exists as type integer

I am running a 3 node cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants