Filling up logfile / IOError: No space left on device #3561

johnarnold · 2015-07-06T21:17:58Z

Running version 1.5.1-1

johnar@logr1west1:/etc/logstash/conf.d$ apt-cache policy logstash
logstash:
Installed: 1:1.5.1-1
Candidate: 1:1.5.1-1
Version table:
*** 1:1.5.1-1 0
100 /var/lib/dpkg/status

Disk appears to be getting filled up, every event is logging to the logfile.

cat logstash.err:

IOError: No space left on device
write at org/jruby/RubyIO.java:1412
write at org/jruby/RubyIO.java:2490
puts at org/jruby/RubyIO.java:2380
<< at /opt/logstash/vendor/bundle/jruby/1.9/gems/cabin-0.7.1/lib/cabin/outputs/io.rb:52
synchronize at org/jruby/ext/thread/Mutex.java:149
<< at /opt/logstash/vendor/bundle/jruby/1.9/gems/cabin-0.7.1/lib/cabin/outputs/io.rb:50
publish at /opt/logstash/vendor/bundle/jruby/1.9/gems/cabin-0.7.1/lib/cabin/channel.rb:176
each at org/jruby/RubyHash.java:1341
publish at /opt/logstash/vendor/bundle/jruby/1.9/g
synchronize at org/jruby/ext/thread/Mutex.java:149
publish at /opt/logstash/vendor/bundle/jruby/1.9/g
_log at /opt/logstash/vendor/bundle/jruby/1.9/g
log_with_level at /opt/logstash/vendor/bundle/jruby/1.9/g
warn at /opt/logstash/vendor/bundle/jruby/1.9/g
buffer_flush at /opt/logstash/vendor/bundle/jruby/1.9/g
buffer_flush at /opt/logstash/vendor/bundle/jruby/1.9/g
each at org/jruby/RubyHash.java:1341
buffer_flush at /opt/logstash/vendor/bundle/jruby/1.9/g
buffer_flush at /opt/logstash/vendor/bundle/jruby/1.9/g
buffer_receive at /opt/logstash/vendor/bundle/jruby/1.9/g
receive at /opt/logstash/vendor/bundle/jruby/1.9/g /outputs/elasticsearch.rb:455
handle at /opt/logstash/vendor/bundle/jruby/1.9/g :88
worker_setup at /opt/logstash/vendor/bundle/jruby/1.9/g :79

johnar@logr1west1:/var/log/logstash$ ls -lrt
total 26888176
-rw-r--r-- 1 root root 57 Jul 6 21:00 logstash.stdout
-rw-rw-rw- 1 logstash logstash 27533471744 Jul 6 21:01 logstash.log
-rw-r--r-- 1 root root 11029 Jul 6 21:01 logstash.err

Logstash options:

$ sudo ps ax | grep logstash

21963 pts/2 SNl 0:17 /usr/bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/var/lib/logstash -Xmx500m -Xss2048k -Djffi.boot.library.path=/opt/logstash/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/var/lib/logstash -Xbootclasspath/a:/opt/logstash/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/opt/logstash/vendor/jruby -Djruby.lib=/opt/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /opt/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /etc/logstash/conf.d -l /var/log/logstash/logstash.log -w 6

Sanitized config:
http://termbin.com/95v7

jordansissel · 2015-07-06T22:47:01Z

every event is logging to the logfile.

Can you show me some examples of these?

While I wait, here are some initial thoughts given the data you provided:

Possible: You may be using output { stdout { ... } } and not realizing it. I see this commented out in your provided config, but your Logstash uses -f /etc/logstash/conf.d which I assume is a directory full of configs - one of them may have this stdout output.
Unlikely: You may have Logstash configured with debug or verbose logging. I do not see this in your ps output though (I looked for --debug, --verbose, and -v flags)

johnarnold · 2015-07-06T23:38:29Z

Jordan, I think this is related to:
logstash-plugins/logstash-output-elasticsearch#144

I'm getting logs like:
{:timestamp=>"2015-07-06T23:32:37.791000+0000", :message=>"failed action with response of 400, dropping action: ["i
ndex", {:_id=>nil, :_index=>"logstash-2015.07.06", :_type=>"logs", :_routing=>nil}, #<LogStash::Event:0x53a3fd3
2 @metadata_accessors=#<LogStash::Util::Accessors:0x510314e5 @store={"retry_count"=>0}, @lut={}>, @cancelled=false
, @DaTa={"message"=>"logger: [ssl_req][06/Jul/2015:23:16:21 +0000] 64.39.105.54 TLSv1 DHE-RSA-AES256-SHA \"GET
/2014/users/10/shipping_addresses HTTP/1.1\" 1296", "@Version"=>"1", "@timestamp"=>"2015-07-06T23:16:21.00
0Z", "source"=>"syslog", "host"=>"bn3-f89-bspos-7b", "path"=>"/var/log/firehose.log", "host_ip"=>"10
....etc

The config file is a symlink up one directory:
$ ls -l /etc/logstash/conf.d
total 0
lrwxrwxrwx 1 root root 16 Jun 26 21:13 logstash.conf -> ../logstash.conf

jordansissel · 2015-07-06T23:41:35Z

Ahh. I'm not sure your problem is logstash-plugins/logstash-output-elasticsearch#144 at this time - that ticket is very long and has lots of mixed reports.

Focusing just on your issue, the error (400) comes from ES because it is rejecting something about your data - what, I don't know at this time. If possible, see what Elasticsearch's logs have to say? Alternately, maybe sniff the http response coming from Elasticsearch?

johnarnold · 2015-07-06T23:50:55Z

Good advice... found this on one of the masters:

[2015-07-06 23:45:42,080][WARN ][cluster.action.shard ] [logmaster2] [logstash-2015.07.06][3] received shard failed for [logstash-2015.07.06][3], node[srkzaGS5TNiN8H7JWfpZdA], [R], s[INITIALIZING], indexUUID [MaXO6tx7TDO7g74ObXkfxg], reason [shard failure [failed recovery][RecoveryFailedException[[logstash-2015.07.06][3]: Recovery failed from [logdata5][x8kENbNRSiCki0JaMp2uow][logdata5][inet[/10.80.80.11:9300]]{master=false} into [logdata9][srkzaGS5TNiN8H7JWfpZdA][logdata9][inet[/10.80.80.15:9300]]{master=false}]; nested: RemoteTransportException[[logdata5][inet[/10.80.80.11:9300]][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[[logstash-2015.07.06][3] Phase[2] Execution failed]; nested: RemoteTransportException[[logdata9][inet[/10.80.80.15:9300]][internal:index/shard/recovery/translog_ops]]; nested: MapperParsingException[failed to parse [device_timestamp]]; nested: MapperParsingException[failed to parse date field [Jul 6 23:33:11], tried both date format [dateOptionalTime], and timestamp number with locale []]; nested: IllegalArgumentException[Invalid format: "Jul 6 23:33:11"]; ]]

This is weird because according to the index mapping, device_timestamp should be a string:
device_timestamp string

Assuming there is a fixable mapping error here, it would be nice to handle more gracefully in logstash rather than spamming the logfile. I'll see if I can fix it and then update the issue.

johnarnold · 2015-07-07T18:44:39Z

Root case:
Elasticsearch was unable to index certain device_timestamp fields due to a conflict between the dynamically mapped template (which cast these as datetime) and the received field (which could be a really ugly string). This caused Elasticsearch to return HTTP 400 codes to logstash, which mishandeled these events and dumped them to the logfile with no indication of what was wrong. The logfile would fill up the disk, eventually causing logstash to die (and not restart successfully).

Resolution:

es-template.txt has been updated to hardcode “device_timestamp” field as a string:

    "device_timestamp" : { "type": "string", "index": "not_analyzed" }

This resolves indexing errors seen on the master due to invalid mapping of events:
[2015-07-06 23:45:42,080][WARN ][cluster.action.shard ] [logmaster2] [logstash-2015.07.06][3] received shard failed for [logstash-2015.07.06][3], node[srkzaGS5TNiN8H7JWfpZdA], [R], s[INITIALIZING], indexUUID [MaXO6tx7TDO7g74ObXkfxg], reason [shard failure [failed recovery][RecoveryFailedException[[logstash-2015.07.06][3]: Recovery failed from [logdata5][x8kENbNRSiCki0JaMp2uow][logdata5][inet[/10.80.80.11:9300]]{master=false} into [logdata9][srkzaGS5TNiN8H7JWfpZdA][logdata9][inet[/10.80.80.15:9300]]{master=false}]; nested: RemoteTransportException[[logdata5][inet[/10.80.80.11:9300]][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[[logstash-2015.07.06][3] Phase[2] Execution failed]; nested: RemoteTransportException[[logdata9][inet[/10.80.80.15:9300]][internal:index/shard/recovery/translog_ops]]; nested: MapperParsingException[failed to parse [device_timestamp]]; nested: MapperParsingException[failed to parse date field [Jul 6 23:33:11], tried both date format [dateOptionalTime], and timestamp number with locale []]; nested: IllegalArgumentException[Invalid format: "Jul 6 23:33:11"]; ]]

I had to nuke today’s index in order to recreate the mappings. Seems to be running pretty smoothly now.

I think the plugin should not spam the log for mapping problems like this (at least with debug off). I'll close this case and comment on the other open case under the plugin. Thanks again Jordan.

johnarnold closed this as completed Jul 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filling up logfile / IOError: No space left on device #3561

Filling up logfile / IOError: No space left on device #3561

johnarnold commented Jul 6, 2015

jordansissel commented Jul 6, 2015

johnarnold commented Jul 6, 2015

jordansissel commented Jul 6, 2015

johnarnold commented Jul 6, 2015

johnarnold commented Jul 7, 2015

Filling up logfile / IOError: No space left on device #3561

Filling up logfile / IOError: No space left on device #3561

Comments

johnarnold commented Jul 6, 2015

jordansissel commented Jul 6, 2015

johnarnold commented Jul 6, 2015

jordansissel commented Jul 6, 2015

johnarnold commented Jul 6, 2015

johnarnold commented Jul 7, 2015