Poor Logstash Throughput #27

ingshtrom · 2018-05-03T13:10:45Z

Hello,

We're trying to send logs to SumoLogic through Logstash. We current send our logs to an ELK stack, and it maintains a throughput of 25k messages per second, but no matter the configuration we have tried with the SumoLogic Logstash plugin, our throughput drops dramatically every time.

Here are the configurations that we have tried and their results as show by graphs of message throughput.

sumologic {
    url=>"{{ .Env "SUMO_URL" }}"
    format=>"%{@json}"
}

The deviation here shows the dramatic loss in throughput and dramatic increase in difference between the logs being sent to the indexer and the logs being sent out of the indexer (to SumoLogic/ELK)

sumologic {
    url=>"{{ .Env "SUMO_URL" }}"
    format=>"%{@json}"
    interval => 30
}

The deviation isn't as big, but it definitely is not something that is acceptable. (dropping to about 13k messages per second)

sumologic {
    url=>"{{ .Env "SUMO_URL" }}"
    format=>"%{@json}"
    interval => 60
    compress => true
    compress_encoding => "gzip"
}

sumologic {
    url=>"{{ .Env "SUMO_URL" }}"
    format=>"%{@json}"
    interval => 30
    compress => true
    compress_encoding => "gzip"
}

Can you help us with achieving a higher throughput? We are trying to do a trial of SumoLogic, but we have yet been able to achieve the throughput that we expect of our logstash indexers.

The text was updated successfully, but these errors were encountered:

bin3377 · 2018-05-09T00:30:32Z

Thanks for the reporting!
I'm trying to create a version with multiple pipelines/senders in parallel. Will update the thread when I get a work version.

bin3377 · 2018-05-15T19:00:58Z

@ingshtrom I have a beta version for enhancing the throughput now in branch https://github.com/SumoLogic/logstash-output-sumologic/tree/byi-asyc-mem-queue

Since there is significant change in the code structure, I may need more time to test and refine it before push to rubygems.

But if you want, you can get the beta version now for testing:
https://github.com/SumoLogic/logstash-output-sumologic/raw/byi-asyc-mem-queue/logstash-output-sumologic-1.1.9.gem

Install:
logstash-plugin install ./logstash-output-sumologic-1.1.9.gem

As a sample config file:

output
{
    sumologic
    {
        url => "https://stag-events.sumologic.net/receiver/v1/http/XXXXX"
        compress => true
        compress_encoding => "gzip"
        source_name => "log test"
        source_category => "logstash sample"
        format => "%{@timestamp} %{@json}"
        interval => 5 # one request can hold up to 5 sec of events, if 0, means every event in a request 
        pile_max => 1024000 # one request can hold up to 1000KB of data
        queue_max => 4096 # up to 4096 requests can be queued in memory
        sender_max => 10 # use up to 10 http client working parallel
    }
}

ingshtrom · 2018-05-16T14:33:06Z

I setup my Dockerfile to do the following:

ADD --chown=logstash:logstash https://github.com/SumoLogic/logstash-output-sumologic/raw/byi-asyc-mem-queue/logstash-output-sumologic-1.1.9.gem /tmp/logstash-output-sumologic-1.1.9-beta.gem
...
RUN /usr/share/logstash/bin/logstash-plugin install /tmp/logstash-output-sumologic-1.1.9-beta.gem

This is the error I get when building the image:

Sorry, I'm really not familiar with Ruby to be able to debug this :(

bin3377 · 2018-05-16T14:45:23Z

looks like your image has a lower version than my build environment.

14:41 # logstash --version
logstash 6.2.2

Can you update the image to 6.2.x?
https://www.elastic.co/guide/en/logstash/current/docker.html

ingshtrom · 2018-05-16T14:48:03Z

gotchya, that is correct. Let me update.

ingshtrom · 2018-05-16T16:40:04Z

@bin3377 Do you have a recommended config for high throughput? I've tried several setups that include increasing the size of requests and increase the number of parallel processing, but to no avail. I also tried the config you had above as a starting point.

Examples:

sumologic {
      url=>"{{ .Env "SUMO_URL" }}"
      format=>"%{@json}"
      compress => true
      compress_encoding => "gzip"
      interval => 0 # one request can hold up to 5 sec of events, if 0, means every event in a request
      pile_max => 102400000 # one request can hold up to 1000KB of data
      queue_max => 4096000 # up to 4096 requests can be queued in memory
      sender_max => 10 # use up to 10 http client working parallel
    }

sumologic {
      url=>"{{ .Env "SUMO_URL" }}"
      format=>"%{@json}"
      compress => true
      compress_encoding => "gzip"
      interval => 15 # one request can hold up to 5 sec of events, if 0, means every event in a request
      pile_max => 10240000 # one request can hold up to 1000KB of data
      queue_max => 409600 # up to 4096 requests can be queued in memory
      sender_max => 30 # use up to 10 http client working parallel
    }

bin3377 · 2018-05-16T16:55:13Z

It's depends on the bottleneck of the process;
by increasing the interval and pile_max, more messages will be piled together so overall reduce the overhead on transmission and benefit for compressing efficiency; but it may make a larger latency because message may hold on client for longer;
by increasing the queue_max, more memory will be used to cache the messages from input, so if input is blocked by the plugin consuming speed, increasing this may be helpful;
by increasing the sender_max, more http connection will work in parallel to sending the messages in queue;
And depends on the message content, adjusting compress/compress_encoding maybe helpful for balancing between the CPU consumption and package size

On the other side, this version is thread safe so theoretically you can use multiple plugins in parallel as workers (https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html)

ingshtrom · 2018-05-16T17:07:39Z

That was a really awesome explanation of each property! Thank you. I will give it a bit more time of testing and report back!

ingshtrom · 2018-05-21T15:10:54Z

I'm currently running into a weird error while trying to debug things. In the logstash logs I get an error about

[2018-05-21T14:57:47,364][ERROR][logstash.outputs.sumologic] Server did not accept request
...
[2018-05-21T14:59:01,095][ERROR][logstash.outputs.sumologic] HTTP request rejected
[2018-05-21T14:59:02,948][ERROR][logstash.outputs.sumologic] HTTP request rejected
[2018-05-21T14:59:04,109][ERROR][logstash.outputs.sumologic] HTTP request rejected
[2018-05-21T14:59:05,101][ERROR][logstash.outputs.sumologic] HTTP request rejected
[2018-05-21T14:59:05,379][ERROR][logstash.outputs.sumologic] HTTP request rejected
[2018-05-21T14:59:05,700][ERROR][logstash.outputs.sumologic] HTTP request rejected
[2018-05-21T14:59:06,289][ERROR][logstash.outputs.sumologic] HTTP request rejected

When I look in logstash I can see the logs in there, so it's odd that I get this error.

I guess my question is what this error is supposed to be meaning. I would presume that I shouldn't be getting it.

ingshtrom · 2018-05-21T18:58:00Z

I found the lines in https://github.com/SumoLogic/logstash-output-sumologic/blob/byi-asyc-mem-queue/lib/logstash/outputs/sumologic/sender.rb that generate these errors. I have seen the error here as well, but less often.

After looking at this file and testing the URL we pass to the sUmoLogic plugin, I have made 2 conclusions about potential bugs in the code. Please correct me if I am wrong or you have other findings:
so I definitely think there are a few bugs in their code:

the collection endpoint in sumologic seems to respond to 200 for me, but not for the plugin
the plugin still attempts to send data to the collection endpoint even after determining that it is “down”

I should also note that these server errors do NOT show up when we only supply the url and format properties to the output plugin. We are trying to determine which property is causing the issue, but even after resetting back to the properties that you specified as default above, we haven't run into the error. Will keep you updated.

Thank you and please advise on recommendations

bin3377 · 2018-05-22T00:05:13Z

The

HTTP request rejected

means the server response is not 200. It could be a server side problem (like 429, 503) or a plugin problem (like 400 if somehow log lines are crapped). And

Error in network transmission

normally means the connection to sumo is broken or in some rare cases, timeout due to server not response package in time (and possibly adjust with request_timeout, socket_timeout or connect_timeout - link).

For more details you can refer to this help doc

I updated the plugin with some extra logging points, could you please download the plugin again and give it a try?

For enabling the logging, you need to update logstash/config/log4j2.properties and here is an example - https://github.com/SumoLogic/logstash-output-sumologic/blob/byi-asyc-mem-queue/samples/log4j2.properties

Thanks!

ingshtrom · 2018-05-22T13:39:16Z

I used the logging config exactly as it is in the example. Here is the sumologic output config that is being used:

sumologic {
    url => "<url>"
    format => "%{@json}"
    compress => true
    compress_encoding => "gzip"
    interval => 0
    pile_max => 1024000
    queue_max => 4096
    sender_max => 30
  }

and the logs are available at this gist.

It isn't outputting the status code, so I'm not sure, but it might be worth knowing if we are getting rate-limited. That would make the most sense as all the requests seem to go through... until they don't.

ingshtrom · 2018-05-22T14:20:09Z

based on the documentation here, it gives an example of throttling for a 10GB account. It shows that you would be allowed around 71MB per minute before being throttled. In the SumoLogic UI it shows that we uploaded around 2.4GB of data so far today. When I look at the logs and when they were ingested, it was primarily during a 5 minute period from 9:20 to 9:25 EST. I'm 99% sure we're being throttled, but you guys should be able to look that up on your end, correct? Because I cannot on my end since the status codes are not being logged.

I did the math and even if the 2.4GB was spread out evenly over 30 minutes, we would still hit the theoretical threshold of 71MB / minute.

So this throttling, coupled with the fact that logstash likes to throttle everything in an attempt to not lose any data creates a scenario where adding the SumoLogic output plugin essentially grinds all of our log ingestion to a halt. source

bin3377 · 2018-05-22T17:02:42Z

Your guessing should be correct. @frankreno will help you to adjust the account provisioning to unblock the throttling. Currently the plugin will retry the sending on following cases:

429 - throttling
503 - server in maintenance
504 - gateway timeout
network failure

ingshtrom · 2018-05-22T19:47:19Z

We got the throttling figured out. I'm still not getting the throughput expected, but I am still testing out different configurations. In the mean time, I noticed a new error I hadn't seen before. Whenever I try to change the config via live-reload the next startup results in lots of errors like this:

2018-05-22 19:37:57.203 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:02.204 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:07.204 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:12.205 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:17.206 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:22.206 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:27.206 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:32.207 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:37.207 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:42.208 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:47.208 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:52.209 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:57.209 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now

These are almost always preceded with errors from the previous process that look like this. I just tested and this stack trace happens with our current config, but wanted to paste it here in case there was an issue with one plugin causing more issues with other plugins.

I will update once I do more testing without the throttling.

bin3377 · 2018-05-22T20:14:07Z

This line is not an error. It just means the pile was sent to queue. Normally you will get the line when pile is not filled to pile_max but the time reached interval. It's not very reasonable to happen continuously but if it's in shutting down process then probably it's ok

ingshtrom · 2018-05-22T20:31:04Z

Ok. They showed more on startup after reload. I wonder if the queue is somehow not filling at all if it is continually logging that without any "HTTP request accepted" messages. Hmmm

ingshtrom · 2018-05-24T19:04:02Z

Update after a few days. I've gotten logstash to achieve very similar throughput as prior to sumologic. The problem is that it is consuming a lot of CPU and a lot more memory. Now that I've proven it's possible to get the throughput we need, I'm playing with the numbers to try and get sustained throughput without as much of a change in resource usage. I've had a lot of other projects take my attention away from this over the last few days, but I should hopefully have a question or answer by tomorrow.

bin3377 · 2018-05-24T20:17:26Z

Thanks a lot for your update! I do expect it consuming more memory/cpu for keeping high throughput since memory cache and multiple the compressing/sending threads has the cost. Looking forward to seeing your test result.

ingshtrom · 2018-05-25T11:50:31Z

Ok, I think we have achieved the expected throughput! I want to reiterate that I think the increased cost is higher than expected, but it's well within our threshold for the machines that we were already running, so I'm not too worried. Here is a graph with some metrics of throughput, CPU, and memory usage over the last day (look for bright yellow for my annotations):

Essentially we are achieving the same in message throughput and network usage while using the same CPU and almost double the memory. These machines have plenty of free memory, though, so it's not a problem.

Here are the changes made. The SumoLogic output config:

sumologic {
    url => "{{ .Env "SUMO_URL" }}"
    format => "%{@json}"
    compress => true
    compress_encoding => "gzip"
    interval => 10
    pile_max => 10240000
    queue_max => 409600
    sender_max => 100
  }

We increased the JVM Heap from 1GB to 2GB via the jvm.options file.

We changed the logging config to this. Notice that it is the same as you provided, but using warn instead of debug:

status = warn
name = LogstashPropertiesConfig
property.basePath = .

appender.console.type = Console
appender.console.name = consoleLogger
appender.console.target = SYSTEM_OUT
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yyyy-MM-dd HH:mm:ss.SSS} %level [%t] [%c] [%M] [%l] - %msg%n

appender.rolling.type = RollingFile
appender.rolling.name = fileLogger
appender.rolling.fileName = ${basePath}/logstash-output-sumologic.log
appender.rolling.filePattern = ${basePath}app_%d{yyyyMMdd}.log.gz
appender.rolling.layout.type = PatternLayout
appender.rolling.layout.pattern = %d{yyyy-MM-dd HH:mm:ss.SSS} %level [%t] [%c] [%M] [%l] - %msg%n
appender.rolling.policies.type = Policies

logger.sumologic.name = logstash.outputs.sumologic
logger.sumologic.level = warn
logger.sumologic.additivity = false
logger.sumologic.appenderRef.rolling.ref = fileLogger
logger.sumologic.appenderRef.console.ref = consoleLogger

rootLogger.level = ${sys:ls.log.level}
rootLogger.appenderRef.rolling.ref = consoleLogger

Prior to introducing SumoLogic, we ran with 4 workers and a batch of 1000. This has been changed to 4 workers and a batch of 2000.

I would love to see a more official build. In the meantime I will continue to let this run over the weekend to see how it fares.

Thank you for all of your help thus far!

ingshtrom · 2018-05-25T11:53:38Z

I also used this query in SumoLogic to view the throughput from your end:

_index=sumologic_volume sizeInBytes
| where _sourceCategory="collector_volume"
| parse regex "\"(?<collector>[^\"]+)\"\:\{\"sizeInBytes\"\:(?<bytes>\d+),\"count\"\:(?<count>\d+)\}" multi
| timeslice 5m
| bytes/1024/1024/1024 as gbytes
| sum(gbytes) as gbytes by collector, _timeslice
| transpose row _timeslice column collector

Worked like a charm.

Also, in Kibana I could see that we were still getting the same throughput as well, so all of the places that we gauge throughput were showing improved results.

bin3377 · 2018-05-25T16:40:44Z

That's awesome report and all valuable analysis! And very happy the new version can unblock you. I will update the document today and push a new version to ruby-gems so later on it's installable from official channel. Thank you!

bin3377 · 2018-05-30T05:23:28Z

The new version is on RubyGems now https://rubygems.org/gems/logstash-output-sumologic
Close issue.

bin3377 self-assigned this May 9, 2018

bin3377 added the enhancement label May 9, 2018

bin3377 closed this as completed May 30, 2018

kikyomits mentioned this issue Apr 24, 2019

Logstash stopped sending log data for certain amount of time. #38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor Logstash Throughput #27

Poor Logstash Throughput #27

ingshtrom commented May 3, 2018

bin3377 commented May 9, 2018

bin3377 commented May 15, 2018

ingshtrom commented May 16, 2018 •

edited

Loading

bin3377 commented May 16, 2018

ingshtrom commented May 16, 2018

ingshtrom commented May 16, 2018 •

edited

Loading

bin3377 commented May 16, 2018

ingshtrom commented May 16, 2018

ingshtrom commented May 21, 2018 •

edited

Loading

ingshtrom commented May 21, 2018

bin3377 commented May 22, 2018 •

edited

Loading

ingshtrom commented May 22, 2018

ingshtrom commented May 22, 2018

bin3377 commented May 22, 2018

ingshtrom commented May 22, 2018

bin3377 commented May 22, 2018

ingshtrom commented May 22, 2018

ingshtrom commented May 24, 2018

bin3377 commented May 24, 2018

ingshtrom commented May 25, 2018

ingshtrom commented May 25, 2018

bin3377 commented May 25, 2018

bin3377 commented May 30, 2018

Poor Logstash Throughput #27

Poor Logstash Throughput #27

Comments

ingshtrom commented May 3, 2018

bin3377 commented May 9, 2018

bin3377 commented May 15, 2018

ingshtrom commented May 16, 2018 • edited Loading

bin3377 commented May 16, 2018

ingshtrom commented May 16, 2018

ingshtrom commented May 16, 2018 • edited Loading

bin3377 commented May 16, 2018

ingshtrom commented May 16, 2018

ingshtrom commented May 21, 2018 • edited Loading

ingshtrom commented May 21, 2018

bin3377 commented May 22, 2018 • edited Loading

ingshtrom commented May 22, 2018

ingshtrom commented May 22, 2018

bin3377 commented May 22, 2018

ingshtrom commented May 22, 2018

bin3377 commented May 22, 2018

ingshtrom commented May 22, 2018

ingshtrom commented May 24, 2018

bin3377 commented May 24, 2018

ingshtrom commented May 25, 2018

ingshtrom commented May 25, 2018

bin3377 commented May 25, 2018

bin3377 commented May 30, 2018

ingshtrom commented May 16, 2018 •

edited

Loading

ingshtrom commented May 16, 2018 •

edited

Loading

ingshtrom commented May 21, 2018 •

edited

Loading

bin3377 commented May 22, 2018 •

edited

Loading