Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor Logstash Throughput #27

Closed
ingshtrom opened this issue May 3, 2018 · 23 comments
Closed

Poor Logstash Throughput #27

ingshtrom opened this issue May 3, 2018 · 23 comments
Assignees

Comments

@ingshtrom
Copy link

Hello,

We're trying to send logs to SumoLogic through Logstash. We current send our logs to an ELK stack, and it maintains a throughput of 25k messages per second, but no matter the configuration we have tried with the SumoLogic Logstash plugin, our throughput drops dramatically every time.

Here are the configurations that we have tried and their results as show by graphs of message throughput.

sumologic {
    url=>"{{ .Env "SUMO_URL" }}"
    format=>"%{@json}"
}

The deviation here shows the dramatic loss in throughput and dramatic increase in difference between the logs being sent to the indexer and the logs being sent out of the indexer (to SumoLogic/ELK)
image

sumologic {
    url=>"{{ .Env "SUMO_URL" }}"
    format=>"%{@json}"
    interval => 30
}

The deviation isn't as big, but it definitely is not something that is acceptable. (dropping to about 13k messages per second)
image

sumologic {
    url=>"{{ .Env "SUMO_URL" }}"
    format=>"%{@json}"
    interval => 60
    compress => true
    compress_encoding => "gzip"
}
sumologic {
    url=>"{{ .Env "SUMO_URL" }}"
    format=>"%{@json}"
    interval => 30
    compress => true
    compress_encoding => "gzip"
}

image

Can you help us with achieving a higher throughput? We are trying to do a trial of SumoLogic, but we have yet been able to achieve the throughput that we expect of our logstash indexers.

@bin3377 bin3377 self-assigned this May 9, 2018
@bin3377
Copy link

bin3377 commented May 9, 2018

Thanks for the reporting!
I'm trying to create a version with multiple pipelines/senders in parallel. Will update the thread when I get a work version.

@bin3377
Copy link

bin3377 commented May 15, 2018

@ingshtrom I have a beta version for enhancing the throughput now in branch https://github.com/SumoLogic/logstash-output-sumologic/tree/byi-asyc-mem-queue

Since there is significant change in the code structure, I may need more time to test and refine it before push to rubygems.

But if you want, you can get the beta version now for testing:
https://github.com/SumoLogic/logstash-output-sumologic/raw/byi-asyc-mem-queue/logstash-output-sumologic-1.1.9.gem

Install:
logstash-plugin install ./logstash-output-sumologic-1.1.9.gem

As a sample config file:

output
{
    sumologic
    {
        url => "https://stag-events.sumologic.net/receiver/v1/http/XXXXX"
        compress => true
        compress_encoding => "gzip"
        source_name => "log test"
        source_category => "logstash sample"
        format => "%{@timestamp} %{@json}"
        interval => 5 # one request can hold up to 5 sec of events, if 0, means every event in a request 
        pile_max => 1024000 # one request can hold up to 1000KB of data
        queue_max => 4096 # up to 4096 requests can be queued in memory
        sender_max => 10 # use up to 10 http client working parallel
    }
}

@ingshtrom
Copy link
Author

ingshtrom commented May 16, 2018

I setup my Dockerfile to do the following:

ADD --chown=logstash:logstash https://github.com/SumoLogic/logstash-output-sumologic/raw/byi-asyc-mem-queue/logstash-output-sumologic-1.1.9.gem /tmp/logstash-output-sumologic-1.1.9-beta.gem
...
RUN /usr/share/logstash/bin/logstash-plugin install /tmp/logstash-output-sumologic-1.1.9-beta.gem

This is the error I get when building the image:
image

Sorry, I'm really not familiar with Ruby to be able to debug this :(

@bin3377
Copy link

bin3377 commented May 16, 2018

looks like your image has a lower version than my build environment.

14:41 # logstash --version
logstash 6.2.2

Can you update the image to 6.2.x?
https://www.elastic.co/guide/en/logstash/current/docker.html

@ingshtrom
Copy link
Author

gotchya, that is correct. Let me update.

@ingshtrom
Copy link
Author

ingshtrom commented May 16, 2018

@bin3377 Do you have a recommended config for high throughput? I've tried several setups that include increasing the size of requests and increase the number of parallel processing, but to no avail. I also tried the config you had above as a starting point.

Examples:

sumologic {
      url=>"{{ .Env "SUMO_URL" }}"
      format=>"%{@json}"
      compress => true
      compress_encoding => "gzip"
      interval => 0 # one request can hold up to 5 sec of events, if 0, means every event in a request
      pile_max => 102400000 # one request can hold up to 1000KB of data
      queue_max => 4096000 # up to 4096 requests can be queued in memory
      sender_max => 10 # use up to 10 http client working parallel
    }
sumologic {
      url=>"{{ .Env "SUMO_URL" }}"
      format=>"%{@json}"
      compress => true
      compress_encoding => "gzip"
      interval => 15 # one request can hold up to 5 sec of events, if 0, means every event in a request
      pile_max => 10240000 # one request can hold up to 1000KB of data
      queue_max => 409600 # up to 4096 requests can be queued in memory
      sender_max => 30 # use up to 10 http client working parallel
    }

@bin3377
Copy link

bin3377 commented May 16, 2018

It's depends on the bottleneck of the process;
by increasing the interval and pile_max, more messages will be piled together so overall reduce the overhead on transmission and benefit for compressing efficiency; but it may make a larger latency because message may hold on client for longer;
by increasing the queue_max, more memory will be used to cache the messages from input, so if input is blocked by the plugin consuming speed, increasing this may be helpful;
by increasing the sender_max, more http connection will work in parallel to sending the messages in queue;
And depends on the message content, adjusting compress/compress_encoding maybe helpful for balancing between the CPU consumption and package size

On the other side, this version is thread safe so theoretically you can use multiple plugins in parallel as workers (https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html)

@ingshtrom
Copy link
Author

That was a really awesome explanation of each property! Thank you. I will give it a bit more time of testing and report back!

@ingshtrom
Copy link
Author

ingshtrom commented May 21, 2018

I'm currently running into a weird error while trying to debug things. In the logstash logs I get an error about

[2018-05-21T14:57:47,364][ERROR][logstash.outputs.sumologic] Server did not accept request
...
[2018-05-21T14:59:01,095][ERROR][logstash.outputs.sumologic] HTTP request rejected
[2018-05-21T14:59:02,948][ERROR][logstash.outputs.sumologic] HTTP request rejected
[2018-05-21T14:59:04,109][ERROR][logstash.outputs.sumologic] HTTP request rejected
[2018-05-21T14:59:05,101][ERROR][logstash.outputs.sumologic] HTTP request rejected
[2018-05-21T14:59:05,379][ERROR][logstash.outputs.sumologic] HTTP request rejected
[2018-05-21T14:59:05,700][ERROR][logstash.outputs.sumologic] HTTP request rejected
[2018-05-21T14:59:06,289][ERROR][logstash.outputs.sumologic] HTTP request rejected

When I look in logstash I can see the logs in there, so it's odd that I get this error.

I guess my question is what this error is supposed to be meaning. I would presume that I shouldn't be getting it.

@ingshtrom
Copy link
Author

I found the lines in https://github.com/SumoLogic/logstash-output-sumologic/blob/byi-asyc-mem-queue/lib/logstash/outputs/sumologic/sender.rb that generate these errors. I have seen the error here as well, but less often.

After looking at this file and testing the URL we pass to the sUmoLogic plugin, I have made 2 conclusions about potential bugs in the code. Please correct me if I am wrong or you have other findings:
so I definitely think there are a few bugs in their code:

  1. the collection endpoint in sumologic seems to respond to 200 for me, but not for the plugin
  2. the plugin still attempts to send data to the collection endpoint even after determining that it is “down”

I should also note that these server errors do NOT show up when we only supply the url and format properties to the output plugin. We are trying to determine which property is causing the issue, but even after resetting back to the properties that you specified as default above, we haven't run into the error. Will keep you updated.

Thank you and please advise on recommendations

@bin3377
Copy link

bin3377 commented May 22, 2018

The

HTTP request rejected

means the server response is not 200. It could be a server side problem (like 429, 503) or a plugin problem (like 400 if somehow log lines are crapped). And

Error in network transmission

normally means the connection to sumo is broken or in some rare cases, timeout due to server not response package in time (and possibly adjust with request_timeout, socket_timeout or connect_timeout - link).

For more details you can refer to this help doc

I updated the plugin with some extra logging points, could you please download the plugin again and give it a try?

For enabling the logging, you need to update logstash/config/log4j2.properties and here is an example - https://github.com/SumoLogic/logstash-output-sumologic/blob/byi-asyc-mem-queue/samples/log4j2.properties

Thanks!

@ingshtrom
Copy link
Author

I used the logging config exactly as it is in the example. Here is the sumologic output config that is being used:

sumologic {
    url => "<url>"
    format => "%{@json}"
    compress => true
    compress_encoding => "gzip"
    interval => 0
    pile_max => 1024000
    queue_max => 4096
    sender_max => 30
  }

and the logs are available at this gist.

It isn't outputting the status code, so I'm not sure, but it might be worth knowing if we are getting rate-limited. That would make the most sense as all the requests seem to go through... until they don't.

@ingshtrom
Copy link
Author

based on the documentation here, it gives an example of throttling for a 10GB account. It shows that you would be allowed around 71MB per minute before being throttled. In the SumoLogic UI it shows that we uploaded around 2.4GB of data so far today. When I look at the logs and when they were ingested, it was primarily during a 5 minute period from 9:20 to 9:25 EST. I'm 99% sure we're being throttled, but you guys should be able to look that up on your end, correct? Because I cannot on my end since the status codes are not being logged.

image

I did the math and even if the 2.4GB was spread out evenly over 30 minutes, we would still hit the theoretical threshold of 71MB / minute.

So this throttling, coupled with the fact that logstash likes to throttle everything in an attempt to not lose any data creates a scenario where adding the SumoLogic output plugin essentially grinds all of our log ingestion to a halt. source

@bin3377
Copy link

bin3377 commented May 22, 2018

Your guessing should be correct. @frankreno will help you to adjust the account provisioning to unblock the throttling. Currently the plugin will retry the sending on following cases:

  1. 429 - throttling
  2. 503 - server in maintenance
  3. 504 - gateway timeout
  4. network failure

@ingshtrom
Copy link
Author

We got the throttling figured out. I'm still not getting the throughput expected, but I am still testing out different configurations. In the mean time, I noticed a new error I hadn't seen before. Whenever I try to change the config via live-reload the next startup results in lots of errors like this:

2018-05-22 19:37:57.203 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:02.204 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:07.204 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:12.205 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:17.206 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:22.206 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:27.206 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:32.207 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:37.207 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:42.208 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:47.208 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:52.209 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now
2018-05-22 19:38:57.209 DEBUG [Ruby-0-Thread-10: :1] [logstash.outputs.sumologic] [invoke] [sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)] - timeout, enqueue pile now

These are almost always preceded with errors from the previous process that look like this. I just tested and this stack trace happens with our current config, but wanted to paste it here in case there was an issue with one plugin causing more issues with other plugins.

I will update once I do more testing without the throttling.

@bin3377
Copy link

bin3377 commented May 22, 2018

This line is not an error. It just means the pile was sent to queue. Normally you will get the line when pile is not filled to pile_max but the time reached interval. It's not very reasonable to happen continuously but if it's in shutting down process then probably it's ok

@ingshtrom
Copy link
Author

Ok. They showed more on startup after reload. I wonder if the queue is somehow not filling at all if it is continually logging that without any "HTTP request accepted" messages. Hmmm

@ingshtrom
Copy link
Author

Update after a few days. I've gotten logstash to achieve very similar throughput as prior to sumologic. The problem is that it is consuming a lot of CPU and a lot more memory. Now that I've proven it's possible to get the throughput we need, I'm playing with the numbers to try and get sustained throughput without as much of a change in resource usage. I've had a lot of other projects take my attention away from this over the last few days, but I should hopefully have a question or answer by tomorrow.

@bin3377
Copy link

bin3377 commented May 24, 2018

Thanks a lot for your update! I do expect it consuming more memory/cpu for keeping high throughput since memory cache and multiple the compressing/sending threads has the cost. Looking forward to seeing your test result.

@ingshtrom
Copy link
Author

Ok, I think we have achieved the expected throughput! I want to reiterate that I think the increased cost is higher than expected, but it's well within our threshold for the machines that we were already running, so I'm not too worried. Here is a graph with some metrics of throughput, CPU, and memory usage over the last day (look for bright yellow for my annotations):

image

Essentially we are achieving the same in message throughput and network usage while using the same CPU and almost double the memory. These machines have plenty of free memory, though, so it's not a problem.

Here are the changes made. The SumoLogic output config:

sumologic {
    url => "{{ .Env "SUMO_URL" }}"
    format => "%{@json}"
    compress => true
    compress_encoding => "gzip"
    interval => 10
    pile_max => 10240000
    queue_max => 409600
    sender_max => 100
  }

We increased the JVM Heap from 1GB to 2GB via the jvm.options file.

We changed the logging config to this. Notice that it is the same as you provided, but using warn instead of debug:

status = warn
name = LogstashPropertiesConfig
property.basePath = .

appender.console.type = Console
appender.console.name = consoleLogger
appender.console.target = SYSTEM_OUT
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yyyy-MM-dd HH:mm:ss.SSS} %level [%t] [%c] [%M] [%l] - %msg%n

appender.rolling.type = RollingFile
appender.rolling.name = fileLogger
appender.rolling.fileName = ${basePath}/logstash-output-sumologic.log
appender.rolling.filePattern = ${basePath}app_%d{yyyyMMdd}.log.gz
appender.rolling.layout.type = PatternLayout
appender.rolling.layout.pattern = %d{yyyy-MM-dd HH:mm:ss.SSS} %level [%t] [%c] [%M] [%l] - %msg%n
appender.rolling.policies.type = Policies

logger.sumologic.name = logstash.outputs.sumologic
logger.sumologic.level = warn
logger.sumologic.additivity = false
logger.sumologic.appenderRef.rolling.ref = fileLogger
logger.sumologic.appenderRef.console.ref = consoleLogger

rootLogger.level = ${sys:ls.log.level}
rootLogger.appenderRef.rolling.ref = consoleLogger

Prior to introducing SumoLogic, we ran with 4 workers and a batch of 1000. This has been changed to 4 workers and a batch of 2000.

I would love to see a more official build. In the meantime I will continue to let this run over the weekend to see how it fares.

Thank you for all of your help thus far!

@ingshtrom
Copy link
Author

I also used this query in SumoLogic to view the throughput from your end:

_index=sumologic_volume sizeInBytes
| where _sourceCategory="collector_volume"
| parse regex "\"(?<collector>[^\"]+)\"\:\{\"sizeInBytes\"\:(?<bytes>\d+),\"count\"\:(?<count>\d+)\}" multi
| timeslice 5m
| bytes/1024/1024/1024 as gbytes
| sum(gbytes) as gbytes by collector, _timeslice
| transpose row _timeslice column collector

Worked like a charm.

Also, in Kibana I could see that we were still getting the same throughput as well, so all of the places that we gauge throughput were showing improved results.

@bin3377
Copy link

bin3377 commented May 25, 2018

That's awesome report and all valuable analysis! And very happy the new version can unblock you. I will update the document today and push a new version to ruby-gems so later on it's installable from official channel. Thank you!

@bin3377
Copy link

bin3377 commented May 30, 2018

The new version is on RubyGems now https://rubygems.org/gems/logstash-output-sumologic
Close issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants