Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling basicauth causes CPU usage to spike #3462

Closed
mikelococo opened this issue May 30, 2020 · 17 comments · Fixed by #3465
Closed

Enabling basicauth causes CPU usage to spike #3462

mikelococo opened this issue May 30, 2020 · 17 comments · Fixed by #3465
Labels
optimization 📉 Performance or cost improvements
Milestone

Comments

@mikelococo
Copy link

mikelococo commented May 30, 2020

Enabling basicauth for a route causes CPU usage to spike significantly for very modest request rates. On a Linode tiny instance (1 vcpu and 1GB of ram, admittedly modest hardware) 10-20 requests per second are sufficient to occupy 60%-80% of the single CPU available.

Here's a profile:

  • Generated using /debug/pprof/
  • Caddy was under modest load. I suspect one could reproduce this with any synthetic load, but I noticed it while browsing an authenticated reverse_proxy route to a netdata instance. So this was a single web-browser using javascript to fetch metrics maybe 10-ish times per second for in-browser display.
  • The profile shows that that >19s of CPU time were used during the 30s sampling period, from this single user traffic.
  • The profile shows that over 92% of that CPU time were occupied by blowfish/encryptBlock, presumably repeatedly doing the expensive hashing of the user-password for each request:

profile001

Perhaps caddyauth/BcryptHash or some similar function could be memoized to cache results for a given password input and reduce the number of bcrypt calculations that must be performed for a logged in user submitting multiple requests with the same username/password.

@mholt mholt added this to the 2.x milestone May 30, 2020
@mholt mholt added the optimization 📉 Performance or cost improvements label May 30, 2020
@mholt
Copy link
Member

mholt commented May 30, 2020

Thank you for profiling!! (As opposed to just complaining that basicauth is "slow" like so many people would usually do.)

So yeah, that's expected with a password hash like bcrypt. Caching the results is a good idea -- if we assume that the program memory is absolutely a safe place in the long term. (In general, we do assume this with Caddy since if your server is popped, it's game over already.)

Should caching it be opt-in or opt-out?

@mikelococo
Copy link
Author

Thank you for profiling!! (As opposed to just complaining that basicauth is "slow" like so many people would usually do.)

👍 The debug endpoint makes profiling pretty straightforward, and I had now idea what was slow until I profiled, I didn't realize it was auth out of the gate.

Caching the results is a good idea -- if we assume that the program memory is absolutely a safe place in the long term.

I'm no expert in modern AppSec, but it seems to me that unless we have an encrypted in-memory path for the basicauth header, that memory is already "tainted" and caching doesn't really change the game. I get that variable lifetime counts for something, but I'm not sure that it's a qualitative change.

Should caching it be opt-in or opt-out?

Having a single user occupy most of an entire CPU seems like an untenable default condition, so I think I'd recommend opt-out... but maybe basicauth is niche enough these days that opt-in is viable. I think in my mind the bigger concern would be controlling tradeoff on the size vs the hit-rate, either with a global-scoped or a directive-scoped config that exposes whatever tweakables the underlying cache implementation has.

Though, to be clear, my own hobbyist use-case is the most trivial possible and would be improved by any caching at all regardless of configurability.

@mikelococo
Copy link
Author

Also maybe if folks are worried about absolutely plaintext passwords hanging around forever in-memory... perhaps a salted fast-hash like sha512 could be the cache key used to lookup the bcrypt value. Obviously fast-hashes are susceptible to brute force attacks if the cache-memory is dumped, but at least one would have to conduct an attack at all... you wouldn't just see "my-aw3some-password!" hanging around in plaintext in a memory dump.

It's not obvious to me that the complexity is worth the protection it gives, but it's something to consider. I'll pipe down now, designing in-memory password handling is well outside my expertise.

@mholt
Copy link
Member

mholt commented May 30, 2020

Also maybe if folks are worried about absolutely plaintext passwords hanging around forever in-memory... perhaps a salted fast-hash like sha512 could be the cache key used to lookup the bcrypt value. Obviously fast-hashes are susceptible to brute force attacks if the cache-memory is dumped, but at least one would have to conduct an attack at all... you wouldn't just see "my-aw3some-password!" hanging around in plaintext in a memory dump.

Funny, I thought of the same thing but didn't even bother suggesting it for the same reason. It makes attacks harder but the point of using expensive hashes in the first place is to... keep them expensive...

Go doesn't give us a lot of control over what stays in memory and what doesn't, except that not caching them is certainly more likely to allow them to be GC'ed sooner rather than later, whereas if we cache them they will absolutely stay in memory (either in plaintext or a weakly-hashed form) for a long time.

Meh, again, if the machine is so compromised that arbitrary memory can be read by the attacker, all bets are off anyway -- since the password is always plaintext coming in after TLS termination.

@href
Copy link

href commented May 31, 2020

I just stumbled upon the same issue when switching some of my personal sites to Caddy 2 from v1. I was wondering why the new Caddy was slower and more resource hungry than expected.

I have a few personal services behind basic auth and they all return stuff from reverse proxies. Sites with lots of thumbnails would take quite a while to load and I would see a big CPU spike. I use some low cost VMs for some of my stuff. There, the impact is very noticable.

Disabling basic auth makes the problem go away. I already figured it might be that due to the use of bcrypt/scrypt.

Personally I think caching would be fine, but I would also be totally happy with the ability to chose a fast hash method. Before this I stored the password in plaintext, since it's really not that critical a password.

Anyway, thanks to @mikelococo for proving my suspicion and doing all the leg-work 😅

@href
Copy link

href commented May 31, 2020

Oh and one more thought. Since I think the hash-function is called for each request with a password, the current implementation could certainly be easily misused to perform a simple denial of service attack.

@mholt
Copy link
Member

mholt commented May 31, 2020

Yeah, okay, so probably opt-in caching is a good idea. It's a little more conservative to be opt-in and it's easier to transition from opt-in to opt-out than the other way around, I think.

@href

the current implementation could certainly be easily misused to perform a simple denial of service attack.

It's no different than any other "simple" DoS attack -- and caching hashes won't prevent that as an attacker would just use a different password each time. As with other DoS attacks, if you're worried about this, implement some rate limiting or firewall rules, or put a CDN up in front.

@href
Copy link

href commented May 31, 2020

It's true that caching won't help, but a fast hash-function would. But fair enough, a regular login site would in principle experience the same issue.

@mholt
Copy link
Member

mholt commented May 31, 2020

It's true that caching won't help, but a fast hash-function would.

As mentioned above, this would leave passwords more vulnerable to brute-force attacks.

So, really, DoS is not a particularly relevant concern here -- if you need to defend against that, there are rate limiting, firewalling, and CDN rules you can establish to do that properly.

mholt added a commit that referenced this issue Jun 1, 2020
Cache capacity is currently hard-coded at 1000 with random eviction.
It is enabled by default from Caddyfile configurations because I assume
this is the most common preference.
@mholt
Copy link
Member

mholt commented Jun 1, 2020

@mikelococo Can you please try the latest at #3465? It works for me and offers considerable speedup (well, unnoticable on my laptop tbh) but I haven't done the fancy profiling on it... just used the time command.

@mholt mholt modified the milestones: 2.x, 2.1 Jun 1, 2020
@mikelococo
Copy link
Author

Looks good. I didn't profile, but I can confirm that my test use-case went from generating >60% CPU load from caddy alone to effectively 0% CPU load from caddy.

@mikelococo
Copy link
Author

Uh... maybe I spoke too soon. Profiling...

@mikelococo
Copy link
Author

mikelococo commented Jun 2, 2020

Looks good after all. I had built and tested master accidentally instead of hash-cache, my initial result must have been some further misinterpretation... not sure how I saw a reduction momentarily out of the master build... must have been slow metrics response+confirmation-bias. Since I bothered to profile again, here it is (my caddyfile has evolved a bit since the last test, but not in ways that matter here):

profile_cached

It's hard to find caddyauth in there, but it's hanging off of the big caddyhttp/wrapMiddleware box on the right. Representing 30ms of 370ms of processing, down from 19.28s of 19.62s of processing in the original profile.

@mholt
Copy link
Member

mholt commented Jun 2, 2020

Great!! Small boxes, that's what we love to see. Thank you! (Had me worried for a hot second.)

Now, I will point to this in the future when people complain about Caddy's performance, as a constructive way to improve it. Thank you for the example.

mholt added a commit that referenced this issue Jun 2, 2020
Cache capacity is currently hard-coded at 1000 with random eviction.
It is enabled by default from Caddyfile configurations because I assume
this is the most common preference.
@douglasparker
Copy link

douglasparker commented Aug 3, 2021

Enabling basic auth still causes extreme CPU usage for me while streaming NetData metrics. I have an 8 core CPU constantly pegged at 75-100% CPU usage while basic auth is enabled.

I really wanted to like Caddy but this needs to be addressed or I'll be forced to go back to nginx.

netdata-cpu

@francislavoie
Copy link
Member

@douglasparker this issue is closed as resolved. Nothing is actionable based on the information you've provided. Please open a new issue, filling out the below information.


It's not immediately clear to us what is going on, so we'll need your help to understand it better.

Ideally, we need to be able to reproduce the bug in the most minimal way possible. This allows us to write regression tests to verify the fix is working. If we can't reproduce it, then you'll have to test our changes for us until it's fixed -- and then we can't add test cases, either.

I've attached a template below that will help make this easier and faster! It will ask for some information you've already provided; that's OK, just fill it out the best you can. 👍

I've also included some helpful tips below the template. Feel free to let me know if you have any questions!

Thank you again for your report, we look forward to resolving it!

Template

## 1. Environment

### 1a. Operating system and version

```
paste here
```


### 1b. Caddy version (run `caddy version` or paste commit SHA)

```
paste here
```


### 1c. Go version (if building Caddy from source; run `go version`)

```
paste here
```


## 2. Description

### 2a. What happens (briefly explain what is wrong)




### 2b. Why it's a bug (if it's not obvious)




### 2c. Log output

```
paste terminal output or logs here
```



### 2d. Workaround(s)




### 2e. Relevant links




## 3. Tutorial (minimal steps to reproduce the bug)




Helpful tips

  1. Environment: Please fill out your OS and Caddy versions, even if you don't think they are relevant. (They are always relevant.) If you built Caddy from source, provide the commit SHA and specify your exact Go version.

  2. Description: Describe at a high level what the bug is. What happens? Why is it a bug? Not all bugs are obvious, so convince readers that it's actually a bug.

    • 2c) Log output: Paste terminal output and/or complete logs in a code block. DO NOT REDACT INFORMATION except for credentials.
    • 2d) Workaround: What are you doing to work around the problem in the meantime? This can help others who encounter the same problem, until we implement a fix.
    • 2e) Relevant links: Please link to any related issues, pull requests, docs, and/or discussion. This can add crucial context to your report.
  3. Tutorial: What are the minimum required specific steps someone needs to take in order to experience the same bug? Your goal here is to make sure that anyone else can have the same experience with the bug as you do. You are writing a tutorial, so make sure to carry it out yourself before posting it. Please:

    • Start with an empty config. Add only the lines/parameters that are absolutely required to reproduce the bug.
    • Do not run Caddy inside containers.
    • Run Caddy manually in your terminal; do not use systemd or other init systems.
    • If making HTTP requests, avoid web browsers. Use a simpler HTTP client instead, like curl.
    • Do not redact any information from your config (except credentials). Domain names are public knowledge and often necessary for quick resolution of an issue!
    • Note that ignoring this advice may result in delays, or even in your issue being closed. 😞 Only actionable issues are kept open, and if there is not enough information or clarity to reproduce the bug, then the report is not actionable.

Example of a tutorial:

Create a config file:
{ ... }

Open terminal and run Caddy:

$ caddy ...

Make an HTTP request:

$ curl ...

Notice that the result is ___ but it should be ___.

@caddyserver caddyserver locked as resolved and limited conversation to collaborators Aug 3, 2021
@mholt
Copy link
Member

mholt commented Aug 3, 2021

To follow-up with anyone who is reading this, @douglasparker's new issue is #4267, and the problem was staying on a development pre-release version of Caddy 2 that was over a year old (the problem was fixed a year ago).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
optimization 📉 Performance or cost improvements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants