Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token refresh doesn't work in traffic storm #1193

Closed
inlined opened this issue Mar 10, 2021 · 10 comments
Closed

Token refresh doesn't work in traffic storm #1193

inlined opened this issue Mar 10, 2021 · 10 comments
Assignees

Comments

@inlined
Copy link
Member

inlined commented Mar 10, 2021

This is a reassignment to close out firebase/firebase-functions#579. The customer's function makes many many Admin SDK API calls every 15 minutes. Previously it would crash with 401 errors because the API server was no longer recognizing the token as valid. The customer was able to work around this error by calling admin.app().INTERNAL.getToken(/*force_refresh=*/true) before their code ran, so it's definitely an Admin bug.

Given the frequency of the crashes, I expect that this bug is happening on the border of when a token is about to expire and the Admin SDK isn't properly refreshing it. I have two theories for what could cause this:

  1. There may be a bug in the code that actually refreshes the tokens. For example, here is a bug where a token that is supposed to expire from [1m, 2m) will never get refreshed.
  2. This may be an issue where the network pool for the auth token is full of serialized requests and it takes longer to flush than the token is valid for.

My recommendation would probably be something surgical like changing this line to subtract a few minutes from the actual expiration time so that we don't serialize requests with a token that's about to expire.

@hiranya911
Copy link
Contributor

@inlined how is this function staying hot for 15 minutes (and possibly longer) to experience a token expiration?

@inlined
Copy link
Member Author

inlined commented Mar 10, 2021

It's a periodic function that runs every 15m. Thanks to the (newer) architecture in the Cloud Functions backend, the code that that implements metadata.google.internal lives in the load balancer and not the container anymore. This means there can be aggressive caching across all functions and even across container death. Assumptions can no longer be made that a "fresh" token will last very long, other than that I think the backend returns a fresh token within a few minutes of the cached token expiring. We can try to dig into the backend source together if you want.

@inlined inlined assigned hiranya911 and unassigned inlined Mar 10, 2021
@hiranya911
Copy link
Contributor

hiranya911 commented Mar 10, 2021

Thanks @inlined.

@trentbrooks, can I ask you to share the token expiry time logs you've collected from your runs? I see you have shared a couple of log entries at firebase/firebase-functions#579 (comment). Do you have any more?

Update: Also how long does each invocation of your function last approximately (i.e. the time to send all notifications)?

@trentbrooks
Copy link

trentbrooks commented Mar 11, 2021

Hi @hiranya911

Here are a couple of screenshots from today with the logs. It's been working consistently since @inlined suggested the fix a few days ago. Function runs scheduled every 15 minutes and sends 250-300 topic messages (promises), and function finishes each run in 1500-5500 milliseconds.

https://pasteboard.co/JS4AoFY.png
https://pasteboard.co/JS4AENC.png

Code:
let now = Date.now();
let token = await admin.app().INTERNAL.getToken(true);
let promises = sendTopicNotifications(indicators, timeframe, exchange);
console.log("promises length", promises.length, "force token expire", ((token.expirationTime - now) / 1000) )
return Promise.all(promises);

@hiranya911
Copy link
Contributor

hiranya911 commented Mar 11, 2021

Thanks for verifying the execution time of your triggers. That takes some of the uncertainty out of the picture. The code looks like you're logging the token expiry time after a forced refresh. @trentbrooks can you please confirm that's indeed what we're seeing in the logs?

Assuming that's the case, this is the summary of the log:

3:00.   900s (15 min)
3:15.   1800s (30 min)
3:30.   Not visible
3:45.   1800s (30 min)
4:00.   900s (15 min)
4:15.   1800s (30 min)

@inlined does this look reasonable to you? Why would the auth/metadata server send a 15 min token at 3:00, and then send a 30 min token at 3:15? We see that again at 4:00 and 4:15. Is it juggling multiple tokens for the same client?

Ideally I would have expected to see some tokens with full 60 minutes expiry time, and even a few with 45 minutes expiry time. @trentbrooks do you see anything like that in the remainder of your logs?

@inlined
Copy link
Member Author

inlined commented Mar 11, 2021

I'll try to comment while using as few codewords as possible...

In App Engine, Cloud Functions, and Cloud Run, there's a single server (we'll call it the "load balancer") per machine that is responsible for creating and deleting the sandboxes that runs users' code. It provides a fully virtualized environment, including the network stack, and may directly implement things like a filesystem or HTTP calls to metadata.google.internal. Those token requests are handled by code in the load balancer that has an aggressive cache that it places in front of the actual call to mint new tokens.

Because the cache is at the load balancer, it means that the cache can be refreshed from any instance of the function and the TTL is more of a global TTL. As to why we're getting 30m tokens instead of 60m tokens... I can't answer that. But it seems that we get 30m tokens and would previously have executed with only a few seconds remaining before Trent was doing forced refreshes.

My suspicion is that the token had <5s remaining, it wasn't expired so we returned it, and the requests queued in the socket pool until the token was no longer valid.

@hiranya911
Copy link
Contributor

I assume there's only one instance of this particular function (the logs certainly seems to suggest so). That makes it even more strange as to why we are not getting 60min tokens.

Also this implies the SDK's built-in proactive token refresher (which uses Node.js timers) no longer works in Functions.

@trentbrooks
Copy link

Here are a couple more pages of fresh logs from today. All working as expected, so no complaints from me. Happy to change the logs if there is something else if it's more useful?

https://pasteboard.co/JSbmQHW.png
https://pasteboard.co/JSbneVq.png

Yes there is 1 instance of this particular function '15m_binance' which does the force token refresh. I have similar functions (without force token refresh) for the 1h, 4h, 1d (daily) and 1w (weekly). Is the refreshed admin token shared across all functions? Or is each function completely independent and requests it's own token?

Note- I also have 5 similar scheduled functions for a different exchange ie. '15m_fx', '1h_fx', etc. I have not touched any of these yet, and the 15 minute is still producing the 401 auth error regularly. Happy to provide any assistance.

@hiranya911
Copy link
Contributor

Is the refreshed admin token shared across all functions? Or is each function completely independent and requests it's own token?

It used to be independent. But with the new architecture it's anybody's guess. May be that's why we don't see 45min and 60min tokens in the logs? May be the token is being refreshed at the load balancer by other function instances?

@hiranya911
Copy link
Contributor

@inlined I've proposed a set of changes that can solve this type of issues in a new PR. PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants