Token refresh doesn't work in traffic storm #1193

inlined · 2021-03-10T19:17:39Z

This is a reassignment to close out firebase/firebase-functions#579. The customer's function makes many many Admin SDK API calls every 15 minutes. Previously it would crash with 401 errors because the API server was no longer recognizing the token as valid. The customer was able to work around this error by calling admin.app().INTERNAL.getToken(/*force_refresh=*/true) before their code ran, so it's definitely an Admin bug.

Given the frequency of the crashes, I expect that this bug is happening on the border of when a token is about to expire and the Admin SDK isn't properly refreshing it. I have two theories for what could cause this:

There may be a bug in the code that actually refreshes the tokens. For example, here is a bug where a token that is supposed to expire from [1m, 2m) will never get refreshed.
This may be an issue where the network pool for the auth token is full of serialized requests and it takes longer to flush than the token is valid for.

My recommendation would probably be something surgical like changing this line to subtract a few minutes from the actual expiration time so that we don't serialize requests with a token that's about to expire.

The text was updated successfully, but these errors were encountered:

hiranya911 · 2021-03-10T20:34:35Z

@inlined how is this function staying hot for 15 minutes (and possibly longer) to experience a token expiration?

inlined · 2021-03-10T21:13:42Z

It's a periodic function that runs every 15m. Thanks to the (newer) architecture in the Cloud Functions backend, the code that that implements metadata.google.internal lives in the load balancer and not the container anymore. This means there can be aggressive caching across all functions and even across container death. Assumptions can no longer be made that a "fresh" token will last very long, other than that I think the backend returns a fresh token within a few minutes of the cached token expiring. We can try to dig into the backend source together if you want.

hiranya911 · 2021-03-10T22:22:49Z

Thanks @inlined.

@trentbrooks, can I ask you to share the token expiry time logs you've collected from your runs? I see you have shared a couple of log entries at firebase/firebase-functions#579 (comment). Do you have any more?

Update: Also how long does each invocation of your function last approximately (i.e. the time to send all notifications)?

trentbrooks · 2021-03-11T05:38:04Z

Hi @hiranya911

Here are a couple of screenshots from today with the logs. It's been working consistently since @inlined suggested the fix a few days ago. Function runs scheduled every 15 minutes and sends 250-300 topic messages (promises), and function finishes each run in 1500-5500 milliseconds.

https://pasteboard.co/JS4AoFY.png
https://pasteboard.co/JS4AENC.png

Code:
let now = Date.now();
let token = await admin.app().INTERNAL.getToken(true);
let promises = sendTopicNotifications(indicators, timeframe, exchange);
console.log("promises length", promises.length, "force token expire", ((token.expirationTime - now) / 1000) )
return Promise.all(promises);

hiranya911 · 2021-03-11T18:13:59Z

Thanks for verifying the execution time of your triggers. That takes some of the uncertainty out of the picture. The code looks like you're logging the token expiry time after a forced refresh. @trentbrooks can you please confirm that's indeed what we're seeing in the logs?

Assuming that's the case, this is the summary of the log:

3:00.   900s (15 min)
3:15.   1800s (30 min)
3:30.   Not visible
3:45.   1800s (30 min)
4:00.   900s (15 min)
4:15.   1800s (30 min)

@inlined does this look reasonable to you? Why would the auth/metadata server send a 15 min token at 3:00, and then send a 30 min token at 3:15? We see that again at 4:00 and 4:15. Is it juggling multiple tokens for the same client?

Ideally I would have expected to see some tokens with full 60 minutes expiry time, and even a few with 45 minutes expiry time. @trentbrooks do you see anything like that in the remainder of your logs?

inlined · 2021-03-11T21:30:25Z

I'll try to comment while using as few codewords as possible...

In App Engine, Cloud Functions, and Cloud Run, there's a single server (we'll call it the "load balancer") per machine that is responsible for creating and deleting the sandboxes that runs users' code. It provides a fully virtualized environment, including the network stack, and may directly implement things like a filesystem or HTTP calls to metadata.google.internal. Those token requests are handled by code in the load balancer that has an aggressive cache that it places in front of the actual call to mint new tokens.

Because the cache is at the load balancer, it means that the cache can be refreshed from any instance of the function and the TTL is more of a global TTL. As to why we're getting 30m tokens instead of 60m tokens... I can't answer that. But it seems that we get 30m tokens and would previously have executed with only a few seconds remaining before Trent was doing forced refreshes.

My suspicion is that the token had <5s remaining, it wasn't expired so we returned it, and the requests queued in the socket pool until the token was no longer valid.

hiranya911 · 2021-03-11T22:44:00Z

I assume there's only one instance of this particular function (the logs certainly seems to suggest so). That makes it even more strange as to why we are not getting 60min tokens.

Also this implies the SDK's built-in proactive token refresher (which uses Node.js timers) no longer works in Functions.

trentbrooks · 2021-03-11T23:04:27Z

Here are a couple more pages of fresh logs from today. All working as expected, so no complaints from me. Happy to change the logs if there is something else if it's more useful?

https://pasteboard.co/JSbmQHW.png
https://pasteboard.co/JSbneVq.png

Yes there is 1 instance of this particular function '15m_binance' which does the force token refresh. I have similar functions (without force token refresh) for the 1h, 4h, 1d (daily) and 1w (weekly). Is the refreshed admin token shared across all functions? Or is each function completely independent and requests it's own token?

Note- I also have 5 similar scheduled functions for a different exchange ie. '15m_fx', '1h_fx', etc. I have not touched any of these yet, and the 15 minute is still producing the 401 auth error regularly. Happy to provide any assistance.

hiranya911 · 2021-03-11T23:27:42Z

Is the refreshed admin token shared across all functions? Or is each function completely independent and requests it's own token?

It used to be independent. But with the new architecture it's anybody's guess. May be that's why we don't see 45min and 60min tokens in the logs? May be the token is being refreshed at the load balancer by other function instances?

hiranya911 · 2021-03-11T23:52:29Z

@inlined I've proposed a set of changes that can solve this type of issues in a new PR. PTAL.

hiranya911 assigned inlined Mar 10, 2021

inlined assigned hiranya911 and unassigned inlined Mar 10, 2021

hiranya911 assigned inlined Mar 11, 2021

hiranya911 mentioned this issue Mar 11, 2021

fix: Decoupled proactive token refresh from FirebaseApp #1194

Merged

hiranya911 closed this as completed Mar 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token refresh doesn't work in traffic storm #1193

Token refresh doesn't work in traffic storm #1193

inlined commented Mar 10, 2021

hiranya911 commented Mar 10, 2021

inlined commented Mar 10, 2021

hiranya911 commented Mar 10, 2021 •

edited

Loading

trentbrooks commented Mar 11, 2021 •

edited

Loading

hiranya911 commented Mar 11, 2021 •

edited

Loading

inlined commented Mar 11, 2021

hiranya911 commented Mar 11, 2021

trentbrooks commented Mar 11, 2021

hiranya911 commented Mar 11, 2021

hiranya911 commented Mar 11, 2021

Token refresh doesn't work in traffic storm #1193

Token refresh doesn't work in traffic storm #1193

Comments

inlined commented Mar 10, 2021

hiranya911 commented Mar 10, 2021

inlined commented Mar 10, 2021

hiranya911 commented Mar 10, 2021 • edited Loading

trentbrooks commented Mar 11, 2021 • edited Loading

hiranya911 commented Mar 11, 2021 • edited Loading

inlined commented Mar 11, 2021

hiranya911 commented Mar 11, 2021

trentbrooks commented Mar 11, 2021

hiranya911 commented Mar 11, 2021

hiranya911 commented Mar 11, 2021

hiranya911 commented Mar 10, 2021 •

edited

Loading

trentbrooks commented Mar 11, 2021 •

edited

Loading

hiranya911 commented Mar 11, 2021 •

edited

Loading