Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APNs server does not respond to some notifications #816

Closed
jchambers opened this issue Oct 9, 2020 · 45 comments
Closed

APNs server does not respond to some notifications #816

jchambers opened this issue Oct 9, 2020 · 45 comments

Comments

@jchambers
Copy link
Owner

jchambers commented Oct 9, 2020

We've received reports that, starting on or around September 19, 2020, APNs servers have stopped responding to some notifications. From Pushy's perspective, this can look like a Future that never resolves (completion handlers are never called and calls to .get() time out or wait forever). Please see the mailing list thread on this topic for additional background and discussion.

From HTTP/2 frame logs, the problem appears to be that the server simply never sends a HEADERS (or DATA) frame in response to a push notification and never closes the HTTP/2 stream associated with the notification.

At this point, the goal is to identify some specific notifications affected by this problem. If you've encountered this issue, we're hoping to get the UUIDs (apns-id) and approximate timestamps of some affected notifications in the interest of sharing information upstream. Because the problem is that the server isn't responding, you'll need to assign your own apns-id values to outbound notifications (SimpleApnsNotification has a pair of constructors that accept an apnsId argument—using UUID.randomUUID() is recommended) to be able to uniquely identify which notifications are having this problem.

This issue is intended to consolidate a number of other reports on this topic, including #807, #814, and #815.

@maelaouane
Copy link

Hi @jchambers ,

We are facing the same issue as stated above. Unfortunately, we do not set apnsId on our notification since we build them with new SimpleApnsPushNotification(token, topic, payload, invalidationTime, priority, pushType, null, null).

Nevertheless, we have message-id in the payload to follow a given message sent in APNS (I can DM you some examples on twitter if needed).

We've got 3k+ notification timeout (5 seconds timeout on the get) on one cluster last Wednesday and we managed to workaround the issue by re-initializing the connection as follow:
client.close().await()
and then recreating a new apns client with a builder as advised

                clientBuilder = new ApnsClientBuilder()
                        .setApnsServer(host)
                        .setSigningKey(ApnsSigningKey.loadFromPkcs8File(file, team, key));
                client = clientBuilder.build();

Our environment: pushy: 0.13.10, jdk: 1.8

Issues started beginning of this week and concerns all iOS users (IOS 12, 13 and 14).

Thanks!

@jchambers
Copy link
Owner Author

Nevertheless, we have message-id in the payload to follow a given message sent in APNS (I can DM you some examples on twitter if needed).

Thank you for the offer, but I'm almost positive that in-payload IDs won't be searchable by Apple. If you have an opportunity to include a UUID in your outbound notifications, we can use that going forward:

new SimpleApnsPushNotification(token, topic, payload, invalidationTime, priority, pushType, null, UUID.randomUUID())

If you get an opportunity to give that a go, please do let me know and we'll figure things out from there. Thanks!

@QILI92
Copy link

QILI92 commented Oct 10, 2020

Hi @jchambers might this swift-server-community/APNSwift#93 be the same cause?

@jchambers
Copy link
Owner Author

might this swift-server-community/APNSwift#93 be the same cause?

It certainly sounds like it's the same problem, yes.

@vsharm22
Copy link

@jchambers
As you have suggested to pass the custome UUID using UUID.randomUUID() . I am trying to understand what will the benefit we will achieve from this . Will it resolve the jvm hang issue ? As i saw if we dont pass the uuid in that case also pushy is creating and sending the uuid internally and returning the same identifier.

Also if i am setting up the timeout in future.get so after specified timeout we will receive the timeoutexception so will it be a solution to catch the exception and create new apns client and send the same notification again with this new client ?

Please suggest and share your input. Thanks you .

@maelaouane
Copy link

Hi @jchambers,

Thank you for your reply. I will add the apns-id to the notification creation call. We have got some additional information to share. While the first encountered exception was a timeout :

send: Failed to send push notification. Encoutered Exception: java.util.concurrent.TimeoutException
--
java.util.concurrent.TimeoutException at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:56)

We have found in another deployment a different error:

send: Failed to send push notification. Encoutered Exception: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Client has been closed and can no longer send push notifications.
--
java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Client has been closed and can no longer send push notifications.
at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:54)

In fact, the service has been restarted in order to re-initialize the connection:

15:04:26.885 [ApnsClient][O]Shutting down.

But we had to wait more than 30 seconds until the service was able to be stopped:

            Future<Void> disconnectFuture = client.close().await();
            if (disconnectFuture.isSuccess()) {
                LOG.info("destroyConnector: Disconnect %s successful", disconnectFuture.toString());
            }

The provided trace
15:04:59.058 [APNSClientConnector][O]destroyConnector: Disconnect DefaultPromise@7a34f4ac(success) successful

Furhter restarts were more straight forward :

2020/10/09-15:05:10.774 [ApnsClient][O]Shutting down.
2020/10/09-15:05:12.985 [APNSClientConnector][O]destroyConnector: Disconnect DefaultPromise@20abb2c9(success) successful

@jchambers Is it normal for client shutdown to be locked for more than half a minute ?
@vsharm22 Re-initializing the connection seems to be a workaround for having the notifications back but we are not sure whether this is a viable solution.

Thanks!

@julienpouget
Copy link

Hi @jchambers

We're experiencing the same issue on our plateform. As a workaround we periodically re-initialize the connection but this is definitely not a viable solution.

Are you still investigating on this ? Is there any news you want to share ?

Thanks !

@jchambers
Copy link
Owner Author

Yes, I'm still investigating the issue. Thank you for your patience.

@huangjunchao
Copy link

Hi @jchambers I have the same proplem when The apns server work one hour I can not get reponse from apns and the user device can not receive msg I use that to send msg

new SimpleApnsPushNotification(token, topic, payload, invalidationTime, priority, pushType, null, UUID.randomUUID())

can not receive msg UUID

  1. 2a77ab4f-4d7c-4968-b036-11ab3795994f
  2. 213d79e0-d339-462c-bf80-a0df193b5531

@vsharm22
Copy link

Hi @jchambers ,

As we added the timeout in future.get so we started getting the timeout exception and we are trying to catch the exception and in catch block we are trying to cancel the future task as shown in below code snippet.
But still our application got hang and after analyzing the thread dump we observe that task is stilled not cancel . Could you please assist here why we are not able to close the task ?

Sorry for the inconvinence.

catch (Exception e) {

    	if(sendNotificationFuture != null)
        {
               if (sendNotificationFuture.isCancellable())
               {
               notificationLogger.debug(String.format("The task is cancellable for cient_id: [%s] and identifier: [%s].", request.getClientId(), apnsMessage.getIdentifier()));

                      if ( ! ( sendNotificationFuture.isCancelled() ) )
                      {
                   notificationLogger.debug(String.format("The task is NOT canceled. Now attempting to cancel the task for cient_id: [%s] and identifier: [%s].", request.getClientId(), apnsMessage.getIdentifier()));
                     sendNotificationFuture.cancel(true);
                      }
               }
               else
               {
                      notificationLogger.debug(String.format("The task is NOT cancellable. Can not cancel the task for cient_id: [%s] and identifier: [%s].", request.getClientId(), apnsMessage.getIdentifier()));
               }
        }
                
       notificationResult = new NotificationWithFullMessageResultImpl(NotificationResultStatus.FAILURE,
               false);
       apnsMessage = null;
       //notificationLogger.error(String.format("Could not notify %1$s", request.getEndpoint()), e);
       notificationLogger.error(String.format("Could not notify endpoint [%s]", request.getEndpoint()), e);
       notificationLogger.error(String.format("Could not notify end point for client id:[%s]", request.getClientId()), e);

    
    }

@jchambers
Copy link
Owner Author

Pushy generally hasn't added support for canceling futures. Part of the problem is that once a notification has been written, it has consumed an HTTP/2 stream. Even if we cancel the future locally, streams are limited by the server, and we won't be able to "reclaim" the stream until the notification resolves remotely.

@vsharm22
Copy link

@jchambers

Thank you for the explaination . Since we are continously facing this issue so Do you want to suggest some workaround till the time you are investigating on this if we have any ?

@huangjunchao
Copy link

hi @jchambers At present, we still face the following problems:

If it is an asynchronous request, our APNs push service can only run for two hours. At present, we are ready to do a synchronous transaction every 10 minutes. If the transaction time-out, we will close the connection and reinitialize it. But we don't know whether this can solve the problem. Can you give us some suggestions? Thank you very much!

@jchambers
Copy link
Owner Author

Friends, I understand this is a serious problem for many of you. I promise I'll share updates as soon as they're available.

@petrdvorak
Copy link
Contributor

@jchambers Did you open a radar issue for that? I opened FB8816555 but would reference your issue to point to a possible duplicate...

@jchambers
Copy link
Owner Author

Did you open a radar issue for that?

I didn't; I've been working through other channels.

@jchambers
Copy link
Owner Author

Folks, I'll be putting out a new build shortly that includes some more verbose/specific logging that I hope will help get to the bottom of this issue. I don't think it will solve the problem in its own right, but it should help get us the information we need to make more progress.

@jchambers
Copy link
Owner Author

Folks, Pushy 0.14.2 has just been released and should be making its way to Maven Central within the next hour or so. Could you please update to the latest version? It includes a few logging changes that could be helpful in getting to the root of this problem. In particular, turning on DEBUG logging for com.eatthepath.pushy.apns.ApnsChannelFactory will help us understand when channels are opened/closed, which could be a really important clue.

Thank you!

@jchambers
Copy link
Owner Author

With thanks to @lkesteloot and @dcollens, we now have some high-quality frame logs showing:

  1. A bunch of streams getting sent/acknowledged by the server
  2. One single stream getting sent, but not acknowledged
  3. A bunch of subsequent streams on the same connection getting acknowledged by the server

Another curious observation is that this problem seems to affect a small subset of device tokens. As an example, most "stalled" notifications are headed for device tokens ABC123 and DEF456, but not all notifications for those tokens will fail. Is this behavior consistent with other folks' experience?

@papo2608
Copy link

@jchambers we can confirm exactly that behavior. We already talked to Apple and they told us, that those request never hit their Application Layer and now their server team is investigating.

@huangjunchao
Copy link

@papo2608 Did Apple say when to fix it

@robertoprato
Copy link

Looks like this issue is the same as #787 , we have been experiencing this on and off for the last 6 months at least. We resorted to expiring the future and recreate the ApnsClient when messages are not being sent.

@jchambers
Copy link
Owner Author

Folks, I've just received word that the problem may have been fixed upstream. Could you please check whether you're still experiencing this problem or if it appears to have been resolved and report back?

Thanks!

@christophemaillot
Copy link

Folks, I've just received word that the problem may have been fixed upstream. Could you please check whether you're still experiencing this problem or if it appears to have been resolved and report back?

It seems so. Our platform was experiencing this issue since mid september, we had to reinitialize the client 3 or 4 times a day. But we just run 4 days in a row without any notification hitches.

@vsharm22
Copy link

received word that the problem may have been fixed upstream.

@jchambers ,
When you are saying upstream so What is the meaning of upstream here ? Apple Server ???

@jchambers
Copy link
Owner Author

When you are saying upstream so What is the meaning of upstream here ? Apple Server ???

Yes, I mean that I believe Apple fixed a problem on their end.

@leiwei999
Copy link

Is it possible to reproduce the same event(APNs server does not respond) using this mock server?
https://pushy-apns.org/apidocs/0.13/com/eatthepath/pushy/apns/server/MockApnsServerBuilder.html
https://pushy-apns.org/apidocs/0.13/com/eatthepath/pushy/apns/server/MockApnsServer.html

Thanks!

@jchambers
Copy link
Owner Author

Is it possible to reproduce the same event(APNs server does not respond) using this mock server?

Respectfully, I think that's beyond the scope of the current issue and sounds more like a request for general support. @leiwei999 could you please move this discussion to the mailing list instead?

@babuv2
Copy link

babuv2 commented Oct 28, 2020

Hi @jchambers,

Jon we are currently in the version 0.13.9. We are facing the issue even now. We are re-starting our service periodically to make sure all the streams are not blocked.

  1. Do we need to update to the latest version for the issue to be fixed?
  2. If the issue was fixed at upstream (Apple side) do we have any tickets that you opened with apple. If you have a ticket id, will you be able to share the same, so that I can share it with the stake holders.

Thanks In Advance Jon
Vivek

@petrdvorak
Copy link
Contributor

@babuv2 We opened FB8816555 issue with Apple for this. It is still open on Apple's side and it does not reference any duplicates. Updating the Pushy version is a good idea anyway but as for this issue, I don't think it will help.

@babuv2
Copy link

babuv2 commented Oct 28, 2020

@petrdvorak Great. Understood. Thank you very much Petr

Vivek

@jchambers
Copy link
Owner Author

Folks,

If you're still experiencing this issue, please:

  1. Let us know, of course!
  2. Update to the latest version of Pushy (currently 0.14.2). This is unlikely to address the issue, but includes some additional logging that may help us identify what's happening.
  3. Send messages with locally-assigned apns-id values (UUID.randomUUID() is fine!). The more detailed SimpleApnsPushNotification constructors will accept an APNs ID as an argument.
  4. Turn on HTTP/2 frame logging.

For recurrences of this issue, we need to be able to show which notifications are getting lost (a timestamp and APNs ID will cover this) and what else was happening on the connection at the time (the frame logs will cover this).

Thank you very much for your continued patience and support!

@janzar
Copy link

janzar commented Oct 30, 2020

Hello,

we too experienced lot of timeouts in 27, 28 and 30 October. Have already updated pushy to 0.14.2. Some apns-id:
Oct 30 14:50:03
apns-id: 00000083-ebe9-9f57-3579-54f778027d82
Oct 30 15:30:06
apns-id: 0000007a-18ba-2071-3579-79a280009042
Oct 30 16:16:32
apns-id: 00000085-a985-245b-3579-a4279f842e65
Oct 30 16:40:05
apns-id: 00000085-f011-0f9d-3579-b9b27739283a

Time is in GMT+03. If we retry push, then we use the same apns-id. Hope that because of this there will be no problems to identfy them.

Thanks,
Zigurds

@mkieloch-352
Copy link

I was also experiencing this issue at my job. It seemed to increase with higher traffic. At the time, we were sending around 300 - 350 notifications per minute and our JVMs were locking up. Updating Pushy has seemed to resolve the issue but will monitor closely.

@jchambers
Copy link
Owner Author

Even though there are still some reports of occasional timeouts, it sounds like the main issue where HTTP/2 streams were getting lost entirely has been resolved upstream.

I'm going to mark this issue as "resolved," but please let me know if you think that's a mistake.

@lkesteloot
Copy link

Thank you Jon for dealing with this!

@floifyarul
Copy link

We have been seeing ton of timeouts today vs handful every day, we are still on an older release of Pushy (0.11.0), wonder if any one else experiencing these timeouts today?

@dcollens
Copy link

dcollens commented Dec 7, 2020

Yeah looks like we have about 25 stranded requests today (they never completed, and have consumed our Semaphore count). I don't have frame logs for them though. Around 1-2pm Eastern time roughly.

@floifyarul
Copy link

Ok thanks, problem with upstream? For us, it started around 11:14am Eastern time and still happening

@jchambers
Copy link
Owner Author

Uuuuuuuugh… while I don't doubt that this is the same problem, frame logs and UUIDs of affected messages would be a big help in diagnosing this (even if that just means forwarding that information to Apple). I do hope this is just a brief hiccup that self-resolves, but if anybody's in a position to capture logs, that'd be awesome.

In the meantime, I'll try to think through some appropriate resiliency strategies for "sometimes HTTP/2 streams just disappear."

@floifyarul
Copy link

Thank you Jon, we have been running fine yesterday, returned to normal error rate. Do you recommend upgrading to newest version will help improve the situation?

@jchambers
Copy link
Owner Author

I always recommend using the latest version, but we haven't shipped anything that addresses this situation specifically.

@floifyarul
Copy link

ok thanks

@babuv2
Copy link

babuv2 commented Dec 21, 2020

@jchambers Jon we have

  1. Updated to the latest pushy version (0.14.2)
  2. Started sending messages with locally-assigned apns-id.

We are still facing the issue on a regular basis. Please find below some of the failed apns-ids

5eb1bdac-f632-43d7-9060-8b77473c5f64
76a52c3e-f559-4a71-ab09-4646af9288ce
c883a202-9639-4c01-93e4-3cf506e3cff7
2f1e02e1-6bf9-4e8d-8687-b20d9effb6e0
e3447696-d68e-495a-99b8-ec280161505a

Any help in debugging/fixing this is highly appreciated.

Thanks In Advance

Vivek

@babuv2
Copy link

babuv2 commented Jan 8, 2021

Hi @jchambers,
John can we reopen this ticket since we are facing this issue regularly. I have already shared some APNS-ids. We will be happy to provide more in case you need it

Thanks In Advance
Vivek

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests