APNs server does not respond to some notifications #816

jchambers · 2020-10-09T14:08:53Z

We've received reports that, starting on or around September 19, 2020, APNs servers have stopped responding to some notifications. From Pushy's perspective, this can look like a Future that never resolves (completion handlers are never called and calls to .get() time out or wait forever). Please see the mailing list thread on this topic for additional background and discussion.

From HTTP/2 frame logs, the problem appears to be that the server simply never sends a HEADERS (or DATA) frame in response to a push notification and never closes the HTTP/2 stream associated with the notification.

At this point, the goal is to identify some specific notifications affected by this problem. If you've encountered this issue, we're hoping to get the UUIDs (apns-id) and approximate timestamps of some affected notifications in the interest of sharing information upstream. Because the problem is that the server isn't responding, you'll need to assign your own apns-id values to outbound notifications (SimpleApnsNotification has a pair of constructors that accept an apnsId argument—using UUID.randomUUID() is recommended) to be able to uniquely identify which notifications are having this problem.

This issue is intended to consolidate a number of other reports on this topic, including #807, #814, and #815.

The text was updated successfully, but these errors were encountered:

maelaouane · 2020-10-09T16:52:59Z

Hi @jchambers ,

We are facing the same issue as stated above. Unfortunately, we do not set apnsId on our notification since we build them with new SimpleApnsPushNotification(token, topic, payload, invalidationTime, priority, pushType, null, null).

Nevertheless, we have message-id in the payload to follow a given message sent in APNS (I can DM you some examples on twitter if needed).

We've got 3k+ notification timeout (5 seconds timeout on the get) on one cluster last Wednesday and we managed to workaround the issue by re-initializing the connection as follow:
client.close().await()
and then recreating a new apns client with a builder as advised

                clientBuilder = new ApnsClientBuilder()
                        .setApnsServer(host)
                        .setSigningKey(ApnsSigningKey.loadFromPkcs8File(file, team, key));
                client = clientBuilder.build();

Our environment: pushy: 0.13.10, jdk: 1.8

Issues started beginning of this week and concerns all iOS users (IOS 12, 13 and 14).

Thanks!

jchambers · 2020-10-09T16:56:27Z

Nevertheless, we have message-id in the payload to follow a given message sent in APNS (I can DM you some examples on twitter if needed).

Thank you for the offer, but I'm almost positive that in-payload IDs won't be searchable by Apple. If you have an opportunity to include a UUID in your outbound notifications, we can use that going forward:

new SimpleApnsPushNotification(token, topic, payload, invalidationTime, priority, pushType, null, UUID.randomUUID())

If you get an opportunity to give that a go, please do let me know and we'll figure things out from there. Thanks!

QILI92 · 2020-10-10T01:32:51Z

Hi @jchambers might this swift-server-community/APNSwift#93 be the same cause?

jchambers · 2020-10-10T01:48:21Z

might this swift-server-community/APNSwift#93 be the same cause?

It certainly sounds like it's the same problem, yes.

vsharm22 · 2020-10-12T07:18:53Z

@jchambers
As you have suggested to pass the custome UUID using UUID.randomUUID() . I am trying to understand what will the benefit we will achieve from this . Will it resolve the jvm hang issue ? As i saw if we dont pass the uuid in that case also pushy is creating and sending the uuid internally and returning the same identifier.

Also if i am setting up the timeout in future.get so after specified timeout we will receive the timeoutexception so will it be a solution to catch the exception and create new apns client and send the same notification again with this new client ?

Please suggest and share your input. Thanks you .

maelaouane · 2020-10-12T09:35:05Z

Hi @jchambers,

Thank you for your reply. I will add the apns-id to the notification creation call. We have got some additional information to share. While the first encountered exception was a timeout :

send: Failed to send push notification. Encoutered Exception: java.util.concurrent.TimeoutException
--
java.util.concurrent.TimeoutException at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:56)

We have found in another deployment a different error:

send: Failed to send push notification. Encoutered Exception: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Client has been closed and can no longer send push notifications.
--
java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Client has been closed and can no longer send push notifications.
at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:54)

In fact, the service has been restarted in order to re-initialize the connection:

15:04:26.885 [ApnsClient][O]Shutting down.

But we had to wait more than 30 seconds until the service was able to be stopped:

            Future<Void> disconnectFuture = client.close().await();
            if (disconnectFuture.isSuccess()) {
                LOG.info("destroyConnector: Disconnect %s successful", disconnectFuture.toString());
            }

The provided trace
15:04:59.058 [APNSClientConnector][O]destroyConnector: Disconnect DefaultPromise@7a34f4ac(success) successful

Furhter restarts were more straight forward :

2020/10/09-15:05:10.774 [ApnsClient][O]Shutting down.
2020/10/09-15:05:12.985 [APNSClientConnector][O]destroyConnector: Disconnect DefaultPromise@20abb2c9(success) successful

@jchambers Is it normal for client shutdown to be locked for more than half a minute ?
@vsharm22 Re-initializing the connection seems to be a workaround for having the notifications back but we are not sure whether this is a viable solution.

Thanks!

julienpouget · 2020-10-16T13:55:46Z

Hi @jchambers

We're experiencing the same issue on our plateform. As a workaround we periodically re-initialize the connection but this is definitely not a viable solution.

Are you still investigating on this ? Is there any news you want to share ?

Thanks !

jchambers · 2020-10-16T14:04:14Z

Yes, I'm still investigating the issue. Thank you for your patience.

huangjunchao · 2020-10-19T03:22:13Z

Hi @jchambers I have the same proplem when The apns server work one hour I can not get reponse from apns and the user device can not receive msg I use that to send msg

new SimpleApnsPushNotification(token, topic, payload, invalidationTime, priority, pushType, null, UUID.randomUUID())

can not receive msg UUID

2a77ab4f-4d7c-4968-b036-11ab3795994f
213d79e0-d339-462c-bf80-a0df193b5531

vsharm22 · 2020-10-20T06:47:22Z

Hi @jchambers ,

As we added the timeout in future.get so we started getting the timeout exception and we are trying to catch the exception and in catch block we are trying to cancel the future task as shown in below code snippet.
But still our application got hang and after analyzing the thread dump we observe that task is stilled not cancel . Could you please assist here why we are not able to close the task ?

Sorry for the inconvinence.

catch (Exception e) {

    	if(sendNotificationFuture != null)
        {
               if (sendNotificationFuture.isCancellable())
               {
               notificationLogger.debug(String.format("The task is cancellable for cient_id: [%s] and identifier: [%s].", request.getClientId(), apnsMessage.getIdentifier()));

                      if ( ! ( sendNotificationFuture.isCancelled() ) )
                      {
                   notificationLogger.debug(String.format("The task is NOT canceled. Now attempting to cancel the task for cient_id: [%s] and identifier: [%s].", request.getClientId(), apnsMessage.getIdentifier()));
                     sendNotificationFuture.cancel(true);
                      }
               }
               else
               {
                      notificationLogger.debug(String.format("The task is NOT cancellable. Can not cancel the task for cient_id: [%s] and identifier: [%s].", request.getClientId(), apnsMessage.getIdentifier()));
               }
        }
                
       notificationResult = new NotificationWithFullMessageResultImpl(NotificationResultStatus.FAILURE,
               false);
       apnsMessage = null;
       //notificationLogger.error(String.format("Could not notify %1$s", request.getEndpoint()), e);
       notificationLogger.error(String.format("Could not notify endpoint [%s]", request.getEndpoint()), e);
       notificationLogger.error(String.format("Could not notify end point for client id:[%s]", request.getClientId()), e);

    
    }

jchambers · 2020-10-20T14:24:51Z

Pushy generally hasn't added support for canceling futures. Part of the problem is that once a notification has been written, it has consumed an HTTP/2 stream. Even if we cancel the future locally, streams are limited by the server, and we won't be able to "reclaim" the stream until the notification resolves remotely.

vsharm22 · 2020-10-20T19:20:37Z

@jchambers

Thank you for the explaination . Since we are continously facing this issue so Do you want to suggest some workaround till the time you are investigating on this if we have any ?

huangjunchao · 2020-10-21T02:08:35Z

hi @jchambers At present, we still face the following problems:

If it is an asynchronous request, our APNs push service can only run for two hours. At present, we are ready to do a synchronous transaction every 10 minutes. If the transaction time-out, we will close the connection and reinitialize it. But we don't know whether this can solve the problem. Can you give us some suggestions? Thank you very much!

jchambers · 2020-10-21T02:15:03Z

Friends, I understand this is a serious problem for many of you. I promise I'll share updates as soon as they're available.

petrdvorak · 2020-10-21T09:47:52Z

@jchambers Did you open a radar issue for that? I opened FB8816555 but would reference your issue to point to a possible duplicate...

jchambers · 2020-10-21T14:17:29Z

Did you open a radar issue for that?

I didn't; I've been working through other channels.

jchambers · 2020-10-21T15:25:34Z

Folks, I'll be putting out a new build shortly that includes some more verbose/specific logging that I hope will help get to the bottom of this issue. I don't think it will solve the problem in its own right, but it should help get us the information we need to make more progress.

jchambers · 2020-10-21T16:57:14Z

Folks, Pushy 0.14.2 has just been released and should be making its way to Maven Central within the next hour or so. Could you please update to the latest version? It includes a few logging changes that could be helpful in getting to the root of this problem. In particular, turning on DEBUG logging for com.eatthepath.pushy.apns.ApnsChannelFactory will help us understand when channels are opened/closed, which could be a really important clue.

Thank you!

jchambers · 2020-10-21T19:14:30Z

With thanks to @lkesteloot and @dcollens, we now have some high-quality frame logs showing:

A bunch of streams getting sent/acknowledged by the server
One single stream getting sent, but not acknowledged
A bunch of subsequent streams on the same connection getting acknowledged by the server

Another curious observation is that this problem seems to affect a small subset of device tokens. As an example, most "stalled" notifications are headed for device tokens ABC123 and DEF456, but not all notifications for those tokens will fail. Is this behavior consistent with other folks' experience?

papo2608 · 2020-10-23T08:11:13Z

@jchambers we can confirm exactly that behavior. We already talked to Apple and they told us, that those request never hit their Application Layer and now their server team is investigating.

huangjunchao · 2020-10-23T08:31:01Z

@papo2608 Did Apple say when to fix it

robertoprato · 2020-10-23T08:43:42Z

Looks like this issue is the same as #787 , we have been experiencing this on and off for the last 6 months at least. We resorted to expiring the future and recreate the ApnsClient when messages are not being sent.

jchambers · 2020-10-24T15:07:21Z

Folks, I've just received word that the problem may have been fixed upstream. Could you please check whether you're still experiencing this problem or if it appears to have been resolved and report back?

Thanks!

christophemaillot · 2020-10-25T21:55:17Z

Folks, I've just received word that the problem may have been fixed upstream. Could you please check whether you're still experiencing this problem or if it appears to have been resolved and report back?

It seems so. Our platform was experiencing this issue since mid september, we had to reinitialize the client 3 or 4 times a day. But we just run 4 days in a row without any notification hitches.

vsharm22 · 2020-10-26T09:03:17Z

received word that the problem may have been fixed upstream.

@jchambers ,
When you are saying upstream so What is the meaning of upstream here ? Apple Server ???

jchambers · 2020-10-26T21:52:51Z

When you are saying upstream so What is the meaning of upstream here ? Apple Server ???

Yes, I mean that I believe Apple fixed a problem on their end.

leiwei999 · 2020-10-27T01:03:37Z

Is it possible to reproduce the same event(APNs server does not respond) using this mock server?
https://pushy-apns.org/apidocs/0.13/com/eatthepath/pushy/apns/server/MockApnsServerBuilder.html
https://pushy-apns.org/apidocs/0.13/com/eatthepath/pushy/apns/server/MockApnsServer.html

Thanks!

jchambers · 2020-10-27T03:19:15Z

Is it possible to reproduce the same event(APNs server does not respond) using this mock server?

Respectfully, I think that's beyond the scope of the current issue and sounds more like a request for general support. @leiwei999 could you please move this discussion to the mailing list instead?

babuv2 · 2020-10-28T11:26:02Z

Hi @jchambers,

Jon we are currently in the version 0.13.9. We are facing the issue even now. We are re-starting our service periodically to make sure all the streams are not blocked.

Do we need to update to the latest version for the issue to be fixed?
If the issue was fixed at upstream (Apple side) do we have any tickets that you opened with apple. If you have a ticket id, will you be able to share the same, so that I can share it with the stake holders.

Thanks In Advance Jon
Vivek

petrdvorak · 2020-10-28T11:30:13Z

@babuv2 We opened FB8816555 issue with Apple for this. It is still open on Apple's side and it does not reference any duplicates. Updating the Pushy version is a good idea anyway but as for this issue, I don't think it will help.

babuv2 · 2020-10-28T12:27:39Z

@petrdvorak Great. Understood. Thank you very much Petr

Vivek

jchambers · 2020-10-28T13:55:15Z

Folks,

If you're still experiencing this issue, please:

Let us know, of course!
Update to the latest version of Pushy (currently 0.14.2). This is unlikely to address the issue, but includes some additional logging that may help us identify what's happening.
Send messages with locally-assigned apns-id values (UUID.randomUUID() is fine!). The more detailed SimpleApnsPushNotification constructors will accept an APNs ID as an argument.
Turn on HTTP/2 frame logging.

For recurrences of this issue, we need to be able to show which notifications are getting lost (a timestamp and APNs ID will cover this) and what else was happening on the connection at the time (the frame logs will cover this).

Thank you very much for your continued patience and support!

janzar · 2020-10-30T16:36:42Z

Hello,

we too experienced lot of timeouts in 27, 28 and 30 October. Have already updated pushy to 0.14.2. Some apns-id:
Oct 30 14:50:03
apns-id: 00000083-ebe9-9f57-3579-54f778027d82
Oct 30 15:30:06
apns-id: 0000007a-18ba-2071-3579-79a280009042
Oct 30 16:16:32
apns-id: 00000085-a985-245b-3579-a4279f842e65
Oct 30 16:40:05
apns-id: 00000085-f011-0f9d-3579-b9b27739283a

Time is in GMT+03. If we retry push, then we use the same apns-id. Hope that because of this there will be no problems to identfy them.

Thanks,
Zigurds

mkieloch-352 · 2020-11-04T05:00:31Z

I was also experiencing this issue at my job. It seemed to increase with higher traffic. At the time, we were sending around 300 - 350 notifications per minute and our JVMs were locking up. Updating Pushy has seemed to resolve the issue but will monitor closely.

jchambers · 2020-11-14T20:09:33Z

Even though there are still some reports of occasional timeouts, it sounds like the main issue where HTTP/2 streams were getting lost entirely has been resolved upstream.

I'm going to mark this issue as "resolved," but please let me know if you think that's a mistake.

lkesteloot · 2020-11-16T17:44:26Z

Thank you Jon for dealing with this!

floifyarul · 2020-12-07T21:42:17Z

We have been seeing ton of timeouts today vs handful every day, we are still on an older release of Pushy (0.11.0), wonder if any one else experiencing these timeouts today?

dcollens · 2020-12-07T22:01:13Z

Yeah looks like we have about 25 stranded requests today (they never completed, and have consumed our Semaphore count). I don't have frame logs for them though. Around 1-2pm Eastern time roughly.

floifyarul · 2020-12-07T22:14:58Z

Ok thanks, problem with upstream? For us, it started around 11:14am Eastern time and still happening

jchambers · 2020-12-07T22:44:56Z

Uuuuuuuugh… while I don't doubt that this is the same problem, frame logs and UUIDs of affected messages would be a big help in diagnosing this (even if that just means forwarding that information to Apple). I do hope this is just a brief hiccup that self-resolves, but if anybody's in a position to capture logs, that'd be awesome.

In the meantime, I'll try to think through some appropriate resiliency strategies for "sometimes HTTP/2 streams just disappear."

floifyarul · 2020-12-09T15:27:57Z

Thank you Jon, we have been running fine yesterday, returned to normal error rate. Do you recommend upgrading to newest version will help improve the situation?

jchambers · 2020-12-10T00:32:55Z

I always recommend using the latest version, but we haven't shipped anything that addresses this situation specifically.

floifyarul · 2020-12-11T16:25:54Z

ok thanks

babuv2 · 2020-12-21T03:57:58Z

@jchambers Jon we have

Updated to the latest pushy version (0.14.2)
Started sending messages with locally-assigned apns-id.

We are still facing the issue on a regular basis. Please find below some of the failed apns-ids

5eb1bdac-f632-43d7-9060-8b77473c5f64
76a52c3e-f559-4a71-ab09-4646af9288ce
c883a202-9639-4c01-93e4-3cf506e3cff7
2f1e02e1-6bf9-4e8d-8687-b20d9effb6e0
e3447696-d68e-495a-99b8-ec280161505a

Any help in debugging/fixing this is highly appreciated.

Thanks In Advance

Vivek

babuv2 · 2021-01-08T16:40:13Z

Hi @jchambers,
John can we reopen this ticket since we are facing this issue regularly. I have already shared some APNS-ids. We will be happy to provide more in case you need it

Thanks In Advance
Vivek

jchambers added bug upstream issue labels Oct 9, 2020

This was referenced Oct 9, 2020

Don't use static exception objects #807

Closed

PushNotificationFuture await() hangs forever #814

Closed

Pushy Notification Threads get Stuck for future.get #815

Closed

kylebrowning mentioned this issue Oct 11, 2020

Connection Deadlock caused by missing response & no timeout swift-server-community/APNSwift#93

Closed

jchambers mentioned this issue Oct 19, 2020

Question about response from APNs #818

Closed

jchambers mentioned this issue Oct 21, 2020

Cancel ping futures when channels close #820

Merged

petrdvorak mentioned this issue Oct 21, 2020

Update Pushy to 0.14.2 wultra/powerauth-push-server#349

Closed

kylebrowning mentioned this issue Oct 26, 2020

Connection deadlock after random time vapor/apns#23

Closed

jchambers closed this as completed Nov 14, 2020

APNs server does not respond to some notifications #816

APNs server does not respond to some notifications #816

Comments

jchambers commented Oct 9, 2020 • edited Loading

maelaouane commented Oct 9, 2020

jchambers commented Oct 9, 2020

QILI92 commented Oct 10, 2020

jchambers commented Oct 10, 2020

vsharm22 commented Oct 12, 2020

maelaouane commented Oct 12, 2020

julienpouget commented Oct 16, 2020

jchambers commented Oct 16, 2020

huangjunchao commented Oct 19, 2020

vsharm22 commented Oct 20, 2020

jchambers commented Oct 20, 2020

vsharm22 commented Oct 20, 2020

huangjunchao commented Oct 21, 2020

jchambers commented Oct 21, 2020

petrdvorak commented Oct 21, 2020

jchambers commented Oct 21, 2020

jchambers commented Oct 21, 2020

jchambers commented Oct 21, 2020

jchambers commented Oct 21, 2020

papo2608 commented Oct 23, 2020

huangjunchao commented Oct 23, 2020

robertoprato commented Oct 23, 2020

jchambers commented Oct 24, 2020

christophemaillot commented Oct 25, 2020

vsharm22 commented Oct 26, 2020

jchambers commented Oct 26, 2020

leiwei999 commented Oct 27, 2020

jchambers commented Oct 27, 2020

babuv2 commented Oct 28, 2020 • edited Loading

petrdvorak commented Oct 28, 2020

babuv2 commented Oct 28, 2020

jchambers commented Oct 28, 2020

janzar commented Oct 30, 2020

mkieloch-352 commented Nov 4, 2020

jchambers commented Nov 14, 2020

lkesteloot commented Nov 16, 2020

floifyarul commented Dec 7, 2020

dcollens commented Dec 7, 2020

floifyarul commented Dec 7, 2020

jchambers commented Dec 7, 2020

floifyarul commented Dec 9, 2020

jchambers commented Dec 10, 2020

floifyarul commented Dec 11, 2020

babuv2 commented Dec 21, 2020

babuv2 commented Jan 8, 2021

jchambers commented Oct 9, 2020 •

edited

Loading

babuv2 commented Oct 28, 2020 •

edited

Loading