Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodical loss of connection on devices (+- 24H) #292

Closed
maxstefaniv opened this issue May 10, 2021 · 12 comments · Fixed by #294
Closed

Periodical loss of connection on devices (+- 24H) #292

maxstefaniv opened this issue May 10, 2021 · 12 comments · Fixed by #294
Labels
bug Something isn't working

Comments

@maxstefaniv
Copy link

maxstefaniv commented May 10, 2021

Expected Behavior

Devices will report telemetry non stop, reaching IoT Hub with no issues.

Current Behavior

After Set Up devices Reported properly for around 24 hours after that started displaying:
could not send message to IoTHub/Edge with error: The operation timed out.
Unless iotEdge is rebooted, all devices will report the same message.

Steps to Reproduce

  1. Deploy the project as it is (gateway device AIOT-ILRA01) (Devices ELSYS Co2 and Netvox R311W)
  2. Add additional decoder (should not be an issue as this module only decodes the message coming from the device as telemetry.)
  3. Wait 24h

Context (Environment)

Device (Host) Operating System

Ubuntu 16.04

Architecture

amd64

LoRaWAN Module Version

LoRaWanNetworkSrvModule  running          Up 2 hours       loraedge/lorawannetworksrvmodule:1.0.5
LoRaWanPktFwdModule      running          Up 2 hours       loraedge/lorawanpktfwdmodule:1.0.5
edgeAgent                running          Up 2 hours       mcr.microsoft.com/azureiotedge-agent:1.0.9.5
edgeHub                  running          Up 2 hours       mcr.microsoft.com/azureiotedge-hub:1.0.9.5

Logs

Attached support_bundle.zip
support_bundle.zip

Story description:

Starting with the LoRaWAN Gateway (AAEON AIOT-ILRA01), we deployed default modules from Microsoft Git and modified one additional module “DecoderValueSensor” adding logic to decode telemetry and deployed it on the Edge Device.
Afterwards, we enrolled a total of 24 multi-sensors, making sure they are sending telemetry that is reaching the IoT Hub, and stored messages are correctly decoded. Everything was fine.
We left the set up connected both to a network and to a power source. The following day, we connected to the Edge Device through SSH connection and noticed a series of error messages:
could not retrieve device twin with error: The operation timed out.
In response to this error, we restarted the Edge Device sudo systemctl restart iotedge, after which all modules restarted and, through checking the logs, error messages no longer appeared, so everything went back to normal.

  1. Can you propose any other solutions that will allow us to detect this anomaly ASAP?
  2. Is this behavior normal, have you encountered it before?
@maxstefaniv maxstefaniv added the bug Something isn't working label May 10, 2021
@Mandur
Copy link
Contributor

Mandur commented May 11, 2021

Hello @maxstefaniv Thank you for raising the issue. I have a few questions:

  • How old was the IoT Edge installation? Is it brand new?
  • Do you have multiple gateways?
  • Would you per case have the edge hub and agent logs?

We have seen such cases in the past in the CI under very specific condition, depending on answer above I would suggest we have a quick chat to troubleshoot further

@maxstefaniv
Copy link
Author

How old was the IoT Edge installation? Is it brand new?
The Resource group with IoT hub was created on 27 April and Edge Device given its Connection string on 29 April, on Friday 30 April I enrolled 24 devices and first error was detected on May 3rd (error occurred through the weekend). I would say it is Brand new.

Do you have multiple gateways?
I have one gateway for each resource group.
first (raspberry Pi) for testing purposes with only 2 devices
second (AIOT-ILRA01) deployed with all devices (I am interested the most in fixing this one)

Would you per case have the edge hub and agent logs?
I have them from the first time it happened:
edgeHub partial log.txt
edgeAgent logs.txt

I am available to chat whenever it is possible to troubleshoot this issue.

@Mandur
Copy link
Contributor

Mandur commented May 11, 2021

Thank you for the prompt reply.
I assume this is similar to the issue I am currently troubleshooting. In order to have a fix at the moment, I recommend setting the environment variable ENABLE_GATEWAY to false on the LoRaNetwork Server.
You can read more about the setting on the reply here, but basically this setting will skip the edge queue and directly interact with IoT Hub, therefore you won't be able to :

  • route messages through iot edge to other edge modules. (Decoders are not affect because are http invocation and not based on message routing)
  • No web proxy support
  • No local queuing in case of intermitted internet network connection

Please let me know if that makes it better. In case you want to IM my skype handle is mandurlevrai

@ronniesa
Copy link
Contributor

ronniesa commented May 11, 2021

@maxstefaniv does this happens only after deploying a new decoder or every time after 24h?

Make sense to try first with what @Mandur is pointing out just above with the ENABLE_GATEWAY to false.

@maxstefaniv
Copy link
Author

@Mandur Hello, yeah "ENABLE_GATEWAY to false" was already set otherwise I was getting an error, that was a fix I already scaned from existing issues.

@ronniesa It happens everytime after 24 hours.

@Mandur
Copy link
Contributor

Mandur commented May 12, 2021

@maxstefaniv When I look at your docker inspect logs, I cannot see the environment variable ENABLE_GATEWAY set to false. May I ask to quickly double check. You can run
docker exec LoRaWanNetworkSrvModule bash -c printenv |grep ENABLE_GATEWAY
(I am just double checking as it solved the issue for me)

I have a device that is running to try to get a repro, but at the moment I am failing to reproduce the problem, I think it might be benefical to have a call to look at the problem more in depth

@maxstefaniv
Copy link
Author

@Mandur Sure I executed it.
image

When are you available. I can add you to my Teams account and we can have a call.

@Mandur
Copy link
Contributor

Mandur commented May 12, 2021

Sure @maxstefaniv just add me, we can sync add hoc!

@Mandur
Copy link
Contributor

Mandur commented May 12, 2021

Following the discussion we are testing a new version with updated client librairies to see if that fix the issue

@maxstefaniv
Copy link
Author

maxstefaniv commented May 13, 2021

Applied update LoRaWanNetworkSrvModule from 1.0.5 to 1.0.5.1 on test edge suggested by @Mandur, and it seems to help. If works on Prod edge as well, will close issue.

@Mandur Mandur mentioned this issue May 13, 2021
Mandur added a commit that referenced this issue May 16, 2021
* Upgrade device sdk versions

* Fix Mac issues

* Fix indentation

* Correct test
@Mandur
Copy link
Contributor

Mandur commented May 16, 2021

As it seems our changes solved the issue, we merged this into dev. This fix will be be included in the next lorawan release.
Feel free to reopen if there are any more issues you face with it.

Thank you for raising this!

@maxstefaniv
Copy link
Author

Effectively I was monitoring the situation through this weekend on both devices and there were no errors. All devices reauthenticate with gateways and there are no drops. This solution works. Thank you very much @Mandur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants