Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Something went wrong" when commissioning through Android app #463

Closed
agners opened this issue Dec 13, 2023 · 14 comments · Fixed by #469 or home-assistant/android#4069
Closed

"Something went wrong" when commissioning through Android app #463

agners opened this issue Dec 13, 2023 · 14 comments · Fixed by #469 or home-assistant/android#4069

Comments

@agners
Copy link
Collaborator

agners commented Dec 13, 2023

When trying to commission a device through the Android App sometimes simply "Something went wrong" is returned.

Note: There might be multiple issue behind that particular error, but this outlines what I've been able to reproduce here.

There are two ways to on-board a Matter device to Home Assistant using Android:

  1. Waiting for the half-screen pop-up to appear, then choose the Home Assistant app
  2. Open the Home Assistant App, then adding a Matter device by navigating to Settings > Devices & services > Devices > Add Device > Add Matter device

When using method 2, the Android app uses the matter/commission_on_network WS command is used. The passcode is passed to this endpoint, no further information. The Matter Server (SDK) then searches for any commissionable device on the network, and (presumably) picks the first it hears about. This of course could be a different one.

Furthermore, a BR might cache DNS-SD entry for a commissionable device (any BR runs a SRP server, which caches service announcements on behalf of Thread devices). Now if a Thread device got reset during commissioning at one point, such a stale entry might linger around for up to 2h (in my particular test case).

I've noticed this by checking the service announcements during a failed commissioning attempt. Two announcements were present:

$ sudo dns-sd -B _matterc._udp local
Browsing for _matterc._udp.local
DATE: ---Wed 13 Dec 2023---
19:45:23.143  ...STARTING...
Timestamp     A/R    Flags  if Domain               Service Type         Instance Name
19:45:23.143  Add        3   2 local.               _matterc._udp.       14C551B182BE1DB4
19:45:23.143  Add        2   2 local.               _matterc._udp.       9FD5C3686D913371  

And it turned out that in this case the second announcement was from a stale Thread device from the OTBR Add-on:

(I've run this command for debugging from within the container)

$ ot-ctl srp server service
...
9FD5C3686D913371._matterc._udp.default.service.arpa.
    deleted: false
    subtypes: _V4447,_T65535,_S15,_L3840
    port: 5540
    priority: 0
    weight: 0
    ttl: 7200
    lease: 7200
    key-lease: 680400
    TXT: [VP=343434372b38313934, DT=3635353335, SII=35333030, SAI=353530, T=30, D=33383430, CM=30, RI=303230304539314145334235424534423833304331314237334346463637464334423335, PH=3336, PI=]
    host: 86579A470F4CE412.default.service.arpa.
    addresses: [fd82:c9e9:5cb7:1:3b2a:c0f6:fcfd:7e61]
@agners agners changed the title Something went wrong" "Something went wrong" when commissioning through Android app Dec 13, 2023
@agners
Copy link
Collaborator Author

agners commented Dec 13, 2023

It seems we get the information from the CommissioningRequestMetadata object at https://github.com/home-assistant/android/blob/5d779a27aa98bd5f075c843537a46a71be8ba387/app/src/full/java/io/homeassistant/companion/android/matter/MatterCommissioningService.kt#L61. We could get more specific information there e.g. the device location (IP) directly. With that we'd not need to rediscover the device on the Matter Server side.

@marcelveldt
Copy link
Contributor

If we get some more info of the device, we could so a call to "get discoverable nodes" first

@agners
Copy link
Collaborator Author

agners commented Dec 13, 2023

I first considered using CommissionIP since we have the IP. But it is deprecated, and talking with Tennessee, he says the preferred way is to discover the device again from the server instead of reaching out to the IP directly (I guess the outcome of a mDNS request might be different in certain setups, so doing another request from server side is the better approach).

The CommissioningRequest object returns a DeviceInfo object (see https://developers.home.google.com/reference/com/google/android/gms/home/matter/commissioning/DeviceInfo), which returns the long discriminator. We can filter by that using CommissionOnNetwork with DiscoveryFilterType.LONG_DISCRIMINATOR.

@jpelgrom
Copy link

FWIW, the DeviceInfo doesn't guarantee that it includes a discriminator (emphasis added):

The device's discriminator, if known, used to discriminate between Matter device beacons to identify the target device.

Let me know if you want a build of the app with specific changes for the on network commissioning.

@agners
Copy link
Collaborator Author

agners commented Dec 13, 2023

The discriminator is a static per device unique value. I expect Google just passes the one they learned from the QR code or manual pairing code there, so I'd guess that value is there. We'll have to try it out.

What I am not sure about is if and how we can distinguish between a short and a long discriminator. It does matter for the device as we need to tell the CommissionOnNetwork API which one to filter for. Reading the Discriminator docs I don't think there is a way to know for sure 😢

If we can't learn from the Android API, then we probably just pass the discriminator as integer to the backend for now, and see how to go about it there.

I am guessing that we get 0x300 (768) as short discriminator, if the discriminator is 0x348 (the long discriminator example of 840 from the spec). If that is the case, then we can just mask the lower 8-bit and if they are zero, assume a short discriminator.

So we'd introduce the following websocket API in Core:

@websocket_api.websocket_command(
    {
        vol.Required(TYPE): "matter/commission_on_network",
        vol.Required("pin"): int,
        vol.Optional("discriminator"): int,
    }
)

@agners
Copy link
Collaborator Author

agners commented Dec 13, 2023

Let me know if you want a build of the app with specific changes for the on network commissioning.

Yeah if you can create a build which passes the discriminator as outlined above would be perfect 👍

@agners
Copy link
Collaborator Author

agners commented Dec 14, 2023

The Matter server logs look something like this (this capture has some additional hacked in log entries):

2023-12-13 17:48:05 core-matter-server matter_server.server.device_controller[1166] INFO Fetching certificates...
2023-12-13 17:48:05 core-matter-server matter_server.server.helpers.paa_certificates[1166] INFO Fetching the latest PAA root certificates from DCL.
2023-12-13 17:48:06 core-matter-server matter_server.server.helpers.paa_certificates[1166] INFO Fetched 0 PAA root certificates from DCL.
2023-12-13 17:48:06 core-matter-server matter_server.server.helpers.paa_certificates[1166] INFO Fetching the latest PAA root certificates from Git.
2023-12-13 17:48:06 core-matter-server matter_server.server.helpers.paa_certificates[1166] INFO Fetched 0 PAA root certificates from Git.
2023-12-13 17:48:06 core-matter-server matter_server.server.device_controller[1166] INFO Commissioning new device on network using Node ID 33 setup_pin_code 60617843.
2023-12-13 17:48:36 core-matter-server chip.CTL[1166] ERROR Mdns discovery timed out
2023-12-13 17:48:36 core-matter-server matter_server.server.device_controller[1166] WARNING Commission on network failed for node 33
2023-12-13 17:48:36 core-matter-server matter_server.server.client_handler[1166] ERROR [548017525264] Error handling message: CommandMessage(message_id='97901fa041c34d3e82e6f8d9e9a00d43', command='commission_on_network', args={'setup_pin_code': 60617843})
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/matter_server/server/client_handler.py", line 188, in _run_handler
    result = await result
             ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/matter_server/server/device_controller.py", line 242, in commission_on_network
    raise NodeCommissionFailed(
matter_server.common.errors.NodeCommissionFailed: Commission on network failed for node 33

@satmandu
Copy link

I would also be happy to use such a build... since I am having the same issue:

home-assistant/addons#3359

@jpelgrom
Copy link

jpelgrom commented Dec 14, 2023

Yeah if you can create a build which passes the discriminator as outlined above would be perfect 👍

Looking at it again, it looks like we confused who creates each object... The CommissioningRequest is created and sent by the app when it wants to start commissioning. Adding a discriminator seems to be a way to limit which devices can be commissioned.

After scanning completes, the app receives a CommissioningRequestMetadata object from the system. This does not include the discriminator, but only passcode + network location (IP) + device descriptor (PID/VID). Unless the app sets a discriminator before commissioning, I don't see how it'd get it from Android.

@agners
Copy link
Collaborator Author

agners commented Dec 14, 2023

Hm, right, so CommissioningRequest is before any communication has been made with the device? 🤔

I guess this API is mainly meant for device manufacturer. They would limit the device to their PID/VID. It is weird though that CommissioningRequestMetadata does not contain a discriminator then. 🤔

agners added a commit to agners/python-matter-server that referenced this issue Dec 18, 2023
When using the Android Matter commissioning flow from the Home Assistant
App (via CommissioningRequest), we get the passcode only, no QR code or
manual pairing code (see CommissioningRequestMetadata documentation).
This means there is no discriminator. In other words, the server will
try to find a commissionable device, and use the pincode against the
first one found.

It seems that sometimes another device is announced to be commissionable
at the same time, which obvisouly has another passcode, hence the
commission fails quite quickly. The App shows "Something went wrong".

This seems to be particularly common when using Thread: The Thread
border router forwards and caches DNS-SD/mDNS service information about
commissionable Thread devices with its SRP server service. The cache is
up to 2h long. So when a communication breakdown happens after a
particular device went into commission mode, this commissionable service
entry lingers in the network for quite some time.

Now if the OTBR sends this entry before the new/valid entry, the Matter
Server tries to commission a device which is no longer responding.

The CommissioningRequestMetadata have a way to get the device's IP
address. This change extends commission_on_network WS endpoint to also
take an IP address. The SDK CommissionIP service is used to commission
the device.

Note: CommissionIP is marked deprecated currently. This is mainly to
prevent implementation which would ask users for IP addresses, which is
not the intended way to implement commissioning. However, for this
particular usecase the API seems very sensible and works as intended.

Fixes: home-assistant-libs#463
agners added a commit that referenced this issue Dec 18, 2023
When using the Android Matter commissioning flow from the Home Assistant
App (via CommissioningRequest), we get the passcode only, no QR code or
manual pairing code (see CommissioningRequestMetadata documentation).
This means there is no discriminator. In other words, the server will
try to find a commissionable device, and use the pincode against the
first one found.

It seems that sometimes another device is announced to be commissionable
at the same time, which obvisouly has another passcode, hence the
commission fails quite quickly. The App shows "Something went wrong".

This seems to be particularly common when using Thread: The Thread
border router forwards and caches DNS-SD/mDNS service information about
commissionable Thread devices with its SRP server service. The cache is
up to 2h long. So when a communication breakdown happens after a
particular device went into commission mode, this commissionable service
entry lingers in the network for quite some time.

Now if the OTBR sends this entry before the new/valid entry, the Matter
Server tries to commission a device which is no longer responding.

The CommissioningRequestMetadata have a way to get the device's IP
address. This change extends commission_on_network WS endpoint to also
take an IP address. The SDK CommissionIP service is used to commission
the device.

Note: CommissionIP is marked deprecated currently. This is mainly to
prevent implementation which would ask users for IP addresses, which is
not the intended way to implement commissioning. However, for this
particular usecase the API seems very sensible and works as intended.

Fixes: #463
@agners
Copy link
Collaborator Author

agners commented Dec 18, 2023

Actually, for this error it seems more typical that the commissioning fails quickly:

2023-12-18 19:41:50 core-matter-server matter_server.server.device_controller[936] INFO Starting Matter commissioning on network using Node ID 53.
2023-12-18 19:41:50 core-matter-server chip.CTL[936] ERROR Unknown filter type; all matches will fail
2023-12-18 19:41:54 core-matter-server chip.EM[936] ERROR Failed to Send CHIP MessageCounter:140956251 on exchange 50461i sendCount: 4 max retries: 4
2023-12-18 19:41:56 core-matter-server chip.SC[936] ERROR PASESession timed out while waiting for a response from the peer. Expected message type was 33
2023-12-18 19:41:56 core-matter-server chip.ZCL[936] ERROR Secure Pairing Failed
2023-12-18 19:41:56 core-matter-server matter_server.server.client_handler[936] ERROR [548113578960] Error handling message: CommandMessage(message_id='7e653a0e055c42328922db59af330c06', command='commission_on_network', args={'setup_pin_code': 59702531, 'ipaddr': None})
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/matter_server/server/client_handler.py", line 188, in _run_handler
    result = await result
             ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/matter_server/server/device_controller.py", line 242, in commission_on_network
    raise NodeCommissionFailed(
matter_server.common.errors.NodeCommissionFailed: Commission on network failed for node 53

From start to failure it just took 6s, which indicates the the Matter Server was communicating to a wrong device.

This is with the logging enhancements. In the current 5.0.3. Matter server INFO Fetched 0 PAA root certificates from Git. is the indicatition that commissioning started.

@agners
Copy link
Collaborator Author

agners commented Jan 4, 2024

Further investigations show that certain devices announce themself twice not the network, e.g. the "Aqara Door and Window Sensor P2" seems to announce twice as a commission-able device when using the App commissioning mode:

12:52:01.235  Add        2   2 local.               _matterc._udp.       A7C33C8FBFF222B4
12:52:06.323  Rmv        1   2 local.               _matterc._udp.       A7C33C8FBFF222B4
12:52:06.323  Add        2   2 local.               _matterc._udp.       8444A7BF539974D6

Looking at the SRP entries on the OTBR shows that the two entries have different:

A7C33C8FBFF222B4._matterc._udp.default.service.arpa.
    deleted: false
    subtypes: _V4447,_T65535,_S15,_L3840
    port: 5540
    priority: 0
    weight: 0
    ttl: 7200
    lease: 7200
    key-lease: 680400
    TXT: [VP=343434372b38313934, DT=3635353335, SII=35333030, SAI=353530, T=30, D=33383430, CM=30, RI=303130303945413034304236414346344235384633383232304236393746334436453234, PH=3336, PI=]
    host: EAD6CD5F7254EA38.default.service.arpa

And

8444A7BF539974D6._matterc._udp.default.service.arpa.
    deleted: false
    subtypes: _V4447,_T65535,_S5,_L1482,_CM
    port: 5540
    priority: 0
    weight: 0
    ttl: 7200
    lease: 7200
    key-lease: 680400
    TXT: [VP=343434372b38313934, DT=3635353335, SII=35333030, SAI=353530, T=30, D=31343832, CM=32, RI=303230304539314145334235424534423833304331314237334346463637464334423335, PH=3336, PI=]
    host: EAD6CD5F7254EA38.default.service.arpa.

It seems that the Matter SDK does not like when two devices announce themselfs as commissionable. They do use different discriminator, but point to the same device, so in theory when not checking the discriminator the SDK should resolve to any of these two hosts (and it shouldn't matter which one as they resolve to the same host). But in testing the SDK trips every time with this device when using CommissionOnNetwork (just the pin code without discriminator).

When using CommissionIP things seem to work reliably.

@agners
Copy link
Collaborator Author

agners commented Jan 9, 2024

I just realized that the current implementation is problematic for WiFi devices if they resolve to link-local addresses:

2024-01-09 17:13:50 core-matter-server matter_server.server.client_handler[607] ERROR [548216738832] SDK Error during handling message: CommandMessage(message_id='aaced745b3d54ed2ac6f752d66ceef2a', command='commission_on_network', args={'setup_pin_code': 70280329, 'ip_addr': 'fe80::daa0:11ff:fe92:6a65%47'})
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/matter_server/server/client_handler.py", line 188, in _run_handler
    result = await result
             ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/matter_server/server/device_controller.py", line 268, in commission_on_network
    success = await self._call_sdk(
              ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/matter_server/server/device_controller.py", line 856, in _call_sdk
    await self.server.loop.run_in_executor(
  File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/chip/ChipDeviceCtrl.py", line 1759, in CommissionIP
    self._ChipStack.CallAsync(
  File "/usr/local/lib/python3.11/site-packages/chip/ChipStack.py", line 383, in CallAsync
    raise res.to_exception()
chip.exceptions.ChipStackError: src/controller/python/ChipDeviceController-ScriptBinding.cpp:390: CHIP Error 0x0000002F: Invalid argument

@agners
Copy link
Collaborator Author

agners commented Jan 11, 2024

With #27 merged, the scope id is accepted. But with an invalid scope id the API will just revert to host OS routing table which on systems with containers running typically doesn't prefer the primary interface, hence commissioning attempts still fail:

2024-01-11 19:55:33 core-matter-server matter_server.server.client_handler[614] DEBUG [548429312336] Received CommandMessage(message_id='6959f559728e4590bda3f6f400880759', command='commission_on_network', args={'setup_pin_code': 89446025, 'ip_addr': 'fe80::daa0:11ff:fe92:6a65%47'})
2024-01-11 19:55:33 core-matter-server matter_server.server.client_handler[614] DEBUG [548429312336] Handling command commission_on_network
2024-01-11 19:55:33 core-matter-server matter_server.server.device_controller[614] INFO Starting Matter commissioning with IP using Node ID 94 via IP address fe80::daa0:11ff:fe92:6a65.
2024-01-11 19:55:33 core-matter-server matter_server.server.storage[614] DEBUG Saved data to persistent storage
2024-01-11 19:55:33 core-matter-server chip.IN[614] DEBUG SecureSession[0x7fa4157350]: Allocated Type:1 LSID:17082
2024-01-11 19:55:33 core-matter-server chip.SC[614] DEBUG Assigned local session key ID 17082
2024-01-11 19:55:33 core-matter-server chip.EM[614] INFO <<< [E:51093i S:0 M:56542096] (U) Msg TX to 0:0000000000000000 [0000] [UDP:[fe80::daa0:11ff:fe92:6a65]:5540] --- Type 0000:20 (SecureChannel:PBKDFParamRequest)
2024-01-11 19:55:33 core-matter-server chip.SC[614] DEBUG Sent PBKDF param request
2024-01-11 19:55:33 core-matter-server chip.CTL[614] INFO Setting attestation nonce to random value
2024-01-11 19:55:33 core-matter-server chip.CTL[614] INFO Setting CSR nonce to random value
2024-01-11 19:55:33 core-matter-server chip.CTL[614] INFO Commission called for node ID 0x000000000000005E
2024-01-11 19:55:33 core-matter-server chip.EM[614] INFO Retransmitting MessageCounter:56542096 on exchange 51093i Send Cnt 1
2024-01-11 19:55:33 core-matter-server chip.EM[614] INFO Retransmitting MessageCounter:56542096 on exchange 51093i Send Cnt 2
2024-01-11 19:55:34 core-matter-server chip.EM[614] INFO Retransmitting MessageCounter:56542096 on exchange 51093i Send Cnt 3
2024-01-11 19:55:35 core-matter-server chip.EM[614] INFO Retransmitting MessageCounter:56542096 on exchange 51093i Send Cnt 4
2024-01-11 19:55:37 core-matter-server chip.EM[614] ERROR Failed to Send CHIP MessageCounter:56542096 on exchange 51093i sendCount: 4 max retries: 4
2024-01-11 19:55:39 core-matter-server chip.SC[614] ERROR PASESession timed out while waiting for a response from the peer. Expected message type was 33
2024-01-11 19:55:39 core-matter-server chip.IN[614] DEBUG SecureSession[0x7fa4157350]: Released - Type:1 LSID:17082
2024-01-11 19:55:39 core-matter-server chip.ZCL[614] ERROR Secure Pairing Failed
2024-01-11 19:55:39 core-matter-server matter_server.server.client_handler[614] ERROR [548429312336] Error handling message: CommandMessage(message_id='6959f559728e4590bda3f6f400880759', command='commission_on_network', args={'setup_pin_code': 89446025, 'ip_addr': 'fe80::daa0:11ff:fe92:6a65%47'})
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/matter_server/server/client_handler.py", line 188, in _run_handler
    result = await result
             ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/matter_server/server/device_controller.py", line 276, in commission_on_network
    raise NodeCommissionFailed(
matter_server.common.errors.NodeCommissionFailed: Commission using IP failed for node 94

agners added a commit to agners/python-matter-server that referenced this issue Jan 11, 2024
When receiving link-local IPv6 addresses, remove the scope by default
as the scope id is typically from a remote machine (e.g. from the Android
phone).

If the Matter server got started with a primary interface set, scope the
link-local address to that interface instead.

This fixes commissioning issue for WiFi devices when using the Android
in-app commissioning flow which sends the IP address alongside.

Related-to: home-assistant-libs#463
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants