Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ubuntu 18.04 unable to resolve cognitiveservices DNS names #798

Closed
1 of 5 tasks
mikeharder opened this issue Apr 28, 2020 · 27 comments
Closed
1 of 5 tasks

Ubuntu 18.04 unable to resolve cognitiveservices DNS names #798

mikeharder opened this issue Apr 28, 2020 · 27 comments
Assignees
Labels
bug Something isn't working OS: Ubuntu

Comments

@mikeharder
Copy link

Describe the bug
Ubuntu 18.04 is unable to resolve *.cognitiveservices.azure.com DNS names by default. As a workaround, we are bypassing the local (stub) DNS server using the following command:

sudo ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf

This may be related to https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1822416 and/or #397.

Area for Triage:
Servers

Question, Bug, or Feature?:
Bug

Virtual environments affected

  • macOS 10.15
  • Ubuntu 16.04 LTS
  • Ubuntu 18.04 LTS
  • Windows Server 2016 R2
  • Windows Server 2019

Expected behavior
Ubuntu 18.04 should be able to resolve *.cognitiveservices.azure.com DNS names by default.

Actual behavior
If we try to resolve a *.cognitiveservices.azure.com DNS name, it fails with SERVFAIL:

https://dev.azure.com/mharder/public/_build/results?buildId=634&view=logs&j=3dc411e8-b5bf-57f2-a8a7-b25d565c86b1&t=f636eda2-37c8-5cad-c3dc-807f9e9ed0bb&l=59

However, if we bypass the local (stub) DNS server using the following command:

sudo ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf

Then the DNS name can be resolved successfully:

https://dev.azure.com/mharder/public/_build/results?buildId=634&view=logs&j=3dc411e8-b5bf-57f2-a8a7-b25d565c86b1&t=aef3c3f0-973b-547b-f96d-ab903995d1d8&l=59

This doesn't repro on Ubuntu 16.04:

https://dev.azure.com/mharder/public/_build/results?buildId=634&view=logs&j=88c4e28e-b89e-5514-cbb8-a3c153cbe716&t=a85e08da-6704-59a1-1759-4e62f4964eb3&l=59

Pipeline Sources: https://github.com/mikeharder/AzurePipelineTests/blob/f920bf50f72fe45c1d653a3bbaac9dcaf3df7682/azure-pipelines.yml

mikeharder added a commit to Azure/azure-sdk-for-java that referenced this issue Apr 28, 2020
- Workaround issue resolving cognitiveservices names
- actions/runner-images#798
@vmapetr vmapetr added bug Something isn't working OS: Ubuntu and removed needs triage labels Apr 29, 2020
@Darleev Darleev self-assigned this Apr 29, 2020
@mikeharder
Copy link
Author

Also, this does not repro on a new Azure Ubuntu 18.04 VM. It only repros on a DevOps Ubuntu 18.04 Hosted Agent.

@Darleev
Copy link
Contributor

Darleev commented Apr 30, 2020

@mikeharder Hello, Thank you for provided details and the investigation. I was able to reproduce the issue on Ubuntu 18.04 agent, and it seems something is configured incorrectly here, since file directly shows us that systemd-resolve service should be responsible for nameservers, but it is not.

systemd-resolve --status
.........
Link 2 (eth0)
          DNS Servers: 168.63.129.16
          DNS Domain: mqljooayxuiuncg2ys3siqhtpb.xx.internal.cloudapp.net

Locally, we still use 127.0.0.53 address, which is recorded in the file /etc/resolv.conf. It seems you are right, it is required to link local /etc/resolv.conf file with systemd-resolve file..
I will keep you posted.

@Darleev
Copy link
Contributor

Darleev commented May 1, 2020

@mikeharder I have created Pull Request with suggested workaround. As soon as all verification processes are complete, workaround will be applied on Ubuntu 18 agents, until then your workaround is the best option here.
I will let you know additionally, when PR will be merged.

@Darleev
Copy link
Contributor

Darleev commented May 5, 2020

@mikeharder We have added fix for the issue to the image and it will be rolled out next week.
In case of any questions, please let us know, we will be glad to assist you further.

@miketimofeev miketimofeev added the awaiting-deployment Code complete; awaiting deployment and/or deployment in progress label May 5, 2020
@nerijusk
Copy link

nerijusk commented May 5, 2020

@mikeharder I have created Pull Request with suggested workaround. As soon as all verification processes are complete, workaround will be applied on Ubuntu 18 agents, until then your workaround is the best option here.
I will let you know additionally, when PR will be merged.

The PR breaks agent build for self hosted agent pools:

Create resolv.conf link.
ln: failed to create symbolic link '/etc/resolv.conf': Device or resource busy

@Darleev
Copy link
Contributor

Darleev commented May 5, 2020

@nerijusk
Agree, We didn't take into account the fact that docker self-hosted agents occupy /etc/resolv.conf file and use it inside containers. We have rolled the pull request back.
@mikeharder it seems, suggested solution cannot be applied for ubuntu 18 image due to described above limitations. I will try to find another solution for the issue, which hopefully does not affect anything else.

@Darleev Darleev removed the awaiting-deployment Code complete; awaiting deployment and/or deployment in progress label May 5, 2020
@mikeharder
Copy link
Author

@Darleev: Sounds good. I don't think my workaround is the correct long-term solution, it's just sufficient to unblock our builds. The local (stub) DNS server should be able to resolve all DNS names.

As I mentioned earlier, this doesn't repro on an Azure Ubuntu 18 VM, so you might want to start by figuring out which additional component on a DevOps Hosted Ubuntu 18 VM is causing this behavior difference. Maybe Docker?

@mikeharder
Copy link
Author

Last time I tested it I am pretty sure it did not repro on an Azure Ubuntu 18 VM. However, I just created a new Azure Ubuntu 18 VM and now it does repro until I use the same workaround.

@mikeharder
Copy link
Author

However, it does not repro on a Hyper-V VM created from ubuntu-18.04.4-live-server-amd64.iso.

@al-cheb
Copy link
Contributor

al-cheb commented May 5, 2020

@mikeharder I have created Pull Request with suggested workaround. As soon as all verification processes are complete, workaround will be applied on Ubuntu 18 agents, until then your workaround is the best option here.
I will let you know additionally, when PR will be merged.

The PR breaks agent build for self hosted agent pools:

Create resolv.conf link.
ln: failed to create symbolic link '/etc/resolv.conf': Device or resource busy

@nerijusk, Could you please append and validate script with small changes ?

if [[ -f /run/systemd/resolve/resolv.conf ]]; then
    echo "Create resolv.conf link."
    ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf
fi

@Darleev
Copy link
Contributor

Darleev commented May 14, 2020

@mikeharder @nerijusk we have applied new fix according to comment above and it will be rolled out next week.

@Darleev Darleev added the awaiting-deployment Code complete; awaiting deployment and/or deployment in progress label May 14, 2020
@chkimes
Copy link

chkimes commented May 19, 2020

However, it does not repro on a Hyper-V VM created from ubuntu-18.04.4-live-server-amd64.iso.

Are we pulling in kernel upgrades with each new image? Do we suspect a DNS bug that was recently introduced into systemd? If so, we should probably file an issue upstream.

@mikeharder
Copy link
Author

@chkimes: It might be this bug, but I am not an expert in this area so I am not certain:

https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1822416

While it does not repro on a new VM created from the ISO, it does also repro on a new Azure Ubuntu 18.04 image, so the issue does appear to be upstream from DevOps.

One thing I know recently changed in the Azure Ubuntu 18.04 image is the default culture was changed from en-US to C (invariant), which caused changes like the enumeration order of filesystem items. I wouldn't expect this to be related to this DNS issue, but maybe?

@chkimes
Copy link

chkimes commented May 19, 2020

It's possible that they're related, but the behaviors appear to be different from the bug description. I'm seeing the DNS resolver return SERVFAIL while the linked ticket describes NXDOMAIN responses.

I took a packet capture and, interestingly, I see the systemd resolver making an external DNS query and the response making it back to systemd, however it appears to completely ignore the response and re-issue queries 1 and 2 seconds later (likely due to a configured timeout).

#	TIMESTAMP	Source		Destination	Protocol Length	Details
176	17:02:13.621323	127.0.0.1	127.0.0.53	DNS	 91	Standard query 0xbd0c A sandbox.app.blackduck.com
177	17:02:13.621535	10.1.0.4	168.63.129.16	DNS	 102	Standard query 0x289f A sandbox.app.blackduck.com OPT
178	17:02:13.724710	168.63.129.16	10.1.0.4	DNS	 474	Standard query response 0x289f A sandbox.app.blackduck.com A 34.66.31.136 A 34.67.84.124 A 35.184.46.39 A 35.184.79.99 A 35.184.144.251 A 35.184.217.249 A 35.188.45.49 A 35.188.165.96 A 35.192.71.147 A 35.193.54.152 A 35.193.180.118 A 35.194.28.222 A 35.202.66.146 A 35.222.40.6 A 35.222.109.216 A 35.222.135.89 A 35.222.182.121 A 35.232.109.144 A 35.232.252.6 A 35.238.225.91 A 35.239.6.69 A 35.239.172.178 OPT
271	17:02:14.430476	10.1.0.4	168.63.129.16	DNS	 102	Standard query 0x289f A sandbox.app.blackduck.com OPT
272	17:02:14.445568	168.63.129.16	10.1.0.4	DNS	 474	Standard query response 0x289f A sandbox.app.blackduck.com A 34.66.31.136 A 34.67.84.124 A 35.184.46.39 A 35.184.79.99 A 35.184.144.251 A 35.184.217.249 A 35.188.45.49 A 35.188.165.96 A 35.192.71.147 A 35.193.54.152 A 35.193.180.118 A 35.194.28.222 A 35.202.66.146 A 35.222.40.6 A 35.222.109.216 A 35.222.135.89 A 35.222.182.121 A 35.232.109.144 A 35.232.252.6 A 35.238.225.91 A 35.239.6.69 A 35.239.172.178 OPT
283	17:02:16.180575	10.1.0.4	168.63.129.16	DNS	 102	Standard query 0x289f A sandbox.app.blackduck.com OPT
284	17:02:16.217957	168.63.129.16	10.1.0.4	DNS	 474	Standard query response 0x289f A sandbox.app.blackduck.com A 34.66.31.136 A 34.67.84.124 A 35.184.46.39 A 35.184.79.99 A 35.184.144.251 A 35.184.217.249 A 35.188.45.49 A 35.188.165.96 A 35.192.71.147 A 35.193.54.152 A 35.193.180.118 A 35.194.28.222 A 35.202.66.146 A 35.222.40.6 A 35.222.109.216 A 35.222.135.89 A 35.222.182.121 A 35.232.109.144 A 35.232.252.6 A 35.238.225.91 A 35.239.6.69 A 35.239.172.178 OPT
349	17:02:18.621333	127.0.0.1	127.0.0.53	DNS	 91	Standard query 0xbd0c A sandbox.app.blackduck.com
350	17:02:19.430426	10.1.0.4	168.63.129.16	DNS	 102	Standard query 0x289f A sandbox.app.blackduck.com OPT
351	17:02:19.432270	168.63.129.16	10.1.0.4	DNS	 474	Standard query response 0x289f A sandbox.app.blackduck.com A 34.67.84.124 A 35.184.46.39 A 35.184.79.99 A 35.184.144.251 A 35.184.217.249 A 35.188.45.49 A 35.188.165.96 A 35.192.71.147 A 35.193.54.152 A 35.193.180.118 A 35.194.28.222 A 35.202.66.146 A 35.222.40.6 A 35.222.109.216 A 35.222.135.89 A 35.222.182.121 A 35.232.109.144 A 35.232.252.6 A 35.238.225.91 A 35.239.6.69 A 35.239.172.178 A 35.239.196.223 OPT
481	17:02:23.621557	127.0.0.1	127.0.0.53	DNS	 91	Standard query 0xbd0c A sandbox.app.blackduck.com
482	17:02:23.930485	127.0.0.53	127.0.0.1	DNS	 91	Standard query response 0xbd0c Server failure A sandbox.app.blackduck.com
483	17:02:23.930567	127.0.0.53	127.0.0.1	DNS	 91	Standard query response 0xbd0c Server failure A sandbox.app.blackduck.com
484	17:02:23.930587	127.0.0.53	127.0.0.1	DNS	 91	Standard query response 0xbd0c Server failure A sandbox.app.blackduck.com

@Darleev
Copy link
Contributor

Darleev commented May 26, 2020

@mikeharder fix has been applied to the current images and initial DNS issue should not be reproduced anymore. Could you please verify?
@chkimes if you believe that need to investigate the issue further and report it to systemd team, please let us know.

@miketimofeev miketimofeev removed the awaiting-deployment Code complete; awaiting deployment and/or deployment in progress label May 26, 2020
@chkimes
Copy link

chkimes commented May 26, 2020

I think at least reporting the bug to systemd is the responsible thing to do here. It was clearly regressed in a recent release, so something broke and we shouldn't have to work around it.

@miketimofeev
Copy link
Contributor

@Darleev @chkimes this workaround breaks the stuff for some users, we have to rollback the changes
#929 (comment)

@Darleev
Copy link
Contributor

Darleev commented May 27, 2020

@chkimes @mikeharder That looks like a systemd-resolve bug, that cannot be fixed on our side due to possible unpredictable impact on other customers( example ). As a workaround I suggest using the way described in the initial message.
In order to find a root cause for the issue, please fill the question here for systemd-resolve team or report bug directly to their bug tracking system.
In case of any questions, feel free to contact us.

@mikeharder
Copy link
Author

@Darleev, @chkimes: Last time I tested, I could repro this on a new Azure Ubuntu 18 VM, but not on a new Hyper-V VM created from the latest Ubuntu Server ISO. And I believe both VMs were using the same version of systemd-resolve.

So I am not sure if this issue is in base Ubuntu Server image, or specific to the Azure Ubuntu image. Do you know how to report issues against the Azure Ubuntu images?

@Darleev
Copy link
Contributor

Darleev commented May 27, 2020

@mikeharder Could you please provide an output of commands:

sudo systemd-resolve --status
sudo systemd-resolve --version
apt-cache policy libnss-resolve

from Hyper-V VM? It helps to understand the difference.

@Darleev
Copy link
Contributor

Darleev commented Jun 1, 2020

@mikeharder,
Let me gently remind you that we are looking forward to your reply regarding ubuntu virutal machine where the issue with systemd does not reproduce.
Could you please provide output of the aforementioned commands?
We are looking forward to your reply.

@mikeharder
Copy link
Author

@Darleev: The output of the latter two commands appear to be identical on both an Azure Ubuntu 18 VM and a Hyper-V Ubuntu 18 VM (created from the Ubuntu Server ISO).

$ sudo systemd-resolve --version

*** Azure ***
systemd 237
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP
+GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN
-PCRE2 default-hierarchy=hybrid

*** Hyper-V ***
systemd 237
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP
+GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN
-PCRE2 default-hierarchy=hybrid
$ apt-cache policy libnss-resolve

*** Azure ***
libnss-resolve:
  Installed: (none)
  Candidate: 237-3ubuntu10.41
  Version table:
     237-3ubuntu10.41 500
        500 http://azure.archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages
     237-3ubuntu10.38 500
        500 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages
     237-3ubuntu10 500
        500 http://azure.archive.ubuntu.com/ubuntu bionic/universe amd64 Packages

*** Hyper-V ***
libnss-resolve:
  Installed: (none)
  Candidate: 237-3ubuntu10.41
  Version table:
     237-3ubuntu10.41 500
        500 http://us.archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages
     237-3ubuntu10.38 500
        500 http://us.archive.ubuntu.com/ubuntu bionic-security/universe amd64 Packages
     237-3ubuntu10 500
        500 http://us.archive.ubuntu.com/ubuntu bionic/universe amd64 Packages

The first command appears to be identical in the "Global" section, with slight differences in the "Link" sections:

$ sudo systemd-resolve --status

*** Azure ***
Global
          DNSSEC NTA: 10.in-addr.arpa
                      16.172.in-addr.arpa
                      168.192.in-addr.arpa
                      17.172.in-addr.arpa
                      18.172.in-addr.arpa
                      19.172.in-addr.arpa
                      20.172.in-addr.arpa
                      21.172.in-addr.arpa
                      22.172.in-addr.arpa
                      23.172.in-addr.arpa
                      24.172.in-addr.arpa
                      25.172.in-addr.arpa
                      26.172.in-addr.arpa
                      27.172.in-addr.arpa
                      28.172.in-addr.arpa
                      29.172.in-addr.arpa
                      30.172.in-addr.arpa
                      31.172.in-addr.arpa
                      corp
                      d.f.ip6.arpa
                      home
                      internal
                      intranet
                      lan
                      local
                      private
                      test

Link 3 (rename3)
      Current Scopes: none
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: no
    DNSSEC supported: no

Link 2 (eth0)
      Current Scopes: DNS
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: no
    DNSSEC supported: no
         DNS Servers: 168.63.129.16
          DNS Domain: vy4dqvqijknelj0nz0uufugejc.xx.internal.cloudapp.net

*** Hyper-V ***
Global
          DNSSEC NTA: 10.in-addr.arpa
                      16.172.in-addr.arpa
                      168.192.in-addr.arpa
                      17.172.in-addr.arpa
                      18.172.in-addr.arpa
                      19.172.in-addr.arpa
                      20.172.in-addr.arpa
                      21.172.in-addr.arpa
                      22.172.in-addr.arpa
                      23.172.in-addr.arpa
                      24.172.in-addr.arpa
                      25.172.in-addr.arpa
                      26.172.in-addr.arpa
                      27.172.in-addr.arpa
                      28.172.in-addr.arpa
                      29.172.in-addr.arpa
                      30.172.in-addr.arpa
                      31.172.in-addr.arpa
                      corp
                      d.f.ip6.arpa
                      home
                      internal
                      intranet
                      lan
                      local
                      private
                      test

Link 2 (eth0)
      Current Scopes: DNS
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: no
    DNSSEC supported: no
         DNS Servers: <redacted>
                      <redacted>
          DNS Domain: <redacted>

@Darleev
Copy link
Contributor

Darleev commented Jun 9, 2020

Just for reference #191441649

@chkimes
Copy link

chkimes commented Jun 15, 2020

After much digging, I believe this is relevant: systemd/systemd#10672

I see that with EDNS extensions, we are specifying a maximum 512-byte response size. When Azure responds, the response does not include the final A record. If the EDNS extension is removed, Azure then responds with the final A record. I find this strange since neither response goes over 512 bytes. I'm still following up with Azure here, but perhaps we may be able to work around it by disabling the EDNS extension or attempting to raise the max response size.

@chkimes
Copy link

chkimes commented Jun 15, 2020

Successful query:

1135	16:27:25.964386	10.1.0.4	168.63.129.16	DNS	128	Standard query 0xc2d4 A mharder-formrec.cognitiveservices.azure.com OPT

Domain Name System (query)
    Transaction ID: 0xc2d4
    Flags: 0x0120 Standard query
        0... .... .... .... = Response: Message is a query
        .000 0... .... .... = Opcode: Standard query (0)
        .... ..0. .... .... = Truncated: Message is not truncated
        .... ...1 .... .... = Recursion desired: Do query recursively
        .... .... .0.. .... = Z: reserved (0)
        .... .... ..1. .... = AD bit: Set
        .... .... ...0 .... = Non-authenticated data: Unacceptable
    Questions: 1
    Answer RRs: 0
    Authority RRs: 0
    Additional RRs: 1
    Queries
        mharder-formrec.cognitiveservices.azure.com: type A, class IN
    Additional records
        <Root>: type OPT
            Name: <Root>
            Type: OPT (41)
            UDP payload size: 4096
            Higher bits in extended RCODE: 0x00
            EDNS0 version: 0
            Z: 0x0000
                0... .... .... .... = DO bit: Cannot handle DNSSEC security RRs
                .000 0000 0000 0000 = Reserved: 0x0000
            Data length: 12
            Option: COOKIE

Unsuccessful query:

1128	16:27:25.713886	10.1.0.4	168.63.129.16	DNS	116	Standard query 0x198d A mharder-formrec.cognitiveservices.azure.com OPT

Domain Name System (query)
    Transaction ID: 0x198d
    Flags: 0x0100 Standard query
        0... .... .... .... = Response: Message is a query
        .000 0... .... .... = Opcode: Standard query (0)
        .... ..0. .... .... = Truncated: Message is not truncated
        .... ...1 .... .... = Recursion desired: Do query recursively
        .... .... .0.. .... = Z: reserved (0)
        .... .... ...0 .... = Non-authenticated data: Unacceptable
    Questions: 1
    Answer RRs: 0
    Authority RRs: 0
    Additional RRs: 1
    Queries
        mharder-formrec.cognitiveservices.azure.com: type A, class IN
    Additional records
        <Root>: type OPT
            Name: <Root>
            Type: OPT (41)
            UDP payload size: 512
            Higher bits in extended RCODE: 0x00
            EDNS0 version: 0
            Z: 0x0000
                0... .... .... .... = DO bit: Cannot handle DNSSEC security RRs
                .000 0000 0000 0000 = Reserved: 0x0000
            Data length: 0

Notable difference:

Success:
            UDP payload size: 4096

Failure:
            UDP payload size: 512

And notable differences in the responses:

Success:
    Flags: 0x8180 Standard query response, No error
        .... ..0. .... .... = Truncated: Message is not truncated

Failure:
    Flags: 0x8380 Standard query response, No error
        .... ..1. .... .... = Truncated: Message is truncated

Interestingly, systemd-resolved is setting the maximum payload size to 512 regardless of whether EDNS0 is configured and regardless of what is sent to it for the payload size. I'm reasonably sure that the way to fix this is to increase the payload size that systemd-resolved is using but I can't find any details about how to do that in the docs.

This explains why bypassing the local resolver was effective as a workaround.

@Darleev
Copy link
Contributor

Darleev commented Jul 3, 2020

Hello @mikeharder,
Finally, we didn't find how to change UDP payload size for virtual machines, it seems can be changed only on systemd-resolved side. I have filled a bug in Ubuntu issue tracker.

In case of any questions or issues, feel free to contact us.

@mikeharder
Copy link
Author

Suspected root cause: Azure/WALinuxAgent#1673

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working OS: Ubuntu
Projects
None yet
Development

No branches or pull requests

7 participants