-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression 2.1.13: AuthenticationException in Centos 7 using WCF #31110
Comments
Crossposted from here |
Moved to corefx repo as this isn't a bug/change introduced in WCF code. @davidsh, are you aware of any changes in NegotiateStream between 2.1.3 and 2.1.13 which would have caused this? |
Thanks for the reply, and thanks for moving it to the correct place.
Yes, I’ve tested this on several hosts, both docker containers and Linux headnodes, and the behaviour is consistent.
Let me know right away if there’s anything I can provide to help diagnose the issue.
… On 8 Oct 2019, at 21:52, Matt Connew ***@***.***> wrote:
Moved to corefx repo as this isn't a bug/change introduced in WCF code. @davidsh, are you aware of any changes in NegotiateStream between 2.1.3 and 2.1.13 which would have caused this?
@Cronan, if this works with 2.1.3 on the same host, then that should rule out any problems with kinit. I also wouldn't expect an error saying it couldn't find the server in the database if there was a problem with kinit as that implies it's communicating with the Kerberos server successfully.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@Cronan, it might be useful to find the exact version which caused the break. I wasn't going to suggest that as it can be a lot of work and it's possible that the appropriate devs might know which changes may have broken this, but if you have it working in docker, it might not be a lot of work. Especially if you do a binary search through the versions. Output log to stdout
Output log to file
|
Thanks for bringing this issue to my attention. Yes, there were changes between 2.1.3 and 2.1.13. We ported some fixes from .NET Core 3.0 into the latest servicing releases for 2.1 and 2.2. These were high priority Linux Kerberos related fixes. See: dotnet/corefx#40109 I'll need to dig into your particular scenario to see why it appears to have been regressed by these changes. |
Can you please provide a repro for this? Or describe in great detail the scenario, networking environment and especially how Kerberos is being used. What OS is the server? What OS is the client? For the Linux client, please include the krb5.conf file on the machine and a KRB5_TRACE.LOG as described above. But, please don't post any confidential information to this GitHub issue. If necessary, you can mask off any information in the traces that is confidential. |
Thank you, both great suggestions, I’ll try first thing tomorrow.
I’m finding it hard to find up to date Kerberos rpms, and the details of working with the tarballs are not very complete, do you have some tips for trying other versions that might be quicker?
… On 8 Oct 2019, at 22:53, Matt Connew ***@***.***> wrote:
@Cronan, it might be useful to find the exact version which caused the break. I wasn't going to suggest that as it can be a lot of work and it's possible that the appropriate devs might know which changes may have broken this, but if you have it working in docker, it might not be a lot of work. Especially if you do a binary search through the versions.
Something quicker that would likely be useful is to capture a Kerberos trace for when it's working and when it's not working. You can do this by setting the environment variable KRB5_TRACE before you run your code to capture Kerberos tracing. You can do one of the following:
Output log to stdout
export KRB5_TRACE = /dev/stdout
Output log to file
export KRB5_TRACE = ~/krb5_trace.log
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@Cronan, I wouldn't try changing your Kerberos stack at this point, it will just add noise. Things were working with .NET Core 2.1.3 so unless we find a reason to believe there's an issue with the version of Kerberos you are running with, just leave everything which isn't .NET Core in the same state as when it's working with .NET Core 2.1.3. |
@davidsh, He's provided details of the client in the initial bug description. He also explained the server is a Windows WCF service. Does the specific Windows version matter? @Cronan, for your WCF client, are you explicitly specifying a service identity? You would have a reference to the one of the classes SpnEndpointIdentity, UpnEndpointIdentity or DnsEndpointIdentity. If you don't have a reference to any of these in your code, we default to using a targetName of host/hostname where hostname is extracted from the uri of the service you are connecting to. So depending on your the address you provided to WCF it might not be a FQDN. |
No. But we need a repro. At the very least, snippets of WCF code for the server and client and how the service is configured. Also, can you please get KRB5_TRACES for both the working scenario (2.1.3) and broken scenario (2.1.13)? As a workaround can you try setting this environment variable on the Linux client? It will enable the 'legacy' HTTP stack (uses libcurl on Linux). I'm curious if it will help with this problem at least as a workaround for now: export DOTNET_SYSTEM_NET_HTTP_USESOCKETSHTTPHANDLER=0 |
@davidsh, this is not using HTTP, this is NegotiateStream. |
Oh. Good point. Yes, setting that environment variable won't change anything. |
Sorry, I actually meant Centos rpms, I was typing that on a phone ... |
|
KRB5_TRACE.LOG for 2.1.3 (working scenario):
|
KRB5_TRACE.LOG for 2.1.13 (failing scenario):
|
So, the first thing that jumps out to me is that in the working version the server principal used is mywindowsservice-prod, which is the alias used in the SPN and URI by the client to connect. In the failing version, the server principal used is the server name, not the alias used as above.
|
Sample code below:
|
Service config:
|
I did the binary search and 2.1.12 is the most recent version that works, and 2.1.13 is the version that starts failing. |
As an experiment, please try adding this to your krb5.conf file in the [libdefaults] section
Normally, the best-practice is to define SPNs that use the FQDN of the server. And that means using the A record DNS name. But it seems that your environment has defined the SPN against the CNAME (alias) DNS record. Setting the above in your krb5.conf file should prevent forward lookups of the hostname. I understand that this worked before upgrading to 2.1.13. But I suspect other changes we made in that servicing release has now impacted using CNAMEs as SPNs. Please try this out and also send us updated KRB5_TRACE logs. |
@Cronan, you are including the TCP port number your service is running on as part of the SPN name. Is that intentional? Is the port number part of the SPN as set up in your DC? @davidsh, only a process running as the SYSTEM user or one of it's equivalents such as Network Service are able to use the hostname or FQDN SPN. If you are running with any other user, you normally need to create a different SPN to use. @davidsh, I could understand the CNAME/FQDN thing when using HttpClient, but this is NegotiateStream. WCF is explicitly proving the targetName from the provided SpnEndpointIdentity. Are you saying that NegotitateStream will use something different than what was provided to the targetName argument? |
NegotiateStream, HttpClient, and SqlClient all use a common native library, System.Net.Security.Native on Linux. That library does the various GSS-API calls to support Negotiate/Kerberos/NTLM protocol. GSS-API uses the Linux krb5.conf file for various settings. The transformation of the passed in 'targetName' from NegotiateStream is still present at the API call to the GSS-API. But in the implementation of the GSS-API call, SPNEGO protocol uses the settings in the krb5.conf file and apparently is transforming the 'targetName' and normalizing it (i.e CNAME -> A record) before doing a Kerberos principal lookup. That transformation is not something controllable at the .NET layer. That is why I suggested to try the |
I found the documentation for this behavior. This seems an odd design decision as the app can do that dance if needed but as you said, nothing that can be done about it. Based on those docs, it looks like this was caused by a change to use GSS_C_NT_HOSTBASED_SERVICE. |
We made the change in .NET Core 3.0 and servicing for .NET Core 2.1/2.2 because a lot of scenarios were broken until we start using GSS_C_NT_HOSTBASED_SERVICE, which appears to be the preferred format for the SPNEGO protocol plugin for GSS-API. But this issue here raises some interesting test areas that we probably want to address in a better way long-term in .NET Core. However, short-term, adjusting krb5.conf might be the best way to unblock things for now. |
Yes, the port number is is part of the SPN. We tend to use the same name but different ports for TCP vs REST, for example. |
I'm definitely going to try this today, but it's obviously not a fix for an entire environment - for example, it would prevent short names (like
It also leaves my systems in a weird position, where most Linux calls that need authentication (e.g. Or am I misunderstanding the situation? |
Are all your SPNs defined only against FQDN CNAMES? Or do you have a mixture of services defined with SPNs using either A or CNAME? |
Interesting article talking about changes to MIT Kerberos libraries that will allow for trying different behavior (CNAME first, then A if CNAME fails) when trying to find proper SPN. This new version of MIT Kerberos (Release 1.18) will have a new option for dns_canonicalize_hostname = fallback See: http://k5wiki.kerberos.org/wiki/Projects/Server_Hostname_Canonicalization |
There are a mixture of services defined with SPNs using either A or CNAME. |
It looks to me like Centos 7 is currently on 1.15
Yes, definitely 1.15.1:
But this looks interesting - do you think this will work with |
I haven't tried it out yet. But those krb5.conf options do affect GSS_C_NT_HOSTBASED_SERVICE name types. So, I do expect it to work with the new |
@davidsh, it would seem that the switch to using GSS_C_NT_HOSTBASED_SERVICE is a breaking change. Would it be possible to retroactively add an AppSetting to control this on a per-app basis? Requiring a system wide krb5.conf configuration which affects all Kerberos usage (and could break a different scenario when setting things so .NET Core works) seems a little too coarse to control behavior. |
Apps can use a custom krb5.conf file by setting an environment variable: http://web.mit.edu/kerberos/krb5-1.13/doc/admin/conf_files/krb5_conf.html
Are all your channels/services using the same type of WCF bindings? Are any using HTTP related WCF bindings? I Would like to understand if any of them use .NET Core classes besides System.Net.Security.NegotiateStream. |
All the services I’m currently using are WCF TCP - I’ll try the custom Conf file and report back.
I’ll also investigate and see whether there are HTTP services in use that I can test.
… On 19 Oct 2019, at 18:59, David Shulman ***@***.***> wrote:
Requiring a system wide krb5.conf configuration which affects all Kerberos usage (and could break a different scenario when setting things so .NET Core works) seems a little too coarse to control behavior.
Apps can use a custom krb5.conf file by setting an environment variable:
http://web.mit.edu/kerberos/krb5-1.13/doc/admin/conf_files/krb5_conf.html
krb5.conf
The krb5.conf file contains Kerberos configuration information, including the locations of KDCs and admin servers for the Kerberos realms of interest, defaults for the current realm and for Kerberos applications, and mappings of hostnames onto Kerberos realms. Normally, you should install your krb5.conf file in the directory /etc. You can override the default location by setting the environment variable KRB5_CONFIG. Multiple colon-separated filenames may be specified in KRB5_CONFIG; all files which are present will be read.
@Cronan
There are a mixture of services defined with SPNs using either A or CNAME.
Either seem to work OK in Windows, some are tricky from Linux.
Are all your channels/services using the same type of WCF bindings? Are any using HTTP related WCF bindings? I Would like to understand if any of them use .NET Core class besides System.Net.Security.NegotiateStream.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Have you been able to test using the custom .conf file? Also, do you have NTLM installed on the Linux machines (gss-ntlmssp)? And what credential are you using in the .NET layer? Is it The reason for the questions is that when Negotiate (SPNEGO) protocol is used but has errors with Kerberos, NTLM fallback would be possible assuming you are using explicit credentials and have installed the NTLM package on the Linux machine. |
Yes, this worked! 🎉
I don't see anything like this call in my WCF client code, where would it be?
|
This is the first of several PRs that add Enterprise Scenarios Testing capability to the repo. This PR focusses on Linux which allows for docker containers to be used in an enterprise network configuration. I focussed on 2 workflows: 1) The 'dev' workflow, and 2) The PR/CI workflow. The dev workflow works well since it's using containers in a docker-compose environment along with volume mounting your current dev's repo enlistment. The PR/CI workflow gives us an Azure DevOps pipeline to automate verification. I still need to work with the infra team to add a real pipeline that will run. I can't do that until this is merged. In the meantime, I have my own DevOps pipeline that verified this PR. See: https://dev.azure.com/systemnetncl/Enterprise%20Testing/_build/results?buildId=141 I will be linking a follow-up GitHub issue describing the roadmap for building on this system including adding Windows environments, NTLM protocol, proxies, and other libraries such as System.Net.Mail and System.Data.SqlClient. Those libraries also use Negotiate/Kerberos/NTLM enterprise-oriented protocols. Contributes to: https://github.com/dotnet/corefx/issues/41652 https://github.com/dotnet/corefx/issues/41489 https://github.com/dotnet/corefx/issues/36896 https://github.com/dotnet/corefx/issues/30150 https://github.com/dotnet/corefx/issues/24707 https://github.com/dotnet/corefx/issues/10041 https://github.com/dotnet/corefx/issues/6606 https://github.com/dotnet/corefx/issues/6161
This is the first of several PRs that add Enterprise Scenarios Testing capability to the repo. This PR focusses on Linux which allows for docker containers to be used in an enterprise network configuration. I focussed on 2 workflows: 1) The 'dev' workflow, and 2) The PR/CI workflow. The dev workflow works well since it's using containers in a docker-compose environment along with volume mounting your current dev's repo enlistment. The PR/CI workflow gives us an Azure DevOps pipeline to automate verification. I still need to work with the infra team to add a real pipeline that will run. I can't do that until this is merged. In the meantime, I have my own DevOps pipeline that verified this PR. See: https://dev.azure.com/systemnetncl/Enterprise%20Testing/_build/results?buildId=141 I will be linking a follow-up GitHub issue describing the roadmap for building on this system including adding Windows environments, NTLM protocol, proxies, and other libraries such as System.Net.Mail and System.Data.SqlClient. Those libraries also use Negotiate/Kerberos/NTLM enterprise-oriented protocols. Contributes to: https://github.com/dotnet/corefx/issues/41652 https://github.com/dotnet/corefx/issues/41489 https://github.com/dotnet/corefx/issues/36896 https://github.com/dotnet/corefx/issues/30150 https://github.com/dotnet/corefx/issues/24707 https://github.com/dotnet/corefx/issues/10041 https://github.com/dotnet/corefx/issues/6606 https://github.com/dotnet/corefx/issues/6161 * Address PR feedback * Change pipeline *.yml to only run on selected filepaths for PRs * Change kdc container Dockerfile to be based on ubuntu:18.04 * Fix typo in README.md * Update .yml file * Link (instead of copy) apache kerb module to the right place
@karelz What more info do you need from me? |
@Cronan it was part of our triage, so I assume @davidsh didn't feel like he has all the info to make it actionable yet. |
Actually, I do understand the root cause here. This issue is caused by a regression in behavior due to a PR fixing related Linux Kerberos issues as described above. See: #31110 (comment) I know how to reproduce this issue. The fix for this issue is complex because it will require some changes to the PAL layer to fix both this issue and preserve the other fixes previously made. I have been thinking about the proper fix. We can also test the fix in the 'runtime-libraries enterprise-linux' DevOps pipline that I previously created. |
@davidsh do you think we can address this realistically in 5.0? |
It's a non-trivial fix. The fix has to be done in the Linux PAL layer. It has to be able to try two different forms of Kerberos principal names. I.e. try one form and if it doesn't match to an SPN in Kerberos, then try the alternate format. Based on current priorities and staffing I don't see this as achievable for 5.0. |
That’s disappointing but understandable. If someone can point me to the relevant place in the code to start looking I’m prepared to try to take this on. |
Describe the bug
I use the WCF client from .NET Core to access Windows WCF services from Linux.
Everything works correctly using .NET Core 2.1.3
Upgrading to 2.1.13 results in the exception below when calling the WCF client.
I also see the same problem with .NET Core 2.2 latest or 3.0 latest.
Expected behavior
I use the WCF client from .NET Core to access Windows WCF services from Linux.
Everything works correctly using .NET Core 2.1.3
I expected this to continue working in newer versions of .NET Core.
Additional context
I use
kinit
before making this call to ensure that I'm authenticated correctly.Am I missing some dependencies, or could this be something else?
Linux Version
.NET Core Version
Full Stack trace
The text was updated successfully, but these errors were encountered: