Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jnlp failing on Windows nodes due to dns resolution issues #264

Closed
salluvada opened this issue Feb 1, 2020 · 8 comments
Closed

jnlp failing on Windows nodes due to dns resolution issues #264

salluvada opened this issue Feb 1, 2020 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@salluvada
Copy link

I have an EKS cluster with two nodegroups linux and windows. The jenkins build pipelines are running fine on linux but when I run a pipeline on windows nodes, the jnlp container keeps failing as it is not able to connect to the operator.

Below is the error I see in jnlp when I tail the logs.

seshu@SeshuZ820:~$ kubectl logs -c jnlp -f k8s-e2e-windows-11-04n3v-l50vg-2xb7x  
WARNING: Work directory is defined twice; in command-line arguments and the 
environment variable
Feb 01, 2020 5:35:23 PM hudson.remoting.jnlp.Main createEngine
INFO: Setting up agent: k8s-e2e-windows-11-04n3v-l50vg-2xb7x
Feb 01, 2020 5:35:23 PM hudson.remoting.jnlp.Main$CuiListener <init>
INFO: Jenkins agent is running in headless mode.
Feb 01, 2020 5:35:23 PM hudson.remoting.Engine startEngine
INFO: Using Remoting version: 4.0
Feb 01, 2020 5:35:23 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using \home\jenkins\agent\remoting as a remoting work directory
Feb 01, 2020 5:35:23 PM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
INFO: Both error and output logs will be printed to \home\jenkins\agent\remoting
Feb 01, 2020 5:35:23 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among [http://jenkins-operator-http-example.default:8080/]
Feb 01, 2020 5:35:23 PM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: Failed to connect to http://jenkins-operator-http-example.default:8080/tcpSlaveAgentListener/: jenkins-operator-http-example.default
java.io.IOException: Failed to connect to http://jenkins-operator-http-example.default:8080/tcpSlaveAgentListener/: jenkins-operator-http-example.default
        at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:217)
        at hudson.remoting.Engine.innerRun(Engine.java:690)
        at hudson.remoting.Engine.run(Engine.java:518)
Caused by: java.net.UnknownHostException: jenkins-operator-http-example.default
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:607)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
        at sun.net.www.http.HttpClient.New(HttpClient.java:339)
        at sun.net.www.http.HttpClient.New(HttpClient.java:357)
        at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1226)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1162)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1056)
        at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:990)
        at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:214)
        ... 2 more

The issue actually is because of resolving the dns name. When I run curl on a windows container, to the operator service url http://jenkins-operator-http-example.default:8080 below is the output.

C:\>curl http://jenkins-operator-http-example.default:8080
curl: (6) Could not resolve host: jenkins-operator-http-example.default

But when I curl the FQDN of the operator service, I am able to get response.

C:\>curl http://jenkins-operator-http-example.default.svc.cluster.local:8080
<html><head><meta http-equiv='refresh' content='1;url=/login?from=%2F'/><script>window.location.re
place('/login?from=%2F');</script></head><body style='background-color:white; color:white;'>      


Authentication required
<!--
You are authenticated as: anonymous
Groups that you are in:

Permission you need to have (but didn't): hudson.model.Hudson.Read
 ... which is implied by: hudson.security.Permission.GenericRead
 ... which is implied by: hudson.model.Hudson.Administer
-->

</body></html>

What surprises me the most is that on the linux nodes, I am able to curl using http://jenkins-operator-http-example.default:8080 and I am getting a response.

root@dotnetcore-57f64b754f-2qwzn:/# curl http://jenkins-operator-http-example.default:8080
<html><head><meta http-equiv='refresh' content='1;url=/login?from=%2F'/><script>window.location.replace('/login?from=%2F');</script></head><body style='background-color:white; color:white;'>


Authentication required
<!--
You are authenticated as: anonymous
Groups that you are in:
  
Permission you need to have (but didn't): hudson.model.Hudson.Read
 ... which is implied by: hudson.security.Permission.GenericRead
 ... which is implied by: hudson.model.Hudson.Administer
-->

</body></html>       

Do you think it is a operator issue or is it more of an EKS issue? This is the last missing piece of me being able to successfully test out jenkins operator in a multi node group scenario with linux and windows nodes.

@salluvada
Copy link
Author

salluvada commented Feb 1, 2020

I did a work around by overriding the JENKINS_URL and JENKINS_TUNNEL parameters in the pod template for now. and was able to successfully run the pipeline on a windows node.

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: jnlp
    image: jenkins/jnlp-agent:latest-windows
    env:
    - name: "JENKINS_URL"
      value: "http://jenkins-operator-http-example.default.svc.cluster.local:8080"
    - name: "JENKINS_TUNNEL"
      value: "jenkins-operator-slave-example.default.svc.cluster.local:50000"
  - name: shell
    image: mcr.microsoft.com/windows/servercore/iis
  nodeSelector:
    beta.kubernetes.io/os: windows
''') {
    node(POD_LABEL) {
        container('shell') {
            powershell 'Get-ChildItem Env: | Sort Name'
        }
    }
}

@tumevoiz
Copy link

tumevoiz commented Feb 3, 2020

Hi @salluvada
We are happy that you found a solution.
Can we close the issue?

Cheers

@tumevoiz tumevoiz added the bug Something isn't working label Feb 3, 2020
@salluvada
Copy link
Author

salluvada commented Feb 3, 2020

@jakalkhalili the problem with my work around is that the name of the CR is hardcoded in the pipeline groovy scripts so if we deploy the jenkins with a different with a different name for the CR, it would break all the pipelines. Can the operator generated pods running the pipelines be generated with the fully qualified domain name for the Jenkins http service and the jenkin slave service?

It is just that the ".svc.cluster.local" that is needed to be appended so that it is guaranteed to be resolved, as per https://github.com/kubernetes/dns/blob/master/docs/specification.md

@tomaszsek
Copy link

@salluvada The operator builds the names for the http and agent k8s services. I think the best solution will be to get the cluster domain in pod runtime like here https://stackoverflow.com/questions/52940774/kubernetes-how-to-check-current-domain-set-by-cluster-domain-from-pod. Would you like to contribute the fix for that? The modifications should be here

func GetJenkinsHTTPServiceName(jenkins *v1alpha2.Jenkins) string {
and
func GetJenkinsSlavesServiceName(jenkins *v1alpha2.Jenkins) string {
.

@salluvada
Copy link
Author

@tomaszsek great. Thank you for your inputs. I will take a shot at it.

@salluvada
Copy link
Author

salluvada commented Mar 22, 2020

@tomaszsek I think that is the wrong place to change the code. When the operator is creating the service, its name should not have the FQDN. but when the jnlp container is created while spinning up a pod to run a build, the service environment variables there should have FQDNs

2020-03-22T22:35:25.709Z        WARN    controller-jenkins      jenkins/jenkins_controller.go:152       Reconcile loop failed 10 times with the same error, giving up: Service "jenkins-operator-http-example.cluster.local" is invalid: metadata.name: Invalid value: "jenkins-operator-http-example.cluster.local": a DNS-1035 label must consist of lower case alphanumeric characters or '-', start with an alphabetic character, and end with an alphanumeric character (e.g. 'my-name',  or 'abc-123', regex used for validation is '[a-z]([-a-z0-9]*[a-z0-9])?')       {"cr": "example"}

The configuration for that is in the jenkins-kubnernetes plugin. is the operator configuring the plugin with these values? need to fix the code there.

image

@salluvada
Copy link
Author

The changes should be done in the below file

fmt.Sprintf("http://%s.%s:%d", GetJenkinsHTTPServiceName(jenkins), jenkins.ObjectMeta.Namespace, jenkins.Spec.Service.Port),

I am going to add two new methods in the file service.go GetJenkinsHTTPServiceFQDN and GetJenkinsSlavesServiceFQDN and use them in the base_configuration_configmap.go

@tomaszsek
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants