Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugging stalled service #326

Merged
merged 1 commit into from
Jan 22, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
203 changes: 203 additions & 0 deletions notes/zrq/20210115-01-service-debug.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
#
# <meta:header>
# <meta:licence>
# Copyright (c) 2021, ROE (http://www.roe.ac.uk/)
#
# This information is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This information is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
# </meta:licence>
# </meta:header>
#
#zrq-notes-time
#zrq-notes-indent
#zrq-notes-crypto
#zrq-notes-ansible
#zrq-notes-osformat
#zrq-notes-zeppelin
#

Target:

Service unresponsive, debugging from outside.

email report from Nigel

> Subject: zeppelin.aglais.uk unstable?
>
> Trying to do some linux-shell level stuff on the above but keep getting chucked off with:
>
> packet_write_wait: Connection to 128.232.227.165: Broken pipe
>
> Is something wrong somewhere?



Result:

Work in progress ....


# -----------------------------------------------------
# Check if we can access the live Zeppelin via HTTP.
#[user@desktop]

curl --head --verbose 'http://zeppelin.aglais.uk/'

> * Trying 128.232.227.165:80...
> * connect to 128.232.227.165 port 80 failed: Connection timed out
> * Failed to connect to zeppelin.aglais.uk port 80: Connection timed out
> * Closing connection 0
> curl: (28) Failed to connect to zeppelin.aglais.uk port 80: Connection timed out


# -----------------------------------------------------
# Check if we can access a dev Zeppelin via HTTP.
#[user@desktop]

curl --head --insecure 'https://zeppelin.metagrid.xyz/'

> HTTP/2 200
> date: Fri, 15 Jan 2021 14:48:12 GMT
> content-type: text/html
> content-length: 4660
> ....
> ....


# -----------------------------------------------------
# Check if we can access the gateway VM via ssh.
#[user@desktop]

ssh -v \
fedora@128.232.227.134 \
'
hostname
date
'

> OpenSSH_8.3p1, OpenSSL 1.1.1i FIPS 8 Dec 2020
> ....
> ....
> debug1: Connecting to 128.232.227.134 [128.232.227.134] port 22.
> debug1: Connection established.
> ....
> ....
> debug1: Authenticating to 128.232.227.134:22 as 'dmr'
> ....
> ....
> debug1: Server host key: ecdsa-sha2-nistp256 SHA256:AM3s0UqRFdjRrbnNu5F+dj8Y1Kh/Am39uvpwtvpeVh8
> ....
> ....
> debug1: Next authentication method: publickey
> ....
> debug1: Offering public key: Cambridge HPC OpenStack RSA SHA256:n5J+DL1a4Ly6YPxUGo+f68Gcuhy8IPepFe6vPcX6Q7o agent
> ....
> ....
> Received disconnect from 128.232.227.134 port 22:2: Too many authentication failures
> Disconnected from 128.232.227.134 port 22

#
# Service is running OK, I just can't login.
# This suggests the VM is healthy, sshd is running and responding to requests.
#



# -----------------------------------------------------
# Check if we can access the zeppelin VM via ssh.
#[user@desktop]

ssh -v \
fedora@128.232.227.165 \
'
hostname
date
'

> OpenSSH_8.3p1, OpenSSL 1.1.1i FIPS 8 Dec 2020
> ....
> ....
> debug1: Connecting to 128.232.227.165 [128.232.227.165] port 22.
> debug1: Connection established.
> ....
> ....
> debug1: Authenticating to 128.232.227.165:22 as 'dmr'
> ....
> ....
> debug1: Server host key: ecdsa-sha2-nistp256 SHA256:KD/NaO4F8TbSX2eB9njVbyu83rAFyEanCHgs5By46UE
> ....
> ....
> debug1: Next authentication method: publickey
> ....
> ....
> debug1: Offering public key: Cambridge HPC OpenStack RSA SHA256:n5J+DL1a4Ly6YPxUGo+f68Gcuhy8IPepFe6vPcX6Q7o agent
> ....
> ....
> Received disconnect from 128.232.227.165 port 22:2: Too many authentication failures
> Disconnected from 128.232.227.165 port 22

#
# Service is running OK, I just can't login.
# This suggests the VM is healthy, sshd is running and responding to requests.
#


# -----------------------------------------------------
# Checking the Horizon console interface.
#[user@desktop]

Gateway machine:

Login prompt is working OK.
I can't login using name/pass, but the prompt is working OK.

Zeppelin machine:

Login prompt is working OK.
I can't login using name/pass, but the prompt is working OK.



# -----------------------------------------------------
# Check if we can access the live Zeppelin via HTTP.
#[user@desktop]

curl --head --verbose 'http://zeppelin.aglais.uk/'

> * Trying 128.232.227.165:80...
> * connect to 128.232.227.165 port 80 failed: Connection timed out
> * Failed to connect to zeppelin.aglais.uk port 80: Connection timed out
> * Closing connection 0
> curl: (28) Failed to connect to zeppelin.aglais.uk port 80: Connection timed out

curl --head --verbose --insecure 'https://zeppelin.aglais.uk/'



#
# Looks like the VM is fine but the web server is down.
# Specifically the front facing HTTP proxy.
# If it was the Zeppelin service behind the proxy then the proxy would respond with a 503 error.
#

#
# Without access to the cluster via the gateway not sure I can do more here.
# Need to have the project ssh key working so that we can share access to these machines.
#