Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading k8s yb cluster fails when bind address is private ip of pod #459

Closed
rkarthik007 opened this issue Sep 3, 2018 · 3 comments
Closed
Assignees
Labels
kind/bug This issue is a bug

Comments

@rkarthik007
Copy link
Collaborator

rkarthik007 commented Sep 3, 2018

Upon upgrading the yb-k8s cluster, the ip addresses of the various pods change. This causes the new nodes to not recognize the older ones. In a trial run after trying to upgrade the cluster, I see the following log spew on the yb-master:

I0903 22:38:30.605554   138 raft_consensus.cc:473] T 00000000000000000000000000000000 P 5fc6489268d64228b2e9fec6b53b399c [term 51 FOLLOWER]: Starting election with config: opid_index: -1 peers { permanent_uuid: "5fc6489268d64228b2e9fec6b53b399c" member_type: VOTER last_known_private_addr { host: "10.4.1.9" port: 7100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "e252b57a7f3541359a6d009b6b1ac2f6" member_type: VOTER last_known_private_addr { host: "10.4.0.9" port: 7100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } } peers { permanent_uuid: "3bd1164406294a8d988961435f3d03c6" member_type: VOTER last_known_private_addr { host: "10.4.2.8" port: 7100 } cloud_info { placement_cloud: "cloud1" placement_region: "datacenter1" placement_zone: "rack1" } }

Note that the peer ip address here are: 10.4.1.9, 10.4.0.9, 10.4.2.8 - which are the original ip addresses.

But after the upgrade, the ip addresses are 10.4.1.11, 10.4.0.11, 10.4.2.11:

yb-demo       yb-master-0                                          1/1       Running   0          19m       10.4.1.11    gke-yugabyte-default-pool-a45c66d3-h3ts
yb-demo       yb-master-1                                          1/1       Running   0          19m       10.4.0.11    gke-yugabyte-default-pool-a45c66d3-sctc
yb-demo       yb-master-2                                          1/1       Running   0          20m       10.4.2.11    gke-yugabyte-default-pool-a45c66d3-n92l

The tservers also are not able to locate the masters:

W0903 22:52:48.196835    28 heartbeater.cc:542] Failed to heartbeat to 10.4.1.11:7100: Service unavailable (yb/tserver/heartbeater.cc:455): master is no longer the leader tries=935, num=3, masters=0x00000000013f4910 -> [[10.4.1.11:7100], [10.4.2.11:7100], [10.4.0.11:7100]], code=Service unavailable

This diff changed the start command of yb master to bind on POD_IP (which is the private ip of the pod) instead of using the pod name. This is because of the master trying to infer if its present in the master addresses, but its breaking k8s functionality. Could we back out that change or use another way to detect this (like using broadcast addresses or resolving the hostname)? cc @bbaddepudi @kmuthukk

@rkarthik007 rkarthik007 added the kind/bug This issue is a bug label Sep 3, 2018
@bbaddepudi
Copy link
Collaborator

The parameter to control use of public ip mentions:
DEFINE_string(use_private_ip, "never", "When to use private IP for connection. " "cloud - would use private IP if destination node is located in the same cloud. " "region - would use private IP if destination node is located in the same cloud and " "region. " "zone - would use private IP if destination node is located in the same cloud, " "region and zone." "never - would never use private IP if broadcast address is specified.");

Though the code in this API does not check for 'never' mode:
bool UsePublicIp(const CloudInfoPB& connect_to, const CloudInfoPB& connect_from)

Fixed by adding that check and ensuring a new master with old hostname and new ip comes up and joins the quorum as expected.

yugabyte-ci pushed a commit that referenced this issue Sep 5, 2018
…of pod.

Summary:
We wanted to use the public/broadcast address or hostname even if private ip changes (in kubernetes setup). The API for this, `UsePublicIp`, was missing a check to return true when `use_private_ip` is set to `never`.

Test Plan:
Tested via the following steps:
- Added 127.0.0.1 as node1, and same for 2 & 3, in /etc/hosts.
- Then started masters of a local RF=3 cluster (on Mac) using: ~/code/yugabyte/build/latest/bin/yb-master --webserver_interface 127.0.0.1 --rpc_bind_addresses=127.0.0.1 --server_broadcast_addresses=node1:7100 --master_addresses node1:7100,node2:7100,node3:7100 --fs_data_dirs "/tmp/yblocal1/" >& /tmp/yb-master_1.out &
- Killed 3rd master.
- Remapped node3 in /etc/hosts to 127.0.0.4.
- Restarted yb-master with 127.0.0.4 for rpc/web addresses, but with node3 in broadcast.
Ensured the new node became a follower in the quorum.

Repeated the same with yb-tserver, but ran into an issue - tracked in #461.

Reviewers: bogdan, sergei, karthik

Reviewed By: karthik

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D5428
@rkarthik007
Copy link
Collaborator Author

@bbaddepudi - could you please post the command params for yb-master and yb-tserver to get this working.

@bbaddepudi
Copy link
Collaborator

Sample commands used to start yb-master and yb-tserver:
`./bin/yb-master --webserver_interface 127.0.0.3 --rpc_bind_addresses=127.0.0.3 --server_broadcast_addresses=node3:7100 --master_addresses node1:7100,node2:7100,node3:7100 --fs_data_dirs "/tmp/yblocal3/" >& /tmp/yb-master_3.out &

./bin/yb-tserver --webserver_interface 127.0.0.3 --rpc_bind_addresses=127.0.0.3 --server_broadcast_addresses=node3:9100 --tserver_master_addrs node1:7100,node2:7100,node3:7100 --fs_data_dirs "/tmp/yblocal3/" >& /tmp/yb-tserver_3.out &
`

And after the /etc/hosts entry was changed, the 127.0.0.3 was changed to 127.0.0.4 in the above commands for restarting each of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This issue is a bug
Projects
None yet
Development

No branches or pull requests

2 participants