Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS based cross Gslb communication #30

Merged
merged 4 commits into from
Jan 21, 2020
Merged

DNS based cross Gslb communication #30

merged 4 commits into from
Jan 21, 2020

Conversation

ytsarev
Copy link
Member

@ytsarev ytsarev commented Jan 20, 2020

  • We need to exchange information between multiple Gslb
    instances which are deployed to different clusters
  • Instead of exposing k8s or any other form of API we
    can just rely on DNS itself
  • We expose only working IP addresses for specific Gslb
    as an A record for service hostsz.$gslb.Name.$dnzZone
    which is created automatically by the operator
  • The data we expose is totally non-sensitive so we simplify
    configuration requiring no service account tokens / tls certificates
    or similar for Gslb information exchange
  • External Gslb enabled clusters are specified as configuration
    environemnt variable in operator deployment and abstracted
    as ohmyglb.extGslbClusters value in operator helm chart
  • First and naive implementation of roundRobin Gslb strategy

Example of this code working on local cluster

$ k -n test-gslb get dnsendpoints.externaldns.k8s.io -o yaml
...
  spec:
    endpoints:
    - dnsName: hostsz.test-gslb.example.com
      recordTTL: 30
      recordType: A
      targets:
      - 172.17.0.2
    - dnsName: app3.cloud.example.com
      recordTTL: 30
      recordType: A
      targets:
      - 172.17.0.2
      - 172.17.0.2
...

Here we observe populates service hostsz entry and also
extended target list for app3.cloud.example.com with roundRobin strategy
(IPs are duplicates given the local testing scenario)

@ytsarev ytsarev force-pushed the grab_external_records branch 4 times, most recently from 408ffbb to ca2f5f5 Compare January 20, 2020 17:07
* We need to exchange information between multiple Gslb
  instances which are deployed to different clusters
* Instead of exposing k8s or any other form of API we
  can just rely on DNS itself
* We expose only working IP addresses for specific Gslb
  as an A record for service `hostsz.$gslb.Name.$dnzZone`
  which is created automatically by the operator
* The data we expose is totally non-sensitive so we simplify
  configuration requiring no service account tokens / tls certificates
  or similar for Gslb information exchange
* External Gslb enabled clusters are specified as configuration
  environemnt variable in operator deployment and abstracted
  as `ohmyglb.extGslbClusters` value in operator helm chart
* First and naive implementation of `roundRobin` Gslb strategy

Example of this code working on local cluster
```
$ k -n test-gslb get dnsendpoints.externaldns.k8s.io -o yaml
...
  spec:
    endpoints:
    - dnsName: hostsz.test-gslb.example.com
      recordTTL: 30
      recordType: A
      targets:
      - 172.17.0.2
    - dnsName: app3.cloud.example.com
      recordTTL: 30
      recordType: A
      targets:
      - 172.17.0.2
      - 172.17.0.2
...
```
Here we observe populates service `hostsz` entry and also
extended target list for `app3.cloud.example.com` with `roundRobin` strategy
(IPs are duplicates given the local testing scenario)
@donovanmuller
Copy link
Contributor

donovanmuller commented Jan 20, 2020

@ytsarev Given two clusters (A and B) with external Gslb enabled on both and one cluster (A) suffers network issues, what cleans up the records for Gslb on cluster A, so that no traffic is sent there.

I.e. If Gslb controller cannot clean up due to catastrophic failure or network issues, how does garbage collection work so as to prevent stale records for the affected cluster (which for arguments sake, cannot accept any ingress traffic)?

@ytsarev
Copy link
Member Author

ytsarev commented Jan 20, 2020

@donovanmuller
Clusters perform cross-check of each other.
Records are getting updated during each Gslb reconciliation
If cluster B can't get anything from 53/udp of cluster A then it contain only own (B) targets in A record
Similar way if cluster A is suffering from network partition then it will 'think' that B is dead and will contain only own records until the network connection is getting recovered.

So as a further implementation steps we need to think about:

  1. Enabling periodic reconciliation (currently it is based purely on reaction to in-cluster Events). Or figure out how to track external Event(s)
  2. Return only healthy targets for healthy services.

Speaking of 2) i think we can remove special entry of hostsz and append extended targets directly from matching ingress host fqdn on externalGslb.
So app3.cloud.example.com records on cluster A will be amended with app3.cloud.example.com records of cluster B and vise versa. With this approach backend service healthcheck is embedded.
What do you think ?

* Register per Gslb Ingress host `localtargets.*` instead of global `hostsz`
* `localtargets.*` A record is getting populated only if backend service is
  healthy
* Make Gslb to return healthy records of *external* Gslb even if associated
  service in own cluster is `Unhealthy/NotFound`
@ytsarev
Copy link
Member Author

ytsarev commented Jan 20, 2020

@donovanmuller I've implemented 2) in 2bcff9f please check it out. Not yet sure if we need 1) (scheduled reconciliation)

Copy link
Contributor

@donovanmuller donovanmuller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@donovanmuller
Copy link
Contributor

@ytsarev understood, I like the updated implementation 👍

@ytsarev ytsarev merged commit 3e2d8c2 into master Jan 21, 2020
@ytsarev ytsarev deleted the grab_external_records branch January 21, 2020 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants