dropping store, external labels are not unique while using AWS NLB #1356

pablokbs · 2019-07-25T21:18:17Z

Hello, I'm having some issues using AWS's NLB (network load balancer) ... seems like the IP addresses from the NLB nodes are confusing the thanos-query service:

thanos-query-9dcfd5cdb-gtzvf thanos level=warn ts=2019-07-25T17:57:25.520410254Z caller=storeset.go:252 component=storeset msg="dropping store, external labels are not unique" address=10.10.18.76:9090 extLset="{cluster=\"use1-prev-1\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-0\"}" duplicates=2
thanos-query-9dcfd5cdb-gtzvf thanos level=warn ts=2019-07-25T17:57:25.52043309Z caller=storeset.go:252 component=storeset msg="dropping store, external labels are not unique" address=10.10.17.176:9090 extLset="{cluster=\"use1-prev-1\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-0\"}" duplicates=2

Below is a diagram of my scenario

                                                       +-------------------+                                  
                                                       |                   |                                  
                                         +------------ |                   |                                  
                                         |             |      Grafana      |                                  
                                         |             |                   |                                  
        srv+ ?                 +---------|--------+    |                   |                                  
                               |                  |    +-------------------+                                  
   10 10901 aws nlb1           |   Thanos Proxy   |                                                           
   10 10901 aws nlb2           |   (thanos query) |                                                           
   10 10901 aws nlb3           |                  |                                                           
                               +------------------+                                                           
                                         |                                                                    
              +----------------------- ------------------------------+                                        
              |                          |                           |                                        
      +-------|-------+          +-------+-------+           +-------|-------+                                
      |    AWS NLB 1  |          |    AWS NLB 2  |           |    AWS NLB 3  |                                
      +---------------+          +---------------+           +---------------+                                
                                                                                                              
 +----------+ +----------+   +----------+ +----------+   +----------+ +----------+                            
 |          | |          |   |          | |          |   |          | |          |                            
 |Query     | |Query     |   |Query     | |Query     |   |Query     | |Query     |                            
 |(nodeport)| |(nodeport)|   |(nodeport)| |(nodeport)|   |(nodeport)| |(nodeport)|                            
 |          | |          |   |          | |          |   |          | |          |                            
 +----------+ +----------+   +----------+ +----------+   +----------+ +----------+                            
                                                                                                              
  label: cluster= cluster1   label: cluster= cluster2    label: cluster= cluster3

The problem I believe is that thanos-query uses the IP address (which is the IP address of AWS NLB workers, not my Kubernetes workers) to know if the label is unique, (comparing IP address vs labels)

If I modify the srv records to point to the nodes directly (using a single thanos-query pod per cluster), everything works fine, but that means I can't move that pod to another node.

Is there a way to change thanos-query to not to use the IP address to compare?

Thanks!

PS:

Thanos v0.6.0
Prometheus v2.7.2

The text was updated successfully, but these errors were encountered:

GiedriusS · 2019-07-26T13:21:53Z

Hi, it doesn't use the IP address to compare the connected nodes, only the external labels. I believe the problem here is that at the leaf nodes you have the same external labels and they get marked as duplicates. It's hard to tell from your diagram what is happening but it seems like through the DNS discovery both of the leaf nodes get picked up. Is this what's happening?

pablokbs · 2019-07-26T13:41:40Z

I believe this answers your question:

thanos-query-9dcfd5cdb-gtzvf thanos level=warn ts=2019-07-25T17:57:25.520410254Z caller=storeset.go:252 component=storeset msg="dropping store, external labels are not unique" address=10.10.18.76:9090 extLset="{cluster=\"use1-prev-1\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-0\"}" duplicates=2
thanos-query-9dcfd5cdb-gtzvf thanos level=warn ts=2019-07-25T17:57:25.52043309Z caller=storeset.go:252 component=storeset msg="dropping store, external labels are not unique" address=10.10.17.176:9090 extLset="{cluster=\"use1-prev-1\",prometheus=\"monitoring/k8s\",prometheus_replica=\"prometheus-k8s-0\"}" duplicates=2

Those are the exact same pod replying from 2 different IP addresses (as the NLB has backend nodes)

Edit: If I remove the NLB and point to the node IP directly, everything works fine

pablokbs · 2019-07-29T19:22:32Z

I've also found something else, seems like one of the thanos-query pods (the ones running on each cluster) is missing the "Announced LabelSets" labels, which is weird as they are being generated with the same exact code in terraform:

EDIT: Nevermind, those had some dns issues trying to resolve their Kubernetes services

pablokbs · 2019-07-29T21:47:31Z

So going back to the issue with having 2 pods exposing the labels thru their NLB IP address:

I've tried using the NLB setting Proxy Protocol v2 to expose the IP address of the nodes instead of the IP address of the NLB nodes, but it seems like it breaks grpc:

thanos-query-787b59b6c7-lzlrd thanos level=warn ts=2019-07-29T21:45:40.165622344Z caller=storeset.go:322 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = Unavailable desc = transport is closing" address=10.80.8.83:9090
thanos-query-787b59b6c7-5b6vs thanos level=warn ts=2019-07-29T21:45:44.204853593Z caller=storeset.go:322 component=storeset msg="update of store node failed" err="initial store client info fetch: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.80.8.83:9090

pablokbs · 2019-07-30T15:49:53Z

I think this is related to #1338 ... I'll try to use a newer version of thanos with this fix

pablokbs · 2019-07-30T17:27:31Z

Using the latest master fixes the issue, but I'm concerned about the fix, as it seems like it's just choosing one of the IP addresses and keep using that forever (or until it gets unhealthy) which can cause issues as the traffic will always go to a single NLB node. The solution is not ideal

GiedriusS · 2019-07-30T18:22:36Z

Yes and that's why you need a load balancer like Envoy/Nginx in front of those two nodes with identical labels. Thanos here is doing the correct thing and protecting you from having needless 2x load. What do you think Thanos should do in such cases?

pablokbs · 2019-07-30T18:39:38Z

These are not two nodes, it's only one Kubernetes node behind 2 NLB nodes (AWS creates a load balancer and add AWS nodes in there, they have their own IP that are the ones being resolved when you query myawsnlb.amazon.com)

Then, my Kubernetes node (the one that has the thanos-query pod) will be behind that NLB, but the IP address that's exposed to the thanos-query Proxy, are the ones from the NLB nodes. So thanos-query can receive the same traffic from 1 pod, with 2 different IP addresses.

I use a load balancer to be able to add a group of nodes that can have the thanos-query pods, I can't maintain a list of nodes manually if the pods are moving around between nodes.

If I add an Envoy/nginx load balancer behind a NLB, I'll have the same problem, the thanos query proxy node will see these nginx/envoy nodes with the NLB ip addresses, that usually are more than 1.

stale · 2020-01-11T06:42:32Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

GiedriusS added the question label Jul 26, 2019

stale bot added the stale label Jan 11, 2020

stale bot closed this as completed Jan 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dropping store, external labels are not unique while using AWS NLB #1356

dropping store, external labels are not unique while using AWS NLB #1356

pablokbs commented Jul 25, 2019

GiedriusS commented Jul 26, 2019

pablokbs commented Jul 26, 2019 •

edited

Loading

pablokbs commented Jul 29, 2019 •

edited

Loading

pablokbs commented Jul 29, 2019

pablokbs commented Jul 30, 2019

pablokbs commented Jul 30, 2019

GiedriusS commented Jul 30, 2019

pablokbs commented Jul 30, 2019

stale bot commented Jan 11, 2020

dropping store, external labels are not unique while using AWS NLB #1356

dropping store, external labels are not unique while using AWS NLB #1356

Comments

pablokbs commented Jul 25, 2019

GiedriusS commented Jul 26, 2019

pablokbs commented Jul 26, 2019 • edited Loading

pablokbs commented Jul 29, 2019 • edited Loading

pablokbs commented Jul 29, 2019

pablokbs commented Jul 30, 2019

pablokbs commented Jul 30, 2019

GiedriusS commented Jul 30, 2019

pablokbs commented Jul 30, 2019

stale bot commented Jan 11, 2020

pablokbs commented Jul 26, 2019 •

edited

Loading

pablokbs commented Jul 29, 2019 •

edited

Loading