Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: AWS X-Ray tracing: remote subsegments should name the remote service, not the local service #354

Open
GrahamLea opened this issue Jun 9, 2021 · 10 comments
Assignees
Labels
Bug Something isn't working

Comments

@GrahamLea
Copy link

Summary
App Mesh Proxies can log segment documents to AWS X-Ray that include subsegments that record calls out to other services. In these subsegments, the name field is the same as the name used for the proxy's own segment, but it should instead name the downstream service being contacted. From the AWS docs on the Segment document format:

For downstream calls, name the subsegment after the resource or service called.

While this is a somewhat minor issue, it affects tooling which makes use of data from X-Ray and expects it to conform to the documented format.

Steps to Reproduce
Set up App Mesh proxy tracing to AWS X-Ray. Look at the raw data of a trace and inspect the names of remote subsegments.

Are you currently working around this issue?
The information that should be in the name property can often be gleaned from the http.request.url property, but this is not straightforward.

@GrahamLea GrahamLea added the Bug Something isn't working label Jun 9, 2021
@kiranmeduri
Copy link

kiranmeduri commented Aug 20, 2021

For some reason when I am testing I am not seeing subsegments from App Mesh Proxy. I see two independent segments for ingress and egress. @GrahamLea can you confirm if this is what you are seeing. Below is an example trace using https://github.com/kiranmeduri/aws-app-mesh-examples/tree/howto-ecs-basics

"Id": "1-611ef56c-3e5b732747131c467377229b",
    "Duration": 0.004,
    "LimitExceeded": false,
    "Segments": [
        {
            "Id": "643a5eba72e10b45",
            "Document": {
                "id": "643a5eba72e10b45",
                "name": "color",
                "start_time": 1629418860.6141655,
                "trace_id": "1-611ef56c-3e5b732747131c467377229b",
                "end_time": 1629418860.6142209,
                "parent_id": "e231bf85c88a176c",
                "http": {
                    "request": {
                        "url": "/",
                        "method": "GET",
                        "user_agent": "Go-http-client/1.1",
                        "client_ip": "127.0.0.1:50750"
                    },
                    "response": {
                        "status": 200
                    }
                }
            }
        },
        {
            "Id": "e231bf85c88a176c",
            "Document": {
                "id": "e231bf85c88a176c",
                "name": "howto-ecs-basics/howto-ecs-basics-color-node",
                "start_time": 1629418860.613789,
                "trace_id": "1-611ef56c-3e5b732747131c467377229b",
                "end_time": 1629418860.6146402,
                "parent_id": "a435624f881b0676",
                "http": {
                    "request": {
                        "url": "http://color.howto-ecs-basics.mesh.local:8080/",
                        "method": "GET",
                        "user_agent": "Go-http-client/1.1",
                        "client_ip": "10.0.70.160",
                        "x_forwarded_for": false
                    },
                    "response": {
                        "status": 200,
                        "content_length": 5
                    }
                },
                "aws": {
                    "app_mesh": {
                        "mesh_name": "howto-ecs-basics",
                        "virtual_node_name": "howto-ecs-basics-color-node"
                    }
                },
                "annotations": {
                    "response_flags": "-",
                    "component": "proxy",
                    "upstream_cluster": "cds_ingress_howto-ecs-basics_howto-ecs-basics-color-node_http_8080",
                    "request_size": "0",
                    "downstream_cluster": "-",
                    "node_id": "mesh/howto-ecs-basics/virtualNode/howto-ecs-basics-color-node"
                },
                "origin": "AWS::AppMesh::Proxy"
            }
        },
        {
            "Id": "a435624f881b0676",
            "Document": {
                "id": "a435624f881b0676",
                "name": "howto-ecs-basics/howto-ecs-basics-front-node",
                "start_time": 1629418860.613093,
                "trace_id": "1-611ef56c-3e5b732747131c467377229b",
                "end_time": 1629418860.6155586,
                "parent_id": "f1db8af9fc1ba721",
                "http": {
                    "request": {
                        "url": "http://color.howto-ecs-basics.mesh.local:8080/",
                        "method": "GET",
                        "user_agent": "Go-http-client/1.1",
                        "client_ip": "10.0.70.160",
                        "x_forwarded_for": false
                    },
                    "response": {
                        "status": 200,
                        "content_length": 5
                    }
                },
                "aws": {
                    "app_mesh": {
                        "mesh_name": "howto-ecs-basics",
                        "virtual_node_name": "howto-ecs-basics-front-node"
                    }
                },
                "annotations": {
                    "response_flags": "-",
                    "component": "proxy",
                    "upstream_cluster": "cds_egress_howto-ecs-basics_howto-ecs-basics-color-node_http_8080",
                    "downstream_cluster": "-",
                    "request_size": "0",
                    "node_id": "mesh/howto-ecs-basics/virtualNode/howto-ecs-basics-front-node"
                },
                "origin": "AWS::AppMesh::Proxy"
            }
        },
        {
            "Id": "b2ebdad055907d25",
            "Document": {
                "id": "b2ebdad055907d25",
                "name": "front",
                "start_time": 1629418860.6127875,
                "trace_id": "1-611ef56c-3e5b732747131c467377229b",
                "end_time": 1629418860.6159291,
                "parent_id": "20e2fa1ed7df8cbd",
                "http": {
                    "request": {
                        "url": "/color",
                        "method": "GET",
                        "user_agent": "curl/7.64.1",
                        "client_ip": "205.251.233.52",
                        "x_forwarded_for": true
                    },
                    "response": {
                        "status": 200
                    }
                },
                "subsegments": [
                    {
                        "id": "f1db8af9fc1ba721",
                        "name": "color.howto-ecs-basics.mesh.local:8080",
                        "start_time": 1629418860.6128187,
                        "end_time": 1629418860.6158834,
                        "http": {
                            "request": {
                                "url": "http://color.howto-ecs-basics.mesh.local:8080",
                                "method": "GET"
                            },
                            "response": {
                                "status": 200,
                                "content_length": 5
                            }
                        },
                        "namespace": "remote",
                        "subsegments": [
                            {
                                "id": "17c8b14277300637",
                                "name": "request",
                                "start_time": 1629418860.612922,
                                "end_time": 1629418860.6129484
                            },
                            {
                                "id": "bde5b81609785fc1",
                                "name": "response",
                                "start_time": 1629418860.6129577,
                                "end_time": 1629418860.6158335
                            }
                        ]
                    }
                ]
            }
        },
        {
            "Id": "20e2fa1ed7df8cbd",
            "Document": {
                "id": "20e2fa1ed7df8cbd",
                "name": "howto-ecs-basics/howto-ecs-basics-front-node",
                "start_time": 1629418860.6123674,
                "trace_id": "1-611ef56c-3e5b732747131c467377229b",
                "end_time": 1629418860.616197,
                "http": {
                    "request": {
                        "url": "http://appme-publi-2egemd7ho73v-1851548281.us-west-2.elb.amazonaws.com/color",
                        "method": "GET",
                        "user_agent": "curl/7.64.1",
                        "client_ip": "10.0.49.69",
                        "x_forwarded_for": false
                    },
                    "response": {
                        "status": 200,
                        "content_length": 39
                    }
                },
                "aws": {
                    "app_mesh": {
                        "mesh_name": "howto-ecs-basics",
                        "virtual_node_name": "howto-ecs-basics-front-node"
                    }
                },
                "annotations": {
                    "response_flags": "-",
                    "component": "proxy",
                    "upstream_cluster": "cds_ingress_howto-ecs-basics_howto-ecs-basics-front-node_http_8080",
                    "request_size": "0",
                    "downstream_cluster": "-",
                    "node_id": "mesh/howto-ecs-basics/virtualNode/howto-ecs-basics-front-node"
                },
                "origin": "AWS::AppMesh::Proxy"
            }
        }
    ]
}

@kiranmeduri
Copy link

@abaptiste @lavignes can you confirm if this is right behavior? I think marking the segment as ingress or egress is useful. But not sure if subsegments make sense unless we have richer instrumentation in Envoy such as RDS/CDS/EDS processing.

@GrahamLea
Copy link
Author

GrahamLea commented Aug 20, 2021

Hey, thanks for picking this up, Kiran.
It's been a while, but I've had another look at the telemetry (coming from a client's system, not mine).
I can't find exactly what I reported originally. It's possible I was confused at the time, but I have a pretty clear recollection of seeing what I reported.
Anyway, what I've been able to find this afternoon is remote subsegments for the ingress segments, but not for the egress ones.
So here's an example of an ingress segment (I've summarised and elided it a bit):

  {
    "id": "6f0d250993e4e647",
    "parent_id": "f20174ca837cc8b6",
    "name": "production/mesh-adrian",
    "origin": "AWS::AppMesh::Proxy",
    "http.request.url": "http://mesh-adrian.default.svc.cluster.local/wo...",
    "aws.app_mesh.mesh_name": "production",
    "subsegments": [
      {
        "id": "9301a3ebae7c57b3",
        "name": "production/mesh-adrian",
        "http.request.url": "http://mesh-adrian.default.svc.cluster.local/wo...",
        "namespace": "remote"
      }
    ]
  }

And here's the egress segment on another service's proxy that is the parent of the one above:

  {
    "id": "f20174ca837cc8b6",
    "parent_id": "1744723913e67295",
    "name": "production/mesh-mary",
    "origin": "AWS::AppMesh::Proxy",
    "http.request.url": "http://mesh-adrian.default.svc.cluster.local/wo...",
    "aws.app_mesh.mesh_name": "production",
    "subsegments": []
  }

(We can tell the latter is an egress bc the request URL is for a different service to the name field.)

So, it seems my opening sentence:

App Mesh Proxies can log segment documents to AWS X-Ray that include subsegments that record calls out to other services.

...is not represented in the data I've looked at today. The proxies are logging subsegments for calls into their own services rather than out to other services.

Two thoughts, having looked at this now:

Subsegments
I think that, in the spirit of X-Ray's format, if proxies are going to trace, they should really log trace outbound calls as remote subsegments, for both ingress and egress requests. I'm sure the X-Ray team can give you a more informed opinion on this, though.

Service Names
You can see in the example above that the ingress subsegment has the same name value as its containing segment. If that were happening on an egress remote subsegment, it would definitely be wrong. (That's what I thought I saw originally.) Whether it's right or wrong on an ingress is debatable. On the one hand, both the service and its proxy are part of the one logical service. On the other hand, in X-Ray land when two components talking to each other share a name, it kind of implies that the service is making remote requests to itself. It seems to me that, if proxies are tracing, they should maybe use different names to the service they are proxying.

Hope that's helpful!

@kiranmeduri
Copy link

@GrahamLea sorry for the delay. If I can trouble you once more, can you let me know which version of Envoy you are using? Thanks

@GrahamLea
Copy link
Author

Sorry, this was from a client's system, so I don't have that information.

@abaptiste
Copy link

After spelunking through Envoy and the X-ray extension, it does not appear that the X-ray extension creates or updates the subsegments object in a trace. We are continuing to research to see how this can be addressed.

@abaptiste
Copy link

Another update, the subsegments object can be added to the trace using the AWSXraySDK, however Envoy's tracer does not use that SDK.

The tracer does create child spans which are equivalent. There is a type field which the tracer should set to subsegment per the X-Ray documentation

Here's a link to the existing tracer fields where we can begin adjusting the span content.

@rexnp
Copy link

rexnp commented Mar 2, 2022

An update on this issue: we've identified the necessary changes to trace remote calls as a subsegment and have the subsegment name be identified as the remote service (instead of the caller's).

  1. The behavior to trace remote calls is controlled by Envoy's start_child_span configuration. Today App Mesh does not expose this option in the API and does not have it enabled; prompted by this issue, our plan is to eventually enable this option by default for all customers.
  2. However, we also discovered that in the existing Xray extension implementation, we are not correctly classifying traces to remote (Envoy child spans) as xray subsegments, but instead treating each subsegment as an independent segment. We have opened a PR to fix this: xray tracer: correctly annotate child spans as xray subsegments envoyproxy/envoy#20170

To demonstrate, this gap is most noticeable in the XRay service map (the graphs below are captured from App Mesh & Xray integration walkthrough):

figure 1.
Screen Shot 2022-02-09 at 6 01 54 PM

The service map above shows the outcome of simply enabling the Envoy start_child_span option with the existing Envoy Xray extension, which contains additional nodes attributed to the "child spans" in Envoy that were misclassified as independent segments. In contrast, the expected service map should look like

figure 2.
Screen Shot 2022-02-09 at 6 05 03 PM

while having segment details about the remote service, e.g.

{
            "Id": "8eecc8a82d35c902",
            "Document": {
                "id": "8eecc8a82d35c902",
                "name": "howto-ecs-basics/howto-ecs-basics-front-node",
                "start_time": 1644451809.443302,
                "trace_id": "1-620457e1-47c3587c6ede1c08098a7273",
                "end_time": 1644451809.4465587,
                "parent_id": "257d8a179102651d",
                "http": {
                    "request": {
                        "url": "http://color.howto-ecs-basics.mesh.local:8080/",
                        "method": "GET",
                        "user_agent": "Go-http-client/1.1",
                        "client_ip": "10.0.77.8",
                        "x_forwarded_for": false
                    },
                    "response": {
                        "status": 200,
                        "content_length": 5
                    }
                },
                "aws": {
                    "app_mesh": {
                        "mesh_name": "howto-ecs-basics",
                        "virtual_node_name": "howto-ecs-basics-front-node"
                    }
                },
                "annotations": {
                    "response_flags": "-",
                    "component": "proxy",
                    "upstream_cluster": "cds_egress_howto-ecs-basics_howto-ecs-basics-color-node_http_8080",
                    "request_size": "0",
                    "downstream_cluster": "-",
                    "node_id": "mesh/howto-ecs-basics/virtualNode/howto-ecs-basics-front-node",
                    "direction": "egress"
                },
                "origin": "AWS::AppMesh::Proxy",
                "subsegments": [
                    {
                        "id": "e6badfd202e93ce9",
                        "name": "router cds_egress_howto-ecs-basics_howto-ecs-basics-color-node_http_8080 egress",
                        "start_time": 1644451809.4434495,
                        "end_time": 1644451809.446304,
                        "http": {
                            "request": {},
                            "response": {
                                "status": 200
                            }
                        },
                        "aws": {},
                        "annotations": {
                            "response_flags": "-",
                            "component": "proxy",
                            "upstream_cluster": "cds_egress_howto-ecs-basics_howto-ecs-basics-color-node_http_8080",
                            "upstream_address": "10.0.66.192:8080",
                            "direction": "egress"
                        }
                    }
                ]
            }
        }

Therefore, today we do not want to simply enable the trace remote call option in Envoy for customers without also having them upgrade their Envoy image with the Xray extension fix.

Path forward:

  1. The Xray extension fix needs to be upstreamed and App Mesh will have to release a new image containing this fix.
  2. We will use an opt-in system where customers can reach out to us to enable the start_child_span option on a per account basis. The reason we cannot simply enable this option for all customers is because we do not want to break their existing integration with Xray (without tracing remote calls, there are no subsegments to piece together in the service map).

I will update back when 1) is complete.

@suniltheta
Copy link

suniltheta commented Mar 18, 2022

This fix envoyproxy/envoy#20170 is merged in Envoy github repo. This fix should be available from next version v1.22.x App Mesh envoy image release.

@shsahu shsahu closed this as completed Apr 6, 2022
@shsahu shsahu reopened this Apr 7, 2022
@suniltheta
Copy link

This capability is added in envoyproxy v1.22 release and should be available from aws-appmesh-envoy:v1.22.0.0 #407 onwards. However, this feature is not enabled by default for all the customers so as to not break their existing behavior. If you want this feature to be enabled cut a customer ticket towards AWS App Mesh with the AWS Account ID, so that we can we can opt your aws account into this feature.

We shall keep this issue open so that customers can discover this capability and cut us a ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants