Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault/problems with error categories/std::error_codes when using dynamic.ot tracing. #5481

Closed
Oberon00 opened this issue Jan 4, 2019 · 9 comments
Labels

Comments

@Oberon00
Copy link
Contributor

Oberon00 commented Jan 4, 2019

Description:
Envoy crashes when the error_category parameter of the OpenTracingMakeTracerFactory implementation is assigned from a dynamic.ot plugin. The cause of this is the combination of:

  1. Envoy links opentracing and libstdc++ statically, forcing plugins to link their own copy.
  2. Envoy unloads the plugin .so if it fails to load before processing the error.

Point 1 alone causes a lot of problems, for example even if 2 is fixed, error categories and thus std::error_codes from plugins will always compare unequal to the ones from envoy (because libstdc++ basically compares the address of error category objects and the copy in the plugin will have a different one than the statically linked copy in the envoy binary) which e.g. affects the LookupKey function in opentracing-cpp's mocktracer (mocktracer/src/propagation.cpp) since result.error() != opentracing::lookup_key_not_supported_error is always true since the lhs comes from envoy and the rhs from the opentracing shared library.

In combination with point 2, this means that envoy tries to dereference a pointer to memory that belongs to an unloaded shared library (whether it is in the so image itself or was deallocated in unload callbacks/static destructors).

Repro steps:
For example, start with the /examples/jaeger-native-tracing sample (I'm using commit 7d2e84d). Replace the jaeger tracing plugin in the front-envoy with the opentracing_mocktracer from opentracing/opentracing-cpp@v1.5.0. Modify the plugins's source code (in opentracing-cpp) like follows:

diff --git a/mocktracer/src/dynamic_load.cpp b/mocktracer/src/dynamic_load.cpp
index 0fc7371..50e9c08 100644
--- a/mocktracer/src/dynamic_load.cpp
+++ b/mocktracer/src/dynamic_load.cpp
@@ -4,11 +4,21 @@
 #include <cstring>
 #include <exception>

+namespace {
+struct DummyUnloadDetection {
+  ~DummyUnloadDetection() {
+    fprintf(stderr, "~DummyUnloadDetection\n");
+    fflush(stderr);
+  }
+};
+}  // namespace
+
 static int OpenTracingMakeTracerFactoryFct(const char* opentracing_version,
                                            const char* opentracing_abi_version,
                                            const void** error_category,
                                            void* error_message,
                                            void** tracer_factory) try {
+  static DummyUnloadDetection unload_guard;
   if (opentracing_version == nullptr || opentracing_abi_version == nullptr ||
       error_category == nullptr || tracer_factory == nullptr) {
     fprintf(stderr,
@@ -28,9 +38,18 @@ static int OpenTracingMakeTracerFactoryFct(const char* opentracing_version,
     return opentracing::incompatible_library_versions_error.value();
   }

-  *tracer_factory = new opentracing::mocktracer::MockTracerFactory{};
+  fprintf(stderr, "error category addr: %p\n",
+          &opentracing::dynamic_load_error_category());
+  fflush(stderr);
+  auto& message = *static_cast<std::string*>(error_message);
+  message = "dummy error";
+  *error_category =
+      static_cast<const void*>(&opentracing::dynamic_load_error_category());
+  return opentracing::dynamic_load_failure_error.value();
+
+  //*tracer_factory = new opentracing::mocktracer::MockTracerFactory{};

-  return 0;
+  //return 0;
 } catch (const std::bad_alloc&) {
   *error_category = static_cast<const void*>(&std::generic_category());
   return static_cast<int>(std::errc::not_enough_memory);

Then compile it (fortunately, opentracing-cpp has a very straightforward cmake build so this should be no problem).

Change the docker-compose.yaml of the envoy example:

diff --git a/examples/jaeger-native-tracing/docker-compose.yml b/examples/jaeger-native-tracing/docker-compose.yml
index 659c4da8b..2225c3d2f 100644
--- a/examples/jaeger-native-tracing/docker-compose.yml
+++ b/examples/jaeger-native-tracing/docker-compose.yml
@@ -7,11 +7,14 @@ services:
       dockerfile: front-proxy/Dockerfile-frontenvoy
     volumes:
       - ./front-envoy-jaeger.yaml:/etc/front-envoy.yaml
-      - ./install-jaeger-plugin.sh:/install-jaeger-plugin.sh
+      - PATH/TO/COMPILED/libopentracing_mocktracer.so.1.5.0:/libopentracing_mocktracer.so
+      - PATH/TO/COMPILED/libopentracing.so.1.5.0:/usr/lib/libopentracing.so.1
       - ./start-front.sh:/start-front.sh
     entrypoint: /start-front.sh
     networks:
       - envoymesh
+    environment:
+      - LD_LIBRARY_PATH=/usr/lib/
     expose:
       - "80"
       - "8001"

For better debugging, I also changed the envoy dockerfile:

diff --git a/examples/front-proxy/Dockerfile-frontenvoy b/examples/front-proxy/Dockerfile-frontenvoy
index 98413046a..38694b522 100644
--- a/examples/front-proxy/Dockerfile-frontenvoy
+++ b/examples/front-proxy/Dockerfile-frontenvoy
@@ -1,5 +1,7 @@
-FROM envoyproxy/envoy:latest
+FROM envoyproxy/envoy-alpine-debug:latest
 
-RUN apt-get update && apt-get -q install -y \
-    curl
+RUN apk update && \
+    apk add curl bash 'libstdc++<7' wget

Admin and Stats Output: Should not be relevant.

Config: The same as in examples/jaeger-native-tracing/front-envoy-jaeger.yaml, except that the tracing config is changed to the following:

tracing:
  http:
    name: envoy.dynamic.ot
    config:
      library: /libopentracing_mocktracer.so
      config:
        output_file: /spans.log

Logs:

$ docker-compose logs front-envoy
Attaching to jaegernativetracing_front-envoy_1
front-envoy_1  | [2019-01-04 12:13:03.559][000008][info][main] [source/server/server.cc:206] initializing epoch 0 (hot restart version=10.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363 size=2654312)
front-envoy_1  | [2019-01-04 12:13:03.559][000008][info][main] [source/server/server.cc:208] statically linked extensions:
front-envoy_1  | [2019-01-04 12:13:03.559][000008][info][main] [source/server/server.cc:210]   access_loggers: envoy.file_access_log,envoy.http_grpc_access_log
front-envoy_1  | [2019-01-04 12:13:03.559][000008][info][main] [source/server/server.cc:213]   filters.http: envoy.buffer,envoy.cors,envoy.ext_authz,envoy.fault,envoy.filters.http.header_to_metadata,envoy.filters.http.jwt_authn,envoy.filters.http.rbac,envoy.grpc_http1_bridge,envoy.grpc_json_transcoder,envoy.grpc_web,envoy.gzip,envoy.health_check,envoy.http_dynamo_filter,envoy.ip_tagging,envoy.lua,envoy.rate_limit,envoy.router,envoy.squash
front-envoy_1  | [2019-01-04 12:13:03.559][000008][info][main] [source/server/server.cc:216]   filters.listener: envoy.listener.original_dst,envoy.listener.proxy_protocol,envoy.listener.tls_inspector
front-envoy_1  | [2019-01-04 12:13:03.559][000008][info][main] [source/server/server.cc:219]   filters.network: envoy.client_ssl_auth,envoy.echo,envoy.ext_authz,envoy.filters.network.dubbo_proxy,envoy.filters.network.rbac,envoy.filters.network.sni_cluster,envoy.filters.network.thrift_proxy,envoy.http_connection_manager,envoy.mongo_proxy,envoy.ratelimit,envoy.redis_proxy,envoy.tcp_proxy
front-envoy_1  | [2019-01-04 12:13:03.559][000008][info][main] [source/server/server.cc:221]   stat_sinks: envoy.dog_statsd,envoy.metrics_service,envoy.stat_sinks.hystrix,envoy.statsd
front-envoy_1  | [2019-01-04 12:13:03.559][000008][info][main] [source/server/server.cc:223]   tracers: envoy.dynamic.ot,envoy.lightstep,envoy.tracers.datadog,envoy.zipkin
front-envoy_1  | [2019-01-04 12:13:03.559][000008][info][main] [source/server/server.cc:226]   transport_sockets.downstream: envoy.transport_sockets.alts,envoy.transport_sockets.capture,raw_buffer,tls
front-envoy_1  | [2019-01-04 12:13:03.559][000008][info][main] [source/server/server.cc:229]   transport_sockets.upstream: envoy.transport_sockets.alts,envoy.transport_sockets.capture,raw_buffer,tls
front-envoy_1  | [2019-01-04 12:13:03.566][000008][info][main] [source/server/server.cc:271] admin address: 0.0.0.0:8001
front-envoy_1  | [2019-01-04 12:13:03.567][000008][info][config] [source/server/configuration_impl.cc:50] loading 0 static secret(s)
front-envoy_1  | [2019-01-04 12:13:03.567][000008][info][config] [source/server/configuration_impl.cc:56] loading 1 cluster(s)
front-envoy_1  | [2019-01-04 12:13:03.569][000008][info][config] [source/server/configuration_impl.cc:67] loading 1 listener(s)
front-envoy_1  | [2019-01-04 12:13:03.572][000008][info][config] [source/server/configuration_impl.cc:92] loading tracing configuration
front-envoy_1  | [2019-01-04 12:13:03.572][000008][info][config] [source/server/configuration_impl.cc:101]   loading tracing driver: envoy.dynamic.ot
front-envoy_1  | error category addr: 0x7f73a7a348f0
front-envoy_1  | ~DummyUnloadDetection
front-envoy_1  | [2019-01-04 12:13:03.580][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:125] Caught Segmentation fault, suspect faulting address 0x7f73a7a348f0
front-envoy_1  | [2019-01-04 12:13:03.580][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:94] Backtrace thr<0> obj</usr/local/bin/envoy> (If unsymbolized, use tools/stack_decode.py):
front-envoy_1  | [2019-01-04 12:13:03.585][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<0> #0 0x689199 Envoy::Extensions::Tracers::DynamicOt::DynamicOpenTracingDriver::formatErrorMessage()
front-envoy_1  | [2019-01-04 12:13:03.589][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<0> #1 0x689727 Envoy::Extensions::Tracers::DynamicOt::DynamicOpenTracingDriver::DynamicOpenTracingDriver()
front-envoy_1  | [2019-01-04 12:13:03.594][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<0> #2 0x688204 Envoy::Extensions::Tracers::DynamicOt::DynamicOpenTracingTracerFactory::createHttpTracerTyped()
front-envoy_1  | [2019-01-04 12:13:03.598][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<0> #3 0x688008 Envoy::Extensions::Tracers::Common::FactoryBase<>::createHttpTracer()
front-envoy_1  | [2019-01-04 12:13:03.602][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<0> #4 0x7620fb Envoy::Server::Configuration::MainImpl::initializeTracers()
front-envoy_1  | [2019-01-04 12:13:03.607][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<0> #5 0x76326a Envoy::Server::Configuration::MainImpl::initialize()
front-envoy_1  | [2019-01-04 12:13:03.611][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<0> #6 0x74616a Envoy::Server::InstanceImpl::initialize()
front-envoy_1  | [2019-01-04 12:13:03.615][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<0> #7 0x748949 Envoy::Server::InstanceImpl::InstanceImpl()
front-envoy_1  | [2019-01-04 12:13:03.619][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<0> #8 0x441b0f Envoy::MainCommonBase::MainCommonBase()
front-envoy_1  | [2019-01-04 12:13:03.624][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<0> #9 0x4426b4 Envoy::MainCommon::MainCommon()
front-envoy_1  | [2019-01-04 12:13:03.628][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<0> #10 0x41afe5 main
front-envoy_1  | [2019-01-04 12:13:03.628][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] thr<0> obj</usr/glibc-compat/lib/libc.so.6>
front-envoy_1  | [2019-01-04 12:13:03.628][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:117] thr<0> #11 0x7f73a9f68ead (unknown)
front-envoy_1  | [2019-01-04 12:13:03.628][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] thr<0> obj</usr/local/bin/envoy>
front-envoy_1  | [2019-01-04 12:13:03.632][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:114] thr<0> #12 0x4353d8 _start
front-envoy_1  | [2019-01-04 12:13:03.632][000008][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:121] end backtrace thread 0
front-envoy_1  | /start-front.sh: line 3:     8 Segmentation fault      (core dumped) /usr/local/bin/envoy -c /etc/front-envoy.yaml --service-cluster front-proxy

Notice how the suspected fault address matches the address of the error category, and that the .so was already unloaded (~DummyUnloadDetection in output) before the segfault occured.

Call Stack: (snip from above)

 Caught Segmentation fault, suspect faulting address 0x7f73a7a348f0
Backtrace thr<0> obj</usr/local/bin/envoy> (If unsymbolized, use tools/stack_decode.py):
 thr<0> #0 0x689199 Envoy::Extensions::Tracers::DynamicOt::DynamicOpenTracingDriver::formatErrorMessage()
 thr<0> #1 0x689727 Envoy::Extensions::Tracers::DynamicOt::DynamicOpenTracingDriver::DynamicOpenTracingDriver()
 thr<0> #2 0x688204 Envoy::Extensions::Tracers::DynamicOt::DynamicOpenTracingTracerFactory::createHttpTracerTyped()
 thr<0> #3 0x688008 Envoy::Extensions::Tracers::Common::FactoryBase<>::createHttpTracer()
 thr<0> #4 0x7620fb Envoy::Server::Configuration::MainImpl::initializeTracers()
 thr<0> #5 0x76326a Envoy::Server::Configuration::MainImpl::initialize()
 thr<0> #6 0x74616a Envoy::Server::InstanceImpl::initialize()
 thr<0> #7 0x748949 Envoy::Server::InstanceImpl::InstanceImpl()
 thr<0> #8 0x441b0f Envoy::MainCommonBase::MainCommonBase()
 thr<0> #9 0x4426b4 Envoy::MainCommon::MainCommon()
 thr<0> #10 0x41afe5 main
 thr<0> obj</usr/glibc-compat/lib/libc.so.6>
 thr<0> #11 0x7f73a9f68ead (unknown)
 thr<0> obj</usr/local/bin/envoy>
 thr<0> #12 0x4353d8 _start
 end backtrace thread 0
@mattklein123
Copy link
Member

@rnburn

@rnburn
Copy link
Contributor

rnburn commented Jan 4, 2019

I think this can be fixed by changing this code to to always return the dynamic_load_failure error code (so that there's no reference to an error_category from the loaded library).

And adding the error category symbols to the global section of the export map so that they resolve to the versions of the opentracing library linked in by envoy, which would make them comparable.

@Oberon00
Copy link
Contributor Author

Oberon00 commented Jan 5, 2019

The fix you suggest sounds good to me, if the error_message output parameter is set to something that preserves the information that would be contained in the error_message from the plugin and the error code + category (calling error_category::message should be safe before onloading the shared object).

Regarding the export map, I thought so too but in this repro, I'm dynamically linking libopentracing.so. I do not completely grok ELF/Linux dynamic symbol resolution, but I thought that this would be sufficient for any symbols in the executable to override symbols from libopentracing.

man 3 dlopen has this to say:

External references in the library are resolved using the libraries in that library's dependency list and any other libraries previously opened with the RTLD_GLOBAL flag. If the executable was linked with the flag "-rdynamic" (or, synonymously, "--export-dynamic"), then the global symbols in the executable will also be used to resolve references in a dynamically loaded library.

So maybe envoy is not liked with -rdynamic?

@rnburn
Copy link
Contributor

rnburn commented Jan 8, 2019

It supports a build with exported symbols (See #2251).

But not sure what the default is.

@rnburn
Copy link
Contributor

rnburn commented Jan 8, 2019

Another (though less elegant and efficient) option that would working without -rdynamic is to first check the error_category with dynamic_cast; then check the std::error_code's value.

@Oberon00
Copy link
Contributor Author

Oberon00 commented Jan 8, 2019

I think dynamic cast would also not work without the types being exported. So far I'v worked around this by comparing the error code value (integer) and the category name (string), which seems to work well enough.

@rnburn
Copy link
Contributor

rnburn commented Jan 8, 2019

Right, those classes would need to be available. But since std::error_category is a polymorphic class you can use the typeid operator to accomplish the same thing.

Could you review the PR (opentracing/opentracing-cpp#102) to verify it will fix the segfaulting?

@Oberon00
Copy link
Contributor Author

Oberon00 commented Jan 9, 2019

Thanks, @rnburn that fixes the crash! 👍 I rebuilt envoy with the opentracing from your PR, and now get the following output, as expected:

Attaching to jaegernativetracing_front-envoy_1
front-envoy_1  | total 428388
front-envoy_1  | -r-xr-xr-x    1 1000     1000     438665184 Jan  9 11:52 envoy
front-envoy_1  | [2019-01-09 12:00:14.483][000009][info][main] [source/server/server.cc:206] initializing epoch 0 (hot restart version=10.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363 size=2654312)
front-envoy_1  | [2019-01-09 12:00:14.499][000009][info][main] [source/server/server.cc:208] statically linked extensions:
front-envoy_1  | [2019-01-09 12:00:14.499][000009][info][main] [source/server/server.cc:210]   access_loggers: envoy.file_access_log,envoy.http_grpc_access_log
front-envoy_1  | [2019-01-09 12:00:14.499][000009][info][main] [source/server/server.cc:213]   filters.http: envoy.buffer,envoy.cors,envoy.ext_authz,envoy.fault,envoy.filters.http.header_to_metadata,envoy.filters.http.jwt_authn,envoy.filters.http.rbac,envoy.grpc_http1_bridge,envoy.grpc_json_transcoder,envoy.grpc_web,envoy.gzip,envoy.health_check,envoy.http_dynamo_filter,envoy.ip_tagging,envoy.lua,envoy.rate_limit,envoy.router,envoy.squash
front-envoy_1  | [2019-01-09 12:00:14.499][000009][info][main] [source/server/server.cc:216]   filters.listener: envoy.listener.original_dst,envoy.listener.proxy_protocol,envoy.listener.tls_inspector
front-envoy_1  | [2019-01-09 12:00:14.499][000009][info][main] [source/server/server.cc:219]   filters.network: envoy.client_ssl_auth,envoy.echo,envoy.ext_authz,envoy.filters.network.dubbo_proxy,envoy.filters.network.rbac,envoy.filters.network.sni_cluster,envoy.filters.network.thrift_proxy,envoy.http_connection_manager,envoy.mongo_proxy,envoy.ratelimit,envoy.redis_proxy,envoy.tcp_proxy
front-envoy_1  | [2019-01-09 12:00:14.499][000009][info][main] [source/server/server.cc:221]   stat_sinks: envoy.dog_statsd,envoy.metrics_service,envoy.stat_sinks.hystrix,envoy.statsd
front-envoy_1  | [2019-01-09 12:00:14.499][000009][info][main] [source/server/server.cc:223]   tracers: envoy.dynamic.ot,envoy.lightstep,envoy.tracers.datadog,envoy.zipkin
front-envoy_1  | [2019-01-09 12:00:14.499][000009][info][main] [source/server/server.cc:226]   transport_sockets.downstream: envoy.transport_sockets.alts,envoy.transport_sockets.capture,raw_buffer,tls
front-envoy_1  | [2019-01-09 12:00:14.499][000009][info][main] [source/server/server.cc:229]   transport_sockets.upstream: envoy.transport_sockets.alts,envoy.transport_sockets.capture,raw_buffer,tls
front-envoy_1  | [2019-01-09 12:00:14.769][000009][info][main] [source/server/server.cc:271] admin address: 0.0.0.0:8001
front-envoy_1  | [2019-01-09 12:00:14.780][000009][info][config] [source/server/configuration_impl.cc:50] loading 0 static secret(s)
front-envoy_1  | [2019-01-09 12:00:14.780][000009][info][config] [source/server/configuration_impl.cc:56] loading 1 cluster(s)
front-envoy_1  | [2019-01-09 12:00:14.803][000009][info][config] [source/server/configuration_impl.cc:67] loading 1 listener(s)
front-envoy_1  | [2019-01-09 12:00:15.166][000009][info][config] [source/server/configuration_impl.cc:92] loading tracing configuration
front-envoy_1  | [2019-01-09 12:00:15.167][000009][info][config] [source/server/configuration_impl.cc:101]   loading tracing driver: envoy.dynamic.ot
front-envoy_1  | error category addr: 0x7f0cefce88f0
front-envoy_1  | ~DummyUnloadDetection
front-envoy_1  | [2019-01-09 12:00:15.776][000009][critical][main] [source/server/server.cc:86] error initializing configuration '/etc/front-envoy.yaml': opentracing: failed to load dynamic library: dummy error

One should just consider if error_code.message() should be prepended to error_message or if one should be able to configure opentracing-cpp to return the actual error code (in case libopentracing is linked dynamically).

@Oberon00
Copy link
Contributor Author

Oberon00 commented Jan 10, 2019

Maybe interesting (if only in the long-run) in this regard is C++ std proposal P1196R0 "Value-based std::error_category comparison" (via reddit).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants