Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime context fails to detach token #2606

Open
jha-hitesh opened this issue Apr 15, 2022 · 36 comments
Open

Runtime context fails to detach token #2606

jha-hitesh opened this issue Apr 15, 2022 · 36 comments
Labels
bug Something isn't working

Comments

@jha-hitesh
Copy link

Describe your environment
python 3.7
ujson
sanic==20.9.1
opentelemetry-api==1.9.1
opentelemetry-sdk==1.9.1
opentelemetry-propagator-jaeger==1.9.1
opentelemetry-exporter-jaeger-thrift==1.9.1
opentelemetry-instrumentation==0.28b1
opentelemetry-exporter-otlp-proto-http==1.10.0

Steps to reproduce
i have used the opentelemetry-instrumentation to create a middleware for sanic web framework, this middleware allows to trace a request, all seems good but sometimes the error Failed to detach context comes randomly for some request.

What is the expected behavior?
instead of Failed to detach context error , original exception should be logged as exception so that the actual issue can be debugged with proper stacktrace etc.
opentelemetry-python/opentelemetry-api/src/opentelemetry/context/init.py line

here instead of

except Exception:  # pylint: disable=broad-except
        logger.error("Failed to detach context")

it should be

except Exception as e:  # pylint: disable=broad-except
        logger.error(e)

What is the actual behavior?
getting Failed to detach context error message instead of original message.

Additional context

@jha-hitesh jha-hitesh added the bug Something isn't working label Apr 15, 2022
@kasium
Copy link
Contributor

kasium commented Jun 1, 2022

I have the same request. Maybe the code could also be changed to

except Exception as e:  # pylint: disable=broad-except
        logger.exception("Failed to detach context")

Is a PR possible?

@lzchen
Copy link
Contributor

lzchen commented Jun 7, 2022

@jha-hitesh
Any idea why the detaching the context would be throwing an exception? We wouldn't want an instrumentation to error out on detaching the context anyways. Maybe make the change locally in the context to debug this. Changing the try except to be not as generic could be a separate issue.

@sihrc
Copy link

sihrc commented Jun 16, 2022

@lzchen

Traceback (most recent call last):
  File "/project/.venv/lib/python3.10/site-packages/opentelemetry/context/__init__.py", line 153, in detach
    _RUNTIME_CONTEXT.detach(token)  # type: ignore
  File "/project/.venv/lib/python3.10/site-packages/opentelemetry/context/contextvars_context.py", line 56, in detach
    self._current_context.reset(token)  # type: ignore
ValueError: <Token var=<ContextVar name='current_context' default={} at 0x7fa9ac2df380> at 0x7fa9ac341780> was created in a different Context

This kind of exception :)

@lzchen
Copy link
Contributor

lzchen commented Jun 16, 2022

@sihrc
Does this not seem like the instrumentation itself has a bug in it? (in which detach is actually called incorrectly)

@kasium
Copy link
Contributor

kasium commented Jun 20, 2022

@lzchen I'm not able to change this locally. Would you accept an exception which changes the log from error to exception?

@lzchen
Copy link
Contributor

lzchen commented Jun 20, 2022

@kasium
Feel free to submit a pr :)

@srikanthccv
Copy link
Member

This is fixed in #2774

@junli-lyft
Copy link

@lzchen

Traceback (most recent call last):
  File "/project/.venv/lib/python3.10/site-packages/opentelemetry/context/__init__.py", line 153, in detach
    _RUNTIME_CONTEXT.detach(token)  # type: ignore
  File "/project/.venv/lib/python3.10/site-packages/opentelemetry/context/contextvars_context.py", line 56, in detach
    self._current_context.reset(token)  # type: ignore
ValueError: <Token var=<ContextVar name='current_context' default={} at 0x7fa9ac2df380> at 0x7fa9ac341780> was created in a different Context

This kind of exception :)

Hi @sihrc, are you able to root cause this issue and fix it? I am having this exception and not sure how to fix it.
I traceback.print_stack() the call stack:

File "/opt/homebrew/Cellar/python-bundle-arm64/0.0.0/brew/opt/python38/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/opt/homebrew/Cellar/python-bundle-arm64/0.0.0/brew/opt/python38/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/python-bundle-arm64/0.0.0/brew/opt/python38/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/junli/.cache/lyftvenv/v/hello-world/tmp58xgye3z/lib/python3.8/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 268, in worker
    self._export(flush_request)
  File "/Users/junli/.cache/lyftvenv/v/hello-world/tmp58xgye3z/lib/python3.8/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 333, in _export
    self._export_batch()
  File "/Users/junli/.cache/lyftvenv/v/hello-world/tmp58xgye3z/lib/python3.8/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 361, in _export_batch
    detach(token)
  File "/Users/junli/.cache/lyftvenv/v/hello-world/tmp58xgye3z/lib/python3.8/site-packages/opentelemetry/context/__init__.py", line 67, in wrapper
    return func(*args, **kwargs)  # type: ignore[misc]
  File "/Users/junli/.cache/lyftvenv/v/hello-world/tmp58xgye3z/lib/python3.8/site-packages/opentelemetry/context/__init__.py", line 159, in detach

But I am not sure how to troubleshoot further. Any ideas?

@eliatcodecov
Copy link

eliatcodecov commented Aug 12, 2022

I'm having a similar issue to @junli-lyft regarding hitting this exception in detach and cannot seem to reliably reproduce in development. I would love some insight on how to address this, but in my case this issue appears to occur when _RUNTIME_CONTEXT is None.

Is it an actual error for _RUNTIME_CONTEXT to be None ? If so, how can we troubleshoot this further? If None is a valid state for the _RUNTIME_CONTEXT should detach account for that before throwing the exception?

@srikanthccv srikanthccv changed the title Raise Original Exception instead of Failed to detach context Runtime context fails to detach token Aug 12, 2022
@srikanthccv
Copy link
Member

Since this appears to be an issue for multiple users, I am going to re-open this with a different title. The original issue was about showing the exception with a log instead of a simple message.

Is it an actual error for _RUNTIME_CONTEXT to be None ?

Yes, _RUNTIME_CONTEXT should never be None.

@srikanthccv srikanthccv reopened this Aug 12, 2022
@srikanthccv
Copy link
Member

Do any of you use tornado/gevent/eventlet in your applications?

@junli-lyft
Copy link

Yeah, I am using gevent 21.12.0 (with monkey patching) and flask 1.1.4.
And here is the list of otel packages in case you need it:

opentelemetry-api==1.12.0rc2
opentelemetry-instrumentation==0.32b0
opentelemetry-instrumentation-flask==0.32b0
opentelemetry-instrumentation-wsgi==0.32b0
opentelemetry-sdk==1.12.0rc2
opentelemetry-semantic-conventions==0.32b0
opentelemetry-util-http==0.32b0

@sfc-gh-sili
Copy link

Is there any update on this issue? I am having a similar issue.

global span = tracer.start_span("span_name")

# In another thread
span.close()

would work, but

span = tracer.start_span("span_name")
global _internal_token
_internal_token = context_api.attach(context_api.set_value(_SPAN_KEY, span))

# In another thread
trace.get_current_span().end()
context_api.detach(_internal_token)

would fail, because _RUNTIME_CONTEXT is None. I suspected that there is another global _RUNTIME_CONTEXT that was interfering, but it was not the case.

Any guess what is causing the issue?

@faynburd
Copy link

Any updates on this?

@jessereich-mcs
Copy link

I'm starting to see this error now too after upgrading to python 3.11 and updating opentelemetry-api and sdk to 1.14.0. But don't fully understand why it is being thrown only what seems to be in our deployed GCS flask application and not on development. We are using gunicorn.

@srikanthccv
Copy link
Member

What's the worker class you are using with gunicorn?

@jessereich-mcs
Copy link

What's the worker class you are using with gunicorn?

gthread

@jessereich-mcs
Copy link

jessereich-mcs commented Nov 17, 2022

Are there any leads on how to reproduce consistently? I'm getting this at random times for my deployed app and will need to basically turn off all tracing if I can't identify the issue.

I have a pubsub listener endpoint which triggers things to be done in the background, that seems to be particularly susceptible to it but I see it happening to others.

I don't think _RUNTIME_CONTEXT is None or else it would be an attribute error on _RUNTIME_CONTEXT.detach().

I'm not sure how it determines that the token has already been used once? How could this fail more gracefully?

"Failed to detach context
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/opentelemetry/context/__init__.py", line 157, in detach
    _RUNTIME_CONTEXT.detach(token)  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/opentelemetry/context/contextvars_context.py", line 50, in detach
    self._current_context.reset(token)  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: <Token used var=<ContextVar name='current_context' default={} at 0x3eab1ffcebb0> at 0x3eaafc381500> has already been used once" 

@morrissimo
Copy link

server: FastAPI (uvicorn with uvloop)
database: Postgres (peewee + psycopg2)

Chiming in with yet another observed instance of this behavior. We're instrumenting a FastAPI app and first started seeing this Failed to detach context exception when we tweaked our database connection generator from using the start_as_current_span() decorator approach to instrumenting the method's logic with a with block (+ events). FWIW, this exception being raised does NOT appear to prevent spans from being delivered to our OTEL endpoint 🤷

original method (does NOT cause the exception):

@tracer.start_as_current_span("get_db")
def get_db(db_state: Any = Depends(reset_db_state)) -> Generator[None, None, None]:
    with db.connection_context():
        yield

tweaked method (DOES cause the exception):

def get_db(db_state: Any = Depends(reset_db_state)) -> Generator[None, None, None]:
    with tracer.start_as_current_span("get_db") as span:
        span.add_event(name="entry", attributes=dict(db_is_closed=db.is_closed()))
        with db.connection_context():
            span.add_event(name="in context, pre-yield", attributes=dict(db_is_closed=db.is_closed()))
            yield
            span.add_event(name="in context, post-yield", attributes=dict(db_is_closed=db.is_closed()))
        span.add_event(name="exit", attributes=dict(db_is_closed=db.is_closed()))

The original motivation for the instrumentation change was when we noticed that we weren't getting full spans using the decorator approach, and we wanted to add some events around the method's yield to help us understand database connection timing.

Since trying out the tweaked method and observing this exception, we've tried a few different things:

  1. removing the span events: unsurprisingly, this made no difference - the exception was still raised
  2. using the decorator approach, and then grabbing the current span inside the method so the span events could be added: reverted to the original instrumentation behavior (the full span wasn't captured, and no events were captured either), but the exception stopped being thrown as well

@emdej
Copy link

emdej commented Jul 4, 2023

I'm getting the same on 1.17.0.

@srikanthccv
Copy link
Member

Are you using any of gevent/eventlet/tornado etc...?

@emdej
Copy link

emdej commented Jul 5, 2023

No.

@marrobi
Copy link

marrobi commented Oct 24, 2023

Getting the same here, using

azure-monitor-opentelemetry            1.0.0
azure-monitor-opentelemetry-exporter   1.0.0b17
opentelemetry-api                      1.20.0
opentelemetry-instrumentation          0.41b0
opentelemetry-instrumentation-asgi     0.41b0
opentelemetry-instrumentation-dbapi    0.41b0
opentelemetry-instrumentation-django   0.41b0
opentelemetry-instrumentation-fastapi  0.41b0
opentelemetry-instrumentation-flask    0.41b0
opentelemetry-instrumentation-logging  0.41b0
opentelemetry-instrumentation-psycopg2 0.41b0
opentelemetry-instrumentation-requests 0.41b0
opentelemetry-instrumentation-urllib   0.41b0
opentelemetry-instrumentation-urllib3  0.41b0
opentelemetry-instrumentation-wsgi     0.41b0
opentelemetry-resource-detector-azure  0.1.0
opentelemetry-sdk                      1.20.0
opentelemetry-semantic-conventions     0.41b0
opentelemetry-util-http                0.41b0

Is a FastAPI app using with tracer.start_as_current_span("span_name") as span:

@benliddicott
Copy link

benliddicott commented Nov 1, 2023

I've been experiencing the same or similar issue, and digging into the code of opentelemetry and Flask I think I have the beginnings of a solution - "works for me".

  • Running under gunicorn
  • Using auto-telemetry not programmatic
  • Making remote REST calls using requests library and concurrent.futures.ThreadPoolExecutor
  • The library flask-executor does not solve the problem (ref https://pypi.org/project/Flask-Executor/ ).

My use-case is I want to use concurrent.futures.ThreadPoolExecutor to make REST/Json api calls in parallel.

# Suppose executor is a ThreadPoolExecutor
# Same results from the flask-executor package

f1 = executor.submit(remote_call_1, "foo")
f2 = executor.submit(remote_call_2, "bar")
f3 = executor.submit(remote_call_3, "baz")

r1 = f1.result()
r2 = f2.result()
r3 = f3.result()

do_something(r1, r2, r3)

When I cause the logger to log the threading.current_thread().name I can see that the ValueError ... was created in a different context is logged from the ThreadPool thread (e.g. ThreadPoolExecutor-0_2)

However, should (for example) remote_call_2 throw an exception, then we get a "ValueError: generator already executing exception - this is from the MainThread thread.

note I have not been able to reproduce this with programmatic instrumentation, I can only get it to occur in a Kubernetes environment with auto-instrumentation.

The solution is threefold.

  • Obtain the Flask context, and copy it (as seen in flask.copy_current_request_context)
  • The wrapper needs to remove the otel activation teardown_request hook "opentelemetry-flask.activation_key" from the context after the function has executed -
    • This prevents the ValueError ... was created in a different context
    • but it prevents the hook from being run at all, so the otel span lasts forever (which is no good)
    • so we must remove it from the context as seen by otel, but without removing it from the real context, so when the real request ends, the span ends.
    • That means we have to create a wrapper/proxy for the Request object that, when the (copied) flask context ends, the teardown_request hook does not end the span, but when the real request ends, the hook does run.
  • The wrapper needs to execute the function in a copy of the Python contextvars using contextvars.copy_context(), otherwise the remote REST/Json calls don't get included in the otel span.

If this is interesting ping me and I will see if I am allowed to share the code.

@ethervoid
Copy link

We have the same issue using dramatiq. We're using opentelemetry-python with gunicorn and threads and we don't have any issues but with dramatiq and a middleware similar to this one we're suffering this issue

@jgibo
Copy link

jgibo commented Apr 22, 2024

Are you using any of gevent/eventlet/tornado etc...?

@srikanthccv, I'm actively getting this issue with a tornado app. python3.11, tornado 6.3.2. Why do you ask?

@alexmojaki
Copy link

I've run into this. I can reproduce it with the script below. I've tried my absolute best to reduce it to a minimal example but it's very weird and sensitive. Even running this code in a Docker container removes the error. Here's the script:

import openai
from opentelemetry import context

import logfire

logfire.configure(
    token='LCyjlJqNbFXqVmQ6lZVLKwxSrPJmKWgXmZG0rJpkS0c4'
    # Uncommenting this removes the error:
    # send_to_logfire=False
    # This might be because sending to logfire adds a thread for a BatchSpanProcessor.
)


class MyStream:
    def __iter__(self):
        # Replacing self._it with a plain variable removes the error,
        # probably because it affects garbage collection.
        self._it = self.stream()
        for item in self._it:
            yield item

        # Replacing the above loop with this removes the error.
        # yield from self._it

    def stream(self):
        with logfire.span('streaming'):
            yield 1
            yield 2


def patched_post(*_, **__):
    # Removing this line removes the error. It returns None.
    context.get_value('foo')
    return MyStream()


# Uncommenting this removes the error, whether or not you keep the same line above within the function.
# context.get_value('foo')

# You DON'T need to fill in a real API key here.
client = openai.Client(api_key='...')
client.post = patched_post

# Uncommenting this removes the error, as does messing with the create method in various ways.
# client.chat.completions.create = patched_post

# Extracting the result of this method call into a global variable removes the error,
# because it means the generator will only be garbage collected on shutdown.
for _ in client.chat.completions.create(model=1, messages=1, stream=1):
    # This is critical - it ensures that the `stream()` generator is left suspended when garbage collected.
    break

And the error output:

Failed to detach context
Traceback (most recent call last):
  File "test.py", line 27, in stream
    yield 1
GeneratorExit

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/alex/work/logfire/.venv/lib/python3.12/site-packages/opentelemetry/context/__init__.py", line 163, in detach
    _RUNTIME_CONTEXT.detach(token)  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alex/work/logfire/.venv/lib/python3.12/site-packages/opentelemetry/context/contextvars_context.py", line 50, in detach
    self._current_context.reset(token)  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: <Token var=<ContextVar name='current_context' default={} at 0x7f1ffc906430> at 0x7f1ffc730780> was created in a different Context

To run this code you will need to pip install logfire openai. You may or may not get the error. For what it's worth, here's some versions:

openai==1.23.3
logfire==0.28.3
Python 3.12.3 (main, Apr 15 2024, 18:25:56) [Clang 17.0.6 ]
Ubuntu 22.04.4 LTS

The essential part is that a context is attached in a generator which never completes properly, so the generator exits when it's garbaged collected and that's when detaching the context tries and fails. I don't know why it fails, I'm guessing garbage collections happens 'in a different context'? It seems like a CPython bug.

@sylvainmouquet
Copy link

sylvainmouquet commented May 9, 2024

another example, the error occured when the reset is done in a async/await function :

import contextvars
import asyncio

var = contextvars.ContextVar('var')
var.set('spam')
token_ok_1 = var.set(1)
token_ok_2 = var.set(1)
token_ko = var.set(1)

async def main_ko():
    try:
        var.reset(token_ko)
    except Exception as e:
        print('KO')
        raise e

def main_ok():
    var.reset(token_ok_2)
    
if __name__ == "__main__":
    var.reset(token_ok_1)
    main_ok()
    asyncio.run(main_ko())

result :

python3 error.py
KO
Traceback (most recent call last):
 ...
  File "app/error.py", line 14, in main
    raise e
  File "app/error.py", line 11, in main
    var.reset(token_ko)
    ~~~~~~~~~^^^^^^^^^^
ValueError: <Token var=<ContextVar name='var' at 0x10026e5c0> at 0x1002b3140> was created in a different Context

@alexmojaki
Copy link

@sylvainmouquet that example can be reduced to:

import asyncio
import contextvars

var = contextvars.ContextVar('var')
token = var.set(1)


async def main():
    var.reset(token)


asyncio.run(main())

Here the error message makes sense. The token was created when calling var.set(1) globally before starting an async event loop, then we're trying to reset inside an async context.

@alexmojaki
Copy link

Here's a good repro:

import asyncio

from opentelemetry import context, trace


async def gen():
    token = context.attach(trace.set_span_in_context('span'))
    try:
        yield 1
        yield 2
    finally:
        context.detach(token)


async def main():
    async for i in gen():
        print(i)
        break


asyncio.run(main())

The error isn't there if you remove the break or make gen a non-async generator. So while #2606 (comment) shows that it can happen in sync code, it's much easier to reproduce with async generators.

@sylvainmouquet
Copy link

When using a generator, the solution is to use async with contextlib.aclosing(gen()) as agen: as explained here:
python/cpython#118944

@alexmojaki
Copy link

Here's a minimal repro without async:

import threading

from opentelemetry import context, trace


class Foo:
    pass


def gen(_foo):
    token = context.attach(trace.set_span_in_context('span'))
    try:
        yield 1
        yield 2
    finally:
        context.detach(token)


def main():
    # Create a circular reference to delay garbage collection
    foo = Foo()
    foo.gen = gen(foo)

    for _ in foo.gen:
        # Leave the generator suspended
        break


threading.Thread(target=main).start()

@sylvainmouquet
Copy link

there is the function closing :

...
with contextlib.closing(gen(foo)) as agen:
     for _ in agen:
...

@alexmojaki
Copy link

The problem is that OTEL SDKs can typically only add spans to iterators to instrument them, they can't ensure/require that the code creating/using those spans closes/finishes the iterators.

For example, consider this code in opentelemetry-instrumentation-wsgi:

def _end_span_after_iterating(iterable, span, token):
    try:
        with trace.use_span(span):
            yield from iterable
    finally:
        close = getattr(iterable, "close", None)
        if close:
            close()
        span.end()
        if token is not None:
            context.detach(token)

Here iterable is the WSGI response. If iteration is not completed (e.g. if the client doesn't read the full response) then it's unclear if/when/how the finally block will end up running. It may depend on how the WSGI server cleans up disconnected clients or Python's garbage collection. This means that the span may remain open for an unpredictable amount of time, and detaching the token may log the error in this thread.

@enrico-stauss
Copy link

This issue just came up for me. I'm executing asyncio.run() on an async func of mine that executes a LlamaIndex Workflow which applies LlamaIndex instrumentation logic. I can see if I'm able to come up with an MWE for this, but do you have an idea at what point I should start digging in order to debug this?
This is the token that cannot be detached:
image

And the callstack:
image

Thanks and kind regards

@alexmojaki
Copy link

Ensure that the body of with trace.use_span/tracer.start_as_current_span/logfire.span never directly contains a yield. See https://logfire.pydantic.dev/docs/guides/advanced/generators/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests