Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot query Loki using otel-lgtm #16202

Open
johnnyggalt opened this issue Feb 11, 2025 · 3 comments
Open

Cannot query Loki using otel-lgtm #16202

johnnyggalt opened this issue Feb 11, 2025 · 3 comments
Labels
type/bug Somehing is not working as expected type/question

Comments

@johnnyggalt
Copy link

(migrated from here)

Describe the bug
I'm using latest otel-lgtm and hooked into it via an ASP.NET backend using OpenTelemetry.Exporter.OpenTelemetryProtocol. Metrics seem to work fine, but not so for logs. If I attempt a query over a time range that includes some log entries I get the following error (enabled Loki logging to see this):

level=info ts=2025-02-07T03:24:55.591285638Z caller=metrics.go:237 component=querier org_id=fake traceID=1bd6f1bd0c08d322 latency=fast query="{service_name=\"backend_web\"} | json | logfmt | drop __error__,__error_details__" query_hash=2839858091 query_type=limited range_type=range length=24m41.15s start_delta=24m55.591261438s end_delta=14.441261538s step=14s duration=1.738757ms status=500 limit=100 returned_lines=1 throughput=5.3MB total_bytes=9.2kB total_bytes_structured_metadata=5.3kB lines_per_second=20129 total_lines=35 post_filter_lines=35 total_entries=1 store_chunks_download_time=0s queue_time=88.203µs splits=0 shards=0 query_referenced_structured_metadata=false pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=179.206µs cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=1 ingester_requests=1 ingester_chunk_head_bytes=9.2kB ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=35 congestion_control_latency=0s index_total_chunks=0 index_post_bloom_filter_chunks=0 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=grafana-lokiexplore-app disable_pipeline_wrappers=false has_labelfilter_before_parser=false
level=error ts=2025-02-07T03:24:55.591670751Z caller=retry.go:107 org_id=fake traceID=1bd6f1bd0c08d322 msg="error processing request" try=4 type=queryrange.LokiRequest query="{service_name=\"backend_web\"} | json | logfmt | drop __error__,__error_details__" query_hash=2839858091 start=2025-02-07T03:00:00Z end=2025-02-07T03:24:41.15Z start_delta=24m55.591665251s end_delta=14.441665551s length=24m41.15s retry_in=4.406740283s code=Code(500) err="rpc error: code = Code(500) desc = failed to parse series labels to categorize labels: 1:2: parse error: unexpected \"=\" in label set, expected identifier or \"}\""

Maybe it's because I'm new to this stack but I have no idea what it's complaining about or how to rectify. Help would be very much appreciated.

To Reproduce
Steps to reproduce the behavior:

  1. Started Loki via otel-lgtm latest (sha256:7b7644781f8f801bb8639872e6a8aef28ee94b59bc82f41bd75725aef872c02d)
  2. Started an ASP.NET backend that is integrated via <PackageReference Include="OpenTelemetry.Exporter.OpenTelemetryProtocol" Version="1.11.1" />
  3. Interacted with the ASP.NET backend to add a few logs to the system
  4. Attempt to view the logs per the below video
repro.mp4

Expected behavior
I was expecting to see the captured logs.

Environment:

  • Infrastructure: Docker
  • Deployment tool: Docker compose

Screenshots, Promtail config, or terminal output
See above.

@johnnyggalt
Copy link
Author

Here is a bit more info on ASP.NET side of things, in case there is a need to repro.

Dependencies are:

        <PackageReference Include="OpenTelemetry.Exporter.OpenTelemetryProtocol" Version="1.11.1" />
        <PackageReference Include="OpenTelemetry.Exporter.Prometheus.AspNetCore" Version="1.11.0-beta.1" />
        <PackageReference Include="OpenTelemetry.Extensions.Hosting" Version="1.11.1" />
        <PackageReference Include="OpenTelemetry.Instrumentation.AspNetCore" Version="1.11.0" />
        <PackageReference Include="OpenTelemetry.Instrumentation.Process" Version="1.11.0-beta.1" />
        <PackageReference Include="OpenTelemetry.Instrumentation.Runtime" Version="1.11.0" />

Startup/registration logic:

    webApplicationBuilder
        .Services
        .AddOpenTelemetry()
        .ConfigureResource(fun builder ->
            builder
                .AddService(
                    serviceName = "backend_web",
                    serviceVersion =
                        (
                            typeof<Startup>.Assembly.GetName().Version
                            |> Option.ofObj
                            |> Option.map string
                            |> Option.defaultValue "unknown"
                        ),
                    serviceInstanceId = Environment.MachineName
                )
           |> ignore
        )
        .WithMetrics(fun builder ->
            builder
                .AddMeter(requiredValue<string> telemetrySection "MeterName")
                .AddAspNetCoreInstrumentation()
                .AddRuntimeInstrumentation()
                .AddProcessInstrumentation()
                .AddOtlpExporter(fun opts ->
                    opts.Endpoint <- openTelemetryUri
                )
            |> ignore
        )
        .WithTracing(fun builder ->
            builder
                .AddAspNetCoreInstrumentation()
                .AddOtlpExporter(fun opts ->
                    opts.Endpoint <- openTelemetryUri
                )
            |> ignore
        )
        .WithLogging(fun builder ->
            builder
                .AddOtlpExporter(fun opts ->
                    opts.Endpoint <- openTelemetryUri
                )
            |> ignore
        )
    |> ignore

From there, it was just a case of logging with an ILogger. If necessary, I'm happy to take the time to put together a standalone repro.

@JStickler JStickler added type/question type/bug Somehing is not working as expected labels Feb 13, 2025
@aceslick911
Copy link

+1 having this same issue... No idea what's happening if this is Grafana or Loki actually causing the issue...

@johnnyggalt
Copy link
Author

This seems to come down to a bug in the .NET OpenTelemetry implementation.

On startup, ASP.NET uses the HostingRequestStartingLog class to log a message. It bundles a number of attributes into that message per this code:

    public KeyValuePair<string, object?> this[int index] => index switch
    {
        0 => new KeyValuePair<string, object?>(nameof(_request.Protocol), _request.Protocol),
        1 => new KeyValuePair<string, object?>(nameof(_request.Method), _request.Method),
        2 => new KeyValuePair<string, object?>(nameof(_request.ContentType), _request.ContentType),
        3 => new KeyValuePair<string, object?>(nameof(_request.ContentLength), _request.ContentLength),
        4 => new KeyValuePair<string, object?>(nameof(_request.Scheme), _request.Scheme),
        5 => new KeyValuePair<string, object?>(nameof(_request.Host), _request.Host.Value),
        6 => new KeyValuePair<string, object?>(nameof(_request.PathBase), _request.PathBase.Value),
        7 => new KeyValuePair<string, object?>(nameof(_request.Path), _request.Path.Value),
        8 => new KeyValuePair<string, object?>(nameof(_request.QueryString), _request.QueryString.Value),
        9 => new KeyValuePair<string, object?>("{OriginalFormat}", OriginalFormat),
        _ => throw new ArgumentOutOfRangeException(nameof(index)),
    };

If any of those attribute have a null value (in my scenario, it is _request.ContentType and _request.ContentLength) then the problem is triggered.

Downstream of this, the OpenTelemetry implementation has this code in its TagWriter class:

    public bool TryWriteTag(
        ref TTagState state,
        string key,
        object? value,
        int? tagValueMaxLength = null)
    {
        if (value == null)
        {
            return false;
        }

Thus, when any attribute is written with a null value, it is not written to the stream correctly. As far as I can tell from reading the protobuf schema, it should be writing the key and a zero-length value, but instead it is just writing nothing.

I will open an issue with the .NET team and link back to here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Somehing is not working as expected type/question
Projects
None yet
Development

No branches or pull requests

3 participants