Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource merge semantics make upgrades problematic #2341

Closed
pkwarren opened this issue Nov 2, 2021 · 24 comments · Fixed by #4876
Closed

Resource merge semantics make upgrades problematic #2341

pkwarren opened this issue Nov 2, 2021 · 24 comments · Fixed by #4876
Labels
bug Something isn't working
Milestone

Comments

@pkwarren
Copy link
Contributor

pkwarren commented Nov 2, 2021

Description

We've encountered issues upon a recent upgrade to OTEL v1.1.0 (which upgraded to semconv v1.7.0) and existing code or contrib code which still used semconv v1.4.0. See open-telemetry/opentelemetry-go-contrib#1384 for an example.

The spec (https://github.com/open-telemetry/opentelemetry-specification/blob/bad49c714a62da5493f2d1d9bafd7ebe8c8ce7eb/specification/resource/sdk.md#merge) seems a bit contradictory in how this should be handled:

The interface MUST provide a way for an old resource and an updating resource to be merged into a new resource.

while also mentioning the merge algorithm (which is implemented to the spec in opentelemetry-go) as:

Else this is a merging error (this is the case when the Schema URL of the old and updating resources are not empty and are different).

From a consumer perspective, there is no real way to determine if all of the resources are compatible prior to calling resource.New(...), and thus we're failing to create a resource detector.

Instead of the current algorithm, it feels like if the two resource schemas being merged are compatible (same major version), the merge should be able to proceed. Otherwise, we'll need to coordinate an upgrade of every telemetry package together to ensure we're on a consistent version of semconv.

Possible algorithms:

  • Create merged resource using the oldest version of semconv. This might introduce new attributes which are introduced with later specs, however it should still be compatible with an older version.
  • Create merged resource using the newest version of semconv (perhaps with migration of certain well known attributes - i.e. rpc.jsonrpc.method -> rpc.method).
  • Create unversioned resource if the versions don't match.

Environment

  • OS: Linux/Mac
  • Architecture: x86_64
  • Go Version: 1.16+
  • opentelemetry-go version: v1.1.0

Steps To Reproduce

  1. Configure a resource with a combination of resource detectors which use different versions of semconv package (i.e. v1.4.0 + v1.7.0).
  2. resource.New fails because two different semconv versions are in use:
    } else {
    return Empty(), errMergeConflictSchemaURL
    }

Expected behavior

Resource detector should be able to be initialized.

@pkwarren pkwarren added the bug Something isn't working label Nov 2, 2021
@Aneurysm9
Copy link
Member

This is a problem we're in the process of addressing. The semantic conventions now include a schema that describes changes from version to version in a way that will enable migrating a resource from one version to the next. We recently landed #2267 which gives us the ability to parse those schema definitions. The next step is to implement the transformations required and then to update the Resource.Merge() method to migrate resources with differing schemas to a common schema.

@seanschade
Copy link
Contributor

@Aneurysm9 I can see where renaming attributes could be helpful in some cases, but also could break dashboards and alerts in an observability platform.

Will there be an option to include both the old and new attributes on a span, metric, or log?
Are there plans for the collector to use the same merge logic? I could see having a processor that transforms attributes as well.

We try and keep our internal libraries in sync with otel. My concern with semconv is that there could be subtle, breaking changes in the telemetry data that affects our observability.

@Aneurysm9
Copy link
Member

There are plans for the collector to have a processor with the same logic and the ability to normalize all incoming attributes to a given schema version.

@MrAlias
Copy link
Contributor

MrAlias commented May 3, 2022

Closing as this is now addressed with the OTel schema.

@MrAlias MrAlias closed this as completed May 3, 2022
@terinjokes
Copy link

@MrAlias How was this resolved? I got bit by this today.

@bboreham
Copy link
Contributor

bboreham commented Mar 8, 2023

In case it helps someone else, I had to track down this version number, and copy it into my program so it matched:

semconv "go.opentelemetry.io/otel/semconv/v1.17.0"

(notwithstanding the release notes telling me v1.18.0 is part of the 1.14 release)

@jufemaiz
Copy link

In case it helps someone else, I had to track down this version number, and copy it into my program so it matched:

semconv "go.opentelemetry.io/otel/semconv/v1.17.0"

(notwithstanding the release notes telling me v1.18.0 is part of the 1.14 release)

This looks like a bug to me – if the schema is being returned as v1.17.0 when v1.18.0 had been released.

@jufemaiz
Copy link

Closing as this is now addressed with the OTel schema.

Can you identify how that is the case?

https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/resource/resource.go#L150-L193

This explicitly requires identical schemas between two resources.

@jufemaiz
Copy link

This is a problem we're in the process of addressing. The semantic conventions now include a schema that describes changes from version to version in a way that will enable migrating a resource from one version to the next. We recently landed #2267 which gives us the ability to parse those schema definitions. The next step is to implement the transformations required and then to update the Resource.Merge() method to migrate resources with differing schemas to a common schema.

Is there an update here? Is there anything I could look to contribute?

@jufemaiz
Copy link

@Aneurysm9 do you have further details or should I open something separately up?

@pellared pellared reopened this Mar 28, 2023
@Aneurysm9
Copy link
Member

I'm not sure there is anything we can do in the Go implementation at this time to improve the situation. The specification does not detail how to perform resource merging with multiple schema URLs involved, instead it defines this as an error and states that the resulting resource is undefined. There is a proposal to separate semantic conventions from the specification, which may result in a new schema family and versions, which I think would only further complicate this situation.

While I think it might be argued that it would be spec compliant to return an error and a merged resource with the schema URL removed, doing so would likely condition users to ignore the error and result in surprising behavior when the implementation changes and/or different behavior is specified.

@jufemaiz
Copy link

jufemaiz commented Mar 30, 2023

Cheers @Aneurysm9. Yep, agreed that the spec does not detail how to manage, however, as highlighted here (and elsewhere through open-telemetry/opentelemetry-go-contrib#3657 with a response @ #3944) opentelemetry-go hard codes schema URLs in a number of locations, as opposed to having either (a) passing it in as a function argument (optional or not); or (b) allowing for a "schema migration" to a given target. This is also not something that is specified in the otel spec itself (afaik, happy to be told otherwise). As such, that hard coding, and its entwinement with the release cycle and dependency management (security issues especially) means that there's some critical issues wrt the "brittleness" of this package (as I've noted in the linked issue raised in otel-go-contrib). This is especially the case if the diff in schema from one to the next is minor, and there is a growing use of third party services that necessarily reference schema definitions. A better "best practice" IMHO is needed across the go library to allow for keeping the maintenance of the package separate from the availability and provision of different schemas.

I'm keen to get more involved in this and understand that the wee hours of the Thursday morning (AEST) is the meeting time so I can likely jump on from next week onwards.

@Aneurysm9
Copy link
Member

Telemetry schema and their description of changes from one version to the next are documented in the specification. The gap lies in the resource specification including a requirement for schema URL support and stating that merging resources with different schema URLs is an error without including a requirement that schema translation be applied or that there be a mechanism for indicating that schema translation should be applied.

There was significant discussion of this topic at the maintainers' meeting today and I expect there will be more discussion of this issue at the specification SIG meeting tomorrow. I'm sorry if my earlier comment appeared to imply that I did not care about this issue or that we were not intending to do anything about it. We certainly do intend to ensure that resources with associated telemetry schema can be used easily and safely. I think that the proposal @MrAlias has made in #3944 is good, and the implementation is the logical conclusion of work started almost two years ago when we began introducing telemetry schema. However, I think it does not go far enough and the spec should instead direct that telemetry schema migration SHOULD be used to align attributes on resources to be merged, either implicitly at the newer of the two versions or explicitly at the version provided by the user. We do, however, need to be careful to ensure that we do not introduce changes to our stable SDK that are incompatible with the specification or that get ahead of specification language under consideration that may change before it is stabilized.

@MrAlias
Copy link
Contributor

MrAlias commented May 13, 2023

Here's a new repo I created to handle schema translations: https://github.com/MrAlias/otel-schema-utils

Only thing supported currently is resource translations.

JamieDanielson pushed a commit to honeycombio/otel-config-go that referenced this issue Aug 14, 2023
## Which problem is this PR solving?

When using detectors with mis-matched Schema URLs, resource.New will
return an error. Previously, this was dropped and consequently it looked
like configuration worked but many resource attributes would be missing
in the resulting telemetry.

With the recent addition of
#48, this error is
much easier to hit. For example, the detectors built into
opentelemetry-go
[import](https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/resource/builtin.go#L25)
semconv `v1.17.0` but this project [pulls
in](https://github.com/honeycombio/otel-config-go/blob/main/otelconfig/otelconfig.go#L21)
`v1.18.0`. This means that adding something like
`otelconfig.WithResourceOption(resource.WithHost())` results in a
mis-configured OTel and no error to warn you what happened.

This is all … very bad. This PR is really the bare minimum that needs to
be done here. That is, it at least causes a runtime error so that you
can tell what's going on — but it doesn't solve the problem that you
fundamentally cannot use this library with _any_ other library that
doesn't happen to use semconv `v1.18.0`. There are [a
number](open-telemetry/opentelemetry-go#2341)
of
[open](open-telemetry/opentelemetry-go#3769)
[issues](open-telemetry/opentelemetry-go-contrib#3657)
in OTel and adjacent repositories regarding this problem, which probably
warrant further investigation into a long-term sustainable solution.

## Short description of the changes

Return the `err` from `resource.New` up through the stack. Update tests
to expect this.

⚠️ I guess technically this could be considered a backwards incompatible
change, in that folks may have configurations that are "working" today
but will throw an error after this change. I put that in scare quotes
though because it's not _really_ working — it's just not throwing an
error. I feel like actually returning an appropriate error here and
letting folks know their configuration is likely not working as expected
is the right choice.

## How to verify that this has the expected result

There's a new test that specifically uses a detector with a different
schema URL.

Co-authored-by: Vera Reynolds <vreynolds@users.noreply.github.com>
@maximkosov
Copy link

maximkosov commented Aug 31, 2023

We hit this issue multiple times already because of routine package updates and without any changes from our side.
The main problem is that often the issue can only be noticed in production, and it feels like the impact is not justified by the severity of the problem we're trying to solve using schema versions.
I mean, should we really care if name of some field (which may never be really looked at) was changed and now we temporarily have duplicated field?
Especially considering, that often resources define mutually exclusive set of fields.

While I can imagine the cases where it is important to enforce compatible schemas, maybe we can just introduce a flag to ignore schema version incompatibility?

@pellared

This comment was marked as outdated.

@jufemaiz
Copy link

jufemaiz commented Sep 8, 2023

maybe we can just introduce a flag to ignore schema version incompatibility?

It would be not compliant with the specification.

I created #4484 that should help to mitigate the issue.

Note for those following here: #4484 (comment)

SIG meeting: This API will not help a lot. Closing.

@MrAlias Will try to revive the translation layer.

@tonglil
Copy link
Contributor

tonglil commented Oct 12, 2023

Something I haven't seen in the discussion currently is that semconv doesn't seem to follow semver. The way these are versioned is that not all versions are compatible, even across minor/patch releases :/

For example, I need to upgrade my code from v1.7.0 to v1.21.0, but both are under v1? Forcing me to do this to continue having working code seems like a major version bump?

@pellared
Copy link
Member

pellared commented Oct 12, 2023

Something I haven't seen in the discussion currently is that semconv doesn't seem to follow semver. The way these are versioned is that not all versions are compatible, even across minor/patch releases :/

For example, I need to upgrade my code from v1.7.0 to v1.21.0, but both are under v1? Forcing me to do this to continue having working code seems like a major version bump?

@tonglil, semantic conventions versioning policy is defined here: https://opentelemetry.io/docs/specs/otel/versioning-and-stability/#semantic-conventions-stability. Take notice that v1.7.0 and v1.21.0 are different packages. This not "usual" SemVer.

@tonglil
Copy link
Contributor

tonglil commented Oct 12, 2023

@pellared sorry to follow up but I'm looking for clarity on a couple of things.

  1. as casual consumers of otel, it's a bit unreasonable to expect them to "know" that it follows a different meaning of semver.

  2. I understand they are different packages, but how come I have to change packages if I haven't changed my code other than a dependency? If my app uses v1.7.0 to implement a schemaless detector (to set some attributes from the environment), shouldn't it continue working even if the cloud provider detector (detecting other attributes) was upgraded to v1.21.0?

  3. It feels like resource.New should be versioned with semconv if it's reliant on all detectors' semconv versions matching.

@jufemaiz
Copy link

Something I haven't seen in the discussion currently is that semconv doesn't seem to follow semver. The way these are versioned is that not all versions are compatible, even across minor/patch releases :/
For example, I need to upgrade my code from v1.7.0 to v1.21.0, but both are under v1? Forcing me to do this to continue having working code seems like a major version bump?

@tonglil, semantic conventions versioning policy is defined here: https://opentelemetry.io/docs/specs/otel/versioning-and-stability/#semantic-conventions-stability. Take notice that v1.7.0 and v1.21.0 are different packages. This not "usual" SemVer.

Excepting that the current strategy adopted in the related opentelemetry-go-contrib packages fails to ensure that each detector/instrumentation is separately implemented and available for each semconv – instead it afixes only a single semconv per release.

open-telemetry/opentelemetry-go-contrib#3657 (comment)

What's more challenging still is that within that single package for a given release, there are currently three different semconvs in use – gcp even has differing semconv within the same 'namespace'.

While it is not necessarily one to discuss here, the problems with interoperability due to brittle design decisions – particularly with distributed systems – is rather concerning to me as a consumer of the standard, especially as it is currently adopted in Go.

Right now I'm seeing version bumps due to security related issues for the contrib packages. The problem is that they don't abide by the compatibility contract of golang modules. Arguably the semconv strategy and contract of otel conflicts with golang's own module contract to the extent that IMHO each semconv alteration should see a similar vN release of a module in the opentelemetry-go-contrib space. That is, each otel semconv has its own package for each detector/instrumentation/instrgen etc module.

@wafer-bw
Copy link

wafer-bw commented Jan 19, 2024

At the very least the error message from func Merge(a, b *Resource) (*Resource, error) "cannot merge resource due to conflicting Schema URL" should be updated to include the SchemaURL() values for both a & b.

Edit1: Why is errMergeConflictSchemaURL not an exported error so we can handle this case via errors.Is?

Edit2: Forgive me if this is naive but a and b are both structs of the same type, why is the responsibility of the underlying API schema being pushed to the caller? It feels like a responsibility the resource or other otel packages would be better fit to manage.

yurishkuro pushed a commit to jaegertracing/jaeger that referenced this issue Jan 24, 2024
## Which problem is this PR solving?
- Fixes #5116 
- Replaces #5115 

## Description of the changes
- Upgrade OTel SDK to v1.22 (dependabot)
- Upgrade semconv version to v1.24 ([Reference
issue](open-telemetry/opentelemetry-go#2341)).
I tried using semconv v1.22 but it was still failing.
- Replace deprecated function `semconv.HTTPStatusCodeKey` with
`semconv.HTTPResponseStatusCodeKey`

## How was this change tested?
- 

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [x] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `yarn lint` and `yarn test`

---------

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Chirag Ghosh <cghosh828049@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@MrAlias MrAlias added this to the v1.23.0 milestone Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment