Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jaeger-v2] Define an internal interface of storage v2 spanstore #5399

Merged
merged 18 commits into from
May 1, 2024

Conversation

james-ryans
Copy link
Contributor

@james-ryans james-ryans commented Apr 29, 2024

Which problem is this PR solving?

Description of the changes

How was this change tested?

  • This PR hasn't been tested yet since it only contains interfaces and no actual implementation to be tested.

Checklist

@james-ryans james-ryans requested a review from a team as a code owner April 29, 2024 17:48
@james-ryans james-ryans requested a review from pavolloffay April 29, 2024 17:48
Copy link

codecov bot commented Apr 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.20%. Comparing base (df086dd) to head (6cbd5d8).
Report is 2 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #5399       +/-   ##
===========================================
+ Coverage   55.80%   95.20%   +39.40%     
===========================================
  Files         171      346      +175     
  Lines        9048    16908     +7860     
===========================================
+ Hits         5049    16098    +11049     
+ Misses       3527      610     -2917     
+ Partials      472      200      -272     
Flag Coverage Δ
badger_v1 10.47% <ø> (ø)
badger_v2 6.48% <ø> (+0.05%) ⬆️
cassandra-3.x 18.37% <ø> (ø)
cassandra-4.x 18.37% <ø> (ø)
elasticsearch-5.x 20.91% <ø> (ø)
elasticsearch-6.x 20.90% <ø> (ø)
elasticsearch-7.x 20.95% <ø> (ø)
elasticsearch-8.x 21.13% <ø> (-0.05%) ⬇️
grpc_v1 12.64% <ø> (ø)
grpc_v2 11.54% <ø> (ø)
kafka 10.18% <ø> (ø)
opensearch-1.x 20.99% <ø> (-0.03%) ⬇️
opensearch-2.x 20.98% <ø> (-0.06%) ⬇️
unittests 91.54% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: James Ryans <james.ryans2012@gmail.com>
Signed-off-by: James Ryans <james.ryans2012@gmail.com>
Signed-off-by: James Ryans <james.ryans2012@gmail.com>
Signed-off-by: James Ryans <james.ryans2012@gmail.com>
Signed-off-by: James Ryans <james.ryans2012@gmail.com>
Comment on lines 14 to 16
// WriteTrace writes batches of spans at once and
// compatible with OTLP Exporter API.
WriteTraces(ctx context.Context, td ptrace.Traces) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// WriteTrace writes batches of spans at once and
// compatible with OTLP Exporter API.
WriteTraces(ctx context.Context, td ptrace.Traces) error
// WriteTrace writes a batch of spans to storage. Idempotent.
// Implementations are not required to support atomic transactions,
// so if any of the spans fail to be written an error is returned.
// Compatible with OTLP Exporter API.
WriteTraces(ctx context.Context, td ptrace.Traces) error

// GetTrace retrieves the trace with a given id.
//
// If no spans are stored for this trace, it returns ErrTraceNotFound.
GetTrace(ctx context.Context, traceID model.TraceID) (*model.Trace, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the return value should be in ptrace model, similar to Writer

GetTrace(ctx context.Context, traceID model.TraceID) (*model.Trace, error)

Isn't there a TraceID type in ptrace? I would use that - we want to do a clean break from legacy /model/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a TraceID type in ptrace?

Yes, they have TraceID type defined in the pcommon pkg.

we want to do a clean break from legacy /model/

I see, I'll keep that in mind!

storage_v2/spanstore/reader.go Show resolved Hide resolved
// by different spans.
//
// If no matching traces are found, the function returns (nil, nil).
FindTraces(ctx context.Context, query *TraceQueryParameters) ([]*model.Trace, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, this one is tricky. We want the result to be ptrace model, but in ptrace there is not clear separation of spans from different traces, instead the separation is by resource/scope, which are not very relevant on retrieval (usually by then they are de-normalized onto individual spans in storage, although some storage backends can theoretically store fully normalized model). Plus, in order to reproduce the grouping of spans into resource/scope hierarchy we need to do relatively expensive comparison (fwiw the query service does this today for Process deduping, even though the UI does not make use of such normalization).

I believe in my RFC I proposed returning []*ptrace.TraceData from these. The deduping could be an optional step/enrichment in the query-service, rather than forcing each storage backend do that.

storage_v2/spanstore/reader.go Outdated Show resolved Hide resolved
storage_v2/proto/storage.proto Outdated Show resolved Hide resolved
Signed-off-by: James Ryans <james.ryans2012@gmail.com>
Signed-off-by: James Ryans <james.ryans2012@gmail.com>
Signed-off-by: James Ryans <james.ryans2012@gmail.com>
Signed-off-by: James Ryans <james.ryans2012@gmail.com>
Signed-off-by: James Ryans <james.ryans2012@gmail.com>
Signed-off-by: James Ryans <james.ryans2012@gmail.com>
Signed-off-by: James Ryans <james.ryans2012@gmail.com>
storage_v2/spanstore/writer.go Show resolved Hide resolved
type Factory interface {
// Initialize performs internal initialization of the factory, such as opening connections to the backend store.
// It is called after all configuration of the factory itself has been done.
Initialize() error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend comparing this with OTEL collector component lifecycle methods. Eg would you need to accept telemetry objects here, or would it be passed to the constructor function? Do we need Close if we're adding lifecycle methods?

Also, different storage factories (eg dependencies storage) would need the same lifestyle interface so it should be pulled to the higher pkg.

Copy link
Contributor Author

@james-ryans james-ryans May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will pass the telemetry objects to the constructor and only receive the logger and config, which is similar to the existing factory, e.g. grpc.NewFactory(*zap.Logger, grpc.Config). I've looked into other OTEL collector extension implementations and they (mostly) only accept logger from the telemetry object.

I've added the flowchart of OTEL collector component's lifecycle below. Inside the Component.Start(context.Context, component.Host), we will call spanstore.NewFactory(*zap.Logger, component.Config) and then Factory.Initialize(context.Context). Once the component is shutting down through Component.Shutdown(context.Context), we will call Factory.Close(context.Context).

flowchart TD
    extension.Factory --> Factory.CreateExtension
    Factory.CreateExtension -->|returns| component.Component
    component.Component --> Component.Start
    Component.Start --> spanstore.NewFactory
    spanstore.NewFactory -->|returns| spanstore.Factory
    spanstore.Factory --> Factory.Initialize
    Factory.Initialize --> Running
    Running --> Factory.Close
    Factory.Close -->|called by| Component.Shutdown

    subgraph "OTEL Collector"
        extension.Factory
        Factory.CreateExtension
        component.Component
        Component.Start
        Component.Shutdown

        subgraph "Jaeger Storage Extension"
            spanstore.NewFactory
            spanstore.Factory
            Factory.Initialize
            Running
            Factory.Close
        end
    end
Loading

Signed-off-by: James Ryans <james.ryans2012@gmail.com>
Signed-off-by: James Ryans <james.ryans2012@gmail.com>
Signed-off-by: James Ryans <james.ryans2012@gmail.com>
// It lives within the OTEL collector extension component's lifecycle.
// The Initialize and Close functions supposed to be called from the
// OTEL component's Start and Shutdown functions.
type Factory interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
type Factory interface {
type FactoryBase interface {

Even though it is namespaced separately from the actual factories, this interface itself is not a "Factory"

// Implementations are not required to support atomic transactions,
// so if any of the spans fail to be written an error is returned.
// Compatible with OTLP Exporter API.
WriteTraces(ctx context.Context, td *ptrace.Traces) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTEL Exporter interface passes ptrace by Value, let's do the same:

func (exp *storageExporter) pushTraces(ctx context.Context, td ptrace.Traces) error {

// by different spans.
//
// If no matching traces are found, the function returns (nil, nil).
FindTraces(ctx context.Context, query *TraceQueryParameters) ([]*ptrace.Traces, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
FindTraces(ctx context.Context, query *TraceQueryParameters) ([]*ptrace.Traces, error)
FindTraces(ctx context.Context, query *TraceQueryParameters) ([]ptrace.Traces, error)

if you return 30 traces, []*... requires 30 heap allocations for no practical gains.

// GetTrace retrieves the trace with a given id.
//
// If no spans are stored for this trace, it returns ErrTraceNotFound.
GetTrace(ctx context.Context, traceID pcommon.TraceID) (*ptrace.Traces, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GetTrace(ctx context.Context, traceID pcommon.TraceID) (*ptrace.Traces, error)
GetTrace(ctx context.Context, traceID pcommon.TraceID) (ptrace.Traces, error)

type Factory interface {
// Initialize performs internal initialization of the factory,
// such as opening connections to the backend store.
Initialize(ctx context.Context) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read through the discussions and the latest approach suggested that the Start function shouldn't throw error anymore and uses the TelemetrySettings.ReportStatus to report StatusFatalError to halt the startup asynchronously, cmiiw.

I'm thinking that any changes to the component lifecycles should be handled by the Jaeger storage extension, we can capture the errors from the storage factory in the extension and report them instead, so the storage factory interface should not be affected by it.

Copy link
Member

@yurishkuro yurishkuro May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, this will be similar to what jaeger-v1 does today. However, jaeger-v1 has a failure mode today where if the db is not ready for any reason when the binary starts (even though the Jaeger configuration pointing to the db is correct), Jaeger fails to start and the binary exists. That's not how a component of a distributed system technically should be operating, i.e. a db temporarily not available at start-up vs. becoming temporarily not available during normal operation should not be treated as different conditions. We don't exit the binary if we lose a connection later, so why do we kill it if we cannot establish the connection during Init()?

If we were to fix this, we need to establish a different expectation from storage implementations, and this is where async runtime status reporting instead of returning an error from Init is coming from. This open-telemetry/opentelemetry-collector#9324 (comment) has a good explanation.

Signed-off-by: James Ryans <james.ryans2012@gmail.com>
Signed-off-by: James Ryans <james.ryans2012@gmail.com>
@yurishkuro yurishkuro merged commit c9036ae into jaegertracing:main May 1, 2024
38 checks passed
varshith257 pushed a commit to varshith257/jaeger that referenced this pull request May 3, 2024
…gertracing#5399)

## Which problem is this PR solving?
- Part of jaegertracing#5334

## Description of the changes
- An API design of Storage V2 spanstore and its proto file.
- For the detailed discussion and how we arrived to this design, please
take a look at
https://docs.google.com/document/d/1IvUcDsdRxMPK-xTUE32w3NSAGTkUcmnDQttN6ijUnWs/edit?usp=sharing

## How was this change tested?
- This PR hasn't been tested yet since it only contains interfaces and
no actual implementation to be tested.

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [ ] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `yarn lint` and `yarn test`

---------

Signed-off-by: James Ryans <james.ryans2012@gmail.com>
Signed-off-by: Vamshi Maskuri <gwcchintu@gmail.com>
Pushkarm029 pushed a commit to Pushkarm029/jaeger that referenced this pull request May 4, 2024
…gertracing#5399)

## Which problem is this PR solving?
- Part of jaegertracing#5334

## Description of the changes
- An API design of Storage V2 spanstore and its proto file.
- For the detailed discussion and how we arrived to this design, please
take a look at
https://docs.google.com/document/d/1IvUcDsdRxMPK-xTUE32w3NSAGTkUcmnDQttN6ijUnWs/edit?usp=sharing

## How was this change tested?
- This PR hasn't been tested yet since it only contains interfaces and
no actual implementation to be tested.

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [ ] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `yarn lint` and `yarn test`

---------

Signed-off-by: James Ryans <james.ryans2012@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

2 participants