Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB Convention does not cover batch/multi/envelope operations #712

Closed
ndimiduk opened this issue Dec 6, 2021 · 10 comments · Fixed by #1072
Closed

DB Convention does not cover batch/multi/envelope operations #712

ndimiduk opened this issue Dec 6, 2021 · 10 comments · Fixed by #1072
Assignees

Comments

@ndimiduk
Copy link

ndimiduk commented Dec 6, 2021

Our API has a small alphabet of relatively simple operations key-value operations (get, put, delete, &c.) For these, the operation names seem clear. We also have a set of operations that support bulk/batching of operations. these can be homogeneous or heterogeneous. For example, batch can accept a set of any combination of get, put, delete, &c. We also support a generic system for server-side compare-and-mutate, where some predicate based on a query over existing data is provided, and when the predicate returns true, some operation is applied — that operation can be a simple or a batch operation. for these collections of heterogenous operations, how should be annotate the span?

@arminru
Copy link
Member

arminru commented Dec 13, 2021

Hey! Thanks for your question.
Would it be possible and make sense for you to track the individual operations within a batch with separate spans? Then you'd have one generic span for the batch (not following any OTel conventions) and the individual spans (following the DB conventions) would be children of that one.

@ndimiduk
Copy link
Author

Would it be possible and make sense for you to track the individual operations within a batch with separate spans?

Anything is possible ;)

Do you mean that the server-side of the batch operation should make spans for each op nested within the batch? Or would you want to create this child spans in the client-side? They all funnel through a single RPC and associated span pair.

Then you'd have one generic span for the batch (not following any OTel conventions) ...

Why do you say that a span for the batch operation is not following any otel conventions? We have a data-action method in the api called batch. Why should it not be recorded as a db.operation ?

@arminru
Copy link
Member

arminru commented Dec 14, 2021

Do you mean that the server-side of the batch operation should make spans for each op nested within the batch? Or would you want to create this child spans in the client-side? They all funnel through a single RPC and associated span pair.

Why do you say that a span for the batch operation is not following any otel conventions? We have a data-action method in the api called batch. Why should it not be recorded as a db.operation ?

The database semantic conventions were only designed for client-side calls, not for the server end. If you could share some details on how your instrumentation looks like for the server and what you would expect from a semantic convention for calls from the server's perspective in a separate issue we can look into crafting such conventions based on the client ones.

Ah I did not consider batch an operation on its own but rather a more abstract set of operations. Let's try something else then.
I think you could have a parent span for batch and then multiple children with the respective operations as they are added to the batch, all following the DB semantic conventions each. However I assume the call for the batch will be executed at once so you wouldn't be able to track timing of each individual operation and thus have a set of zero duration spans just for the purpose of their attributes. This way you can inspect the content of your batch on your tracing backend and still have the same selectors in place to find the spans as if the operations were executed individually. An error would likely only be reflected collectively on the batch span, however,

    [=DB batch=========================]
    [] <- DB delete
    [] <- DB put
      [=RPC=========================]

Alternatively, if your database client library accepts the operations individually and only combines them into a batch later on, you could make each individual operation span be the child of it's "actual" causal parent and use span links to link the batch span to them instead.

    [====] <- SomeActionCausingADeleteOperationExecutedInDb 
     [] <- DB delete
--
        [===] <- SomeActionCausingAGetOperationExecutedInDb
         [] <- DB get
--
                         [=DB batch=========================] (linking to both DB delete and DB get from above)
                           [=RPC=========================]

The batch going over the wire in one RPC call would be modeled as a child of the batch span in either case.

@Apache9
Copy link

Apache9 commented Dec 16, 2021

In HBase, we will apply the batch at server side as a whole(almost), thw work flow is like this:

RPC server receives a batch -> grab all the row locks -> Build the WAL edit for all the operations -> Write out the WAL edit -> Apply all the operations to memstore -> advance the MVCC number -> return

So typical I do not think it is possible to use different spans to trace different operations in the batch. As you can see, although in every step we will likely process the operations one by one, but looking at a higher level, in each step will process all the operations and then go to the next step. It will be very strange to create a span for each operation and switch them all the time...

Thanks.

@ndimiduk
Copy link
Author

ndimiduk commented Jan 6, 2022

@arminru @bogdandrutu I wonder if you have any thoughts about the PR linked here. The idea is to expose a summary of the content of a batch operation as an additional attribute that is implementation-specific. Specifically, I hope that a span storage/query system would be able to make use of that attribute to enable operators to find all spans that execute a given operation, whether that operation is executed at the top level or it is a part of a batch operation.

@jack-berg jack-berg transferred this issue from open-telemetry/opentelemetry-specification Feb 7, 2024
@lmolkova
Copy link
Contributor

lmolkova commented Feb 9, 2024

Assuming there is just one bulk operation that deals with batch as a whole, I can think of the following solutions:

E.g bulk operation consists of ["get foo", "delete bar"].

Option 1. Attributes with array values

  • db.operation = bulk
  • db.mydb.sub_operations = [get, delete], and db.mydb.some_other_attribute= [foo, bar] ...

Cons:

  • bulk operations are common and we should consider defining sub-operation attributes in top db namespace
  • the relationship between elements in attribute arrays is based on index (if we record more than one attribute per sub-operation) which is subtle and error prone
  • bulk operations with a lot of sub-operations might be too long for some backends that have low limits on the attribute length

Option 2. Events/logs

  • db.operation = bulk
  • Plus we emit an event for each sub-operation that contains grouped attributes describing that operation

We do something similar in messaging (with links though):

  • if a batch of messages is sent, send operation should have links to all messages being sent and their unique per-message properties should be on the link.
  • if there is just one message its properties should be recorded either on the link or on the span.

DB operations don't have an individual trace-context, so links are not suitable here, but events could work. Then it should also be easier to enable/disable sub-operation reporting depending on the needs. the drawback is that events/logs could go to a different backend

Cons:

  • if we don't have a case for multiple attributes describing one operation, this seems like an overkill and Option 1 would be a better choice.

Option 3. Creating artificial spans per sub-operation

Cons:

  • misleading
  • costly (both perf and volume)

Additional things to consider:

  • we should have an attribute that records the size of a batch (in case of messaging it's called messaging.batch.message_count)
  • metrics:
    • we should consider having a metric that measures number of sub-operations (since db.operation.duration would not provide a count for bulk operations)
    • bulk duration could be a different metric (e.g. with different histogram boundaries)

@jcocchi
Copy link
Contributor

jcocchi commented Feb 9, 2024

Cosmos DB is currently creating a string attribute db.cosmosdb.batch_operations with each operation type in the batch and the count for that type. Adding an attribute to the convention would be useful to standardize this.

We should be able to capture:

  • Overall count for batch
  • Operations in batch
  • Optionally: more information for each batch operation according to each db's requirements (count, status code etc.)

I prefer the simplicity of Option 1. Attributes with array values, but agree it creates a challenge for additional information about each batch operation. The most important piece of additional information for Cosmos DB is operation count, so maybe something like the following could work: db.batch.count = 6 db.mydb.sub_operations = [get:2, delete:4] . Capturing information beyond operation count may be too clunky in this format though

@roji
Copy link

roji commented Feb 9, 2024

@lmolkova isn't that conflating "batch" with "bulk", by proposing db.operation = bulk for something containing two things (get and delete)? The standard naming for this seems to be batching, where a bulk command usually corresponds to a command that changes multiple records (like a SQL UPDATE statement).

Regardless, the commands contained in a batch are typically the same in every aspect as a standalone command not executed in a batch; each command has a SQL (so db.statement), a db.operation (select, insert...), a set of parameters (something not currently represented in the semantic conventions, but which could/should), and any other attributes which a specific database may add. For me this strongly points towards represents the batched commands as spans, which would allow querying/interpreting them just like commands which aren't batched. Introducing a new way to represent batched commands may seem simpler on first look, but actually creates two ways to represent the same logical thing, and makes the data more difficult to interpret. In effect, a batch is conceptually is just a container for commands.

Note that it's true that certain attributes must be the same across all commands in the batch, e.g. the hostname, network info, etc. So these attributes could optionally be lifted up to the span representing the batch, leaving on the command only attributes which can vary (e.g. SQL, parameter info).

@jcocchi
Copy link
Contributor

jcocchi commented Feb 9, 2024

@roji one difference between batch operations and standalone operations is the duration. If you add each operation in a batch as its own span, is the duration of each sub operation the same as the parent? Is it 0? This could also create confusion because it may make those operations appear either abnormally quick or abnormally long

@roji
Copy link

roji commented Feb 9, 2024

@jcocchi that's true indeed... I don't know if there are other OTel cases where a larger "logical container" span wrap nested spans as in this case, and how that's best represented...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

8 participants