Make Tracing Great Again #17693

breezewish · 2020-06-04T16:46:42Z

Feature Request

Is your feature request related to a problem? Please describe:

TiDB already supports TRACE statement, but it is rarely useful for several reasons:

It is not enabled by default due to slow. This limited its use cases. For example, it cannot be used to know how jitters happens.
It does not trace TiKV, missing a important piece.

Tracing is in fact a very useful piece of observability (log, metrics, trace). We did pretty well for log and metrics, but pretty bad for trace.

For now it is a bit hard to discover what causes a SQL to run slow in some scenarios. Existing facilities are:

“EXPLAIN ANALYZE”: Only useful to know which executor takes a long time, cannot know RocksDB time, raft time, snapshot time, etc. Cannot inspect a SQL in the past.
Tracing: As said before, only available in TiDB, cannot trace TiKV, and enabling it will greatly affect the performance so that it is rarely useful to inspect a SQL in the past.
Metrics: It displays execution information of all SQLs, when payload is hybrid it is very difficult to figure out. Also metrics are not grouped in a way that we are easy to follow by SQL lifetime.
Log / slow log: Different components output their own logs and it is very hard to link them together. Some logs only contain aggregated data and are not precise enough.

The new tracing implementation is supposed to provide diagnose capability for such scenarios.

Describe the feature you'd like:

Improve the tracing feature.

Stage 1.

Tracing should be efficient enough to be enabled by default for all SQL queries, and being stored for expansive or slow queries. In this way, we can know why a slow query happens. However, some user may care about performance a lot. In such case, we provide a global variable and/or session variable to disable it.
Tracing integrates spans from TiKV (and maybe PD) in one place.
The existing low performance tracing facility in TiDB shall be dropped.

The result of tracing can be further displayed in TiDB Dashboard in a nice way, for example, like what DataDog did:

Stage 2.

Users usually talk with executors instead of spans. The tracing should be able to be integrated with execution plans, allowing user to easily map each spans with executors, or providing a way to know why a specific executor is slow.

However overall tracing view is still necessary, since some spans do not belong to any executors.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

The tracing facility implementation in TiKV is already finished: https://github.com/pingcap-incubator/minitrace . It is super-efficient, can trace a span within 20ns. Even for shortest requests like point get, tracing 100 spans only introduces 6% performance lost (notice that in real life we are likely to trace < 10 spans for point get, so that the performance lost is negligible).

Tracing spans can be a data source for correspinding fields of slow log and metrics, to avoid counting duration repeatedly (and result in different durationos).

The text was updated successfully, but these errors were encountered:

SunRunAway · 2020-06-04T17:18:12Z

cc @qw4990 I think it’s a considerable reimplementation for the entries in slow log.

shenli · 2020-07-16T16:12:50Z

It would be great if we can show the tracing result in DataDog. DataDog is popular and I heard from several users that they expect to see metrics or tracing in DataDog.
https://github.com/DataDog/dd-trace-go
The lack of official rust client may be an issue. I can only find this: https://github.com/pipefy/datadog-apm-rust

zz-jason · 2020-07-22T13:22:20Z

@breeswish @IANTHEREAL I moved it to the P1 priority, please confirm if that's correct.

IANTHEREAL · 2020-07-22T13:51:12Z

@zz-jason yes, do it

breezewish added the type/feature-request Categorizes issue or PR as related to a new feature. label Jun 4, 2020

breezewish mentioned this issue Jun 4, 2020

add trace info to coprocessor call tikv/tikv#7781

Merged

SunRunAway added the epic/slow-query label Jun 4, 2020

SunRunAway added the sig/execution SIG execution label Jun 4, 2020

breezewish mentioned this issue Jun 8, 2020

Requirement Request: Timeline Tracing pingcap/tidb-dashboard#552

Closed

zz-jason added the feature/accepted This feature request is accepted by product managers label Jul 29, 2020

scsldb added the priority/P1 The issue has P1 priority. label Jul 29, 2020

breezewish mentioned this issue Jul 29, 2020

Easier discover performance issues and diagnose causes (Stage 1) #18867

Open

13 tasks

zhongzc mentioned this issue Jul 30, 2020

Timeline Tracing Roadmap pingcap/tidb-dashboard#710

Open

30 tasks

IANTHEREAL removed the type/feature-request Categorizes issue or PR as related to a new feature. label Aug 10, 2020

zhongzc mentioned this issue Aug 15, 2020

*: Report tracing results to Jaeger tikv/tikv#8440

Closed

ichn-hu mentioned this issue Nov 3, 2020

Welcome to contribute #20804

Closed

zhongzc mentioned this issue Nov 10, 2020

*: Introduce tracing framework tikv/tikv#8981

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Tracing Great Again #17693

Make Tracing Great Again #17693

breezewish commented Jun 4, 2020 •

edited

Loading

SunRunAway commented Jun 4, 2020

shenli commented Jul 16, 2020

zz-jason commented Jul 22, 2020

IANTHEREAL commented Jul 22, 2020

Make Tracing Great Again #17693

Make Tracing Great Again #17693

Comments

breezewish commented Jun 4, 2020 • edited Loading

Feature Request

SunRunAway commented Jun 4, 2020

shenli commented Jul 16, 2020

zz-jason commented Jul 22, 2020

IANTHEREAL commented Jul 22, 2020

breezewish commented Jun 4, 2020 •

edited

Loading