Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[crashtracker] RFC for structured log format #554

Merged
merged 25 commits into from
Nov 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
217 changes: 217 additions & 0 deletions docs/RFCs/0005-crashtracker-structured-log-format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
# RFC 0005: Crashtracker Structured Log Format (Version 1.0)

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [IETF RFC 2119](https://datatracker.ietf.org/doc/html/rfc2119).

## Summary

This document describes version 1.0 of the crashinfo data format.

## Motivation

The `libdatadog` crashtracker detects program crashes.
It automatically collects information relevant to the characterizing and debugging the crash, including stack-traces, the crash-type (e.g. SIGSIGV, SIGBUS, etc) crash, the library version, etc.
This RFC establishes a standardized logging format for reporting this information.

### Why structured json

As a text-based format, json can be written to standard logging endpoints.
It is (somewhat) human readable, so users can directly interpret the crash info off their log if necessary.
As a structured format, it avoids the ambiguity of standard semi-structured stacktrace formats (as used by e.g. Java, .Net, etc).
Due to the use of native extensions, it is possible for a single stack-trace to include frames from multiple languages (e.g. python may call C code, which calls Rust code, etc).
Having a single structured format allows us to work across languages.

## Proposed format

A natural language description of the proposed json format is given here.
An example is given in Appendix A, and the schema is given in Appendix B.
Any field not listed as "Required" is optional.
Consumers MUST accept json with elided optional fields.

### Extensibility

The data-format has a REQUIRED `data_schema_version` field, which represents the semver version ID of the data.
Following semver, collectors may add additional fields without affecting the major version number.
Parsers SHOULD therefore accept unexpected fields, either by ignoring them, or by displaying them as additional data.

### Fields

- `counters`: **[optional]**
A map of names to integer values.
At present, this is used by the profiler to track which operations were active at the time of the crash.
- `data_schema_version`: **[required]**
A string containing the semver ID of the crashtracker data schema ("1.0" for the current version).
- `error`: **[required]**
- `threads`: **[optional]**
An array of `Thread` objects.
In a multi-threaded program, the collector SHOULD collect the stacktraces of all active threads, and report them here.
A `Thread` object has the following fields:
- `crashed`: **[required]**
A boolean which tells if the thread crashed.
- `name`: **[required]**
Name of the thread (e.g. 'Thread 0').
- `stack`: **[required]**
The `StackTrace` of the thread.
See below for more details on how stacktraces are formatted.
- `state`: **[optional]**
Platform-specific state of the thread when its state was captured (CPU registers dump for iOS, thread state enum for Android, etc.).
Currently, this is a platform-dependent string.
- `is_crash`: **[required]**
Boolean true if the error was a crash, false otherwise.
- `kind`: **[required]**
The kind of error that occurred.
For example, "Panic", "UnhandledException", "UnixSignal".
- `message`: **[optional]**
A human readable string containing an error message associated with the stack trace.
- `source_type`: **[required]**
The string "Crashtracking".
- `stack`: **[required]**
This represents the stack of the crashing thread.
See below for more details on how stacktraces are formatted.
- `files`: **[optional]**
A `Map<filename, contents>` where `contents` is an array of plain text strings, one per line.
Useful files for triage and debugging, such as `/proc/self/maps` or `/proc/meminfo`.
- `fingerprint`: **[optional]**
A string containing a summary or hash of crash information which can be used for deduplication.
- `incomplete`: **[required]**
Boolean `false` if the crashreport is complete (i.e. contains all intended data), `true` if there is expected missing data.
This can happen becasue the crashtracker is architected to stream data to an out of process receiver, allowing a partial crash report to be emitted even in the case where the crashtracker itself crashed during stack trace collection.
This MUST be set to `true` if any required field is missing.
- `log_messages`: **[optional]**
An array of strings containing log messages generated by the crashtracker.
- `metadata`: **[required]**
Metadata about the system in which the crash occurred:
- `library_name`: **[required]**
e.g. "dd-trace-python".
- `library_version`: **[required]**
e.g. "2.16.0".
- `family`: **[required]**
e.g. "python".
- `tags`: **[optional]**
A set of key:value pairs, representing any tags the crashtracking system wishes to associate with the crash.
Examples would include "hostname", "service", and any configuration information the system wishes to track.
- `os_info`: **[required]**
The OS + processor architecture on which the crash occurred.
Follows the display format of the [os_info crate](https://crates.io/crates/os_info).
- `architecture`: **[required]**
e.g. "arm64"
- `bitness`: **[required]**
e.g. "64-bit".
- `os_type`: **[required]**
e.g. "Mac OS".
- `version`: **[required]**
e.g. "14.7.0".
- `proc_info`: **[optional]**
A place to store information about the crashing process.
In the future, this may have additional optional fields as more data is collected.
- `pid`: **[required]**
The PID of the crashing process.
- `sig_info`: **[optional]**
UNIX signal based collectors only: Useful information from the [siginfo_t](https://man7.org/linux/man-pages/man2/sigaction.2.html) structure.
- `sid_addr`: **[optional]**
A hexidecimal string with the memory address at which the fault occurred, e.g. "0xDEADBEEF".
- `si_code`: **[required]**
An integer storing the [UNIX signal code](https://man7.org/linux/man-pages/man7/signal.7.html), e.g. `1` for a `SEGV_MAPERR`.
- `si_code_human_readable`: **[required]**
The signal code expressed as a human readable string, e.g. "SEGV_MAPERR" for `SEGV_MAPERR`.
Follows the naming convention in [the manpage](https://man7.org/linux/man-pages/man7/signal.7.html).
- `si_signo`: **[required]**
An integer storing the [UNIX signal number](https://man7.org/linux/man-pages/man7/signal.7.html), e.g. `11` for a segmentation violation.
- `si_signo_human_readable`: **[required]**
The signal name, e.g. "SIGSEGV".
Follows the naming convention in [the manpage](https://man7.org/linux/man-pages/man7/signal.7.html).
- `span_ids`: **[optional]**
A vector representing active span ids at the time of program crash.
The collector MAY cap the number of spans that it tracks.
- `id`: **[required]**
A string containing the span id.
- `thread_name`: **[optional]**
A string containing the thread name for the given span.
- `timestamp`: **[required]**
The time at which the crash occurred, in ISO 8601 format.
- `trace_ids:`: **[optional]**
A vector representing active span ids at the time of program crash.
The collector MAY cap the number of spans that it tracks.
- `id`: **[required]**
A string containing the trace id.
- `thread name`: **[optional]**
A string containing the thread name for the given trace.
- `uuid`: **[required]**
A UUID v4 which uniquely identifies the crash.
This will typically be generated at crash-time, and then associated with the uploaded crash.

### Stacktraces

Different languages and language runtimes have different representations of a stacktrace.
The representation below attempts to collect as much information as possible.
In addition, not all information may be available at crash-time on a given machine.
For example, some libraries may have been shipped with debug symbols stripped, meaning that the only information available about a given frame may be the instruction pointer (`ip`) address, stored as a hex number "0xDEADBEEF".
This address may be given as an absolute address, or a `NormalizedAddress`, which can be used by backend symbolication.

A stacktrace consists of

- `format`: **[required]**
An identifier describing the format of the stack trace.
Allows for extensibility to support different stack trace formats.
The format described below is identified using the string "Datadog Crashtracker 1.0"
- `frames`: **[required]**
An array of `StackFrame`, described below.
Note that each inlined function gets its own stack frame in this schema.

#### StackFrames

- **Absolute Addresses**
The actual in-memory addresses used in the crashing process.
Combined with mapping information, such as from `/proc/self/maps`, and the relevant binaries, this can be used to reconstruct relevant symbols.
These fields follow the scheme used by the [backtrace crate](https://docs.rs/backtrace/latest/backtrace/struct.Frame.html)
- `ip`: **[optional]**
The current instruction pointer of this frame.
This is normally the next instruction to execute in the frame, but not all implementations list this with 100% accuracy (but it’s generally pretty close).
- `sp`: **[optional]**
The current stack pointer of this frame.
- `symbol_address`: **[optional]**
The starting symbol address of the frame of this function.
This will attempt to rewind the instruction pointer returned by ip to the start of the function, returning that value.
In some cases, however, backends will just return ip from this function.
- `module_base_address`: **[optional]**
The base address of the module to which the frame belongs
- **Relative Addresses**
Addresses expressed as an offset into a given library or executable.
Can be used by backend symbolication to generate debug names etc.
Note that tracking this per stack frame can entail significant duplication of information.
Adding a "modules" section and referencing it by index, as in the pprof specification, is future work.
- `build_id`: **[optional]**
A string identifying the build id of the module the address belongs to.
For example, GNU build ids are hex strings "9944168df12b0b9b152113c4ad663bc27797fb15".
Pdb build ids can be stored as a concatenation of the guid and the age (using a well-known separator).
- `build_id_type`: **[required if `build_id` is set, optional otherwise]**
The type of the `build_id`. E.g. "SHA1/GNU/GO/PDB/PE".
- `file_type`: **[required if `relative_address` is set, optional otherwise]**
The file type of the module containing the symbol, e.g. "ELF", "PDB", etc.
- `relative_address`: **[optional]**
The relative offset of the symbol in the base file (e.g. an ELF virtual address), given as a hexidecimal string.
- `path`: **[required if `relative_address` is set, optional otherwise]**
The path to the module containing the symbol.
- **Debug information (e.g. "names")**
Human readable debug information representing the location of the stack frame in the high-level code.
Note that this is a best effort collection: for optimized code, it may be difficult to associate a given instruction back to file, line and column.
- `column`: **[optional]**
The column number in the given file where the symbol was defined.
- `file`: **[optional]**
The file name where this function was defined.
Note that this may be either an absolute or relative path.
- `line`: **[optional]**
The line number in the given file where the symbol was defined.
- `function`: **[optional]**
The name of the function.
This may or may not include module information.
It may or may not be demangled (e.g. "\_ZNSt28**atomic_futex_unsigned_base26_M_futex_wait_until_steadyEPjjbNSt6chrono8durationIlSt5ratioILl1ELl1EEEENS2_IlS3_ILl1ELl1000000000EEEE" vs "std::**atomic_futex_unsigned_base::\_M_futex_wait_until_steady")

### Other data

## Appendix A: Example output

[Available here](artifacts/0002-crashtracker-example.json)

## Appendix B: Json Schema

[Available here](artifacts/0002-crashtracker-schema.json)
Loading
Loading