Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/file] Add posibility to write telemetry in Parquet or Delta format #33807

Open
marcinsiennicki95 opened this issue Jun 28, 2024 · 8 comments
Labels

Comments

@marcinsiennicki95
Copy link

Component(s)

exporter/file

Is your feature request related to a problem? Please describe.

Parquet Format:
Parquet is a columnar storage file format optimized for big data processing frameworks. It provides efficient data compression and encoding schemes, enhancing performance and reducing storage costs. Telemetry data written in Parquet format is stored in columns, making it faster to read and query specific fields.

Delta Format:
Delta Lake is an open-source storage layer that brings ACID transactions to big data workloads. Delta format combines the reliability of data lakes with the performance of data warehouses. Writing telemetry data in Delta format allows for scalable and reliable data processing, supporting complex data pipelines and real-time analytics.

Describe the solution you'd like

Ability to write in Parquet or Delta format

Describe alternatives you've considered

No response

Additional context

No response

@marcinsiennicki95 marcinsiennicki95 added enhancement New feature or request needs triage New item requiring triage labels Jun 28, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@marcinsiennicki95
Copy link
Author

marcinsiennicki95 commented Jun 28, 2024

@jmacd Is it possible with current stat of arrow, because I found in documentation.

https://github.com/open-telemetry/otel-arrow
4. Output OpenTelemetry data to the Parquet file format, part of the Apache Arrow ecosystem

@jmacd
Copy link
Contributor

jmacd commented Jul 8, 2024

@marcinsiennicki95 there is a connection between Arrow and Parquet, but it is not an automatic translation. The way we have structured the OTel-Arrow data stream, there are multiple logical tables being exchanged within an Arrow IPC payload, both because of varying schemas within the telemetry and because of shared data references. These multiple logical tables would naturally translate into multiple Parquet files.

When writing tables of shared data across an OTel-Arrow stream, the OTel-Arrow components will repeat shared data once per stream - while in a database system it would be possible to refer to past data in the system. The tradeoffs involved between writing across the network and constructing a database are large, so to make progress on this issue we would have to settle on what the Parquet schema looks like.

cc/ @lquerel

@jmacd
Copy link
Contributor

jmacd commented Jul 8, 2024

(Teaser: I've been playing around with an Parquet-first telemetry data store, it's helped me come to concrete opinions about this problem. https://github.com/jmacd/duckpond)

@marcinsiennicki95
Copy link
Author

Thx for answer. I had a conversation on the OpenTelemetry Slack channel and found out that @atoulme was working on the Parquet format

Copy link
Contributor

github-actions bot commented Sep 9, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@atoulme
Copy link
Contributor

atoulme commented Oct 2, 2024

Not anymore. As noted, the parquetexporter was not adopted, and we are working on Apache Arrow instead.

@github-actions github-actions bot removed the Stale label Oct 3, 2024
Copy link
Contributor

github-actions bot commented Dec 3, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants