Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sink_parquet_cloud doesnt work when updating from 0.40 -> 0.41 #17172

Closed
2 tasks done
a-agmon opened this issue Jun 25, 2024 · 2 comments · Fixed by #18027
Closed
2 tasks done

sink_parquet_cloud doesnt work when updating from 0.40 -> 0.41 #17172

a-agmon opened this issue Jun 25, 2024 · 2 comments · Fixed by #18027
Labels
bug Something isn't working P-low Priority: low rust Related to Rust Polars

Comments

@a-agmon
Copy link

a-agmon commented Jun 25, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

let parquet_write_opt = ParquetWriteOptions {
        compression: ParquetCompression::Uncompressed,
        ..Default::default()
    };

    LazyFrame::scan_parquet(source, cloud_args.clone())?
        .with_streaming(true)
        .with_predicate_pushdown(true)
        .with_projection_pushdown(true)
        .select([col("A"), col("B"), col("C")])
        .group_by([col("B"), col("C")])
        .agg([col("A").count().alias("apps")])
     
        .sink_parquet_cloud(target, cloud_args.cloud_options.clone(), parquet_write_opt);

Log output

POLARS PREFETCH_SIZE: 20
RUN STREAMING PIPELINE
[parquet -> generic-group_by -> parquet_sink]
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
STREAMING CHUNK SIZE: 16666 rows
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
POLARS ROW_GROUP PREFETCH_SIZE: 128
finish streaming aggregation with local in-memory table

Issue description

On version 0.40 this code was used to write a small aggregation to cloud storage (S3), after updating to 0.41.* the file is not written and no error is being thrown

Expected behavior

On version File will be written to S3 or error thrown

Installed versions

polars = { version = "0.41.2", features = [
"lazy",
"aws",
"parquet",
"streaming",
"performant",
"cloud_write",
] }

@a-agmon a-agmon added bug Something isn't working needs triage Awaiting prioritization by a maintainer rust Related to Rust Polars labels Jun 25, 2024
@ritchie46 ritchie46 added P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels Jun 25, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jun 25, 2024
@ritchie46
Copy link
Member

This functionality was community provided. At the moment it is a bit low priority for us. We want to redesign this properly in the new engine. I would accept a fix.

philss added a commit to elixir-explorer/explorer that referenced this issue Jul 20, 2024
This makes the "ObjectStore" choose to upload using a single request or
a multi request (multi part) upload depending on the size of the chunks.

This change is needed because Polars is now calling the writer without
buffering, so this was breaking the upload.

There are two tests failing for different reasons:
- the IPC upload to a unknown bucket is failing because the new
  CloudWriter is not propagating the error. This is probably an easy
  fix.
- the Lazy "parquet to cloud" is failing for the reason I wrote above.
  This is probably related to this issue:

  pola-rs/polars#17172
@philss
Copy link
Contributor

philss commented Jul 22, 2024

It seems that with the new object_store v0.10.0 (landed in Polars after PR #16920), the trait that puts the parts is not performing an internal buffer anymore. So each call to writer.put_part makes a new network request, and I believe this is leading to errors.

This is not explicitly documented in the ObjectStore's change log, but is described in the "removed docs" of this commit: apache/arrow-rs@96c4c0b

As a solution, we may need to use the WriteMultipart that does buffer the data to a certain capacity before making the request, or we can use the new BufWriter that automatically choose between doing a single "put" or a "put multipart" depending on the size of the payload, and does the buffer as the name suggests.

Do you have any preference? I can try to submit a PR with a fix for this.

philss added a commit to philss/polars that referenced this issue Aug 2, 2024
The issue started after the bump of `ObjectStore` to v0.10. Before that,
ObjectStore was doing an internal buffer.

The implementation is using `ObjectStore::BufWriter`, that is going to
perform a "put" request if the size of data is below the "capacity".
Otherwise it is going to do a "put multipart" instead.

Fixes pola-rs#17172
@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-low Priority: low rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants