Spark partial limit push down #10943

huaxingao · 2024-08-14T22:20:40Z

For sql such as SELECT * FROM table LIMIT n, push down Spark's partial limit to Iceberg, so that Iceberg can stop reading data once the limit is reached.

huaxingao · 2024-08-15T16:02:09Z

cc @aokolnychyi @RussellSpitzer @szehon-ho

.palantir/revapi.yml

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedColumnIterator.java

RussellSpitzer · 2024-08-15T22:00:23Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

@@ -1151,6 +1152,11 @@ public ReadBuilder withAADPrefix(ByteBuffer aadPrefix) {
      return this;
    }

+    public ReadBuilder pushedlimit(int limit) {


Should we assert this is greater than 1? I assume and input 0 or negative is a bad call

I think we are OK because Spark has assert for limit. If the limit is negative, e.g. SELECT * FROM table limit -2, Spark will throw

org.apache.spark.sql.AnalysisException: [INVALID_LIMIT_LIKE_EXPRESSION.IS_NEGATIVE] The limit like expression "-2" is invalid. The limit expression must be equal to or greater than 0, but got -2.;

If the limit is 0, e.g. SELECT * FROM table limit 0, Spark changes this to an empty table scan and it won't reach here.

This is public though, and anyone can call it so we can't rely on Spark to cover us.

Right, I forgot that. I've added a check to throw an IllegalArgumentException if the limit is <= 0

Usually we do Preconditions.checkArgument
for things like this

Changed to Preconditions.checkArgument. Thanks

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedColumnIterator.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetReader.java

parquet/src/main/java/org/apache/iceberg/parquet/VectorizedParquetReader.java

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedColumnIterator.java

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

RussellSpitzer · 2024-09-26T16:26:01Z

arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedColumnIterator.java

        advance();
+        int expectedBatchSize;
+        if (numValsToRead < 0) {
+          throw new IllegalStateException("numValsToRead has invalid value");


"Cannot X (because Y) (recover by Z)" - > "Cannot read a negative number of values. numValsToRead = %D"

Changed. Thanks

parquet/src/main/java/org/apache/iceberg/parquet/VectorizedParquetReader.java

RussellSpitzer · 2024-09-26T17:14:02Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

+  }
+
+  private boolean hasDeletes(org.apache.iceberg.Scan scan) {
+    try (CloseableIterable<FileScanTask> fileScanTasks = scan.planFiles()) {


We are now potentially scanning the file scan tasks several times before beginning a query,

Push aggregates

Has Deletes

Actually creating tasks

I'm wondering if we should be caching this

Probably save this for another PR

I will have a follow-up PR for this.

RussellSpitzer · 2024-09-27T20:54:37Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitioningAwareScan.java

+      Schema expectedSchema,
+      List<Expression> filters,
+      Supplier<ScanReport> scanReportSupplier,
+      Integer pushedLimit) {


I'm wondering if we should include this in SparkReadConf rather than having a separate argument. I have similar thoughts around SparkInputPartition. I'm not a big fan of having to plumb the new arguments all the way through the code base but those two options may not look great either since they aren't a great fit imho.

Ideally I think I would want something like SparkReadContext but I don't know how often more things like this will come up

Thanks @RussellSpitzer for sharing your concern and suggestion!
I think we can borrow @aokolnychyi 's idea of adding ParquetBatchReadConf and OrcBatchReadConf. We can have something like

@Value.Immutable public interface ParquetBatchReadConf extends Serializable { ParquetReaderType readerType(); // this is for comet, we don't need this for now int batchSize(); @Nullable Integer limit(); } @Value.Immutable public interface OrcBatchReadConf extends Serializable { int batchSize(); }

Similarly, we can also have ParquetRowReadConf and OrcRowReadConf.

I have changed the code to add ParquetBatchReadConf and OrcBatchReadConf. We still need to pass pushedLimit to the SparkPartitioningAwareScan, SparkScan and SparkBatch constructors, so pushedLimit can be passed from SparkScanBuilder to SparkBatch, this is because pushedLimit is not available in SparkReadConf, we have to call SparkScanBuilder.pushLimit to get pushedLimit.

Please let me know if this approach looks OK to you. Thanks!

huaxingao added 2 commits August 14, 2024 14:57

Spark partial limit push down

866a878

fix test

39f663e

github-actions bot added spark parquet arrow labels Aug 14, 2024