[SPARK-18775][SQL] Limit the max number of records written per file #16204

rxin · 2016-12-08T00:47:35Z

What changes were proposed in this pull request?

Currently, Spark writes a single file out per task, sometimes leading to very large files. It would be great to have an option to limit the max number of records written per file in a task, to avoid humongous files.

This patch introduces a new write config option maxRecordsPerFile (default to a session-wide setting spark.sql.files.maxRecordsPerFile) that limits the max number of records written to a single file. A non-positive value indicates there is no limit (same behavior as not having this flag).

How was this patch tested?

Added test cases in PartitionedWriteSuite for both dynamic partition insert and non-dynamic partition insert.

rxin · 2016-12-08T00:54:15Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -821,8 +831,6 @@ private[sql] class SQLConf extends Serializable with CatalystConf with Logging {

  def warehousePath: String = new Path(getConf(StaticSQLConf.WAREHOUSE_PATH)).toString

-  def ignoreCorruptFiles: Boolean = getConf(IGNORE_CORRUPT_FILES)


i moved this closer to the controls for files.

SparkQA · 2016-12-08T02:30:03Z

Test build #69837 has finished for PR 16204 at commit fca401f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ssimeonov

This will be a much-loved feature. :)

ssimeonov · 2016-12-08T03:37:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

@@ -225,32 +228,50 @@ object FileFormatWriter extends Logging {
      taskAttemptContext: TaskAttemptContext,
      committer: FileCommitProtocol) extends ExecuteWriteTask {

-    private[this] var outputWriter: OutputWriter = {
+    private[this] var currentWriter: OutputWriter = _


Looking through the code, three things stand out:

There is code duplication between SingleDirectoryWriteTask and DynamicPartitionWriteTask when it comes to current writer management and cleanup.

There is duplication within releaseResources() and newOutputWriter() of the write tasks when it comes to releasing resources.

Write task state management is leaky because releaseResources() is called explicitly by executeTask(). Also, releaseResources() will be called twice when there are no exceptions and once if there is an exception in execute(), which is a bit confusing.

What about asking the base trait to do a bit more work and present a stronger contract to its users, e.g.:

private trait ExecuteWriteTask { protected[this] var currentWriter: OutputWriter = null def execute(iterator: Iterator[InternalRow]): Set[String] = { try { executeImp(iterator) } finally { releaseResources() } } /** * Writes data out to files, and then returns the list of partition strings written out. * The list of partitions is sent back to the driver and used to update the catalog. */ protected def executeImp(iterator: Iterator[InternalRow]): Set[String] protected def resetCurrentWriter(): Unit = { if (currentWriter != null) { currentWriter.close() currentWriter = null } } protected def releaseResources(): Unit = { resetCurrentWriter() } }

A simpler implementation would omit releaseResources() and simply call resetCurrentWriter() in finally. That is OK since all the classes are private but slightly less readable when it comes to unexpected future changes.

I'm wondering if we should just remove SingleDirectoryWriteTask. The few tests I tried still seem to pass with the dynamic implementation as well.

The dynamic one forces a sort, which is highly inefficient.

Ah ok. I guess you could still refactor out the record counting code to be shared, but I'm not sure it's worth it.

Indeed I wanted to do that in the beginning but given there are only two implementations with almost no code in this part, it's just over abstraction.

SparkQA · 2016-12-08T07:07:38Z

Test build #69853 has started for PR 16204 at commit f77730f.

SparkQA · 2016-12-08T07:22:41Z

Test build #69854 has started for PR 16204 at commit d2172d1.

SparkQA · 2016-12-08T07:34:27Z

Test build #69843 has finished for PR 16204 at commit 3199f8f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-12-08T08:43:06Z

cc @ericl

SparkQA · 2016-12-08T10:36:06Z

Test build #3477 has finished for PR 16204 at commit d2172d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2016-12-08T23:11:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

          if (partitionPath.nonEmpty) {
            updatedPartitions.add(partitionPath)
          }
+        } else if (description.maxRecordsPerFile > 0 &&
+            recordsInFile == description.maxRecordsPerFile) {


>= is probably a little more clear intent, here and in the other write task

ericl · 2016-12-08T23:46:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

@@ -225,32 +228,50 @@ object FileFormatWriter extends Logging {
      taskAttemptContext: TaskAttemptContext,
      committer: FileCommitProtocol) extends ExecuteWriteTask {

-    private[this] var outputWriter: OutputWriter = {
+    private[this] var currentWriter: OutputWriter = _


I'm wondering if we should just remove SingleDirectoryWriteTask. The few tests I tried still seem to pass with the dynamic implementation as well.

SparkQA · 2016-12-09T02:56:46Z

Test build #69899 has finished for PR 16204 at commit ceeacde.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-09T11:03:12Z

Test build #3484 has finished for PR 16204 at commit ceeacde.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2016-12-10T04:06:55Z

LGTM

mridulm · 2016-12-08T12:46:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

    }

    override def execute(iter: Iterator[InternalRow]): Set[String] = {
+      var fileCounter = 0
+      var recordsInFile = 0


mridulm · 2016-12-10T16:53:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

+
+      // This must be in a form that matches our bucketing format. See BucketingUtils.
+      val ext = f"$bucketId.c$fileCounter%03d" +
+        description.outputWriterFactory.getFileExtension(taskAttemptContext)


Is there an assumption here on number of files per output partition ?

mridulm · 2016-12-10T16:54:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

@@ -379,30 +405,40 @@ object FileFormatWriter extends Logging {
      val sortedIterator = sorter.sortedIterator()

      // If anything below fails, we should abort the task.
+      var recordsInFile = 0


rxin · 2016-12-14T20:57:48Z

cc @hvanhovell

SparkQA · 2016-12-14T23:12:47Z

Test build #70145 has finished for PR 16204 at commit 5bf2b32.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-12-20T07:58:27Z

@hvanhovell don't forget this one!

hvanhovell

One tiny comment regarding documentation. Otherwise LGTM.

hvanhovell · 2016-12-21T22:06:06Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    .booleanConf
+    .createWithDefault(false)
+
+  val MAX_RECORDS_PER_FILE = SQLConfigBuilder("spark.sql.files.maxRecordsPerFile")


Should we also mention that there is a limit to the number of files produced? This might not be the best location.

the limit realistically speaking is so high that i doubt it'd matter unless this value is set to 1.

Okay that is fair. Lets merge this.

hvanhovell · 2016-12-21T22:50:09Z

LGTM - merging to master. Thanks!

## What changes were proposed in this pull request? Currently, Spark writes a single file out per task, sometimes leading to very large files. It would be great to have an option to limit the max number of records written per file in a task, to avoid humongous files. This patch introduces a new write config option `maxRecordsPerFile` (default to a session-wide setting `spark.sql.files.maxRecordsPerFile`) that limits the max number of records written to a single file. A non-positive value indicates there is no limit (same behavior as not having this flag). ## How was this patch tested? Added test cases in PartitionedWriteSuite for both dynamic partition insert and non-dynamic partition insert. Author: Reynold Xin <rxin@databricks.com> Closes apache#16204 from rxin/SPARK-18775.

Today, parquet supports the [maxRecordsPerFile](apache/spark#16204) option to limit the max number of records written per file so that users can control the parquet file size to avoid humongous files. For example, ``` spark.range(100) .write .format("parquet") .option("maxRecordsPerFile", 5) .save(path) ``` The above code will generate 20 parquet files and each one contains 5 rows. This is missing in Delta. This PR adds the support for Delta by passing the `maxRecordsPerFile` option from Delta to ParquetFileFormat. Note: today both Delta and parquet support the SQL conf `spark.sql.files.maxRecordsPerFile` to control the file size. This PR is just adding the `DataFrameWriter` option support to mimic the parquet format behavior. Fixes #781 Closes #1017 Co-authored-by: Andrew Olson <aolson1@cerner.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> GitOrigin-RevId: 02af2c40457fe0acc76a31687e4fd6c47f3f2944

Today, parquet supports the [maxRecordsPerFile](apache/spark#16204) option to limit the max number of records written per file so that users can control the parquet file size to avoid humongous files. For example, ``` spark.range(100) .write .format("parquet") .option("maxRecordsPerFile", 5) .save(path) ``` The above code will generate 20 parquet files and each one contains 5 rows. This is missing in Delta. This PR adds the support for Delta by passing the `maxRecordsPerFile` option from Delta to ParquetFileFormat. Note: today both Delta and parquet support the SQL conf `spark.sql.files.maxRecordsPerFile` to control the file size. This PR is just adding the `DataFrameWriter` option support to mimic the parquet format behavior. Fixes delta-io#781 Closes delta-io#1017 Co-authored-by: Andrew Olson <aolson1@cerner.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> GitOrigin-RevId: 02af2c40457fe0acc76a31687e4fd6c47f3f2944

[SPARK-18775][SQL] Limit the max number of records written per file

fca401f

rxin commented Dec 8, 2016

View reviewed changes

ssimeonov reviewed Dec 8, 2016

View reviewed changes

rxin added 2 commits December 7, 2016 21:07

Fix bucketing

3199f8f

Merge branch 'master' into SPARK-18775

f77730f

Add test case.

d2172d1

rxin changed the title ~~[SPARK-18775][SQL] Limit the max number of records written per file - WIP~~ [SPARK-18775][SQL] Limit the max number of records written per file Dec 8, 2016

ericl reviewed Dec 8, 2016

View reviewed changes

code review

ceeacde

mridulm reviewed Dec 10, 2016

View reviewed changes

rxin added 2 commits December 14, 2016 12:51

Merge branch 'master' into SPARK-18775

87e29b6

code review and some defensive setting

5bf2b32

hvanhovell approved these changes Dec 21, 2016

View reviewed changes

asfgit closed this in 354e936 Dec 21, 2016

noslowerdna mentioned this pull request Mar 23, 2022

Make sure that writeOptions is used when passed to the writeFiles function delta-io/delta#652

Closed

zsxwing mentioned this pull request Apr 4, 2022

[#781] Add support for Spark DataFrameWriter maxRecordsPerFile option delta-io/delta#1017

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18775][SQL] Limit the max number of records written per file #16204

[SPARK-18775][SQL] Limit the max number of records written per file #16204

rxin commented Dec 8, 2016 •

edited

Loading

rxin Dec 8, 2016

SparkQA commented Dec 8, 2016

ssimeonov left a comment

ssimeonov Dec 8, 2016 •

edited

Loading

ericl Dec 8, 2016

rxin Dec 9, 2016

ericl Dec 9, 2016

rxin Dec 9, 2016

SparkQA commented Dec 8, 2016

SparkQA commented Dec 8, 2016

SparkQA commented Dec 8, 2016

rxin commented Dec 8, 2016

SparkQA commented Dec 8, 2016

ericl Dec 8, 2016

ericl Dec 8, 2016

SparkQA commented Dec 9, 2016

SparkQA commented Dec 9, 2016

ericl commented Dec 10, 2016

mridulm Dec 8, 2016

mridulm Dec 10, 2016

rxin Dec 11, 2016

mridulm Dec 10, 2016

rxin commented Dec 14, 2016

SparkQA commented Dec 14, 2016

rxin commented Dec 20, 2016

hvanhovell left a comment

hvanhovell Dec 21, 2016

rxin Dec 21, 2016

hvanhovell Dec 21, 2016

hvanhovell commented Dec 21, 2016

		@@ -821,8 +831,6 @@ private[sql] class SQLConf extends Serializable with CatalystConf with Logging {

		def warehousePath: String = new Path(getConf(StaticSQLConf.WAREHOUSE_PATH)).toString

		def ignoreCorruptFiles: Boolean = getConf(IGNORE_CORRUPT_FILES)

[SPARK-18775][SQL] Limit the max number of records written per file #16204

[SPARK-18775][SQL] Limit the max number of records written per file #16204

Conversation

rxin commented Dec 8, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

SparkQA commented Dec 8, 2016

ssimeonov left a comment

Choose a reason for hiding this comment

ssimeonov Dec 8, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 8, 2016

SparkQA commented Dec 8, 2016

SparkQA commented Dec 8, 2016

rxin commented Dec 8, 2016

SparkQA commented Dec 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 9, 2016

SparkQA commented Dec 9, 2016

ericl commented Dec 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Dec 14, 2016

SparkQA commented Dec 14, 2016

rxin commented Dec 20, 2016

hvanhovell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell commented Dec 21, 2016

rxin commented Dec 8, 2016 •

edited

Loading

ssimeonov Dec 8, 2016 •

edited

Loading