[MINOR][SQL][DOC] Correct parquet nullability documentation #22759

dima-asana · 2018-10-17T21:14:33Z

What changes were proposed in this pull request?

Parquet files appear to have nullability info when being written, not being read.

How was this patch tested?

Some test code: (running spark 2.3, but the relevant code in DataSource looks identical on master)

case class NullTest(bo: Boolean, opbol: Option[Boolean])
val testDf = spark.createDataFrame(Seq(NullTest(true, Some(false))))

defined class NullTest
testDf: org.apache.spark.sql.DataFrame = [bo: boolean, opbol: boolean]

testDf.write.parquet("s3://asana-stats/tmp_dima/parquet_check_schema")

spark.read.parquet("s3://asana-stats/tmp_dima/parquet_check_schema/part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet4").printSchema()
root
|-- bo: boolean (nullable = true)
|-- opbol: boolean (nullable = true)

Meanwhile, the parquet file formed does have nullable info:

[]batch@prod-report000:/tmp/dimakamalov-batch$ aws s3 ls s3://asana-stats/tmp_dima/parquet_check_schema/
2018-10-17 21:03:52 0 _SUCCESS
2018-10-17 21:03:50 504 part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet
[]batch@prod-report000:/tmp/dimakamalov-batch$ aws s3 cp s3://asana-stats/tmp_dima/parquet_check_schema/part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet .
download: s3://asana-stats/tmp_dima/parquet_check_schema/part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet to ./part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet
[]batch@prod-report000:/tmp/dimakamalov-batch$ java -jar parquet-tools-1.8.2.jar schema part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet
message spark_schema {
required boolean bo;
optional boolean opbol;
}

srowen · 2018-11-08T17:25:44Z

@dima-asana it looks like this was on purpose: 2f38378 CC @gatorsmile
I agree though, doesn't appear that it's actually forced to be nullable on write.

gatorsmile · 2018-11-10T05:08:59Z

LGTM

Could you do us a favor to add the test cases for ensuring that the generated parquet files have a correct nullability value?

gatorsmile · 2018-11-10T05:09:45Z

docs/sql-programming-guide.md

@@ -706,7 +706,7 @@ data across a fixed number of buckets and can be used when a number of unique va

 [Parquet](http://parquet.io) is a columnar format that is supported by many other data processing systems.
 Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema
-of the original data. When writing Parquet files, all columns are automatically converted to be nullable for
+of the original data. When reading Parquet files, all columns are automatically converted to be nullable for


This file has been re-orged . Could you merge the latest master?

srowen · 2018-11-26T12:47:31Z

@dima-asana can you rebase, and add a simple test case or else find one that does demonstrate the behavior here?

srowen · 2018-12-04T14:48:09Z

Ping @dima-asana to rebase or close

dima-asana · 2018-12-05T23:27:51Z

@dima-asana can you rebase, and add a simple test case or else find one that does demonstrate the behavior here?

done, sorry for the delay

srowen

It seems fine, I just had a few trivial suggestions.

srowen · 2018-12-06T14:35:30Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

@@ -542,6 +551,35 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be
    }
  }

+  test("parquet - column nullability -- write only") {
+    val schema = StructType(
+      StructField("cl1", IntegerType, nullable = false) ::


Nit: could we indent these at the same level?

srowen · 2018-12-06T14:36:12Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+        new PrimitiveType(Repetition.OPTIONAL, PrimitiveTypeName.INT32, "cl2")
+      )
+
+      assert (expectedParquetSchema == parquetSchema)


Nit: I think ideally you use the === test operator, so that failures generated a better message

srowen · 2018-12-06T14:36:30Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+      f.close
+
+      // the write keeps nullable info from the schema
+      val expectedParquetSchema: Seq[PrimitiveType] = Seq(


Also really doesn't matter, but you can simplify the code by omitting types like this, etc.

Parquet files appear to have nullability info when being written. However, they do lose nullability info while read

SparkQA · 2018-12-06T23:53:27Z

Test build #4459 has finished for PR 22759 at commit ac138ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? Parquet files appear to have nullability info when being written, not being read. ## How was this patch tested? Some test code: (running spark 2.3, but the relevant code in DataSource looks identical on master) case class NullTest(bo: Boolean, opbol: Option[Boolean]) val testDf = spark.createDataFrame(Seq(NullTest(true, Some(false)))) defined class NullTest testDf: org.apache.spark.sql.DataFrame = [bo: boolean, opbol: boolean] testDf.write.parquet("s3://asana-stats/tmp_dima/parquet_check_schema") spark.read.parquet("s3://asana-stats/tmp_dima/parquet_check_schema/part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet4").printSchema() root |-- bo: boolean (nullable = true) |-- opbol: boolean (nullable = true) Meanwhile, the parquet file formed does have nullable info: []batchprod-report000:/tmp/dimakamalov-batch$ aws s3 ls s3://asana-stats/tmp_dima/parquet_check_schema/ 2018-10-17 21:03:52 0 _SUCCESS 2018-10-17 21:03:50 504 part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet []batchprod-report000:/tmp/dimakamalov-batch$ aws s3 cp s3://asana-stats/tmp_dima/parquet_check_schema/part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet . download: s3://asana-stats/tmp_dima/parquet_check_schema/part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet to ./part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet []batchprod-report000:/tmp/dimakamalov-batch$ java -jar parquet-tools-1.8.2.jar schema part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet message spark_schema { required boolean bo; optional boolean opbol; } Closes apache#22759 from dima-asana/dima-asana-nullable-parquet-doc. Authored-by: dima-asana <42555784+dima-asana@users.noreply.github.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

gatorsmile reviewed Nov 10, 2018

View reviewed changes

dima-asana force-pushed the dima-asana-nullable-parquet-doc branch from 0c6ae56 to acede3e Compare December 5, 2018 23:25

srowen reviewed Dec 6, 2018

View reviewed changes

Correct parquet nullability documentation

ac138ff

Parquet files appear to have nullability info when being written. However, they do lose nullability info while read

dima-asana force-pushed the dima-asana-nullable-parquet-doc branch from acede3e to ac138ff Compare December 6, 2018 19:10

srowen approved these changes Dec 7, 2018

View reviewed changes

asfgit closed this in bd00f10 Dec 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MINOR][SQL][DOC] Correct parquet nullability documentation #22759

[MINOR][SQL][DOC] Correct parquet nullability documentation #22759

dima-asana commented Oct 17, 2018

srowen commented Nov 8, 2018

gatorsmile commented Nov 10, 2018

gatorsmile Nov 10, 2018 •

edited

Loading

srowen commented Nov 26, 2018

srowen commented Dec 4, 2018

dima-asana commented Dec 5, 2018

srowen left a comment

srowen Dec 6, 2018

dima-asana Dec 6, 2018

srowen Dec 6, 2018

dima-asana Dec 6, 2018

srowen Dec 6, 2018

dima-asana Dec 6, 2018

SparkQA commented Dec 6, 2018

[MINOR][SQL][DOC] Correct parquet nullability documentation #22759

[MINOR][SQL][DOC] Correct parquet nullability documentation #22759

Conversation

dima-asana commented Oct 17, 2018

What changes were proposed in this pull request?

How was this patch tested?

srowen commented Nov 8, 2018

gatorsmile commented Nov 10, 2018

gatorsmile Nov 10, 2018 • edited Loading

Choose a reason for hiding this comment

srowen commented Nov 26, 2018

srowen commented Dec 4, 2018

dima-asana commented Dec 5, 2018

srowen left a comment

Choose a reason for hiding this comment

srowen Dec 6, 2018

Choose a reason for hiding this comment

dima-asana Dec 6, 2018

Choose a reason for hiding this comment

srowen Dec 6, 2018

Choose a reason for hiding this comment

dima-asana Dec 6, 2018

Choose a reason for hiding this comment

srowen Dec 6, 2018

Choose a reason for hiding this comment

dima-asana Dec 6, 2018

Choose a reason for hiding this comment

SparkQA commented Dec 6, 2018

gatorsmile Nov 10, 2018 •

edited

Loading