[HUDI-1722]hive beeline/spark-sql query specified field on mor table occur NPE #2722

xiarixiaoyao · 2021-03-25T13:13:15Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

Fix the bug introduced by HUDI-892 HUDI-892 introduce this problem。
this pr skip adding projection columns if there are no log files in the hoodieRealtimeSplit。 but this pr donnot consider that multiple getRecordReaders share same jobConf。
Consider the following questions：
we have four getRecordReaders:

reader1(its hoodieRealtimeSplit contains no log files)
reader2 (its hoodieRealtimeSplit contains log files)
reader3(its hoodieRealtimeSplit contains log files)
reader4(its hoodieRealtimeSplit contains no log files)

now reader1 run first, HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf will be set to be true, and no hoodie additional projection columns will be added to jobConf （see HoodieParquetRealtimeInputFormat.addProjectionToJobConf）

reader2 run later, since HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP in jobConf is set to be true, no hoodie additional projection columns will be added to jobConf. （see HoodieParquetRealtimeInputFormat.addProjectionToJobConf）
which lead to the result that _hoodie_record_key would be missing and merge step would throw exceptions

2021-03-25 20:23:14,014 | INFO  | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0038_m_000000_0: Error: java.lang.NullPointerException2021-03-25 20:23:14,014 | INFO  | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0038_m_000000_0: Error: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:101) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:92) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:68) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:77) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:42) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:205) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:191) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at org.apache.hadoop.hive.ql.exec.mr.ExecMapRunner.run(ExecMapRunner.java:37) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177)

Obviously, this is an occasional problem。 if reader2 run first, hoodie additional projection columns will be added to jobConf and in this case the query will be ok

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

This pull request is already covered by existing tests

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

xiarixiaoyao · 2021-03-25T13:15:17Z

test step:
before patch:
step1:

val df = spark.range(0, 100000).toDF("keyid")
.withColumn("col3", expr("keyid"))
.withColumn("p", lit(0))
.withColumn("p1", lit(0))
.withColumn("p2", lit(7))
.withColumn("a1", lit(Array[String] ("sb1", "rz")))
.withColumn("a2", lit(Array[String] ("sb1", "rz")))

// create hoodie table hive_14b

merge(df, 4, "default", "hive_14b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "bulk_insert")

notice: bulk_insert will produce 4 files in hoodie table

step2:

val df = spark.range(99999, 100002).toDF("keyid")
.withColumn("col3", expr("keyid"))
.withColumn("p", lit(0))
.withColumn("p1", lit(0))
.withColumn("p2", lit(7))
.withColumn("a1", lit(Array[String] ("sb1", "rz")))
.withColumn("a2", lit(Array[String] ("sb1", "rz")))

// upsert table

merge(df, 4, "default", "hive_14b", DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL, op = "upsert")

now : we have four base files and one log file in hoodie table

step3:

spark-sql/beeline:

select count(col3) from hive_14b_rt;

then the query failed.
2021-03-25 20:23:14,014 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0038_m_000000_0: Error: java.lang.NullPointerException2021-03-25 20:23:14,014 | INFO | AsyncDispatcher event handler | Diagnostics report from attempt_1615883368881_0038_m_000000_0: Error: java.lang.NullPointerException at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:101) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:36) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:92) at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.next(RealtimeCompactedRecordReader.java:43) at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.next(HoodieRealtimeRecordReader.java:79) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:68) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:77) at org.apache.hudi.hadoop.realtime.HoodieCombineRealtimeRecordReader.next(HoodieCombineRealtimeRecordReader.java:42) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:205) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:191) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:52) at org.apache.hadoop.hive.ql.exec.mr.ExecMapRunner.run(ExecMapRunner.java:37) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:465) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at org.apache.hadoop.mapred.YarnChild$1.run(YarnChild.java:183) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1761) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:177)

after patch:
spark-sql/hive-beeline
select count(col3) from hive_14b_rt;
+---------+
| _c0 |
+---------+
| 100002 |
+---------+

merge function:
def merge(df: org.apache.spark.sql.DataFrame, par: Int, db: String, tableName: String,
tableType: String = DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
hivePartitionExtract: String = "org.apache.hudi.hive.MultiPartKeysValueExtractor", op: String = "upsert"): Unit = {
val mode = if (op.equals("bulk_insert")) {
Overwrite
} else {
Append
}
df.write.format("hudi").
option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType).
option(HoodieCompactionConfig.INLINE_COMPACT_PROP, false).
option(PRECOMBINE_FIELD_OPT_KEY, "col3").
option(RECORDKEY_FIELD_OPT_KEY, "keyid").
option(PARTITIONPATH_FIELD_OPT_KEY, "p,p1,p2").
option(DataSourceWriteOptions.OPERATION_OPT_KEY, op).
option(HoodieWriteConfig.KEYGENERATOR_CLASS_PROP, classOf[ComplexKeyGenerator].getName).
option("hoodie.bulkinsert.shuffle.parallelism", par.toString).
option("hoodie.metadata.enable", "false").
option("hoodie.insert.shuffle.parallelism", par.toString).
option("hoodie.upsert.shuffle.parallelism", par.toString).
option("hoodie.delete.shuffle.parallelism", par.toString).
option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true").
option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "p,p1,p2").
option("hoodie.datasource.hive_sync.support_timestamp", "true").
option(HIVE_STYLE_PARTITIONING_OPT_KEY, "true").
option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor").
option(HIVE_USE_JDBC_OPT_KEY, "false").
option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, db).
option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName).
option(TABLE_NAME, tableName).mode(mode).save(s"/tmp/db/{tableName}")
}

xiarixiaoyao · 2021-03-25T13:15:48Z

@garyli1019 could you pls help me review this pr, thanks

nsivabalan · 2021-03-30T19:40:20Z

@garyli1019 : replace "sev:critical" label with "sev:high" if applicable. Fix the corresponding jira as well if you had to change it.

garyli1019 · 2021-04-01T02:21:53Z

@xiarixiaoyao Thanks for your contribution. Looks like you are able to reproduce this problem in the unit test. Is that possible to add the unit test to this pr as well?

xiarixiaoyao · 2021-04-01T07:24:23Z

thanks @garyli1019 . ok， i will try to add unit test

xiarixiaoyao · 2021-04-10T03:40:18Z

@garyli1019 unit test has added， pls review again， thanks

garyli1019

@xiarixiaoyao thanks for adding the test! @lw309637554 would you please review this PR as well?

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java

lw309637554 · 2021-04-16T14:45:01Z

@xiarixiaoyao hello , will review it this weekend

lw309637554 · 2021-04-18T08:49:40Z

...oop-mr/src/test/java/org/apache/hudi/hadoop/functional/TestHoodieCombineHiveInputFormat.java

+    InputSplit[] splits = combineHiveInputFormat.getSplits(jobConf, 1);
+    // Since the SPLIT_SIZE is 3, we should create only 1 split with all 3 file groups
+    assertEquals(1, splits.length);
+    RecordReader<NullWritable, ArrayWritable> recordReader =


hello , just see one recordreader?

yes， we only create one combine recorder， but this recorder hold three RealtimeCompactedRecordReaders。
the creating order of those RealtimeCompactedRecordReaders lead this npe problem.
for test example:
combine recorder holds three RealtimeCompactedRecordReaders, we call them creader1, creader2, creader3
creader1: only has base file
creader2: only has base file
creader3: has base file and log file.

if creader3 is create firstly, hoodie additional projection columns will be added to jobConf and in this case the query will be ok
however if creader1 or creader2 is create firstly, no hoodie additional projection columns will be added to jobConf, the query will failed

got it, thanks

...adoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java

lw309637554 · 2021-04-18T08:54:38Z

@xiarixiaoyao thanks for your contribution. Add the unit test is very necessary. Also the resolution left some comments.

xiarixiaoyao · 2021-04-19T02:45:38Z

@lw309637554 thanks for you review. i have answered your questions， pls check them， thanks。

Another question: TestHoodieCombineHiveInputFormat.testHoodieRealtimeCombineHoodieInputFormat is disabled by default, i have checked that test function, and find there exists some problems. could i fix those problem and enable TestHoodieCombineHiveInputFormat.testHoodieRealtimeCombineHoodieInputFormat default

vinothchandar · 2021-04-19T21:38:03Z

@nsivabalan same over to you to get this ready for review.

lw309637554 · 2021-04-20T01:20:26Z

testHoodieRealtimeCombineHoodieInputFormat

try it .

nsivabalan · 2021-04-20T03:14:59Z

@vinothchandar : I see that author is actively responding/working on the PR. Will leave it to the author to address feedback. If we don't see any activity for sometime, I can chime in.

xiarixiaoyao · 2021-04-21T01:48:30Z

@lw309637554 @nsivabalan thanks for your review. i will try testHoodieRealtimeCombineHoodieInputFormat in another pr, since
it has nothing to do with this problem.

garyli1019

LGTM

vinothchandar

@lw309637554 can you please let us know if you are okay with this change. This LGTM

vinothchandar · 2021-05-10T23:25:19Z

...adoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java

      synchronized (jobConf) {
        LOG.info(
            "Before adding Hoodie columns, Projections :" + jobConf.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR)
                + ", Ids :" + jobConf.get(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR));
-        if (jobConf.get(HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP) == null) {
+        if (jobConf.get(HoodieInputFormatUtils.HOODIE_READ_COLUMNS_PROP) == null
+            || (!realtimeSplit.getDeltaLogPaths().isEmpty() && !HoodieRealtimeInputFormatUtils.requiredProjectionFieldsExistInConf(jobConf))) {


can we pull this check into a small util method, that we can call in both places?

vinothchandar · 2021-05-11T00:07:50Z

@xiarixiaoyao Could you please rebase this PR . I tried doing this myself, seems tricky

xiarixiaoyao · 2021-05-11T03:35:51Z

@vinothchandar i will rebase this pr, thanks

…occur NPE

codecov-commenter · 2021-05-11T13:50:54Z

Codecov Report

Merging #2722 (8632d15) into master (ac72470) will increase coverage by 0.00%.
The diff coverage is 25.00%.

@@            Coverage Diff            @@
##             master    #2722   +/-   ##
=========================================
  Coverage     54.78%   54.78%           
- Complexity     3808     3812    +4     
=========================================
  Files           481      481           
  Lines         23320    23326    +6     
  Branches       2488     2492    +4     
=========================================
+ Hits          12775    12779    +4     
- Misses         9390     9392    +2     
  Partials       1155     1155

Flag	Coverage Δ	Complexity Δ
hudicli	`39.53% <ø> (ø)`	`220.00 <ø> (ø)`
hudiclient	`∅ <ø> (∅)`	`0.00 <ø> (ø)`
hudicommon	`50.37% <ø> (ø)`	`1975.00 <ø> (ø)`
hudiflink	`63.04% <ø> (ø)`	`530.00 <ø> (ø)`
hudihadoopmr	`51.01% <25.00%> (+0.05%)`	`266.00 <4.00> (+4.00)`
hudisparkdatasource	`73.33% <ø> (ø)`	`237.00 <ø> (ø)`
hudisync	`46.44% <ø> (ø)`	`144.00 <ø> (ø)`
huditimelineservice	`64.36% <ø> (ø)`	`62.00 <ø> (ø)`
hudiutilities	`69.59% <ø> (ø)`	`378.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...i/hadoop/utils/HoodieRealtimeInputFormatUtils.java	`33.91% <16.66%> (-0.95%)`	`12.00 <2.00> (+2.00)`	⬇️
...oop/realtime/HoodieParquetRealtimeInputFormat.java	`81.48% <50.00%> (ø)`	`7.00 <2.00> (ø)`
...hudi/hadoop/hive/HoodieCombineHiveInputFormat.java	`44.12% <0.00%> (+0.23%)`	`17.00% <0.00%> (+1.00%)`
...op/realtime/HoodieCombineRealtimeRecordReader.java	`83.33% <0.00%> (+6.66%)`	`9.00% <0.00%> (+1.00%)`

xiarixiaoyao · 2021-05-11T14:02:00Z

@vinothchandar ， i have rebased this pr pls check them， thanks

lw309637554 · 2021-05-12T03:23:09Z

@lw309637554 can you please let us know if you are okay with this change. This LGTM

@vinothchandar @xiarixiaoyao LGTM.

nsivabalan requested a review from garyli1019 March 30, 2021 19:36

nsivabalan added the priority:critical production down; pipelines stalled; Need help asap. label Mar 30, 2021

garyli1019 self-assigned this Apr 1, 2021

xiarixiaoyao force-pushed the hive_npe branch 2 times, most recently from 6070c23 to 0a896f1 Compare April 9, 2021 13:12

garyli1019 reviewed Apr 10, 2021

View reviewed changes

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeInputFormatUtils.java Show resolved Hide resolved

garyli1019 requested a review from lw309637554 April 10, 2021 14:00

lw309637554 reviewed Apr 18, 2021

View reviewed changes

...adoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java Outdated Show resolved Hide resolved

vinothchandar assigned nsivabalan Apr 19, 2021

garyli1019 approved these changes Apr 30, 2021

View reviewed changes

vinothchandar reviewed May 10, 2021

View reviewed changes

[HUDI-1722]hive beeline/spark-sql query specified field on mor table …

8632d15

…occur NPE

xiarixiaoyao force-pushed the hive_npe branch from 0a896f1 to 8632d15 Compare May 11, 2021 12:32

garyli1019 merged commit 6f7ff7e into apache:master May 12, 2021

vinothchandar mentioned this pull request May 24, 2021

[HUDI-892] RealtimeParquetInputFormat skip adding projection columns if there are no log files #2190

Merged

5 tasks

xiarixiaoyao deleted the hive_npe branch December 3, 2021 02:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-1722]hive beeline/spark-sql query specified field on mor table occur NPE #2722

[HUDI-1722]hive beeline/spark-sql query specified field on mor table occur NPE #2722

xiarixiaoyao commented Mar 25, 2021 •

edited by vinothchandar

Loading

xiarixiaoyao commented Mar 25, 2021 •

edited

Loading

xiarixiaoyao commented Mar 25, 2021

nsivabalan commented Mar 30, 2021

garyli1019 commented Apr 1, 2021

xiarixiaoyao commented Apr 1, 2021

xiarixiaoyao commented Apr 10, 2021

garyli1019 left a comment

lw309637554 commented Apr 16, 2021

lw309637554 Apr 18, 2021

xiarixiaoyao Apr 19, 2021 •

edited

Loading

lw309637554 Apr 20, 2021

lw309637554 commented Apr 18, 2021

xiarixiaoyao commented Apr 19, 2021 •

edited

Loading

vinothchandar commented Apr 19, 2021

lw309637554 commented Apr 20, 2021

nsivabalan commented Apr 20, 2021

xiarixiaoyao commented Apr 21, 2021

garyli1019 left a comment

vinothchandar left a comment

vinothchandar May 10, 2021

xiarixiaoyao May 11, 2021

vinothchandar commented May 11, 2021

xiarixiaoyao commented May 11, 2021

codecov-commenter commented May 11, 2021 •

edited

Loading

xiarixiaoyao commented May 11, 2021

lw309637554 commented May 12, 2021

[HUDI-1722]hive beeline/spark-sql query specified field on mor table occur NPE #2722

[HUDI-1722]hive beeline/spark-sql query specified field on mor table occur NPE #2722

Conversation

xiarixiaoyao commented Mar 25, 2021 • edited by vinothchandar Loading

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

xiarixiaoyao commented Mar 25, 2021 • edited Loading

xiarixiaoyao commented Mar 25, 2021

nsivabalan commented Mar 30, 2021

garyli1019 commented Apr 1, 2021

xiarixiaoyao commented Apr 1, 2021

xiarixiaoyao commented Apr 10, 2021

garyli1019 left a comment

Choose a reason for hiding this comment

lw309637554 commented Apr 16, 2021

lw309637554 Apr 18, 2021

Choose a reason for hiding this comment

xiarixiaoyao Apr 19, 2021 • edited Loading

Choose a reason for hiding this comment

lw309637554 Apr 20, 2021

Choose a reason for hiding this comment

lw309637554 commented Apr 18, 2021

xiarixiaoyao commented Apr 19, 2021 • edited Loading

vinothchandar commented Apr 19, 2021

lw309637554 commented Apr 20, 2021

nsivabalan commented Apr 20, 2021

xiarixiaoyao commented Apr 21, 2021

garyli1019 left a comment

Choose a reason for hiding this comment

vinothchandar left a comment

Choose a reason for hiding this comment

vinothchandar May 10, 2021

Choose a reason for hiding this comment

xiarixiaoyao May 11, 2021

Choose a reason for hiding this comment

vinothchandar commented May 11, 2021

xiarixiaoyao commented May 11, 2021

codecov-commenter commented May 11, 2021 • edited Loading

Codecov Report

xiarixiaoyao commented May 11, 2021

lw309637554 commented May 12, 2021

xiarixiaoyao commented Mar 25, 2021 •

edited by vinothchandar

Loading

xiarixiaoyao commented Mar 25, 2021 •

edited

Loading

xiarixiaoyao Apr 19, 2021 •

edited

Loading

xiarixiaoyao commented Apr 19, 2021 •

edited

Loading

codecov-commenter commented May 11, 2021 •

edited

Loading