Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

[NSE-400] Native Arrow Row to columnar support #637

Merged
merged 24 commits into from
Jan 8, 2022

Conversation

haojinIntel
Copy link
Collaborator

@haojinIntel haojinIntel commented Dec 15, 2021

What changes were proposed in this pull request?

Native Arrow Row to columnar support

How was this patch tested?

pass jenkins

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/native-sql-engine/issues

Then could you also rename commit message and pull request title in the following format?

[NSE-${ISSUES_ID}] ${detailed message}

See also:

@zhouyuan zhouyuan changed the title Arrow Row to columnar support [DNM] Arrow Row to columnar support Dec 17, 2021
@zhouyuan
Copy link
Collaborator

zhouyuan commented Jan 4, 2022

@haojinIntel can you do a rebase? the format check is fixed in master

val res = new Iterator[ColumnarBatch] {
private val converters = new RowToColumnConverter(localSchema)
private var last_cb: ColumnarBatch = null
private var elapse: Long = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new buf here

// Allocate large buffer to store the numRows rows
val bufferSize = 134217728 // 128M can estimator the buffer size based on the data type
val allocator = SparkMemoryUtils.contextAllocator()
val arrowBuf: ArrowBuf = allocator.buffer(bufferSize)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reuse buf allocated in #95

}
val timeZoneId = SparkSchemaUtils.getLocalTimezoneID()
val arrowSchema = ArrowUtils.toArrowSchema(schema, timeZoneId)
val schemaBytes: Array[Byte] = ConverterUtils.getSchemaBytesBuf(arrowSchema)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move these varaiables to init

out_data.null_count = null_count;
*array = MakeArray(std::make_shared<arrow::ArrayData>(std::move(out_data)));
return arrow::Status::OK();
} else if (type->id() == arrow::Int8Type::type_id) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use switch here

@zhouyuan zhouyuan changed the title [DNM] Arrow Row to columnar support [NSE-400] Arrow Row to columnar support Jan 7, 2022
@github-actions
Copy link

github-actions bot commented Jan 7, 2022

#400

@zhouyuan zhouyuan changed the title [NSE-400] Arrow Row to columnar support [NSE-400] Native Arrow Row to columnar support Jan 7, 2022
@zhouyuan zhouyuan merged commit 8e9f4fe into oap-project:master Jan 8, 2022
zhouyuan added a commit to zhouyuan/native-sql-engine that referenced this pull request Jan 8, 2022
* Support ArrowRowToColumnar Optimization

* Replace expired code

* Add the code to convert recordbatch to columnarBatch

* Add unit test on java size

* Update the unit tests

* Fix the bug when reading decimal value from unsafeRow

* Use ArrowRowToColumnarExec instead of RowToArrowColumnarExec

* Use clang-format to standardize the CPP code format

* enable arrowRowToColumnarExec

* Add the metrics for ArrowRowToColumnarExec

* Add the metrics for ArrowRowToColumnarExec and unsupport Codegen

* Add parameter 'spark.oap.sql.columnar.rowtocolumnar' to control ArrowRowToColumnarExec

* Remove useless code

* Release arrowbuf after return recordbatch

* Fix the processTime metric for ArrowRowToColumnarExec

* Refine the code of ArrowRowToColumnar operator

* Add more metrics to detect the elapse time of each action

* Small fix about allocating buffer for unsafeRow

* Remove useless code

* Remove useless metrics for ArrowRowToColumnarExec

* Fall back to use java RowToColumnarExec when the row is not unsafeRow Type

* Fix the bug for decimal format

* fix format

Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>
zhouyuan added a commit that referenced this pull request Jan 9, 2022
* Use Hadoop 3.2 as default hadoop.version (#652)

* [NSE-661] Add left/right trim in WSCG

* [NSE-675] Add instr expression support (#676)

* Initial commit

* Add the support in wscg

* [NSE-674] Add translate expression support (#672)

* Initial commit

* Add StringTranslate for subquery checking

* Code refactor

* Change arrow branch for unit test [will revert at last]

* Revert "Change arrow branch for unit test [will revert at last]"

This reverts commit bf74356.

* Port the function to wscg

* Change arrow branch for unit test [will revert at last]

* Format native code

* Fix a bug

* Revert "Change arrow branch for unit test [will revert at last]"

This reverts commit 3a53fa2.

* [NSE-681] Add floor & ceil expression support (#682)

* Initial commit

* Add ceil expression support

* Change arrow branch for unit test [will revert at last]

* Revert "Change arrow branch for unit test [will revert at last]"

This reverts commit 5fb2f4b.

* [NSE-647] Leverage buffered write in shuffle  (#648)

Closes #647

* [NSE-400] Native Arrow Row to columnar support (#637)

* Support ArrowRowToColumnar Optimization

* Replace expired code

* Add the code to convert recordbatch to columnarBatch

* Add unit test on java size

* Update the unit tests

* Fix the bug when reading decimal value from unsafeRow

* Use ArrowRowToColumnarExec instead of RowToArrowColumnarExec

* Use clang-format to standardize the CPP code format

* enable arrowRowToColumnarExec

* Add the metrics for ArrowRowToColumnarExec

* Add the metrics for ArrowRowToColumnarExec and unsupport Codegen

* Add parameter 'spark.oap.sql.columnar.rowtocolumnar' to control ArrowRowToColumnarExec

* Remove useless code

* Release arrowbuf after return recordbatch

* Fix the processTime metric for ArrowRowToColumnarExec

* Refine the code of ArrowRowToColumnar operator

* Add more metrics to detect the elapse time of each action

* Small fix about allocating buffer for unsafeRow

* Remove useless code

* Remove useless metrics for ArrowRowToColumnarExec

* Fall back to use java RowToColumnarExec when the row is not unsafeRow Type

* Fix the bug for decimal format

* fix format

Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>

* fix leakage in rowtocolumn (#683)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

Co-authored-by: Wei-Ting Chen <weiting.chen@intel.com>
Co-authored-by: PHILO-HE <feilong.he@intel.com>
Co-authored-by: Hongze Zhang <hongze.zhang@intel.com>
Co-authored-by: haojinIntel <hao.jin@intel.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants