Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

[NSE-317]fix columnar cache #346

Merged

Conversation

xuechendi
Copy link
Collaborator

This PR is aim to fix current UT issues related with RDD cache

  1. for tests wo/ Arrow Serialiazer, we now avoided using ColumarInMemoryTableScan
  2. for tests w/ Arrow Serializer, while its cachedPlan doesn't support columnar, now we supported ConvertInternalRowToCachedBatch in ArrowSerializer
  3. for tests w/ Arrow Serializer and its cachedPlan also supports columnar, we will use Arrow for fast cache.

TODO:
need to back port spark PR's for manual close cached blocks, https://issues.apache.org/jira/browse/SPARK-35396
need to test with our jupyter test(now only UT tested)

Fixed: #317

@github-actions
Copy link

github-actions bot commented Jun 1, 2021

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/native-sql-engine/issues

Then could you also rename commit message and pull request title in the following format?

[NSE-${ISSUES_ID}] ${detailed message}

See also:

@xuechendi
Copy link
Collaborator Author

@rui-mo , please take a check

@rui-mo
Copy link
Collaborator

rui-mo commented Jun 1, 2021

verified on my env. This pr cleans up my test code and refresh the ut: xuechendi#3

@xuechendi xuechendi force-pushed the wip_columnar_cache_fix_for_3.1.1 branch 2 times, most recently from 49d4e03 to acc178a Compare June 30, 2021 09:29
@xuechendi
Copy link
Collaborator Author

@zhouyuan , should be mergable

native_sql_path = "/mnt/nvme2/chendi/intel-bigdata/OAP/native-sql-engine/native-sql-engine/core/target/spark-columnar-core-1.2.0-snapshot-jar-with-dependencies.jar"
native_arrow_datasource_path = "/mnt/nvme2/chendi/intel-bigdata/OAP/native-sql-engine/arrow-data-source/standard/target/spark-arrow-datasource-standard-1.2.0-snapshot-jar-with-dependencies.jar"
spark = SparkSession.builder.master('yarn')\
        .appName("Recsys2021_data_process")\
        .config("spark.executorEnv.LD_LIBRARY_PATH", "/usr/local/lib64/")\
        .config("spark.driver.extraClassPath", 
                f"{native_sql_path}:{native_arrow_datasource_path}")\
        .config("spark.executor.extraClassPath",
                f"{native_sql_path}:{native_arrow_datasource_path}")\
        .config("spark.sql.extensions", "com.intel.oap.ColumnarPlugin")\
        .config("spark.shuffle.manager", "org.apache.spark.shuffle.sort.ColumnarShuffleManager")\
        .config("spark.sql.cache.serializer", "org.apache.spark.sql.execution.ArrowColumnarCachedBatchSerializer")\
        .config("spark.executor.memory", "10g")\
        .config("spark.executor.memoryOverhead", "16g")\
        .config("spark.memory.offHeap.use", "true")\
        .config("spark.memory.offHeap.size", "12G")\
        .config("spark.executor.extraJavaOptions", "-XX:MaxDirectMemorySize=25G")\
        .getOrCreate()

test: cache join result then do aggregate

df = spark.read.format('arrow').load("/recsys2021_0608")
dict_df = spark.read.parquet("/recsys2021_0608_processed/recsys_dicts/language")
df = df.select("tweet_id", "language", "tweet_timestamp", "engaged_with_user_id", "engaging_user_id")
df = df.join(dict_df.withColumnRenamed('dict_col', 'language'), 'language', 'left')
df.cache()
df.groupby('dict_col_id', 'language').count().show()

* limitations under the License.
*/

package org.apache.spark.sql.travis
Copy link
Collaborator

@rui-mo rui-mo Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @xuechendi , we have changed this package "travis" into "nativesql", could you also put this file into nativesql folder?
This file should also be renamed into NativeCachedTableSuite.scala.

xuechendi and others added 10 commits July 6, 2021 13:29
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
…elease function

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
…zedCacheEntry to OffHeap

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi xuechendi force-pushed the wip_columnar_cache_fix_for_3.1.1 branch from 7c115dc to 7fb4ec3 Compare July 6, 2021 06:29
@xuechendi xuechendi changed the title [DNM][NSE-317]Wip columnar cache fix for 3.1.1 [NSE-317]Wip columnar cache fix for 3.1.1 Jul 6, 2021
@github-actions
Copy link

github-actions bot commented Jul 6, 2021

#317

@zhouyuan zhouyuan changed the title [NSE-317]Wip columnar cache fix for 3.1.1 [NSE-317]fix columnar cache Jul 7, 2021
@zhouyuan zhouyuan merged commit b584b08 into oap-project:master Jul 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

persistent memory cache issue
3 participants