Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

Cannot run Terasort with pmem-shuffle of branch-1.2 #46

Closed
haojinIntel opened this issue Aug 16, 2021 · 1 comment
Closed

Cannot run Terasort with pmem-shuffle of branch-1.2 #46

haojinIntel opened this issue Aug 16, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@haojinIntel
Copy link
Contributor

Using the latest code of PMEM-SHUFFLE of branch-1.2 to run terasort with spark-3.1.1 and meet the error like below:

2021-08-16 17:47:40,924 INFO scheduler.TaskSetManager: Starting task 696.0 in stage 2.0 (TID 3996) (vsr221, executor 4, partition 696, PROCESS_LOCAL, 4282 bytes) taskResourceAssignments Map()
2021-08-16 17:47:40,952 WARN scheduler.TaskSetManager: Lost task 687.0 in stage 2.0 (TID 3987) (vsr221 executor 4): org.apache.spark.shuffle.FetchFailedException: failed to allocate 16777216 byte(s) of direct memory (used: 64424509440, max: 64424509440)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:770)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:685)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)
        at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at org.apache.spark.util.CompletionIterator.foreach(CompletionIterator.scala:25)
        at scala.collection.TraversableOnce.size(TraversableOnce.scala:110)
        at scala.collection.TraversableOnce.size$(TraversableOnce.scala:108)
        at org.apache.spark.util.CompletionIterator.size(CompletionIterator.scala:25)
        at org.apache.spark.shuffle.BaseShuffleReader.read(BaseShuffleReader.scala:93)
        at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 16777216 byte(s) of direct memory (used: 64424509440, max: 64424509440)
        at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:754)
        at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:709)
        at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:755)
        at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:731)
        at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:247)
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:227)
        at io.netty.buffer.PoolArena.allocate(PoolArena.java:147)
        at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:356)
        at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
        at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178)
        at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:139)
        at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:114)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:147)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        ... 1 more

The configuration of spark is showed below:

hibench.streambench.spark.checkpointPath /var/tmp
spark.shuffle.pmof.reduce_serializer_buffer_size 262144
spark.shuffle.pmof.client_buffer_nums 64
hibench.streambench.spark.receiverNumber        4
spark.shuffle.readHostLocalDisk   false
spark.yarn.historyServer.address vsr224:18080
spark.io.compression.codec      snappy
spark.shuffle.pmof.server_pool_size 3
spark.shuffle.pmof.pmem_capacity 246833655808
spark.driver.rhost 10.1.0.124
hibench.yarn.executor.cores   12
spark.executor.memory 60g
hibench.streambench.spark.useDirectMode true
spark.shuffle.compress          true
spark.eventLog.dir hdfs://vsr224:9000/spark-history-server
spark.driver.rport 61000
spark.shuffle.pmof.pmem_list /dev/dax0.0,/dev/dax0.1,/dev/dax1.0,/dev/dax1.1
spark.shuffle.pmof.map_serializer_buffer_size 262144
spark.driver.memory 10g
spark.shuffle.pmof.shuffle_block_size 2097152
spark.memory.offHeap.enabled false
spark.eventLog.enabled true
spark.shuffle.pmof.chunk_size 262144
spark.driver.extraClassPath  /opt/Beaver/OAP/oap_jar/oap-rpmem-shuffle-java-1.2-with-spark3.1.1.jar
spark.shuffle.spill.pmof.MemoryThreshold 16777216
spark.shuffle.pmof.dev_core_set dax0.0:0-95,dax0.1:0-95,dax1.0:0-95,dax1.1:0-95
hibench.yarn.executor.num     12
spark.history.fs.logDirectory hdfs://vsr224:9000/spark-history-server
spark.executor.extraClassPath  /opt/Beaver/OAP/oap_jar/oap-rpmem-shuffle-java-1.2-with-spark3.1.1.jar
spark.history.fs.cleaner.enabled true
spark.default.parallelism     ${hibench.default.map.parallelism}
spark.shuffle.pmof.client_pool_size 3
spark.shuffle.pmof.enable_pmem true
spark.shuffle.pmof.max_task_num 50000
hibench.streambench.spark.storageLevel 2
hibench.streambench.spark.batchInterval         100
spark.shuffle.pmof.max_stage_num 1
spark.shuffle.pmof.node vsr221-10.1.0.121,vsr222-10.1.0.122,vsr223-10.1.0.123
hibench.spark.master     yarn
spark.sql.shuffle.partitions 1000
spark.history.ui.port 18080
spark.shuffle.spill.compress    true
hibench.spark.home       /opt/Beaver/spark
spark.shuffle.pmof.enable_rdma false
spark.shuffle.manager org.apache.spark.shuffle.pmof.PmofShuffleManager
spark.sql.warehouse.dir hdfs://vsr224:9000/spark-warehouse
spark.shuffle.pmof.server_buffer_nums 64

The cluster contain 1 master and 3 workers. Each worker contains 96 vores and 384GB DRAM and 3 * 1.8TB nvme disks. The bandwidth of network is 10Gbps.

@haojinIntel
Copy link
Contributor Author

@zhixingheyi-tian @Eugene-Mark Please help to track the issue. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants