Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot run ALS algorithm with oap-mllib thanks to the commit "2883d3447d07feb55bf5d4fee8225d74b0b1e2b1" #116

Closed
haojinIntel opened this issue Aug 11, 2021 · 1 comment · Fixed by #118
Labels
bug Something isn't working

Comments

@haojinIntel
Copy link
Collaborator

Thanks to the commit "2883d3447d07feb55bf5d4fee8225d74b0b1e2b1" of branch-1.2, running ALS with oap-mllib encounter the following issue:

2021-08-11 10:58:40,941 ERROR scheduler.TaskSetManager: Task 2 in stage 7.0 failed 4 times; aborting job
2021-08-11 10:58:40,949 INFO cluster.YarnScheduler: Cancelling stage 7
2021-08-11 10:58:40,949 INFO cluster.YarnScheduler: Killing all running tasks in stage 7: Stage cancelled
2021-08-11 10:58:40,958 INFO cluster.YarnScheduler: Stage 7 was cancelled
2021-08-11 10:58:40,959 INFO scheduler.DAGScheduler: ResultStage 7 (collect at Utils.scala:102) failed in 12.733 s due to Job aborted due to stage failure: Task 2 in stage 7.0 failed 4 times, most recent failure: Lost task 2.3 in stage 7.0 (TID 638) (bdpe-sky3 executor 9): java.lang.UnsatisfiedLinkError: org.apache.spark.ml.util.OneCCL$.c_getAvailPort(Ljava/lang/String;)I
        at org.apache.spark.ml.util.OneCCL$.c_getAvailPort(Native Method)
        at org.apache.spark.ml.util.OneCCL$.getAvailPort(OneCCL.scala:54)
        at org.apache.spark.ml.util.Utils$.$anonfun$checkExecutorAvailPort$1(Utils.scala:103)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
2021-08-11 10:58:40,961 INFO scheduler.DAGScheduler: Job 6 failed: collect at Utils.scala:102, took 27.956369 s
2021-08-11 10:58:40,964 ERROR util.Instrumentation: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 7.0 failed 4 times, most recent failure: Lost task 2.3 in stage 7.0 (TID 638) (bdpe-sky3 executor 9): java.lang.UnsatisfiedLinkError: org.apache.spark.ml.util.OneCCL$.c_getAvailPort(Ljava/lang/String;)I
        at org.apache.spark.ml.util.OneCCL$.c_getAvailPort(Native Method)
        at org.apache.spark.ml.util.OneCCL$.getAvailPort(OneCCL.scala:54)
        at org.apache.spark.ml.util.Utils$.$anonfun$checkExecutorAvailPort$1(Utils.scala:103)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
@haojinIntel
Copy link
Collaborator Author

@xwu99 @zhixingheyi-tian Please help to track the issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants