Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2377] Python API for Streaming #2538

Closed
wants to merge 365 commits into from
Closed

Conversation

davies
Copy link
Contributor

@davies davies commented Sep 25, 2014

This patch brings Python API for Streaming.

This patch is based on work from @giwa

@tdas
Copy link
Contributor

tdas commented Oct 11, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Oct 11, 2014

QA tests have started for PR 2538 at commit 3e2492b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 11, 2014

Tests timed out for PR 2538 at commit 6db00da after a configured wait of 120m.

@tdas
Copy link
Contributor

tdas commented Oct 11, 2014

@davies unfortunately the last update to statebykey introduced some bug. Here is the error in jenkins logs.

Traceback (most recent call last):
File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/streaming/util.py", line 57, in call
r = self.func(t, *rdds)
File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/streaming/dstream.py", line 254, in saveAsTextFile
rdd.saveAsTextFile(path)
File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", line 1267, in saveAsTextFile
keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in call
self.target_id, self.name)
File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
format(target_id, '.', name), value)
Py4JJavaError: An error occurred while calling o62024.saveAsTextFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 296848.0 failed 1 times, most recent failure: Lost task 0.0 in stage 296848.0 (TID 780, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "pyspark/worker.py", line 109, in main
process()
File "pyspark/worker.py", line 100, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "pyspark/serializers.py", line 234, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/rdd.py", line 1641, in
map_values_fn = lambda (k, v): (k, f(v))
File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/streaming/dstream.py", line 579, in
state = g.mapValues(lambda (vs, s): updateFunc(vs, s))
TypeError: updater() takes exactly 1 argument (2 given)

    org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
    org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:166)
    org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
    org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
    org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
    org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:235)
    org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:196)
    org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:196)
    org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1321)
    org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:195)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1191)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1180)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1179)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1179)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:694)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:694)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:694)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1397)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1352)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

@SparkQA
Copy link

SparkQA commented Oct 11, 2014

Tests timed out for PR 2538 at commit 3e2492b after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21612/Test FAILed.

if len(sys.argv) != 3:
print >> sys.stderr, "Usage: stateful_network_wordcount.py <hostname> <port>"
exit(-1)
sc = SparkContext(appName="PythonStreamingNetworkWordCount")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appName could be "PythonStreamingStatefulNetworkWordCount"

@SparkQA
Copy link

SparkQA commented Oct 11, 2014

QA tests have started for PR 2538 at commit 6db00da.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 11, 2014

QA tests have started for PR 2538 at commit 331ecce.

  • This patch merges cleanly.

@davies
Copy link
Contributor Author

davies commented Oct 11, 2014

@tdas it's my mistake, the updateStateByKey() was used in another tests, it's fixed now.

@SparkQA
Copy link

SparkQA commented Oct 11, 2014

QA tests have started for PR 2538 at commit 6db00da.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 11, 2014

QA tests have started for PR 2538 at commit 64561e4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 11, 2014

QA tests have finished for PR 2538 at commit 6db00da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StreamingContext(object):
    • class DStream(object):
    • class TransformedDStream(DStream):
    • class TransformFunction(object):
    • class TransformFunctionSerializer(object):

@SparkQA
Copy link

SparkQA commented Oct 11, 2014

QA tests have finished for PR 2538 at commit 64561e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StreamingContext(object):
    • class DStream(object):
    • class TransformedDStream(DStream):
    • class TransformFunction(object):
    • class TransformFunctionSerializer(object):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21631/Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 11, 2014

Tests timed out for PR 2538 at commit 6db00da after a configured wait of 120m.

@SparkQA
Copy link

SparkQA commented Oct 11, 2014

Tests timed out for PR 2538 at commit 331ecce after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21627/Test FAILed.

@tdas
Copy link
Contributor

tdas commented Oct 12, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Oct 12, 2014

QA tests have started for PR 2538 at commit 64561e4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 12, 2014

QA tests have started for PR 2538 at commit 6db00da.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 12, 2014

QA tests have started for PR 2538 at commit 6db00da.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 12, 2014

QA tests have finished for PR 2538 at commit 64561e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StreamingContext(object):
    • class DStream(object):
    • class TransformedDStream(DStream):
    • class TransformFunction(object):
    • class TransformFunctionSerializer(object):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21660/
Test PASSed.

@tdas
Copy link
Contributor

tdas commented Oct 12, 2014

Alright, I am merging this!! This is awesome, great great work @giwa and @davies. Thank a lot to both of you, and also to @JoshRosen for ongoing help, suggestions and discussions.

@asfgit asfgit closed this in 69c67ab Oct 12, 2014
@giwa
Copy link
Contributor

giwa commented Oct 12, 2014

Thank you for merging this! I appreciate to you, @tdas , to give me many advices. Thank you, @davies to refactor code a lot. I learned many thing from your commits. And also thank you to @JoshRosen to give us the insight and discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants