[SPARK-5154] [PySpark] [Streaming] Kafka streaming support in Python #3715

davies · 2014-12-16T22:19:54Z

This PR brings the Python API for Spark Streaming Kafka data source.

    class KafkaUtils(__builtin__.object)
     |  Static methods defined here:
     |
     |  createStream(ssc, zkQuorum, groupId, topics, storageLevel=StorageLevel(True, True, False, False,
2), keyDecoder=<function utf8_decoder>, valueDecoder=<function utf8_decoder>)
     |      Create an input stream that pulls messages from a Kafka Broker.
     |
     |      :param ssc:  StreamingContext object
     |      :param zkQuorum:  Zookeeper quorum (hostname:port,hostname:port,..).
     |      :param groupId:  The group id for this consumer.
     |      :param topics:  Dict of (topic_name -> numPartitions) to consume.
     |                      Each partition is consumed in its own thread.
     |      :param storageLevel:  RDD storage level.
     |      :param keyDecoder:  A function used to decode key
     |      :param valueDecoder:  A function used to decode value
     |      :return: A DStream object

run the example:

bin/spark-submit --driver-class-path external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar examples/src/main/python/streaming/kafka_wordcount.py localhost:2181 test

SparkQA · 2014-12-16T22:22:31Z

Test build #24510 has started for PR 3715 at commit 75d485e.

This patch merges cleanly.

SparkQA · 2014-12-16T22:23:25Z

Test build #24510 has finished for PR 3715 at commit 75d485e.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KafkaUtils(object):
- class MQTTUtils(object):

AmplabJenkins · 2014-12-16T22:23:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24510/
Test FAILed.

SparkQA · 2014-12-16T22:32:29Z

Test build #24511 has started for PR 3715 at commit 048dbe6.

This patch merges cleanly.

SparkQA · 2014-12-16T23:53:50Z

Test build #24511 has finished for PR 3715 at commit 048dbe6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KafkaUtils(object):
- class MQTTUtils(object):

AmplabJenkins · 2014-12-16T23:53:54Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24511/
Test PASSed.

prabeesh · 2014-12-17T11:15:32Z

LGTM

SparkQA · 2014-12-18T23:47:26Z

Test build #24609 has started for PR 3715 at commit 5697a01.

This patch merges cleanly.

SparkQA · 2014-12-18T23:48:24Z

Test build #24609 has finished for PR 3715 at commit 5697a01.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KafkaUtils(object):
- class MQTTUtils(object):

AmplabJenkins · 2014-12-18T23:48:25Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24609/
Test FAILed.

SparkQA · 2014-12-19T00:02:34Z

Test build #24610 has started for PR 3715 at commit 98c8d17.

This patch merges cleanly.

SparkQA · 2014-12-19T01:02:59Z

Test build #24610 has finished for PR 3715 at commit 98c8d17.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KafkaUtils(object):
- class MQTTUtils(object):
- class Analyzer(catalog: Catalog,

AmplabJenkins · 2014-12-19T01:03:02Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24610/
Test FAILed.

SparkQA · 2015-01-08T21:32:38Z

Test build #25267 has started for PR 3715 at commit eea16a7.

This patch merges cleanly.

SparkQA · 2015-01-08T22:24:04Z

Test build #25267 has finished for PR 3715 at commit eea16a7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KafkaUtils(object):

AmplabJenkins · 2015-01-08T22:24:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25267/
Test FAILed.

tdas · 2015-01-09T00:08:03Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

+        dataOut.write(bytes)
+      }
+    }
+    def writeS(str: String) {


new line missing.

Kafka-assembly for Python API

SparkQA · 2015-01-22T07:12:45Z

Test build #25951 has started for PR 3715 at commit adeeb38.

This patch does not merge cleanly.

Conflicts: make-distribution.sh project/SparkBuild.scala

SparkQA · 2015-01-29T20:12:48Z

Test build #26331 has started for PR 3715 at commit f257071.

This patch does not merge cleanly.

SparkQA · 2015-01-29T21:45:51Z

Test build #26331 has finished for PR 3715 at commit f257071.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class KafkaUtils(object):

AmplabJenkins · 2015-01-29T21:45:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26331/
Test PASSed.

Conflicts: python/pyspark/tests.py

SparkQA · 2015-01-29T22:42:53Z

Test build #26339 has started for PR 3715 at commit e6d0427.

This patch merges cleanly.

SparkQA · 2015-01-29T23:47:39Z

Test build #26339 has finished for PR 3715 at commit e6d0427.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KafkaUtils(object):

AmplabJenkins · 2015-01-29T23:47:43Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26339/
Test PASSed.

tdas · 2015-01-30T00:49:11Z

core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala

+    def write(obj: Any): Unit = obj match {
+      case null =>
+        dataOut.writeInt(SpecialLengths.NULL)
+


Why the extra line?

group them into different categories.

SparkQA · 2015-01-30T01:27:37Z

Test build #26361 has started for PR 3715 at commit 4280d04.

This patch merges cleanly.

SparkQA · 2015-01-30T02:43:37Z

Test build #26361 has finished for PR 3715 at commit 4280d04.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-30T02:43:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26361/
Test PASSed.

SparkQA · 2015-02-03T02:12:39Z

Test build #26578 has started for PR 3715 at commit d93bfe0.

This patch merges cleanly.

davies · 2015-02-03T02:12:51Z

@tdas @pwendell done

tdas · 2015-02-03T02:43:03Z

python/pyspark/streaming/kafka.py

+                print "No kafka package, please put the assembly jar into classpath:"
+                print " $ bin/spark-submit --driver-class-path external/kafka-assembly/target/" + \
+                      "scala-*/spark-streaming-kafka-assembly-*.jar"
+            raise e


The message that gets printed here is quite scary.

2015-02-02 18:31:31.950 java[76691:5f03] Unable to load realm info from SCDynamicStore No kafka package, please put the assembly jar into classpath: $ bin/spark-submit --driver-class-path external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar Traceback (most recent call last): File "/Users/tdas/Projects/Spark/spark/examples/src/main/python/streaming/kafka_wordcount.py", line 46, in <module> kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1}) File "/Users/tdas/Projects/Spark/spark/python/pyspark/streaming/kafka.py", line 80, in createStream raise e py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.util.Utils.classForName. : java.lang.ClassNotFoundException: kafka.serializer.DefaultDecoder at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.util.Utils$.classForName(Utils.scala:153) at org.apache.spark.util.Utils.classForName(Utils.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:744)

Its easy to miss the real message. Is it possible to quit in a such that the whole stack trace does not get printed. Rather it gracefully quits after printing this message? Perhaps an exit? @JoshRosen Is that good idea.

tdas · 2015-02-03T03:15:25Z

I am merging despite a small comment from me. Thanks @davies and others for helping!

SparkQA · 2015-02-03T03:29:57Z

Test build #26578 has finished for PR 3715 at commit d93bfe0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KafkaUtils(object):

AmplabJenkins · 2015-02-03T03:30:01Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26578/
Test PASSed.

Davies Liu added 2 commits December 16, 2014 13:48

support kafka in Python

07923c4

add mqtt

75d485e

fix python style

048dbe6

bypass decoder in scala

5697a01

fix python style

98c8d17

Davies Liu added 2 commits January 8, 2015 13:28

add example and fix bugs

f6ce899

refactor

eea16a7

davies changed the title ~~[WIP] Kafka and MQTT support in Python~~ [SPARK-5154] [PySpark] [Streaming] Kafka streaming support in Python Jan 8, 2015

davies mentioned this pull request Jan 8, 2015

[SPARK-1630] Turn Null of Java/Scala into None of Python #1551

Closed

tdas reviewed Jan 9, 2015
View reviewed changes

tdas and others added 2 commits January 21, 2015 17:31

Kafka-assembly for Python API

aea8953

Merge pull request #3 from tdas/kafka-python-api

adeeb38

Kafka-assembly for Python API

Merge branch 'master' of github.com:apache/spark into kafka

33730d1

Conflicts: make-distribution.sh project/SparkBuild.scala

add tests for null in RDD

f257071

Merge branch 'master' of github.com:apache/spark into kafka

e6d0427

Conflicts: python/pyspark/tests.py

tdas reviewed Jan 30, 2015
View reviewed changes

address comments

4280d04

tdas mentioned this pull request Feb 2, 2015

[SPARK-5341] Use maven coordinates as dependencies in spark-shell and spark-submit #4215

Closed

5 tasks

Update make-distribution.sh

d93bfe0

tdas reviewed Feb 3, 2015
View reviewed changes

asfgit closed this in 0561c45 Feb 3, 2015

[SPARK-5154] [PySpark] [Streaming] Kafka streaming support in Python #3715

[SPARK-5154] [PySpark] [Streaming] Kafka streaming support in Python #3715

Conversation

davies commented Dec 16, 2014

SparkQA commented Dec 16, 2014

SparkQA commented Dec 16, 2014

AmplabJenkins commented Dec 16, 2014

SparkQA commented Dec 16, 2014

SparkQA commented Dec 16, 2014

AmplabJenkins commented Dec 16, 2014

prabeesh commented Dec 17, 2014

SparkQA commented Dec 18, 2014

SparkQA commented Dec 18, 2014

AmplabJenkins commented Dec 18, 2014

SparkQA commented Dec 19, 2014

SparkQA commented Dec 19, 2014

AmplabJenkins commented Dec 19, 2014

SparkQA commented Jan 8, 2015

SparkQA commented Jan 8, 2015

AmplabJenkins commented Jan 8, 2015

tdas Jan 9, 2015

Choose a reason for hiding this comment

SparkQA commented Jan 22, 2015

SparkQA commented Jan 29, 2015

SparkQA commented Jan 29, 2015

AmplabJenkins commented Jan 29, 2015

SparkQA commented Jan 29, 2015

SparkQA commented Jan 29, 2015

AmplabJenkins commented Jan 29, 2015

tdas Jan 30, 2015

Choose a reason for hiding this comment

davies Jan 30, 2015

Choose a reason for hiding this comment

SparkQA commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

SparkQA commented Feb 3, 2015

davies commented Feb 3, 2015

tdas Feb 3, 2015

Choose a reason for hiding this comment

tdas commented Feb 3, 2015

SparkQA commented Feb 3, 2015

AmplabJenkins commented Feb 3, 2015