[SPARK-24630][SS] Support SQLStreaming in Spark #22575

jackylee-ch · 2018-09-28T01:59:41Z

What changes were proposed in this pull request?

This patch propose new support of SQLStreaming in Spark, Please refer SPARK-24630 for more details.

This patch supports:

Support create stream table, which can be used as Source and Sink in SQLStreaming;
create table kafka_sql_test using kafka options( isStreaming = 'true', subscribe = 'topic', kafka.bootstrap.servers = 'localhost:9092')
Add keyword 'STREAM' in sql to support SQLStreaming queries;
select stream * from kafka_sql_test
As for those complex queries, they all can be supported as long as SQL and StructStreaming support.

How was this patch tested?

Some UTs are added to verify sqlstreaming.

…tream table

WangTaoTheTonic · 2018-10-10T06:38:14Z

ok to test

WangTaoTheTonic · 2018-10-10T06:55:31Z

Is this still a WIP?
Using isStreaming tag in DDL to mark if a table is streaming or not is brilliant. It keeps compatible with batch queries sql.
If possible, I think not introducing STREAM keywords in DML is better to go. Maybe we can use properties(like isStreaming) of table participated in query to generate StreamingRelation or batch relation. How do you think?
SQLStreaming is important part in SS in my perspective, as it makes SS more complete and usable. Thanks for your work!

jackylee-ch · 2018-10-12T01:26:50Z

@WangTaoTheTonic
Adding 'stream' keyword has two purposes:

Mark the entire sql query as a stream query and generate the SQLStreaming plan tree.
Mark the table type as UnResolvedStreamRelation. Parse the table as StreamingRelation or other Relation, especially in the stream join batch queries, such as kafka join mysql.

Besides, the keyword 'stream' makes it easier to express StructStreaming with pure SQL.
A little example to show importances of 'stream': read stream from kafka stream table, and join mysql to count user message

with 'stream'
- select stream kafka_sql_test.name, count(door) from kafka_sql_test inner join mysql_test on kafka_sql_test.name == mysql_test.name group by kafka_sql_test.name
  - It will be regarded as Streaming Query using Console Sink, the kafka_sql_test will be parsed as StreamingRelation and mysql_test will be parsed as JDBCRelation, not Streaming Relation.
- insert into csv_sql_table select stream kafka_sql_test.name, count(door) from kafka_sql_test inner join mysql_test on kafka_sql_test.name == mysql_test.name group by kafka_sql_test.name
  - It will be regarded as Streaming Query using FileStream Sink, the kafka_sql_test will be parsed as StreamingRelation and mysql_test will be parsed as JDBCRelation, not Streaming Relation.
without 'stream'
- select kafka_sql.name, count(door) from kafka_sql_test inner join mysql_test on kafka_sql_test.name == mysql_test.name group by kafka_sql_test.name
  - It will be regarded as Batch Query, the kafka_sql_test will be parsed to KafkaRelation and mysql_test will be parsed as JDBCRelation.

WangTaoTheTonic · 2018-10-12T03:36:31Z

How should we do if we wanna join two kafka stream and sink the result to another stream?

jackylee-ch · 2018-10-12T09:33:29Z

How should we do if we wanna join two kafka stream and sink the result to another stream?
insert into kafka_sql_out select stream t1.value from (select cast(value as string), timestamp as time1 from kafka_sql_in1) as t1 inner join (select cast(value as string), timestamp as time2 from kafka_sql_in2) as t2 on time1 >= time2 and time1 <= time2 + interval 10 seconds where t1.value == t2.value

jackylee-ch · 2018-10-14T16:24:25Z

cc @xuanyuanking

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala

xuanyuanking

As comment in https://issues.apache.org/jira/browse/SPARK-24630?focusedCommentId=16523064&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16523064, the currently approach supporting user submitting a streaming job all by sql, but mainly based on Hive table support, need more discussion on other data sources.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/hive/src/test/scala/org/apache/spark/sql/hive/StreamTableDDLCommandSuite.scala

…Streaming

cloud-fan · 2018-10-22T04:40:17Z

Do we have a full story about streaming SQL? is the STREAM keyword the only difference between stream sql and normal sql?

also cc @tdas @zsxwing

jackylee-ch · 2018-10-22T12:35:49Z

ql and normal sql? how could users define watermark with SQL?

Yes, the 'stream' keyword is the only difference from normal sql.
We can use configuration to define watermark.

jackylee-ch · 2018-10-26T01:52:56Z

@WangTaoTheTonic @cloud-fan @xuanyuanking
I have removed the stream keyword. Table API is supported in SQLStreaming now.

WangTaoTheTonic · 2018-10-31T06:51:43Z

Nice! I am looking forward to it.

shijinkui · 2018-11-02T09:17:10Z

@cloud-fan Hi, Wenchen. Is it ready for merge in? This PR is very useful and is what I want to develop and need.
Once Spark support StreamSQL, it will be easier for developping streaming job.
Thanks.

jackylee-ch · 2018-11-09T01:21:49Z

@tdas @zsxwing @cloud-fan
Hi, any other questions block this patch for merge in?

gvramana · 2018-11-12T09:06:24Z

How should we do if we wanna join two kafka stream and sink the result to another stream?
insert into kafka_sql_out select stream t1.value from (select cast(value as string), timestamp as time1 from kafka_sql_in1) as t1 inner join (select cast(value as string), timestamp as time2 from kafka_sql_in2) as t2 on time1 >= time2 and time1 <= time2 + interval 10 seconds where t1.value == t2.value

Hi stczwd,
Currently Dataframe API support "writeStream.start()" api to run streaming in background, so that query can be executed on that sink, also multiple stream to stream processing can happen in single session.
How this can be achieved using INSERT INTO stream?
How multiple streams with different properties can be executed in same session?

jackylee-ch · 2018-11-12T15:08:29Z

Currently Dataframe API support "writeStream.start()" api to run streaming in background, so that query can be executed on that sink, also multiple stream to stream processing can happen in single session.
How this can be achieved using INSERT INTO stream?
How multiple streams with different properties can be executed in same session?

SQLStreaming does not support multiple streams. In our cases, SQLStreaming is basically used in ad-hoc, Each case only run one insert into steam.
Still, SQLStreaming can support multiple streams with Table API.
spark.table("kafka_stream").groupBy("value").count().writeStream.outputMode("complete").format("console").start()

jackylee-ch · 2018-11-21T00:40:38Z

@cloud-fan @zsxwing @tdas @xuanyuanking
This patch has been submitted for a long time. Do you have any questions? Can this patch be merged in?

sujith71955 · 2018-11-26T05:04:00Z

@stczwd Can you provide a detail design document for this PR, by mentioning the scenarios is been handled and constraints if any. this wll give a complete pitcture about this PR. Thanks

jackylee-ch · 2018-11-28T02:25:35Z

@sujithjay
Please refer SPARK-24630 for more details.

sujith71955 · 2018-11-28T04:40:19Z

There is a DatasourceV2 community synch meetup tomorrow which is cordinated by Ryan Blue , can we discuss this point.

sujith71955 · 2018-11-28T04:40:48Z

cc @koeninger

jackylee-ch · 2018-11-28T04:56:28Z

I have removed the 'stream' keyword.

There is a DatasourceV2 community synch meetup tomorrow which is cordinated by Ryan Blue , can we discuss this point.

Yep, it's a good idea.

sujith71955 · 2018-11-28T05:01:17Z

Can you send a mail to Ryan blue for adding this SPIP topic in tomorrow meeting. Meeting will be conducted tomorrow 05:00 pm PST. If you confirm then we can also attend the meeting.

…

On Wed, 28 Nov 2018 at 10:27 AM, stczwd ***@***.***> wrote: [image: image] <https://user-images.githubusercontent.com/12999161/49129177-ab056680-f2f4-11e8-8f71-4695ebc045c1.png> I have removed the 'stream' keyword. There is a DatasourceV2 community synch meetup tomorrow which is cordinated by Ryan Blue , can we discuss this point. Yep, it's a good idea. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#22575 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMZZ-SyUG6FGTS5Q89z_zh8A3a_mjn8hks5uzhfSgaJpZM4W9ueb> .

jackylee-ch · 2018-11-28T05:37:15Z

Can you send a mail to Ryan blue for adding this SPIP topic in tomorrow meeting. Meeting will be conducted tomorrow 05:00 pm PST. If you confirm then we can also attend the meeting.

I have send an email to Ryan Blue to attend this meeting.

sujith71955 · 2018-11-28T05:55:49Z

Can you send a mail to Ryan blue for adding this SPIP topic in tomorrow meeting. Meeting will be conducted tomorrow 05:00 pm PST. If you confirm then we can also attend the meeting.

I have send an email to Ryan Blue to attend this meeting.

I think you should also ask him to add your SPIP topic for tomorrows discussion.Agenda has to be set prior.

jackylee-ch · 2018-11-28T07:34:39Z

I hive send an email to Ryan Blue.

Can you send a mail to Ryan blue for adding this SPIP topic in tomorrow meeting. Meeting will be conducted tomorrow 05:00 pm PST. If you confirm then we can also attend the meeting.

I have send an email to Ryan Blue to attend this meeting.

I think you should also ask him to add your SPIP topic for tomorrows discussion.Agenda has to be set prior.

Tomorrow's discussion is mainly focus on DataSource V2 API, I don't think they will spend time to discuss SQL API. However, We can mention it while discussing the Catalog API.

sql/hive/src/test/scala/org/apache/spark/sql/hive/StreamTableDDLCommandSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/SQLStreamingSink.scala

…to milliseconds

jackylee-ch · 2018-12-22T07:20:31Z

@cloud-fan You can take a look at my changes here, I will update the design documentation later,
Thank you for your interest in SQLStreaming.

yaooqinn · 2018-12-26T07:16:15Z

@stczwd lgtm. nit: It would be very helpful to add some instructions to the online documentation.

jackylee-ch · 2018-12-27T07:24:26Z

@stczwd lgtm. nit: It would be very helpful to add some instructions to the online documentation.

@yaooqinn Thanks for support. I have written a detailed design doc, which will be issued soon.

KevinZwx · 2019-04-16T03:54:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/SQLStreamingSink.scala

+   */
+  private def parseTrigger(): Trigger = {
+    val trigger = Utils.timeStringAsMs(sqlConf.sqlStreamTrigger)
+    Trigger.ProcessingTime(trigger, TimeUnit.MILLISECONDS)


Continuous processing mode is supported now, do you plan to support it? If so I think we can traverse the logical plan to find out whether this is a continuous query and create a ContinuousTrigger

This feature is under discussion: https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0

AmplabJenkins · 2019-09-16T18:18:46Z

Can one of the admins verify this patch?

github-actions · 2020-01-06T00:07:24Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

Support SQLStreaming in Spark: Add keyword 'STREAM'; Support create s…

af26ea7

…tream table

Support SQLStreaming in Spark: Add analysis rule for SQLStreaming

36c68a1

jackylee-ch changed the title ~~[SPARK-24630][SS][WIP] Support SQLStreaming in Spark~~ [SPARK-24630][SS] Support SQLStreaming in Spark Oct 11, 2018

xuanyuanking reviewed Oct 17, 2018

View reviewed changes

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala Outdated Show resolved Hide resolved

xuanyuanking reviewed Oct 17, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

sql/hive/src/test/scala/org/apache/spark/sql/hive/StreamTableDDLCommandSuite.scala Outdated Show resolved Hide resolved

[SPARK-24630][SS] Support SQLStreaming in Spark: add watermark to SQL…

fba7bb4

…Streaming

jackylee-ch added 2 commits October 26, 2018 09:03

[SPARK-24630][SS] Support SQLStreaming in Spark: remove stream keyword

a741185

[SPARK-24630][SS] Support SQLStreaming in Spark: remove unused changes

0392de7

uncleGen reviewed Nov 29, 2018

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/StreamTableDDLCommandSuite.scala Show resolved Hide resolved

jackylee-ch mentioned this pull request Nov 30, 2018

[SPARK-24252][SQL] Add catalog registration and table catalog APIs. #21306

Closed

sujith71955 reviewed Dec 3, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

sujith71955 reviewed Dec 3, 2018

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/SQLStreamingSink.scala Outdated Show resolved Hide resolved

jackylee-ch added 4 commits December 6, 2018 17:18

[SPARK-24630][SS] Support SQLStreaming in Spark: change microseconds …

6150463

…to milliseconds

Merge branch 'master' into sqlstreaming

41f595d

[SPARK-24630][SS] Support SQLStreaming in Spark: change microseconds …

e735938

…to milliseconds

Merge branch 'master' into sqlstreaming

45acfb5

KevinZwx reviewed Apr 16, 2019

View reviewed changes

dongjoon-hyun added the STRUCTURED STREAMING label Jun 14, 2019

github-actions bot added the Stale label Jan 6, 2020

github-actions bot closed this Jan 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24630][SS] Support SQLStreaming in Spark #22575

[SPARK-24630][SS] Support SQLStreaming in Spark #22575

jackylee-ch commented Sep 28, 2018

WangTaoTheTonic commented Oct 10, 2018

WangTaoTheTonic commented Oct 10, 2018

jackylee-ch commented Oct 12, 2018

WangTaoTheTonic commented Oct 12, 2018

jackylee-ch commented Oct 12, 2018

jackylee-ch commented Oct 14, 2018

xuanyuanking left a comment

cloud-fan commented Oct 22, 2018

jackylee-ch commented Oct 22, 2018

jackylee-ch commented Oct 26, 2018 •

edited

Loading

WangTaoTheTonic commented Oct 31, 2018

shijinkui commented Nov 2, 2018

jackylee-ch commented Nov 9, 2018

gvramana commented Nov 12, 2018 •

edited

Loading

jackylee-ch commented Nov 12, 2018

jackylee-ch commented Nov 21, 2018

sujith71955 commented Nov 26, 2018 •

edited

Loading

jackylee-ch commented Nov 28, 2018

sujith71955 commented Nov 28, 2018

sujith71955 commented Nov 28, 2018

jackylee-ch commented Nov 28, 2018

sujith71955 commented Nov 28, 2018 via email

jackylee-ch commented Nov 28, 2018

sujith71955 commented Nov 28, 2018

jackylee-ch commented Nov 28, 2018

jackylee-ch commented Dec 22, 2018

yaooqinn commented Dec 26, 2018

jackylee-ch commented Dec 27, 2018

KevinZwx Apr 16, 2019

uncleGen Apr 17, 2019

AmplabJenkins commented Sep 16, 2019

github-actions bot commented Jan 6, 2020

[SPARK-24630][SS] Support SQLStreaming in Spark #22575

[SPARK-24630][SS] Support SQLStreaming in Spark #22575

Conversation

jackylee-ch commented Sep 28, 2018

What changes were proposed in this pull request?

How was this patch tested?

WangTaoTheTonic commented Oct 10, 2018

WangTaoTheTonic commented Oct 10, 2018

jackylee-ch commented Oct 12, 2018

WangTaoTheTonic commented Oct 12, 2018

jackylee-ch commented Oct 12, 2018

jackylee-ch commented Oct 14, 2018

xuanyuanking left a comment

Choose a reason for hiding this comment

cloud-fan commented Oct 22, 2018

jackylee-ch commented Oct 22, 2018

jackylee-ch commented Oct 26, 2018 • edited Loading

WangTaoTheTonic commented Oct 31, 2018

shijinkui commented Nov 2, 2018

jackylee-ch commented Nov 9, 2018

gvramana commented Nov 12, 2018 • edited Loading

jackylee-ch commented Nov 12, 2018

jackylee-ch commented Nov 21, 2018

sujith71955 commented Nov 26, 2018 • edited Loading

jackylee-ch commented Nov 28, 2018

sujith71955 commented Nov 28, 2018

sujith71955 commented Nov 28, 2018

jackylee-ch commented Nov 28, 2018

sujith71955 commented Nov 28, 2018 via email

jackylee-ch commented Nov 28, 2018

sujith71955 commented Nov 28, 2018

jackylee-ch commented Nov 28, 2018

jackylee-ch commented Dec 22, 2018

yaooqinn commented Dec 26, 2018

jackylee-ch commented Dec 27, 2018

KevinZwx Apr 16, 2019

Choose a reason for hiding this comment

uncleGen Apr 17, 2019

Choose a reason for hiding this comment

AmplabJenkins commented Sep 16, 2019

github-actions bot commented Jan 6, 2020

jackylee-ch commented Oct 26, 2018 •

edited

Loading

gvramana commented Nov 12, 2018 •

edited

Loading

sujith71955 commented Nov 26, 2018 •

edited

Loading