Need a way to sample a topic #261

kgorman · 2017-08-29T23:23:35Z

It's natural to want to create a stream, but what data exists in a topic? It may be completely free form without a schema.

How do I sample the topic a bit to see what data is coming through in order to know what I want to create?

Perhaps to start just something simple like:

select * from mytopic limit 5;

A more powerful query would be something like:

select random_sample(*, 100, 0) from mytopic;

selecting all the columns and 100 randomly sampled messages starting at offset 0

The text was updated successfully, but these errors were encountered:

kgorman · 2017-08-29T23:34:13Z

I mean like, this isn't particularly elegant:

ksql> create stream kg (test varchar) with (value_format='delimited', kafka_topic='defaultsink');

 Message        
----------------
 Stream created 
ksql> select * from kg;
Exception in deserializing the delimited row: {"flight": "", "timestamp_verbose": "2017-08-29 18:33:28.406898", "msg_type": "8", "track": "", "timestamp": 1504049608, "altitude": "", "counter": 1057, "lon": "", "icao": "A365B7", "vr": "", "lat": "", "speed": ""}
Query terminated

;-)

hjafarpour · 2017-08-30T02:25:31Z

@kgorman Assuming your topic is in JSON format try this:

ksql> REGISTER TOPIC t1 WITH (value_format = 'json', kafka_topic='defaultsink');
ksql> PRINT t1 SAMPLE 10;

This will print out one row out of every 10 row. Of course you can set the value to any integer you desire, if the rate is high you can say 1000 or even more!
Note that in the docs we omitted the above statements!
Let me know if you run into any issues.

miguno · 2017-09-05T17:14:03Z

@kgorman: From what I understand, your actual problem is figuring out how to format/parse the data in a STREAM's or TABLE's underlying Kafka topic, correct? That is, which properties would need to be set (and to which values) in order for e.g. CREATE STREAM to work properly?

In other words, it's not about being able to sample a topic -- it's about knowing how to write the properties part of e.g. CREATE STREAM correctly? Asking because sampling might be a direct use case (e.g. in situations where you want to work on a lower volume variant of the actual input data).

kgorman · 2017-09-05T20:44:14Z

Good point(s) @miguno. Yes, correct on all counts. Because of the interactive nature of the shell, it's natural to experiment and make streams and tables to explore and visualize. I think @hjafarpour solution is spot on! It works for this purpose. It would be good to update the docs/example to include this type of information. Perhaps I am just missing it. If so, apologies.

miguno · 2017-09-06T07:32:14Z

Thanks for clarifying @kgorman! We'll take a look at how we could update the docs with this information.

hjafarpour self-assigned this Aug 30, 2017

kgorman closed this as completed Sep 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need a way to sample a topic #261

Need a way to sample a topic #261

kgorman commented Aug 29, 2017

kgorman commented Aug 29, 2017

hjafarpour commented Aug 30, 2017 •

edited by miguno

Loading

miguno commented Sep 5, 2017 •

edited

Loading

kgorman commented Sep 5, 2017

miguno commented Sep 6, 2017

Need a way to sample a topic #261

Need a way to sample a topic #261

Comments

kgorman commented Aug 29, 2017

kgorman commented Aug 29, 2017

hjafarpour commented Aug 30, 2017 • edited by miguno Loading

miguno commented Sep 5, 2017 • edited Loading

kgorman commented Sep 5, 2017

miguno commented Sep 6, 2017

hjafarpour commented Aug 30, 2017 •

edited by miguno

Loading

miguno commented Sep 5, 2017 •

edited

Loading