Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need a way to sample a topic #261

Closed
kgorman opened this issue Aug 29, 2017 · 5 comments
Closed

Need a way to sample a topic #261

kgorman opened this issue Aug 29, 2017 · 5 comments
Assignees

Comments

@kgorman
Copy link

kgorman commented Aug 29, 2017

It's natural to want to create a stream, but what data exists in a topic? It may be completely free form without a schema.

How do I sample the topic a bit to see what data is coming through in order to know what I want to create?

Perhaps to start just something simple like:

select * from mytopic limit 5;

A more powerful query would be something like:

select random_sample(*, 100, 0) from mytopic;

selecting all the columns and 100 randomly sampled messages starting at offset 0

@kgorman
Copy link
Author

kgorman commented Aug 29, 2017

I mean like, this isn't particularly elegant:

ksql> create stream kg (test varchar) with (value_format='delimited', kafka_topic='defaultsink');

 Message        
----------------
 Stream created 
ksql> select * from kg;
Exception in deserializing the delimited row: {"flight": "", "timestamp_verbose": "2017-08-29 18:33:28.406898", "msg_type": "8", "track": "", "timestamp": 1504049608, "altitude": "", "counter": 1057, "lon": "", "icao": "A365B7", "vr": "", "lat": "", "speed": ""}
Query terminated

;-)

@hjafarpour
Copy link
Contributor

hjafarpour commented Aug 30, 2017

@kgorman Assuming your topic is in JSON format try this:

ksql> REGISTER TOPIC t1 WITH (value_format = 'json', kafka_topic='defaultsink');
ksql> PRINT t1 SAMPLE 10;

This will print out one row out of every 10 row. Of course you can set the value to any integer you desire, if the rate is high you can say 1000 or even more!
Note that in the docs we omitted the above statements!
Let me know if you run into any issues.

@hjafarpour hjafarpour self-assigned this Aug 30, 2017
@miguno
Copy link
Contributor

miguno commented Sep 5, 2017

@kgorman: From what I understand, your actual problem is figuring out how to format/parse the data in a STREAM's or TABLE's underlying Kafka topic, correct? That is, which properties would need to be set (and to which values) in order for e.g. CREATE STREAM to work properly?

In other words, it's not about being able to sample a topic -- it's about knowing how to write the properties part of e.g. CREATE STREAM correctly? Asking because sampling might be a direct use case (e.g. in situations where you want to work on a lower volume variant of the actual input data).

@kgorman
Copy link
Author

kgorman commented Sep 5, 2017

Good point(s) @miguno. Yes, correct on all counts. Because of the interactive nature of the shell, it's natural to experiment and make streams and tables to explore and visualize. I think @hjafarpour solution is spot on! It works for this purpose. It would be good to update the docs/example to include this type of information. Perhaps I am just missing it. If so, apologies.

@kgorman kgorman closed this as completed Sep 5, 2017
@miguno
Copy link
Contributor

miguno commented Sep 6, 2017

Thanks for clarifying @kgorman! We'll take a look at how we could update the docs with this information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants