From 09e90ae01b70eb2a43401d2820cc2d7e22515877 Mon Sep 17 00:00:00 2001 From: Michael Drogalis Date: Mon, 17 Aug 2020 16:06:23 -0700 Subject: [PATCH 01/15] docs: ports lexical structure, ddl, and primer docs --- docs/overview/apache-kafka-primer.md | 244 ++++++++++++++++++ docs/reference/sql/data-definition.md | 149 +++++++++++ .../reference/sql/syntax/lexical-structure.md | 149 +++++++++++ docs/requirements.txt | 1 + mkdocs.yml | 12 +- 5 files changed, 552 insertions(+), 3 deletions(-) create mode 100644 docs/overview/apache-kafka-primer.md create mode 100644 docs/reference/sql/data-definition.md create mode 100644 docs/reference/sql/syntax/lexical-structure.md diff --git a/docs/overview/apache-kafka-primer.md b/docs/overview/apache-kafka-primer.md new file mode 100644 index 000000000000..f9a52a9a8a72 --- /dev/null +++ b/docs/overview/apache-kafka-primer.md @@ -0,0 +1,244 @@ +--- +layout: page +title: Apache Kafka® primer +tagline: Kafka concepts you need to use ksqlDB +description: Learn the minimum number of Kafka concepts to use ksqlDB effectively +keywords: ksqldb, kafka +--- + +ksqlDB is an event streaming database built specifically for {{ site.aktm }}. +Although it's designed to give you a higher-level set of primitives than +{{ site.ak }} has, it's inevitable that all of {{ site.ak }}'s concepts can't be, and +shouldn't be, abstracted away entirely. This section describes the minimum +number of {{ site.ak }} concepts that you need to use ksqlDB effectively. +For more information, consult the official [Apache Kafka documentation](https://kafka.apache.org/documentation/). + +## Records + +The primary unit of data in {{ site.ak }} is the event. An event models +something that happened in the world at a point in time. In {{ site.ak }}, +you represent each event using a data construct known as a record. A record +carries a few different kinds of data in it: key, value, timestamp, topic, partition, offset, and headers. + +The _key_ of a record is an arbitrary piece of data that denotes the identity +of the event. If the events are clicks on a web page, a suitable key might be +the ID of the user who did the clicking. + +The _value_ is also an arbitrary piece of data that represents the primary data of +interest. The value of a click event probably contains the page that it +happened on, the DOM element that was clicked, and other interesting tidbits +of information. + +The _timestamp_ denotes when the event happened. There are a few different "kinds" +of time that can be tracked. These aren’t discussed here, but they’re useful to +[learn about](../../../concepts/time-and-windows-in-ksqldb-queries/#time-semantics) nonetheless. + +The _topic_ and _partition_ describe which larger collection of events +this event belongs to, and the _offset_ describes its exact position within +that larger collection (more on that below). + +Finally, the _headers_ carry arbitrary, user-supplied metadata about the record. + +ksqlDB abstracts over some of these pieces of information so you don’t need to +think about them. Others are exposed directly and are an integral part of the +programming model. For example, the fundamental unit of data in ksqlDB is the +_row_. A row is a helpful abstraction over a {{ site.ak }} record. Rows have +columns of two kinds: key columns and value columns. They also carry +pseudocolumns for metadata, like a `timestamp`. + +In general, ksqlDB avoids raising up {{ site.ak }}-level implementation details +that don’t contribute to a high-level programming model. + +## Topics + +Topics are named collections of records. Their purpose is to let you hold +events of mutual interest together. A series of click records might get stored +in a "clicks" topic so that you can access them all in one place. Topics are +append-only. Once you add a record to a topic, you can’t change or delete it +individually. + +There are no rules for what kinds of records can be placed into topics. They +don't need to conform to the same structure, relate to the same situation, or +anything like that. The way you manage publication to topics is entirely a +matter of user convention and enforcement. + +ksqlDB provides higher-level abstractions over a topic through _streams_ and +_tables_. A stream or table is a {{ site.ak }} topic with a registered schema. +The schema controls the shape of records that are allowed to be stored in the +topic. This kind of static typing makes it easier to understand what sort of +rows are in your topic and generally helps you make fewer mistakes in your +programs that process them. + +## Partitions + +When a record is placed into a topic, it is placed into a particular partition. +A partition is a totally ordered sequence of records. Topics have multiple +partitions to make storage and processing more scalable. When you create a +topic, you choose how many partitions it has. + +When you append a record to a topic, a partitioning strategy chooses which +partition it is stored in. There are many partitioning strategies. The most common +one is to hash the contents of the record's key against the total number of +partitions. This has the effect of placing all records with the same identity +into the same partition, which is useful because of the strong ordering +guarantees. + +The order of the records is tracked by a piece of data known as an offset, +which is set when the record is appended. A record with offset of _10_ happened +earlier than a record in the same partition with offset of _20_. + +Much of the mechanics here are handled automatically by ksqlDB on your behalf. +When you create a stream or table, you choose the number of partitions for the +underlying topic so that you can have control over its scalability. When you +declare a schema, you choose which columns are part of the key and which are +part of the value. Beyond this, you don't need to think about individual partitions +or offsets. Here are some examples of that. + +When a record is processed, its key content is hashed so that its new downstream +partition will be consistent with all other records with the same key. When records are +appended, they follow the correct offset order, even in the presence of +failures or faults. When a stream's key content changes because of how a query +wants to process the rows (via `GROUP BY` or `PARTITION BY`), the underlying +records keys are recalculated, and the records are sent to a new partition set +to perform the computation. + +## Producers and consumers + +Producers and consumers facilitate the movement of records to and from topics. +When an application wants to either publish records or subscribe to them, it +invokes the APIs (generally called the _client_) to do so. Clients communicate +with the brokers over a structured network protocol. + +When consumers read records from a topic, they never delete them or mutate +them in any way. This pattern of being able to repeatedly read the same +information is helpful for building multiple applications over the same data +set in a non-conflicting way. It's also the primary building block for +supporting "replay", where an application can rewind its event stream and read +old information again. + +Producers and consumers expose a fairly low-level API. You need to construct +your own records, manage their schemas, configure their serialization, and +handle what you send where. + +ksqlDB behaves as a high-level, continuous producer and consumer. You simply +declare the shape of your records, then issue high-level SQL commands that +describe how to populate, alter, and query the data. These SQL programs are +translated into low-level client API invocations that take care of the details +for you. + +## Brokers + +The brokers are servers that store and manage access to topics. Multiple brokers +can cluster together to replicate topics in a highly-available, fault-tolerant +manner. Clients communicate with the brokers to read and write records. + +When you run a ksqlDB server or cluster, each of its nodes communicates with +the {{ site.ak }} brokers to do its processing. From the {{ site.ak }} brokers' +point of view, each ksqlDB server is like a client. No processing takes place +on the broker. ksqlDB's servers do all of their computation on their own nodes. + +## Serializers + +Because no data format is a perfect fit for all problems, {{ site.ak }} was +designed to be agnostic to the data contents in the key and value portions of +its records. When records move from client to broker, the user payload (key and +value) must be transformed to byte arrays. This enables {{ site.ak }} to work +with an opaque series of bytes without needing to know anything about what they +are. When records are delivered to a consumer, those byte arrays need to be +transformed back into their original topics to be meaningful to the application. +The process that converts to and from byte representations is called +_serialization_. + +When a producer sends a record to a topic, it must decide which serializers to +use to convert the key and value to byte arrays. The key and value +serializers are chosen independently. When a consumer receives a record, it +must decide which deserializer to use to convert the byte arrays back to +their original values. Serializers and deserializers come in pairs. If you use +a different deserializer, you won't be able to make sense of the byte contents. + +ksqlDB raises the abstraction of serialization substantially. Instead of +configuring serializers manually, you declare formats using configuration +options at stream/table creation time. Instead of having to keep track of which +topics are serialized which way, ksqlDB maintains metadata about the byte +representations of each stream and table. Consumers are configured automatically +to use the correct deserializers. + +## Schemas + +Although the records serialized to {{ site.ak }} are opaque bytes, they must have +some rules about their structure to make it possible to process them. One aspect of this +structure is the schema of the data, which defines its shape and fields. Is it +an integer? Is it a map with keys `foo`, `bar`, and `baz`? Something else? + +Without any mechanism for enforcement, schemas are implicit. A consumer, +somehow, needs to know the form of the produced data. Frequently this happens +by getting a group of people to agree verbally on the schema. This approach, +however, is error prone. It's often better if the schema can be managed +centrally, audited, and enforced programmatically. + +[Confluent {{ site.sr }}](https://docs.confluent.io/current/schema-registry/index.html), a project outside of {{ site.ak }}, helps with schema +management. {{ site.sr }} enables producers to register a topic with a schema +so that when any further data is produced, it is rejected if it doesn't +conform to the schema. Consumers can consult {{ site.sr }} to find the schema +for topics they don't know about. + +Rather than having you glue together producers, consumers, and schema +configuration, ksqlDB integrates transparently with {{ site.sr }}. By enabling +a configuration option so that the two systems can talk to each other, ksqlDB +stores all stream and table schemas in {{ site.sr }}. These schemas can then be +downloaded and used by any application working with ksqlDB data. Moreover, +ksqlDB can infer the schemas of existing topics automatically, so that you +don't need to declare their structure when you define the stream or table over +it. + +## Consumer groups + +When a consumer program boots up, it registers itself into a _consumer group_, +which multiple consumers can enter. Each time a record is eligible to be +consumed, exactly one consumer in the group reads it. This effectively provides +a way for a set of processes to coordinate and load balance the consumption of +records. + +Because the records in a single topic are meant to be consumed in one, each +partition in the subscription is read by only one consumer at a time. The number +of partitions that each consumer is responsible for is defined by the total +number of source partitions divided by the number of consumers. If a consumer +dynamically joins the group, the ownership is recomputed and the partitions +reassigned. If a consumer leaves the group, the same computation takes place. + +ksqlDB builds on this powerful load balancing primitive. When you deploy a +persistent query to a cluster of ksqlDB servers, the workload is distributed +across the cluster according to the number of source partitions. You don't need +to manage group membership explicitly, because all of this happens automatically. + +For example, if you deploy a persistent query with ten source partitions to a +ksqlDB cluster with two nodes, each node processes five partitions. If you lose +a server, the sole remaining server will rebalance automatically and process +all ten. If you add four more servers, each rebalances to process two partitions. + +## Retention + +When you create a topic, you must set a retention duration. Retention +defines how long a record is stored before it's deleted. Retention is one of the +only ways to delete a record in a topic. This parameter is +particularly important in stream processing because it defines the time +horizon that you can replay a stream of events. Replay is useful if you're +fixing a bug, building a new application, or backtesting some existing piece of +logic. + +ksqlDB enables you to control the retention of the underlying topics of base +streams and tables directly, so it's important to understand the concept. For +more information see [Topics and Logs in the Kafka docs](https://kafka.apache.org/documentation/#intro_topics). + +## Compaction + +Compaction is a process that runs in the background on each {{ site.ak }} +broker that periodically deletes all but the latest record per key. It is an +optional, opt-in process. Compaction is particularly useful when your records +represent some kind of updates to a piece of a state, and the latest update is +the only one that matters in the end. + +ksqlDB directly leverages compaction to support the underlying changelogs that +back its materialized tables. They allow ksqlDB to store the minimum amount of +information needed to rebuild a table in the event of a failover. For more +information see [Log Compaction in the Kafka docs](https://kafka.apache.org/documentation/#compaction). \ No newline at end of file diff --git a/docs/reference/sql/data-definition.md b/docs/reference/sql/data-definition.md new file mode 100644 index 000000000000..32ebfc27eef0 --- /dev/null +++ b/docs/reference/sql/data-definition.md @@ -0,0 +1,149 @@ +--- +layout: page +title: Data definition +tagline: Use DDL to structure data +description: How to use DDL to structure data in ksqlDB +keywords: ksqldb, sql, ddl +--- + +This section covers how you create the structures that store your events. ksqlDB abstracts events as rows with columns and stores them in streams and tables. + +## Rows and columns + +Streams and tables help you model collections of events that accrete over time. Both are represented as a series of rows and columns with a schema, much like a relational database table. Rows represent individual events. Columns represent the attributes of those events. + +Each column has a data type. The data type limits the span of permissible values that it can take on. For example, if a column is declared as type `INT`, it cannot take on the value of string `'foo'`. + +In contrast to relational database tables, the columns of a row in ksqlDB are divided into "key" and "value" columns. The key columns control which partition a row resides in. The value columns, by convention, are used to store the main data of interest. Being able to control the key columns is useful for manipulating the underlying data locality, and generally allows you to integrate with the wider Kafka ecosystem, which uses the same key/value data model. By default, a column is a value column. Marking a column as a `(PRIMARY) KEY` makes it a key column. + +Internally, each row is backed by a [Kafka record](../../../overview/apache-kafka-primer/#records). In Kafka, the key and value parts of a record are independently [serialized](../../../overview/apache-kafka-primer/#serializers). ksqlDB allows you to exercise that same flexibility, and generally builds on the semantics of Kafka records, rather than hiding them. + +There is no theoretical limit on the number of columns in a stream or table. In practice, the limit is determined by the maximum message size that Kafka can store and the resources dedicated to ksqlDB. + +## Streams + +Streams are partitioned, immutable, append-only collections. They represent a series of historical facts. For example, the rows of a stream could model a sequence of financial transactions, like "Alice sent $100 to Bob”, then “Charlie sent $50 to Bob". + +Once a row is inserted into a stream, it can never change. New rows can be appended at the end of the stream, but existing rows can never be updated or deleted. + +Each row is stored in a particular partition. Every row, implicitly or explicitly, has a key that represents its identity. All rows with the same key reside in the same partition. + +To create a stream, use the `CREATE STREAM` command. In this command, you specify a name for the new stream, the names of the columns, and the data type of each column. + +```sql +CREATE STREAM s1 ( + k VARCHAR KEY, + v1 INT, + v2 VARCHAR +) WITH ( + kafka_topic = 's1', + partitions = 3, + value_format = 'json' +); +``` + +This creates a new stream named `s1` with three columns: `k`, `v1`, and `v2`. The column `k` is designated as the key of this stream, which controls which partition each row is stored in. When the data is stored, the value portion of each row's underlying Kafka record will be serialized in the JSON format. + +Under the covers, each stream corresponds to a [Kafka topic](../../../overview/apache-kafka-primer/#topics) with a registered schema. If the backing topic for a stream doesn't exist when you declare it, ksqlDB creates it on your behalf, as in the example above. + +You can also declare a stream on top of an existing topic. When you do that, ksqlDB simply registers its associated schema. If topic `s2` already exists, this command will register a new stream over it: + +```sql +CREATE STREAM s2 ( + k1 VARCHAR KEY, + v1 VARCHAR +) WITH ( + kafka_topic = 's2', + value_format = 'json' +); +``` + +Note that when you create a stream on an existing topic, you don't need to declare the number of partitions in it. ksqlDB simply infers the partition count from the existing topic. + +## Tables + +Tables are mutable, partitioned collections that model change over time. By contrast to streams, which represent a historical sequence of events, tables represent what is true as of "now". For example, you might use a table to model the locations that someone has lived at as a stream: first Miami, then New York, then London, and so forth. + +Tables work by leveraging the keys of each row. If a sequence of rows shares a key, the last row for a given key represents the most up-to-date information for that key's identity. A background process periodically runs and deletes all but the newest rows for each key. + +Here is what declaring a table looks like in code. Syntactically, it is almost the same as declaring a stream. + +```sql +CREATE TABLE current_location ( + person VARCHAR PRIMARY KEY, + location VARCHAR +) WITH ( + kafka_topic = 'current_location', + partitions = 3, + value_format = 'json' +); +``` + +As with streams, tables can also be declared directly ontop of an existing Kafka topic. Simply omit the number of partitions in the `WITH` clause. + +## Keys + +Columns can be marked with the `KEY` keyword to indicate that they are key columns. Key columns constitute the key portion of the row's underlying Kafka record. Only streams can mark columns as keys, and it is optional for them to do do. Tables must use the `PRIMARY KEY` constraint instead. + +In this example, `k1`'s data is stored in the key portion of the row, while `v1`'s data is stored in the value. + +```sql +CREATE STREAM s3 ( + k1 VARCHAR KEY, + v1 VARCHAR +) WITH ( + kafka_topic = 's3', + value_format = 'json' +); +``` + +Being able to explicitly declare key columns is especially useful when you are creating a stream over an existing topic. If ksqlDB cannot infer what data is in the key of the underlying Kafka record, it must internally perform a repartition of the rows. If you're not sure what data is in the key or you simply don't need it, you can omit the `KEY` keyword. + +## Default values + +If a column is declared in a schema, but no attribute is present in the underlying Kafka record, the value for the row's column is populated as `null`. + +## Pseudocolumns + +Pseudocolumns are columns that are automatically populated by ksqlDB. They contain meta-information that can be infered about the row at creation time. By default, pseudocolumns are not returned when selecting all columns with the star (`*`) special character. You must select them explicitly, as in: + +```sql +SELECT ROWTIME, * FROM s1 EMIT CHANGES; +``` + +The following table lists all pseudocolumns. + +| pseudocolumn | meaning | +|--------------|--------------------------------| +| `ROWTIME` | Row timestamp, inferred from the underlying Kafka record if not overridden. | + +You cannot create additional pseudocolumns beyond these. + +## Constraints + +Although data types help limit the range of values that can be accepted by ksqlDB, sometimes it is useful to have more sophisticated restrictions. Constraints allow you to exercise that type of logic directly in your schema. + +### Primary key constraints + +In a relational database, a primary key indicates that a column will be used as a unique identifier for all rows in a table. If you have a table with a row in it who's primary key is `5`, you can't insert another row whose primary key is also `5`. + +ksqlDB uses primary keys in a similar way, but there are a few differences because it is an event streaming database, not a relational database. + +First, only tables can have primary keys. Streams do not support them. Second, adding multiple rows to a table with the same primary key doesn't cause the subsequent rows to be rejected. The reason for both of these behaviors is the same: the purpose of tables are to model change of particular identities, whereas streams are used to accrete facts. When you insert multiple rows to a table with the same primary key, ksqlDB inteprets those rows as changes to a single identity. + +Primary keys cannot be null, and they must be used in all declared tables. In this example, `id` acts as the primary key for table `users`: + +```sql +CREATE TABLE users ( + id BIGINT PRIMARY KEY + name VARCHAR + ) WITH ( + kafka_topic = 'users', + partitions = 3, + value_format = 'json' + ); +``` + +### Not-null constraints + +A not-null constraint designates that a column cannot contain a null value. ksqlDB doesn't yet support this constraint, but you can track the progress towards it in [GitHub issue 4436](https://github.com/confluentinc/ksql/issues/4436). \ No newline at end of file diff --git a/docs/reference/sql/syntax/lexical-structure.md b/docs/reference/sql/syntax/lexical-structure.md new file mode 100644 index 000000000000..0f187e0102c0 --- /dev/null +++ b/docs/reference/sql/syntax/lexical-structure.md @@ -0,0 +1,149 @@ +--- +layout: page +title: Lexical structure data +tagline: Structure of SQL commands and statements in ksqlDB +description: Details about SQL commands and statements in ksqlDB +keywords: ksqldb, sql, keyword, identifier, constant, operator +--- + +SQL is a domain-specific language for managing and manipulating data. It’s primarily used to work with structured data, where the types and relationships across entities are well-defined. Originally adopted for relational databases, SQL is rapidly becoming the language of choice for stream processing. It’s declarative, expressive, and ubiquitous. + +The American National Standards Institute (ANSI) maintains a standard for the specification of SQL. SQL-92, the third revision to the standard, is generally the most recognized form of that specification. Beyond the standard, there are many flavors and extensions to SQL so that it can express programs beyond the SQL-92 grammar. + +ksqlDB’s SQL grammar was initially built around Presto’s grammar and has been judiciously extended. ksqlDB goes beyond SQL-92 because the standard currently has no constructs for streaming queries, a core aspect of this project. + +## Syntax + +SQL inputs are made up of a series of commands. Each command is made up of a series of tokens and ends in a semicolon (`;`). The tokens that apply depend on the command being invoked. + +A token is any keyword, identifier, backticked identifier, literal, or special character. Tokens are conventionally separated by whitespace unless there is no ambiguity in the grammar. This often happens when tokens flank a special character. + +As an example, the following is syntactically valid ksqlDB SQL input: + +```sql +INSERT INTO s1 (a, b) VALUES ('k1', 'v1'); + +CREATE STREAM s2 AS + SELECT a, b + FROM s1 + EMIT CHANGES; + +SELECT * FROM t1 WHERE k1='foo' EMIT CHANGES; +``` + +## Keywords + +Some tokens, such as `SELECT`, `INSERT`, and `CREATE`, are known as keywords. Keywords are reserved tokens that have a specific meaning in ksqlDB’s syntax. They control their surrounding allowable tokens and execution semantics. Keywords are case insensitive, meaning `SELECT` and `select` are equivalent. You cannot create an identifier that is already a keyword (unless you use backticked identifiers). + +A complete list of keywords can be found in the appendix. + +## Identifiers + +Identifiers are symbols that represent user-space entities, like streams, tables, columns, and other objects. For example, if you have a stream named `s1`, `s1` is an _identifier_ for that stream. By default, identifiers are case-insensitive, meaning `s1` and `S1` refer to the same stream. Under the covers, ksqlDB will capitalize all of the characters in the identifier for all future display purposes. + +Unless an identifier is backticked, it may only be composed of characters that are a letter, number, or underscore. There is no imposed limit on the number of characters. + +To make it possible to use any character in an identifier, you can surround it in backticks (``` ` ```) when it is declared and used. A _backticked identifier_ is useful when you don't control the data, so it might have special characters, or even keywords. When you use backticked identifers, the case is captured exactly, and any future references to the identifer become case-sensitive. As an example, if you declare the following stream: + +```sql +CREATE STREAM `s1` ( + k VARCHAR KEY, + `@MY-identifier-stream-column!` INT +) WITH ( + kafka_topic = 's1', + partitions = 3, + value_format = 'json' +); +``` + +You must select from it by backticking the stream name and column name and using the original casing: + +```sql +SELECT `@MY-identifier-stream-column!` FROM `s3` EMIT CHANGES; +``` + +## Constants + +There are three implicitly typed constants, or literals, in ksqlDB: strings, numbers, and booleans. + +### String constants + +A string constant is an arbitrary series of characters surrounded by single quotes (`'`), like `'Hello world'`. To include a quote inside of a string literal, escape the quote by prefixing it with another quote, as in `'You can call me ''Stuart'', or Stu.'` + +### Numeric constants + +Numeric constants are accepted in these forms: + +1. **_`digits`_** +2. **_`digits`_**`.[`**_`digits`_**`][e[+-]`**_`digits`_**`]` +3. `[`**_`digits`_**`].`**_`digits`_**`[e[+-]`**_`digits`_**`]` +4. **_`digits`_**`e[+-]`**_`digits`_** + +where **_`digits`_** is one or more single-digit integers (`0` through `9`). At least one digit must be present before or after the decimal point, if there is one. At least one digit must follow the exponent symbol `e`, if there is one. Spaces and underscores (nor any other characters) are allowed in the constant. + +Numeric constants may also have a `+` or `-` prefix, but this considered a function applied to the constant, not the constant itself. + +Here are some examples of valid numeric constants: + +- `5` +- `7.2` +- `0.0087` +- `1.` +- `.5` +- `1e-3` +- `1.332434e+2` +- `+100` +- `-250` + +### Boolean constants + +A boolean constant is represented as either the identifer `true` or `false`. Boolean constants are not case-sensitive, meaning `true` evaluates to the same value as `TRUE`. + +## Operators + +Operators are infix functions composed of special characters. A complete list of operators can be found in the appendix. ksqlDB does not allow you to add user-space operators. + +## Special characters + +Some characters have a particular meaning that doesn't correspond to an operator. The following list describes the special characters and their purpose: + +1. Parentheses (`()`) retain their usual meaning in programming languages for grouping expressions and controlling the order of evaluation. +2. Brackets (`[]`) are used to work with arrays, both in their construction and subscript access. +3. Commas (`,`) delineate a discrete list of entities. +4. The semi-colons (`;`) terminates a SQL command. +5. The asterisk (`*`), when used in particular syntax, is used as an "all" qualifier. This is seen most commonly in a `SELECT` command to retrieve all columns. +6. The period (`.`) accesses a column in a stream or table. +7. The arrow (`->`) accesses a field in a struct data type. + +## Comments + +A comment is a string beginning with twos dashes. It includes all of the content from the dashes to the end of the line: + +```sql +-- Here is a comment. +``` + +You can also span a comment over multiple lines by using C-style syntax: + +```sql +/* Here is + another comment. + */ +``` + +## Lexical precedence + +Operators are evaluated using the following order of precedence: + +1. `*`, `/`, `%` +2. `+`, `-` +3. `=`, `>`, `<`, `>=`, `<=`, `<>`, `!=` +4. `NOT` +5. `AND` +6. `BETWEEN`, `LIKE`, `OR` + +In an expression, when two operators have the same precedence level, they're +evaluated left-to-right based on their position. + +You can enclose an expression in parentheses to force precedence or clarify +precedence, for example, `(5 + 2) * 3`. \ No newline at end of file diff --git a/docs/requirements.txt b/docs/requirements.txt index 9a1d8bdf37de..de6290cd024e 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -7,3 +7,4 @@ Pygments==2.4.2 mkdocs-material==5.1.3 python-dateutil==2.8.1 mkdocs-redirects==1.0.1 +mdx_truly_sane_lists==1.2 diff --git a/mkdocs.yml b/mkdocs.yml index 098e6e67ed8c..4cd6c754773a 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -26,13 +26,12 @@ extra_javascript: nav: - Overview: index.md - Getting started: quickstart.md # links to Derek's quickstart at ksqldb.io + - Apache Kafka primer: overview/apache-kafka-primer.md - Concepts: - Concepts: concepts/index.md - Events: concepts/events.md - Collections: - Collections Overview: concepts/collections/index.md - - Streams: concepts/collections/streams.md - - Tables: concepts/collections/tables.md - Inserting events: concepts/collections/inserting-events.md - Stream Processing: concepts/stream-processing.md - Materialized Views: concepts/materialized-views.md @@ -76,6 +75,10 @@ nav: - Control the case of identifiers: how-to-guides/control-the-case-of-identifiers.md - Reference: - Syntax Reference: developer-guide/syntax-reference.md + - The SQL language: + - SQL syntax: + - Lexical structure: reference/sql/syntax/lexical-structure.md + - Data definition: reference/sql/data-definition.md - Statements: - SQL quick reference: developer-guide/ksqldb-reference/quick-reference.md - Statement Index: developer-guide/ksqldb-reference/index.md @@ -153,7 +156,7 @@ nav: - ksqlDB with Embedded Connect: tutorials/embedded-connect.md - Integrate with PostgreSQL: tutorials/connect-integration.md - Troubleshooting: troubleshoot-ksqldb.md - - Frequently Asked Questions: faq.md + - Frequently asked questions: faq.md markdown_extensions: - toc: @@ -167,6 +170,7 @@ markdown_extensions: - mdx_gh_links: user: confluentinc repo: ksqldb + - mdx_truly_sane_lists plugins: - search @@ -175,6 +179,8 @@ plugins: - redirects: redirect_maps: developer-guide/implement-a-udf.md: how-to-guides/create-a-user-defined-function.md + concepts/collections/streams.md: reference/sql/syntax/data-definition.md + concepts/collections/tables.md: reference/sql/syntax/data-definition.md extra: site: From c990aa7a9d7107394b257d41aab8cc514cf9232f Mon Sep 17 00:00:00 2001 From: Michael Drogalis Date: Mon, 17 Aug 2020 16:25:34 -0700 Subject: [PATCH 02/15] docs: remove old docs --- docs/concepts/collections/streams.md | 92 --------------------------- docs/concepts/collections/tables.md | 93 ---------------------------- 2 files changed, 185 deletions(-) delete mode 100644 docs/concepts/collections/streams.md delete mode 100644 docs/concepts/collections/tables.md diff --git a/docs/concepts/collections/streams.md b/docs/concepts/collections/streams.md deleted file mode 100644 index 254723e61023..000000000000 --- a/docs/concepts/collections/streams.md +++ /dev/null @@ -1,92 +0,0 @@ ---- -layout: page -title: Streams -tagline: Stream collections in ksqlDB -description: Learn about streams of events in ksqlDB. -keywords: ksqldb, collection, stream ---- - -A stream is a durable, partitioned sequence of immutable events. When a new -event is added a stream, it's appended to the partition that its key belongs -to. Streams are useful for modeling a historical sequence of activity. For -example, you might use a stream to model a series of customer purchases or a -sequence of readings from a sensor. Under the hood, streams are simply stored -as {{ site.aktm }} topics with an enforced schema. You can create a stream from -scratch or declare a stream on top of an existing {{ site.ak }} topic. In both -cases, you can specify a variety of configuration options. - -Create a stream from scratch ------------------------------- - -When you create a stream from scratch, a backing {{ site.ak }} topic is created -automatically. Use the CREATE STREAM statement to create a stream from scratch, -and give it a name, schema, and configuration options. The following statement -registers a `publications` stream on a topic named `publication_events`. Events -in the `publications` stream are distributed over 3 partitions, are keyed on -the `author` column, and are serialized in the Avro format. - -```sql -CREATE STREAM publications ( - author VARCHAR KEY, - title VARCHAR - ) WITH ( - kafka_topic = 'publication_events', - value_format = 'avro', - partitions = 3 - ); -``` - -In this example, a new stream named `publications` is created with two columns: -`author` and `title`. Both are of type `VARCHAR`. ksqlDB automatically creates -an underlying `publication_events` topic that you can access freely. The topic -has 3 partitions, and any new events that are appended to the stream are hashed -according to the value of the `author` column. Because {{ site.ak }} can store -data in a variety of formats, we let ksqlDB know that we want the value portion -of each row stored in the Avro format. You can use a variety of configuration -options in the final `WITH` clause. - -!!! note - If you create a stream from scratch, you must supply the number of - partitions. - -Create a stream over an existing Kafka topic --------------------------------------------- - -You can also create a stream on top of an existing {{ site.ak }} topic. -Internally, ksqlDB simply registers the topic with the provided schema -and doesn't create anything new. - -```sql -CREATE STREAM publications ( - author VARCHAR KEY, - title VARCHAR - ) WITH ( - kafka_topic = 'publication_events', - value_format = 'avro' - ); -``` - -Because the topic already exists, you do not need to specify the number of partitions. - -It's important that the columns you define match the data in the existing topic. -In this case, the message would need a `KAFKA` serialized `VARCHAR` in the message key -and an `AVRO` serialized record containing a `title` field in the message value. - -If both the `author` and `title` columns are in the message value, you can write: - -```sql -CREATE STREAM publications ( - author VARCHAR, - title VARCHAR - ) WITH ( - kafka_topic = 'publication_events', - value_format = 'avro' - ); -``` - -Notice the `author` column is no longer marked with the `KEY` keyword, so it is now -read from the message value. - -If an underlying event in the {{ site.ak }} topic doesn’t conform to the given -stream schema, the event is discarded at read-time, and an error is added to the -[processing log](../../developer-guide/test-and-debug/processing-log.md). diff --git a/docs/concepts/collections/tables.md b/docs/concepts/collections/tables.md deleted file mode 100644 index 81377a18fd09..000000000000 --- a/docs/concepts/collections/tables.md +++ /dev/null @@ -1,93 +0,0 @@ ---- -layout: page -title: Tables -tagline: Table collections in ksqlDB -description: Learn about tables of events in ksqlDB. -keywords: ksqldb, collection, table ---- - -A table is a durable, partitioned collection that models change over time. -It's the mutable counterpart to the immutable [stream](streams.md). By contrast -to streams, which represent a historical sequence of events, tables represent -what is true as of “now”. For example, you might model the locations that -someone has lived at as a stream: first Miami, then New York, then London, -and so forth. You can use a table to roll up this information and tell you -where they live right now. Tables can also be used to materialize a view by -incrementally aggregating a stream of events. - -Tables work by leveraging the keys of each event. Keys are used to denote -identity. If a sequence of events shares a key, the last event for a given key -represents the most up-to-date information. Under the hood, ksqlDB uses Kafka’s -notion of a *compacted topic* to make this work. Compaction is a process that -periodically deletes all but the newest events for each key. For more -information, see -[Log Compaction](https://kafka.apache.org/documentation/#compaction). - -You can create a table from scratch or declare a table on top of an existing -{{ site.aktm }} topic. You can supply a variety of configuration options. In -either case, the table is not *materialized*, which limits its ability to be -queried. Only tables that are derived from other collections are materialized. -For more information, see [Materialized Views](../materialized-views.md). - -Create a table from scratch ---------------------------- - -When you create a table from scratch, a backing compacted {{ site.ak }} topic -is created automatically. Use the -[CREATE TABLE](../../developer-guide/ksqldb-reference/create-table.md) -statement to create a table from scratch, and give it a name, schema, and -configuration options. The following statement registers a `movies` table on a -topic named `movies`. Events in the `movies` table are distributed over 5 -partitions, are keyed on the `title` column, and are serialized in the Avro -format. - -```sql -CREATE TABLE movies ( - title VARCHAR PRIMARY KEY, - release_year INT - ) WITH ( - kafka_topic = 'movies', - value_format = 'avro', - partitions = 5 - ); -``` - -In this example, a new table named `movies` is created with two columns: -`title` and `release_year`. ksqlDB automatically creates an underlying `movies` -topic that you can access freely. The topic has 5 partitions, and any new -events that are integrated into the table are hashed according to the value -of the `title` column. Because {{ site.ak }} can store data in a variety of -formats, we let ksqlDB know that we want the value portion of each row stored -in the Avro format. You can use a variety of configuration options in the final -WITH clause. - -!!! note - If you create a table from scratch, you must supply the number of - partitions. - -Create a table over an existing Kafka topic -------------------------------------------- - -You can also create a table on top of an existing {{ site.ak }} topic. -Internally, ksqlDB simply registers the topic with the provided schema -and doesn't create anything new. - -```sql -CREATE TABLE movies ( - title VARCHAR PRIMARY KEY, - release_year INT - ) WITH ( - kafka_topic = 'movies', - value_format = 'avro' - ); -``` - -Because the topic already exists, you do not need to specify the number of partitions. - -It's important that the columns you define match the data in the existing topic. -In this case, the message would need a `KAFKA` serialized `VARCHAR` in the message key -and an `AVRO` serialized record containing a `release_year` field in the message value. - -If an underlying event in the {{ site.ak }} topic doesn’t conform to the given -table schema, the event is discarded at read-time, and an error is added to the -[processing log](../../developer-guide/test-and-debug/processing-log.md). From 38a199eebb7f81dd8313ecdd55e6adbc2d836719 Mon Sep 17 00:00:00 2001 From: Michael Drogalis Date: Mon, 17 Aug 2020 16:26:35 -0700 Subject: [PATCH 03/15] docs: fix redirects --- mkdocs.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mkdocs.yml b/mkdocs.yml index 4cd6c754773a..b9c7ddd7d640 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -179,8 +179,8 @@ plugins: - redirects: redirect_maps: developer-guide/implement-a-udf.md: how-to-guides/create-a-user-defined-function.md - concepts/collections/streams.md: reference/sql/syntax/data-definition.md - concepts/collections/tables.md: reference/sql/syntax/data-definition.md + concepts/collections/streams.md: reference/sql/data-definition.md + concepts/collections/tables.md: reference/sql/data-definition.md extra: site: From c10b014d6b1c490fca6a319050a353c5336aebfe Mon Sep 17 00:00:00 2001 From: Michael Drogalis Date: Mon, 17 Aug 2020 16:28:42 -0700 Subject: [PATCH 04/15] docs: remove dead links --- docs/concepts/collections/index.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/concepts/collections/index.md b/docs/concepts/collections/index.md index 400434286443..d417745fc5a3 100644 --- a/docs/concepts/collections/index.md +++ b/docs/concepts/collections/index.md @@ -35,8 +35,7 @@ defined schema. Only data that conforms to the schema can be added to the collection. ksqlDB supports two abstractions for representing collections: -[streams](streams.md) and [tables](tables.md). Both operate under a simple -key/value model. +streams and tables. Both operate under a simple key/value model. Streams ------- From f0fe45a42492d5cda4e9b679d2a246e784230a8e Mon Sep 17 00:00:00 2001 From: Michael Drogalis Date: Mon, 17 Aug 2020 16:33:27 -0700 Subject: [PATCH 05/15] docs: fixes more dead links --- .../developer-guide/ksqldb-reference/create-stream-as-select.md | 2 +- docs/developer-guide/ksqldb-reference/create-table-as-select.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/developer-guide/ksqldb-reference/create-stream-as-select.md b/docs/developer-guide/ksqldb-reference/create-stream-as-select.md index 0c81e5e6daa7..b3f2cb86ca39 100644 --- a/docs/developer-guide/ksqldb-reference/create-stream-as-select.md +++ b/docs/developer-guide/ksqldb-reference/create-stream-as-select.md @@ -65,7 +65,7 @@ The key of the resulting stream is determined by the following rules, in order o The projection must include all columns required in the result, including any key columns. -For supported [serialization formats](../developer-guide/serialization.md), +For supported [serialization formats](../../developer-guide/serialization.md), ksqlDB can integrate with [Confluent Schema Registry](https://docs.confluent.io/current/schema-registry/index.html). ksqlDB registers the value schema of the new stream with {{ site.sr }} automatically. The schema is registered under the subject `-value`. diff --git a/docs/developer-guide/ksqldb-reference/create-table-as-select.md b/docs/developer-guide/ksqldb-reference/create-table-as-select.md index 86cb46b57dc3..096163522a81 100644 --- a/docs/developer-guide/ksqldb-reference/create-table-as-select.md +++ b/docs/developer-guide/ksqldb-reference/create-table-as-select.md @@ -69,7 +69,7 @@ The primary key of the resulting table is determined by the following rules, in The projection must include all columns required in the result, including any primary key columns. -For supported [serialization formats](../developer-guide/serialization.md), +For supported [serialization formats](../../developer-guide/serialization.md), ksqlDB can integrate with the [Confluent Schema Registry](https://docs.confluent.io/current/schema-registry/index.html). ksqlDB registers the value schema of the new table with {{ site.sr }} automatically. The schema is registered under the subject `-value`. From e7857ad6505a29bbf2ebedfa1bab547b573fa536 Mon Sep 17 00:00:00 2001 From: Michael Drogalis Date: Mon, 17 Aug 2020 16:54:27 -0700 Subject: [PATCH 06/15] docs: adds appendix --- docs/reference/sql/appendix.md | 142 ++++++++++++++++++ .../reference/sql/syntax/lexical-structure.md | 4 +- mkdocs.yml | 1 + 3 files changed, 145 insertions(+), 2 deletions(-) create mode 100644 docs/reference/sql/appendix.md diff --git a/docs/reference/sql/appendix.md b/docs/reference/sql/appendix.md new file mode 100644 index 000000000000..fdbe929f3e48 --- /dev/null +++ b/docs/reference/sql/appendix.md @@ -0,0 +1,142 @@ +## Keywords + +The following table shows all keywords in the language. + +| keyword | description | example | +|--------------|-----------------------------------------|----------------------------------------------------------------------| +| `ADVANCE` | hop size in hopping window | `WINDOW HOPPING (SIZE 30 SECONDS, ADVANCE BY 10 SECONDS)` | +| `ALL` | list hidden topics | `SHOW ALL TOPICS` | +| `AND` | logical "and" operator | `WHERE userid<>'User_1' AND userid<>'User_2'` | +| `ARRAY` | one-indexed array of elements | `SELECT ARRAY[1, 2] FROM s1 EMIT CHANGES;` | +| `AS` | alias a column, expression, or type | | +| `BEGINNING` | print from start of topic | `PRINT FROM BEGINNING;` | +| `BETWEEN` | constrain a value to a range | `SELECT event FROM events WHERE event_id BETWEEN 10 AND 20 …` | +| `BY` | specify expression | `GROUP BY regionid`, `ADVANCE BY 10 SECONDS`, `PARTITION BY userid` | +| `CASE` | select a condition from expressions | `SELECT CASE WHEN condition THEN result [ WHEN … THEN … ] … END` | +| `CAST` | change expression type | `SELECT id, CONCAT(CAST(COUNT(*) AS VARCHAR), '_HELLO') FROM views …`| +| `CHANGES` | specify push query | `SELECT * FROM users EMIT CHANGES;` | +| `CONNECTOR` | manage a connector | `CREATE SOURCE CONNECTOR 'jdbc-connector' WITH( …` | +| `CONNECTORS` | list all connectors | `SHOW CONNECTORS;` | +| `CREATE` | create an object | `CREATE STREAM rock_songs (artist VARCHAR, title VARCHAR) …` | +| `DAY` | time unit of one day for a window | `WINDOW TUMBLING (SIZE 30 SECONDS, RETENTION 1 DAY)` | +| `DAYS` | time unit of days for a window | `WINDOW TUMBLING (SIZE 30 SECONDS, RETENTION 1000 DAYS)` | +| `DECIMAL` | decimal numeric type | | +| `DELETE` | remove a {{ site.ak}} topic | `DROP TABLE DELETE TOPIC;` | +| `DESCRIBE` | list details for an object | `DESCRIBE PAGEVIEWS;` | +| `DROP` | delete an object | `DROP CONNECTOR ;` | +| `ELSE` | condition in `WHEN` statement | `CASE WHEN units<2 THEN 'sm' WHEN units<4 THEN 'med' ELSE 'large' …` | +| `EMIT` | specify push query | `SELECT * FROM users EMIT CHANGES;` | +| `END` | close a `CASE` block | `SELECT CASE WHEN condition THEN result [ WHEN … THEN … ] … END` | +| `EXISTS` | test whether object exists | `DROP STREAM IF EXISTS ;` | +| `EXPLAIN` | show execution plan | `EXPLAIN ;` or `EXPLAIN ;` | +| `EXTENDED` | list details for an object | `DESCRIBE EXTENDED ;` | +| `FALSE` | Boolean value of false | | +| `FINAL` | specify pull query | `SELECT * FROM users EMIT FINAL;` | +| `FROM` | specify record source for queries | `SELECT * FROM users;` | +| `FULL` | specify `FULL JOIN` | `CREATE TABLE t AS SELECT * FROM l FULL OUTER JOIN r ON l.ID = r.ID;`| +| `FUNCTION` | list details for a function | `DESCRIBE FUNCTION ;` | +| `FUNCTIONS` | list all functions | `SHOW FUNCTIONS;` | +| `GRACE` | grace period for a tumbling window | `WINDOW TUMBLING (SIZE 1 HOUR, GRACE PERIOD 2 HOURS)` | +| `GROUP` | group rows with the same values | `SELECT regionid, COUNT(*) FROM pageviews GROUP BY regionid` | +| `HAVING` | condition expression | `GROUP BY card_number HAVING COUNT(*) > 3` | +| `HOPPING` | specify a hopping window | `WINDOW HOPPING (SIZE 30 SECONDS, ADVANCE BY 10 SECONDS)` | +| `HOUR` | time unit of one hour for a window | `WINDOW TUMBLING (SIZE 1 HOUR, RETENTION 1 DAY)` | +| `HOURS` | time unit of hours for a window | `WINDOW TUMBLING (SIZE 2 HOURS, RETENTION 1 DAY)` | +| `IF` | test whether object exists | `DROP STREAM IF EXISTS ;` | +| `IN` | specify multiple values | `WHERE name IN (value1, value2, ...)` | +| `INNER` | specify `INNER JOIN` | `CREATE TABLE t AS SELECT * FROM l INNER JOIN r ON l.ID = r.ID;` | +| `INSERT` | insert new records in a stream/table | `INSERT INTO ...` | +| `INTEGER` | integer numeric type | `CREATE TABLE profiles (id INTEGER PRIMARY KEY, …` | +| `INTERVAL` | number of messages to skip in `PRINT` | `PRINT INTERVAL 5;` | +| `INTO` | stream/table to insert values | `INSERT INTO stream_name ...` | +| `IS` | | | +| `JOIN` | match records in streams/tables | `CREATE TABLE t AS SELECT * FROM l INNER JOIN r ON l.ID = r.ID;` | +| `KEY` | specify key column | `CREATE TABLE users (userId INT PRIMARY KEY, …` | +| `LEFT` | specify `LEFT JOIN` | `CREATE TABLE t AS SELECT * FROM l LEFT JOIN r ON l.ID = r.ID;` | +| `LIKE` | match pattern | `WHERE UCASE(gender)='FEMALE' AND LCASE (regionid) LIKE '%_6'` | +| `LIMIT` | number of records to output | `SELECT * FROM users EMIT CHANGES LIMIT 5;` | +| `LIST` | list objects | `SHOW STREAMS;` | +| `MAP` | `map` data type | `SELECT MAP(k1:=v1, k2:=v1*2) FROM s1 EMIT CHANGES;` | +| `MILLISECOND` | time unit of one ms for a window | `WINDOW TUMBLING (SIZE 1 MILLISECOND, RETENTION 1 DAY)` | +| `MILLISECONDS` | time unit of ms for a window | `WINDOW TUMBLING (SIZE 100 MILLISECONDS, RETENTION 1 DAY)` | +| `MINUTE` | time unit of one min for a window | `WINDOW TUMBLING (SIZE 1 MINUTE, RETENTION 1 DAY)` | +| `MINUTES` | time unit of mins for a window | `WINDOW TUMBLING (SIZE 30 MINUTES, RETENTION 1 DAY)` | +| `MONTH` | time unit of one month for a window | `WINDOW TUMBLING (SIZE 1 HOUR, RETENTION 1 MONTH)` | +| `MONTHS` | time unit of months for a window | `WINDOW TUMBLING (SIZE 1 HOUR, RETENTION 2 MONTHs)` | +| `NOT` | logical "not" operator | | +| `NULL` | field with no value | | +| `ON` | specify join criteria | `LEFT JOIN users ON pageviews.userid = users.userid` | +| `OR` | logical "or" operator | `WHERE userid='User_1' OR userid='User_2'` | +| `OUTER` | specify `OUTER JOIN` | `CREATE TABLE t AS SELECT * FROM l FULL OUTER JOIN r ON l.ID = r.ID;`| +| `PARTITION` | repartition a stream | `PARTITION BY ` | +| `PARTITIONS` | partitions to distribute keys over | `CREATE STREAM users_rekeyed WITH (PARTITIONS=6) AS …` | +| `PERIOD` | grace period for a tumbling window | `WINDOW TUMBLING (SIZE 1 HOUR, GRACE PERIOD 2 HOURS)` | +| `PRIMARY` | specify primary key column | `CREATE TABLE users (userId INT PRIMARY KEY, …` | +| `PRINT` | output records in a topic | `PRINT FROM BEGINNING;` | +| `PROPERTIES` | list all properties | `SHOW PROPERTIES;` | +| `QUERIES` | list all queries | `SHOW QUERIES;` | +| `REPLACE` | string replace | `REPLACE(col1, 'foo', 'bar')` | +| `RETENTION` | time to retain past windows | `WINDOW TUMBLING (SIZE 30 SECONDS, RETENTION 1000 DAYS)` | +| `RIGHT` | | | +| `RUN` | execute queries from a file | `RUN SCRIPT ;` | +| `SCRIPT` | execute queries from a file | `RUN SCRIPT ;` | +| `SECOND` | time unit of one sec for a window | `WINDOW TUMBLING (SIZE 1 SECOND, RETENTION 1 DAY)` | +| `SECONDS` | time unit of secs for a window | `WINDOW TUMBLING (SIZE 30 SECONDS, RETENTION 1 DAY)` | +| `SELECT` | query a stream or table | | +| `SESSION` | specify a session window | `WINDOW SESSION (60 SECONDS)` | +| `SET` | assign a property value | `SET 'auto.offset.reset'='earliest';` | +| `SHOW` | list objects | `SHOW FUNCTIONS;` | +| `SINK` | create a sink connector | `CREATE SINK CONNECTOR …` | +| `SIZE` | time length of a window | `WINDOW TUMBLING (SIZE 5 SECONDS)` | +| `SOURCE` | create a source connector | `CREATE SOURCE CONNECTOR …` | +| `STREAM` | register a stream on a topic | `CREATE STREAM users_orig AS SELECT * FROM users EMIT CHANGES;` | +| `STREAMS` | list all streams | `SHOW STREAMS;` | +| `STRUCT` | struct data type | `SELECT STRUCT(f1 := v1, f2 := v2) FROM s1 EMIT CHANGES;` | +| `TABLE` | register a table on a topic | `CREATE TABLE users (id BIGINT PRIMARY KEY, …` | +| `TABLES` | list all tables | `SHOW TABLES;` | +| `TERMINATE` | end a persistent query | `TERMINATE query_id;` | +| `THEN` | return expression in a CASE block | `CASE WHEN units<2 THEN 'sm' WHEN units<4 THEN 'med' ELSE 'large' …` | +| `TIMESTAMP` | specify a timestamp column | `CREATE STREAM pageviews WITH (TIMESTAMP='viewtime', …` | +| `TOPIC` | specify {{site.ak}} topic to delete | `DROP TABLE DELETE TOPIC;` | +| `TOPICS` | list all streams | `SHOW TOPICS;` | +| `TRUE` | Boolean value of true | | +| `TUMBLING` | specify a tumbling window | `WINDOW TUMBLING (SIZE 5 SECONDS)` | +| `TYPE` | alias a complex type declaration | `CREATE TYPE AS ;` | +| `TYPES` | list all custom type aliases | `SHOW TYPES;` | +| `UNSET` | unassign a property value | `UNSET 'auto.offset.reset';` | +| `VALUES` | list of values to insert | `INSERT INTO foo VALUES ('key', 'A');` | +| `WHEN` | specify condition in a `CASE` block | `SELECT CASE WHEN condition THEN result [ WHEN … THEN … ] …` | +| `WHERE` | filter records by a condition | `SELECT * FROM pageviews WHERE pageid < 'Page_20'` | +| `WINDOW` | groups rows with the same keys | `SELECT userid, COUNT(*) FROM users WINDOW SESSION (60 SECONDS) …` | +| `WITH` | specify object creation params | `CREATE STREAM pageviews WITH (TIMESTAMP='viewtime', …` | +| `WITHIN` | time range in a windowed join | `SELECT * FROM impressions i JOIN clicks c WITHIN 1 minute …` | +| `YEAR` | time unit of one year for a window | `WINDOW TUMBLING (SIZE 1 HOUR, RETENTION 1 YEAR)` | +| `YEARS` | time unit of years for a window | `WINDOW TUMBLING (SIZE 1 HOUR, RETENTION 2 YEARS)` | + +## Operators + +The following table shows all operators in the language. + +| operator | meaning | applies to +|--------------|--------------------------------|----------------- +| `=` | is equal to | string, numeric +| `!=` or `<>` | is not equal to | string, numeric +| `<` | is less than | string, numeric +| `<=` | is less than or equal to | string, numeric +| `>` | is greater than | string, numeric +| `>=` | is greater than or equal to | string, numeric +| `+` | addition for numeric, concatenation for string | string, numeric +| `-` | subtraction | numeric +| `*` | multiplication | numeric +| `/` | division | numeric +| `%` | modulus | numeric +| `||` or `+` | concatenation | string +| `:=` | assignment | all +| `->` | struct field dereference | struct +| `.` | source dereference | table, stream +| `E` or `e` | exponent | numeric +| `NOT` | logical NOT | boolean +| `AND` | logical AND | boolean +| `OR` | logical OR | boolean +| `BETWEEN` | test if value within range | numeric, string +| `LIKE` | match a pattern | string \ No newline at end of file diff --git a/docs/reference/sql/syntax/lexical-structure.md b/docs/reference/sql/syntax/lexical-structure.md index 0f187e0102c0..60e575cead9c 100644 --- a/docs/reference/sql/syntax/lexical-structure.md +++ b/docs/reference/sql/syntax/lexical-structure.md @@ -35,7 +35,7 @@ SELECT * FROM t1 WHERE k1='foo' EMIT CHANGES; Some tokens, such as `SELECT`, `INSERT`, and `CREATE`, are known as keywords. Keywords are reserved tokens that have a specific meaning in ksqlDB’s syntax. They control their surrounding allowable tokens and execution semantics. Keywords are case insensitive, meaning `SELECT` and `select` are equivalent. You cannot create an identifier that is already a keyword (unless you use backticked identifiers). -A complete list of keywords can be found in the appendix. +A complete list of keywords can be found in the [appendix](../appendix.md#keywords). ## Identifiers @@ -101,7 +101,7 @@ A boolean constant is represented as either the identifer `true` or `false`. Boo ## Operators -Operators are infix functions composed of special characters. A complete list of operators can be found in the appendix. ksqlDB does not allow you to add user-space operators. +Operators are infix functions composed of special characters. A complete list of operators can be found in the [appendix](../appendix.md#operators). ksqlDB does not allow you to add user-space operators. ## Special characters diff --git a/mkdocs.yml b/mkdocs.yml index b9c7ddd7d640..97c4a7b365c5 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -79,6 +79,7 @@ nav: - SQL syntax: - Lexical structure: reference/sql/syntax/lexical-structure.md - Data definition: reference/sql/data-definition.md + - Appendix: reference/sql/appendix.md - Statements: - SQL quick reference: developer-guide/ksqldb-reference/quick-reference.md - Statement Index: developer-guide/ksqldb-reference/index.md From 52cee8f2674b5558b17c9da6897f3f6f90a62b26 Mon Sep 17 00:00:00 2001 From: Jim Galasyn Date: Tue, 18 Aug 2020 11:00:24 -0700 Subject: [PATCH 07/15] docs: copy edit new ddl and lexical structure topics (DOCS-5143) (#6046) * docs: copy edit new sql reference topics * docs: copy edit lexical structure topic --- docs/reference/sql/appendix.md | 8 + docs/reference/sql/data-definition.md | 149 ++++++++++++++---- .../reference/sql/syntax/lexical-structure.md | 112 +++++++++---- 3 files changed, 204 insertions(+), 65 deletions(-) diff --git a/docs/reference/sql/appendix.md b/docs/reference/sql/appendix.md index fdbe929f3e48..f83a38687b9f 100644 --- a/docs/reference/sql/appendix.md +++ b/docs/reference/sql/appendix.md @@ -1,3 +1,11 @@ +--- +layout: page +title: ksqlDB SQL keywords and operators +tagline: SQL language keywords +description: Tables listing all valid keywords and operators in ksqlDB SQL +keywords: ksqldb, sql, keyword, operators +--- + ## Keywords The following table shows all keywords in the language. diff --git a/docs/reference/sql/data-definition.md b/docs/reference/sql/data-definition.md index 32ebfc27eef0..51f751ccda3b 100644 --- a/docs/reference/sql/data-definition.md +++ b/docs/reference/sql/data-definition.md @@ -6,29 +6,57 @@ description: How to use DDL to structure data in ksqlDB keywords: ksqldb, sql, ddl --- -This section covers how you create the structures that store your events. ksqlDB abstracts events as rows with columns and stores them in streams and tables. +This section covers how you create the structures that store your events. +ksqlDB abstracts events as rows with columns and stores them in streams +and tables. ## Rows and columns -Streams and tables help you model collections of events that accrete over time. Both are represented as a series of rows and columns with a schema, much like a relational database table. Rows represent individual events. Columns represent the attributes of those events. - -Each column has a data type. The data type limits the span of permissible values that it can take on. For example, if a column is declared as type `INT`, it cannot take on the value of string `'foo'`. - -In contrast to relational database tables, the columns of a row in ksqlDB are divided into "key" and "value" columns. The key columns control which partition a row resides in. The value columns, by convention, are used to store the main data of interest. Being able to control the key columns is useful for manipulating the underlying data locality, and generally allows you to integrate with the wider Kafka ecosystem, which uses the same key/value data model. By default, a column is a value column. Marking a column as a `(PRIMARY) KEY` makes it a key column. - -Internally, each row is backed by a [Kafka record](../../../overview/apache-kafka-primer/#records). In Kafka, the key and value parts of a record are independently [serialized](../../../overview/apache-kafka-primer/#serializers). ksqlDB allows you to exercise that same flexibility, and generally builds on the semantics of Kafka records, rather than hiding them. - -There is no theoretical limit on the number of columns in a stream or table. In practice, the limit is determined by the maximum message size that Kafka can store and the resources dedicated to ksqlDB. +Streams and tables help you model collections of events that accrete over time. +Both are represented as a series of rows and columns with a schema, much like a +relational database table. Rows represent individual events. Columns represent +the attributes of those events. + +Each column has a data type. The data type limits the span of permissible values +that you can assign. For example, if a column is declared as type `INT`, it can't +be assigned the value of string `'foo'`. + +In contrast to relational database tables, the columns of a row in ksqlDB are +divided into _key_ and _value_ columns. The key columns control which partition +a row resides in. The value columns, by convention, store the main data of +interest. Controlling the key columns is useful for manipulating the underlying +data locality, and enables you to integrate with the wider {{ site.ak }} +ecosystem, which uses the same key/value data model. By default, a column is a +value column. Marking a column as a `(PRIMARY) KEY` makes it a key column. + +Internally, each row is backed by a [Kafka record](../../../overview/apache-kafka-primer/#records). +In {{ site.ak }}, the key and value parts of a record are +[serialized](../../../overview/apache-kafka-primer/#serializers) independently. +ksqlDB enables you to exercise this same flexibility and builds on the semantics +of {{ site.ak }} records, rather than hiding them. + +There is no theoretical limit on the number of columns in a stream or table. +In practice, the limit is determined by the maximum message size that {{ site.ak }} +can store and the resources dedicated to ksqlDB. ## Streams -Streams are partitioned, immutable, append-only collections. They represent a series of historical facts. For example, the rows of a stream could model a sequence of financial transactions, like "Alice sent $100 to Bob”, then “Charlie sent $50 to Bob". +A stream is a partitioned, immutable, append-only collection that represents a +series of historical facts. For example, the rows of a stream could model a +sequence of financial transactions, like "Alice sent $100 to Bob", followed by +"Charlie sent $50 to Bob". -Once a row is inserted into a stream, it can never change. New rows can be appended at the end of the stream, but existing rows can never be updated or deleted. +Once a row is inserted into a stream, it can never change. New rows can be +appended at the end of the stream, but existing rows can never be updated or +deleted. -Each row is stored in a particular partition. Every row, implicitly or explicitly, has a key that represents its identity. All rows with the same key reside in the same partition. +Each row is stored in a particular partition. Every row, implicitly or explicitly, +has a key that represents its identity. All rows with the same key reside in the +same partition. -To create a stream, use the `CREATE STREAM` command. In this command, you specify a name for the new stream, the names of the columns, and the data type of each column. +To create a stream, use the `CREATE STREAM` command. The following example +statement specifies a name for the new stream, the names of the columns, and +the data type of each column. ```sql CREATE STREAM s1 ( @@ -42,11 +70,20 @@ CREATE STREAM s1 ( ); ``` -This creates a new stream named `s1` with three columns: `k`, `v1`, and `v2`. The column `k` is designated as the key of this stream, which controls which partition each row is stored in. When the data is stored, the value portion of each row's underlying Kafka record will be serialized in the JSON format. +This creates a new stream named `s1` with three columns: `k`, `v1`, and `v2`. +The column `k` is designated as the key of this stream, which controls the +partition that each row is stored in. When the data is stored, the value +portion of each row's underlying {{ site.ak }} record is serialized in the +JSON format. -Under the covers, each stream corresponds to a [Kafka topic](../../../overview/apache-kafka-primer/#topics) with a registered schema. If the backing topic for a stream doesn't exist when you declare it, ksqlDB creates it on your behalf, as in the example above. +Under the hood, each stream corresponds to a [Kafka topic](../../../overview/apache-kafka-primer/#topics) +with a registered schema. If the backing topic for a stream doesn't exist when +you declare it, ksqlDB creates it on your behalf, as shown in the previous +example statement. -You can also declare a stream on top of an existing topic. When you do that, ksqlDB simply registers its associated schema. If topic `s2` already exists, this command will register a new stream over it: +You can also declare a stream on top of an existing topic. When you do that, +ksqlDB simply registers its associated schema. If topic `s2` already exists, +the following statement register a new stream over it: ```sql CREATE STREAM s2 ( @@ -58,17 +95,29 @@ CREATE STREAM s2 ( ); ``` -Note that when you create a stream on an existing topic, you don't need to declare the number of partitions in it. ksqlDB simply infers the partition count from the existing topic. +!!! tip + When you create a stream on an existing topic, you don't need to declare + the number of partitions for the topic. ksqlDB infers the partition count + from the existing topic. ## Tables -Tables are mutable, partitioned collections that model change over time. By contrast to streams, which represent a historical sequence of events, tables represent what is true as of "now". For example, you might use a table to model the locations that someone has lived at as a stream: first Miami, then New York, then London, and so forth. +A table is a mutable, partitioned collection that models change over time. In +contrast with a stream, which represents a historical sequence of events, a +table represents what is true as of "now". For example, you might use a table +to model the locations where someone has lived as a stream: first Miami, then +New York, then London, and so forth. -Tables work by leveraging the keys of each row. If a sequence of rows shares a key, the last row for a given key represents the most up-to-date information for that key's identity. A background process periodically runs and deletes all but the newest rows for each key. +Tables work by leveraging the keys of each row. If a sequence of rows shares a +key, the last row for a given key represents the most up-to-date information +for that key's identity. A background process periodically runs and deletes all +but the newest rows for each key. -Here is what declaring a table looks like in code. Syntactically, it is almost the same as declaring a stream. +Syntactically, declaring a table is similar to declaring a stream. The following +example statement declares a `current_location` table that has a key field +named `person`. -```sql +```sql hl_lines="2" CREATE TABLE current_location ( person VARCHAR PRIMARY KEY, location VARCHAR @@ -79,13 +128,18 @@ CREATE TABLE current_location ( ); ``` -As with streams, tables can also be declared directly ontop of an existing Kafka topic. Simply omit the number of partitions in the `WITH` clause. +As with a stream, you can declare a table directly on top of an existing +{{ site.ak }} topic by omitting the number of partitions in the `WITH` clause. ## Keys -Columns can be marked with the `KEY` keyword to indicate that they are key columns. Key columns constitute the key portion of the row's underlying Kafka record. Only streams can mark columns as keys, and it is optional for them to do do. Tables must use the `PRIMARY KEY` constraint instead. +You can mark a column with the `KEY` keyword to indicate that it's a key +column. Key columns constitute the key portion of the row's underlying +{{ site.ak }} record. Only streams can mark columns as keys, and it's optional +for them to do do. Tables must use the `PRIMARY KEY` constraint instead. -In this example, `k1`'s data is stored in the key portion of the row, while `v1`'s data is stored in the value. +In the following example statement, `k1`'s data is stored in the key portion of +the row, and `v1`'s data is stored in the value. ```sql CREATE STREAM s3 ( @@ -97,15 +151,25 @@ CREATE STREAM s3 ( ); ``` -Being able to explicitly declare key columns is especially useful when you are creating a stream over an existing topic. If ksqlDB cannot infer what data is in the key of the underlying Kafka record, it must internally perform a repartition of the rows. If you're not sure what data is in the key or you simply don't need it, you can omit the `KEY` keyword. +The ability to declare key columns explicitly is especially useful when you're +creating a stream over an existing topic. If ksqlDB can't infer what data is in +the key of the underlying {{ site.ak }} record, it must perform a repartition +of the rows internally. If you're not sure what data is in the key or you simply +don't need it, you can omit the `KEY` keyword. ## Default values -If a column is declared in a schema, but no attribute is present in the underlying Kafka record, the value for the row's column is populated as `null`. +If a column is declared in a schema, but no attribute is present in the +underlying {{ site.ak }} record, the value for the row's column is populated as +`null`. ## Pseudocolumns -Pseudocolumns are columns that are automatically populated by ksqlDB. They contain meta-information that can be infered about the row at creation time. By default, pseudocolumns are not returned when selecting all columns with the star (`*`) special character. You must select them explicitly, as in: +A pseudocolumn is a column that's automatically populated by ksqlDB and contains +meta-information that can be inferred about the row at creation time. By default, +pseudocolumns aren't returned when selecting all columns with the star (`*`) +special character. You must select them explicitly, as shown in the following +example statement. ```sql SELECT ROWTIME, * FROM s1 EMIT CHANGES; @@ -117,21 +181,34 @@ The following table lists all pseudocolumns. |--------------|--------------------------------| | `ROWTIME` | Row timestamp, inferred from the underlying Kafka record if not overridden. | -You cannot create additional pseudocolumns beyond these. +You can't create additional pseudocolumns beyond these. ## Constraints -Although data types help limit the range of values that can be accepted by ksqlDB, sometimes it is useful to have more sophisticated restrictions. Constraints allow you to exercise that type of logic directly in your schema. +Although data types help limit the range of values that can be accepted by +ksqlDB, sometimes it's useful to have more sophisticated restrictions. +_Constraints_ enable you to exercise this type of logic directly in your schema. ### Primary key constraints -In a relational database, a primary key indicates that a column will be used as a unique identifier for all rows in a table. If you have a table with a row in it who's primary key is `5`, you can't insert another row whose primary key is also `5`. +In a relational database, a primary key indicates that a column will be used as +a unique identifier for all rows in a table. If you have a table that has a row +with primary key `5`, you can't insert another row whose primary key is also `5`. -ksqlDB uses primary keys in a similar way, but there are a few differences because it is an event streaming database, not a relational database. +ksqlDB uses primary keys in a similar way, but there are a few differences, +because ksqlDB is an event streaming database, not a relational database. -First, only tables can have primary keys. Streams do not support them. Second, adding multiple rows to a table with the same primary key doesn't cause the subsequent rows to be rejected. The reason for both of these behaviors is the same: the purpose of tables are to model change of particular identities, whereas streams are used to accrete facts. When you insert multiple rows to a table with the same primary key, ksqlDB inteprets those rows as changes to a single identity. +- Only tables can have primary keys. Streams do not support them. +- Adding multiple rows to a table with the same primary key doesn't cause the + subsequent rows to be rejected. + +The reason for both of these behaviors is the same: the purpose of tables is to +model change of particular identities, but streams are used to accrete facts. +When you insert multiple rows with the same primary key into a table, ksqlDB +interprets these rows as changes to a single identity. -Primary keys cannot be null, and they must be used in all declared tables. In this example, `id` acts as the primary key for table `users`: +Primary keys can't be null, and they must be used in all declared tables. In +the following example statement, `id` acts as the primary key for table `users`: ```sql CREATE TABLE users ( @@ -146,4 +223,6 @@ CREATE TABLE users ( ### Not-null constraints -A not-null constraint designates that a column cannot contain a null value. ksqlDB doesn't yet support this constraint, but you can track the progress towards it in [GitHub issue 4436](https://github.com/confluentinc/ksql/issues/4436). \ No newline at end of file +A _not-null constraint_ designates that a column can't contain a null value. +ksqlDB doesn't support this constraint, but you can track its progress in +[GitHub issue 4436](https://github.com/confluentinc/ksql/issues/4436). \ No newline at end of file diff --git a/docs/reference/sql/syntax/lexical-structure.md b/docs/reference/sql/syntax/lexical-structure.md index 60e575cead9c..627177b08589 100644 --- a/docs/reference/sql/syntax/lexical-structure.md +++ b/docs/reference/sql/syntax/lexical-structure.md @@ -6,19 +6,33 @@ description: Details about SQL commands and statements in ksqlDB keywords: ksqldb, sql, keyword, identifier, constant, operator --- -SQL is a domain-specific language for managing and manipulating data. It’s primarily used to work with structured data, where the types and relationships across entities are well-defined. Originally adopted for relational databases, SQL is rapidly becoming the language of choice for stream processing. It’s declarative, expressive, and ubiquitous. - -The American National Standards Institute (ANSI) maintains a standard for the specification of SQL. SQL-92, the third revision to the standard, is generally the most recognized form of that specification. Beyond the standard, there are many flavors and extensions to SQL so that it can express programs beyond the SQL-92 grammar. - -ksqlDB’s SQL grammar was initially built around Presto’s grammar and has been judiciously extended. ksqlDB goes beyond SQL-92 because the standard currently has no constructs for streaming queries, a core aspect of this project. +SQL is a domain-specific language for managing and manipulating data. It’s +used primarily to work with structured data, where the types and relationships +across entities are well-defined. Originally adopted for relational databases, +SQL is rapidly becoming the language of choice for stream processing. It’s +declarative, expressive, and ubiquitous. + +The American National Standards Institute (ANSI) maintains a standard for the +specification of SQL. SQL-92, the third revision to the standard, is generally +the most recognized form of the specification. Beyond the standard, there are +many flavors and extensions to SQL so that it can express programs beyond +what's possible with the SQL-92 grammar. + +ksqlDB’s SQL grammar was built initially around Presto's grammar and has been +extended judiciously. ksqlDB goes beyond SQL-92, because the standard currently +has no constructs for streaming queries, which are a core aspect of this project. ## Syntax -SQL inputs are made up of a series of commands. Each command is made up of a series of tokens and ends in a semicolon (`;`). The tokens that apply depend on the command being invoked. +SQL inputs are made up of a series of statements. Each statements is made up of +a series of tokens and ends in a semicolon (`;`). The tokens that apply depend +on the statement being invoked. -A token is any keyword, identifier, backticked identifier, literal, or special character. Tokens are conventionally separated by whitespace unless there is no ambiguity in the grammar. This often happens when tokens flank a special character. +A token is any keyword, identifier, backticked identifier, literal, or special +character. By convention, tokens are separated by whitespace, unless there is +no ambiguity in the grammar. This happens when tokens flank a special character. -As an example, the following is syntactically valid ksqlDB SQL input: +The following example statements are syntactically valid ksqlDB SQL input: ```sql INSERT INTO s1 (a, b) VALUES ('k1', 'v1'); @@ -33,17 +47,34 @@ SELECT * FROM t1 WHERE k1='foo' EMIT CHANGES; ## Keywords -Some tokens, such as `SELECT`, `INSERT`, and `CREATE`, are known as keywords. Keywords are reserved tokens that have a specific meaning in ksqlDB’s syntax. They control their surrounding allowable tokens and execution semantics. Keywords are case insensitive, meaning `SELECT` and `select` are equivalent. You cannot create an identifier that is already a keyword (unless you use backticked identifiers). +Some tokens, such as `SELECT`, `INSERT`, and `CREATE`, are _keywords_. +Keywords are reserved tokens that have a specific meaning in ksqlDB's syntax. +They control their surrounding allowable tokens and execution semantics. +Keywords are case insensitive, meaning `SELECT` and `select` are equivalent. +You can't create an identifier that is already a keyword, unless you use +backticked identifiers. A complete list of keywords can be found in the [appendix](../appendix.md#keywords). ## Identifiers -Identifiers are symbols that represent user-space entities, like streams, tables, columns, and other objects. For example, if you have a stream named `s1`, `s1` is an _identifier_ for that stream. By default, identifiers are case-insensitive, meaning `s1` and `S1` refer to the same stream. Under the covers, ksqlDB will capitalize all of the characters in the identifier for all future display purposes. +Identifiers are symbols that represent user-space entities, like streams, +tables, columns, and other objects. For example, if you have a stream named +`s1`, `s1` is an _identifier_ for that stream. By default, identifiers are +case-insensitive, meaning `s1` and `S1` refer to the same stream. Under the +hood, ksqlDB capitalizes all of the characters in the identifier for all +future display purposes. -Unless an identifier is backticked, it may only be composed of characters that are a letter, number, or underscore. There is no imposed limit on the number of characters. +Unless an identifier is backticked, it may be composed only of characters that +are a letter, number, or underscore. There is no imposed limit on the number of +characters. -To make it possible to use any character in an identifier, you can surround it in backticks (``` ` ```) when it is declared and used. A _backticked identifier_ is useful when you don't control the data, so it might have special characters, or even keywords. When you use backticked identifers, the case is captured exactly, and any future references to the identifer become case-sensitive. As an example, if you declare the following stream: +To make it possible to use any character in an identifier, you can enclose it +in backticks (``` ` ```) when you declare and use it. A _backticked identifier_ +is useful when you don't control the data, so it might have special characters, +or even keywords. When you use backticked identifers, ksqlDB captures the case +exactly, and any future references to the identifer become case-sensitive. For +example, if you declare the following stream: ```sql CREATE STREAM `s1` ( @@ -56,7 +87,8 @@ CREATE STREAM `s1` ( ); ``` -You must select from it by backticking the stream name and column name and using the original casing: +You must select from it by backticking the stream name and column name and +using the original casing: ```sql SELECT `@MY-identifier-stream-column!` FROM `s3` EMIT CHANGES; @@ -64,24 +96,33 @@ SELECT `@MY-identifier-stream-column!` FROM `s3` EMIT CHANGES; ## Constants -There are three implicitly typed constants, or literals, in ksqlDB: strings, numbers, and booleans. +There are three implicitly typed constants, or literals, in ksqlDB: strings, +numbers, and booleans. ### String constants -A string constant is an arbitrary series of characters surrounded by single quotes (`'`), like `'Hello world'`. To include a quote inside of a string literal, escape the quote by prefixing it with another quote, as in `'You can call me ''Stuart'', or Stu.'` +A string constant is an arbitrary series of characters surrounded by single +quotes (`'`), like `'Hello world'`. To include a quote inside of a string +literal, escape the quote by prefixing it with another quote, for example +`'You can call me ''Stuart'', or Stu.'` ### Numeric constants -Numeric constants are accepted in these forms: +Numeric constants are accepted in the following forms: 1. **_`digits`_** 2. **_`digits`_**`.[`**_`digits`_**`][e[+-]`**_`digits`_**`]` 3. `[`**_`digits`_**`].`**_`digits`_**`[e[+-]`**_`digits`_**`]` 4. **_`digits`_**`e[+-]`**_`digits`_** -where **_`digits`_** is one or more single-digit integers (`0` through `9`). At least one digit must be present before or after the decimal point, if there is one. At least one digit must follow the exponent symbol `e`, if there is one. Spaces and underscores (nor any other characters) are allowed in the constant. +where **_`digits`_** is one or more single-digit integers (`0` through `9`). -Numeric constants may also have a `+` or `-` prefix, but this considered a function applied to the constant, not the constant itself. +- At least one digit must be present before or after the decimal point, if + there is one. +- At least one digit must follow the exponent symbol `e`, if there is one. +- Spaces and underscores (nor any other characters) are allowed in the constant. +- Numeric constants may also have a `+` or `-` prefix, but this considered to + be a function applied to the constant, not the constant itself. Here are some examples of valid numeric constants: @@ -97,27 +138,38 @@ Here are some examples of valid numeric constants: ### Boolean constants -A boolean constant is represented as either the identifer `true` or `false`. Boolean constants are not case-sensitive, meaning `true` evaluates to the same value as `TRUE`. +A boolean constant is represented as either the identifer `true` or `false`. +Boolean constants are not case-sensitive, meaning `true` evaluates to the same +value as `TRUE`. ## Operators -Operators are infix functions composed of special characters. A complete list of operators can be found in the [appendix](../appendix.md#operators). ksqlDB does not allow you to add user-space operators. +Operators are infix functions composed of special characters. A complete list +of operators can be found in the [appendix](../appendix.md#operators). ksqlDB +doesn't allow you to add user-space operators. ## Special characters -Some characters have a particular meaning that doesn't correspond to an operator. The following list describes the special characters and their purpose: - -1. Parentheses (`()`) retain their usual meaning in programming languages for grouping expressions and controlling the order of evaluation. -2. Brackets (`[]`) are used to work with arrays, both in their construction and subscript access. -3. Commas (`,`) delineate a discrete list of entities. -4. The semi-colons (`;`) terminates a SQL command. -5. The asterisk (`*`), when used in particular syntax, is used as an "all" qualifier. This is seen most commonly in a `SELECT` command to retrieve all columns. -6. The period (`.`) accesses a column in a stream or table. -7. The arrow (`->`) accesses a field in a struct data type. +Some characters have a particular meaning that doesn't correspond to an +operator. The following list describes the special characters and their +purpose. + +- Parentheses (`()`) retain their usual meaning in programming languages for + grouping expressions and controlling the order of evaluation. +- Brackets (`[]`) are used to work with arrays, both in their construction and + subscript access. +- Commas (`,`) delineate a discrete list of entities. +- The semi-colons (`;`) terminates a SQL command. +- The asterisk (`*`), when used in particular syntax, is used as an "all" + qualifier. This is seen most commonly in a `SELECT` command to retrieve all + columns. +- The period (`.`) accesses a column in a stream or table. +- The arrow (`->`) accesses a field in a struct data type. ## Comments -A comment is a string beginning with twos dashes. It includes all of the content from the dashes to the end of the line: +A comment is a string beginning with twos dashes. It includes all of the +content from the dashes to the end of the line: ```sql -- Here is a comment. From f489b4630c72a2cba0ab7b6e6c93c9a79b23adff Mon Sep 17 00:00:00 2001 From: Michael Drogalis Date: Wed, 26 Aug 2020 14:57:53 -0700 Subject: [PATCH 08/15] docs: more carefully describe partition --- docs/overview/apache-kafka-primer.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/overview/apache-kafka-primer.md b/docs/overview/apache-kafka-primer.md index f9a52a9a8a72..d20faeb4f1ca 100644 --- a/docs/overview/apache-kafka-primer.md +++ b/docs/overview/apache-kafka-primer.md @@ -33,8 +33,8 @@ The _timestamp_ denotes when the event happened. There are a few different "kind of time that can be tracked. These aren’t discussed here, but they’re useful to [learn about](../../../concepts/time-and-windows-in-ksqldb-queries/#time-semantics) nonetheless. -The _topic_ and _partition_ describe which larger collection of events -this event belongs to, and the _offset_ describes its exact position within +The _topic_ and _partition_ describe which larger collection and subset of events +this particular event belongs to, and the _offset_ describes its exact position within that larger collection (more on that below). Finally, the _headers_ carry arbitrary, user-supplied metadata about the record. From 4078867c6bb7b1fbd74d283d0de35ca96c79dc46 Mon Sep 17 00:00:00 2001 From: Michael Drogalis Date: Wed, 26 Aug 2020 15:00:45 -0700 Subject: [PATCH 09/15] docs: clarify --- docs/overview/apache-kafka-primer.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/overview/apache-kafka-primer.md b/docs/overview/apache-kafka-primer.md index d20faeb4f1ca..e0ff1f0f10b3 100644 --- a/docs/overview/apache-kafka-primer.md +++ b/docs/overview/apache-kafka-primer.md @@ -99,8 +99,8 @@ partition will be consistent with all other records with the same key. When reco appended, they follow the correct offset order, even in the presence of failures or faults. When a stream's key content changes because of how a query wants to process the rows (via `GROUP BY` or `PARTITION BY`), the underlying -records keys are recalculated, and the records are sent to a new partition set -to perform the computation. +records keys are recalculated, and the records are sent to a new partition in +the new topic set to perform the computation. ## Producers and consumers From 024834fbf549563caa2a622e70ea2d4298e4eb38 Mon Sep 17 00:00:00 2001 From: Michael Drogalis Date: Wed, 26 Aug 2020 15:04:00 -0700 Subject: [PATCH 10/15] docs: link to s/t --- docs/overview/apache-kafka-primer.md | 6 ++++-- docs/reference/sql/data-definition.md | 2 +- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/overview/apache-kafka-primer.md b/docs/overview/apache-kafka-primer.md index e0ff1f0f10b3..4398366087d6 100644 --- a/docs/overview/apache-kafka-primer.md +++ b/docs/overview/apache-kafka-primer.md @@ -62,8 +62,10 @@ don't need to conform to the same structure, relate to the same situation, or anything like that. The way you manage publication to topics is entirely a matter of user convention and enforcement. -ksqlDB provides higher-level abstractions over a topic through _streams_ and -_tables_. A stream or table is a {{ site.ak }} topic with a registered schema. +ksqlDB provides higher-level abstractions over a topic through +_[streams](../reference/sql/data-definition.md#streams)_ and +_[tables](../reference/sql/data-definition.md#tables)_. +A stream or table is a {{ site.ak }} topic with a registered schema. The schema controls the shape of records that are allowed to be stored in the topic. This kind of static typing makes it easier to understand what sort of rows are in your topic and generally helps you make fewer mistakes in your diff --git a/docs/reference/sql/data-definition.md b/docs/reference/sql/data-definition.md index 51f751ccda3b..dbc8b796589e 100644 --- a/docs/reference/sql/data-definition.md +++ b/docs/reference/sql/data-definition.md @@ -117,7 +117,7 @@ Syntactically, declaring a table is similar to declaring a stream. The following example statement declares a `current_location` table that has a key field named `person`. -```sql hl_lines="2" +```sql CREATE TABLE current_location ( person VARCHAR PRIMARY KEY, location VARCHAR From 2582edf0eb6257015dd40e0ed073d566a0fc57b1 Mon Sep 17 00:00:00 2001 From: Michael Drogalis Date: Wed, 26 Aug 2020 15:07:06 -0700 Subject: [PATCH 11/15] docs: suggestions --- docs/overview/apache-kafka-primer.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/overview/apache-kafka-primer.md b/docs/overview/apache-kafka-primer.md index 4398366087d6..6789c1a5e722 100644 --- a/docs/overview/apache-kafka-primer.md +++ b/docs/overview/apache-kafka-primer.md @@ -109,7 +109,7 @@ the new topic set to perform the computation. Producers and consumers facilitate the movement of records to and from topics. When an application wants to either publish records or subscribe to them, it invokes the APIs (generally called the _client_) to do so. Clients communicate -with the brokers over a structured network protocol. +with the brokers (see below) over a structured network protocol. When consumers read records from a topic, they never delete them or mutate them in any way. This pattern of being able to repeatedly read the same @@ -148,8 +148,8 @@ value) must be transformed to byte arrays. This enables {{ site.ak }} to work with an opaque series of bytes without needing to know anything about what they are. When records are delivered to a consumer, those byte arrays need to be transformed back into their original topics to be meaningful to the application. -The process that converts to and from byte representations is called -_serialization_. +The processes that convert to and from byte representations are called +_serialization_ and _deserialization_, respectively. When a producer sends a record to a topic, it must decide which serializers to use to convert the key and value to byte arrays. The key and value From 124d8c113204e850b9f0d04a8bcf66afcd641e29 Mon Sep 17 00:00:00 2001 From: Michael Drogalis Date: Wed, 26 Aug 2020 15:12:30 -0700 Subject: [PATCH 12/15] docs: clarify language --- docs/overview/apache-kafka-primer.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/overview/apache-kafka-primer.md b/docs/overview/apache-kafka-primer.md index 6789c1a5e722..e1e6fabd654c 100644 --- a/docs/overview/apache-kafka-primer.md +++ b/docs/overview/apache-kafka-primer.md @@ -201,7 +201,7 @@ consumed, exactly one consumer in the group reads it. This effectively provides a way for a set of processes to coordinate and load balance the consumption of records. -Because the records in a single topic are meant to be consumed in one, each +Because the records in a single topic are meant to be consumed by one process in the group, each partition in the subscription is read by only one consumer at a time. The number of partitions that each consumer is responsible for is defined by the total number of source partitions divided by the number of consumers. If a consumer @@ -216,7 +216,7 @@ to manage group membership explicitly, because all of this happens automatically For example, if you deploy a persistent query with ten source partitions to a ksqlDB cluster with two nodes, each node processes five partitions. If you lose a server, the sole remaining server will rebalance automatically and process -all ten. If you add four more servers, each rebalances to process two partitions. +all ten. If you now add four more servers, each rebalances to process two partitions. ## Retention From ead9b76bd3fe1eccea64ba6585a67db42bf95f15 Mon Sep 17 00:00:00 2001 From: Michael Drogalis Date: Wed, 26 Aug 2020 15:15:35 -0700 Subject: [PATCH 13/15] docs: ombine retention & compaction --- docs/overview/apache-kafka-primer.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/overview/apache-kafka-primer.md b/docs/overview/apache-kafka-primer.md index e1e6fabd654c..5c2bb9d74fb2 100644 --- a/docs/overview/apache-kafka-primer.md +++ b/docs/overview/apache-kafka-primer.md @@ -218,10 +218,13 @@ ksqlDB cluster with two nodes, each node processes five partitions. If you lose a server, the sole remaining server will rebalance automatically and process all ten. If you now add four more servers, each rebalances to process two partitions. -## Retention +## Retention and compaction -When you create a topic, you must set a retention duration. Retention -defines how long a record is stored before it's deleted. Retention is one of the +It is often desirable to clean up older records after some period of time. +Retention and compaction are two different options for doing this. They are both +optional and can be used in conjunction. + +Retention defines how long a record is stored before it's deleted. Retention is one of the only ways to delete a record in a topic. This parameter is particularly important in stream processing because it defines the time horizon that you can replay a stream of events. Replay is useful if you're @@ -232,9 +235,7 @@ ksqlDB enables you to control the retention of the underlying topics of base streams and tables directly, so it's important to understand the concept. For more information see [Topics and Logs in the Kafka docs](https://kafka.apache.org/documentation/#intro_topics). -## Compaction - -Compaction is a process that runs in the background on each {{ site.ak }} +Compaction, by contrast, is a process that runs in the background on each {{ site.ak }} broker that periodically deletes all but the latest record per key. It is an optional, opt-in process. Compaction is particularly useful when your records represent some kind of updates to a piece of a state, and the latest update is From 0c1fbf05b1a7441db4a10972f6a989c7a7bbb913 Mon Sep 17 00:00:00 2001 From: Michael Drogalis Date: Thu, 27 Aug 2020 08:14:07 -0700 Subject: [PATCH 14/15] docs: address Vicky's feedback --- docs/reference/sql/syntax/lexical-structure.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/reference/sql/syntax/lexical-structure.md b/docs/reference/sql/syntax/lexical-structure.md index 627177b08589..a9b421e7cb7d 100644 --- a/docs/reference/sql/syntax/lexical-structure.md +++ b/docs/reference/sql/syntax/lexical-structure.md @@ -91,7 +91,7 @@ You must select from it by backticking the stream name and column name and using the original casing: ```sql -SELECT `@MY-identifier-stream-column!` FROM `s3` EMIT CHANGES; +SELECT `@MY-identifier-stream-column!` FROM `s1` EMIT CHANGES; ``` ## Constants @@ -120,8 +120,8 @@ where **_`digits`_** is one or more single-digit integers (`0` through `9`). - At least one digit must be present before or after the decimal point, if there is one. - At least one digit must follow the exponent symbol `e`, if there is one. -- Spaces and underscores (nor any other characters) are allowed in the constant. -- Numeric constants may also have a `+` or `-` prefix, but this considered to +- No spaces, underscores, or any other characters are allowed in the constant. +- Numeric constants may also have a `+` or `-` prefix, but this is considered to be a function applied to the constant, not the constant itself. Here are some examples of valid numeric constants: From 9e3dd5663ae95d4dbfff3858e175b837522efd29 Mon Sep 17 00:00:00 2001 From: Michael Drogalis Date: Tue, 1 Sep 2020 15:52:15 -0700 Subject: [PATCH 15/15] docs: almog feedback --- docs/overview/apache-kafka-primer.md | 4 ++-- docs/reference/sql/appendix.md | 2 +- docs/reference/sql/syntax/lexical-structure.md | 6 +++--- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/overview/apache-kafka-primer.md b/docs/overview/apache-kafka-primer.md index 5c2bb9d74fb2..4961c457f0f0 100644 --- a/docs/overview/apache-kafka-primer.md +++ b/docs/overview/apache-kafka-primer.md @@ -65,7 +65,7 @@ matter of user convention and enforcement. ksqlDB provides higher-level abstractions over a topic through _[streams](../reference/sql/data-definition.md#streams)_ and _[tables](../reference/sql/data-definition.md#tables)_. -A stream or table is a {{ site.ak }} topic with a registered schema. +A stream or table associates a schema with a {{ site.ak }} topic. The schema controls the shape of records that are allowed to be stored in the topic. This kind of static typing makes it easier to understand what sort of rows are in your topic and generally helps you make fewer mistakes in your @@ -74,7 +74,7 @@ programs that process them. ## Partitions When a record is placed into a topic, it is placed into a particular partition. -A partition is a totally ordered sequence of records. Topics have multiple +A partition is a totally ordered sequence of records by offset. Topics may have multiple partitions to make storage and processing more scalable. When you create a topic, you choose how many partitions it has. diff --git a/docs/reference/sql/appendix.md b/docs/reference/sql/appendix.md index f83a38687b9f..bd3d2874a7b9 100644 --- a/docs/reference/sql/appendix.md +++ b/docs/reference/sql/appendix.md @@ -22,7 +22,7 @@ The following table shows all keywords in the language. | `BY` | specify expression | `GROUP BY regionid`, `ADVANCE BY 10 SECONDS`, `PARTITION BY userid` | | `CASE` | select a condition from expressions | `SELECT CASE WHEN condition THEN result [ WHEN … THEN … ] … END` | | `CAST` | change expression type | `SELECT id, CONCAT(CAST(COUNT(*) AS VARCHAR), '_HELLO') FROM views …`| -| `CHANGES` | specify push query | `SELECT * FROM users EMIT CHANGES;` | +| `CHANGES` | specify incremental refinement type | `SELECT * FROM users EMIT CHANGES;` | | `CONNECTOR` | manage a connector | `CREATE SOURCE CONNECTOR 'jdbc-connector' WITH( …` | | `CONNECTORS` | list all connectors | `SHOW CONNECTORS;` | | `CREATE` | create an object | `CREATE STREAM rock_songs (artist VARCHAR, title VARCHAR) …` | diff --git a/docs/reference/sql/syntax/lexical-structure.md b/docs/reference/sql/syntax/lexical-structure.md index a9b421e7cb7d..0209fcacf52b 100644 --- a/docs/reference/sql/syntax/lexical-structure.md +++ b/docs/reference/sql/syntax/lexical-structure.md @@ -51,14 +51,14 @@ Some tokens, such as `SELECT`, `INSERT`, and `CREATE`, are _keywords_. Keywords are reserved tokens that have a specific meaning in ksqlDB's syntax. They control their surrounding allowable tokens and execution semantics. Keywords are case insensitive, meaning `SELECT` and `select` are equivalent. -You can't create an identifier that is already a keyword, unless you use +You can't create an identifier that is already a reserved word, unless you use backticked identifiers. A complete list of keywords can be found in the [appendix](../appendix.md#keywords). ## Identifiers -Identifiers are symbols that represent user-space entities, like streams, +Identifiers are symbols that represent user-defined entities, like streams, tables, columns, and other objects. For example, if you have a stream named `s1`, `s1` is an _identifier_ for that stream. By default, identifiers are case-insensitive, meaning `s1` and `S1` refer to the same stream. Under the @@ -157,7 +157,7 @@ purpose. - Parentheses (`()`) retain their usual meaning in programming languages for grouping expressions and controlling the order of evaluation. - Brackets (`[]`) are used to work with arrays, both in their construction and - subscript access. + subscript access. They also allow you to key into maps. - Commas (`,`) delineate a discrete list of entities. - The semi-colons (`;`) terminates a SQL command. - The asterisk (`*`), when used in particular syntax, is used as an "all"