Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt: Improve entry point page. Absorb community tutorial. #153

Merged
merged 3 commits into from
Dec 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/domain/timeseries/generate/node.rst
Original file line number Diff line number Diff line change
Expand Up @@ -346,7 +346,7 @@ will open up a map view showing the current position of the ISS:
.. _detailed guide: https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Asynchronous/Promises
.. _ground point: https://en.wikipedia.org/wiki/Ground_track
.. _input values: https://node-postgres.com/features/queries#Parameterized%20query
.. _interactive REPL mode: https://www.oreilly.com/library/view/learning-node-2nd/9781491943113/ch04.html
.. _interactive REPL mode: https://web.archive.org/web/20240910181004/https://www.oreilly.com/library/view/learning-node-2nd/9781491943113/ch04.html
.. _International Space Station: https://www.nasa.gov/mission_pages/station/main/index.html
.. _node-postgres: https://www.npmjs.com/package/pg
.. _Node.js: https://nodejs.org/en/
Expand Down
131 changes: 82 additions & 49 deletions docs/integrate/dbt/index.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
(dbt)=

# dbt

:::{include} /_include/links.md
:::

## About
```{div}
:style: "float: right"
[![](https://www.getdbt.com/ui/img/logos/dbt-logo.svg){w=180px}](https://www.getdbt.com/)
```

[dbt] is an open source tool for transforming data in data warehouses using Python and
SQL. It is an SQL-first transformation workflow platform that lets teams quickly and
[dbt] is a tool for transforming data in data warehouses using Python and SQL.

It is an SQL-first transformation workflow platform that lets teams quickly and
collaboratively deploy analytics code following software engineering best practices
like modularity, portability, CI/CD, and documentation.

Expand Down Expand Up @@ -56,69 +60,101 @@ scale.
:::


## Install
### dbt's Features
The data abstraction layer provided by [dbt-core] allows the decoupling of
the models on which reports and dashboards rely from the source data. When
business rules or source systems change, you can still maintain the same models
as a stable interface.

Some of the things that dbt can do include:

* Import reference data from CSV files.
* Track changes in source data with different strategies so that downstream
models do not need to be built every time from scratch.
* Run tests on data, to confirm assumptions remain valid, and to validate
any changes made to the models' logic.

### CrateDB's Benefits
Due to its unique capabilities, CrateDB is an excellent warehouse choice for
data transformation projects. It offers automatic indexing, fast aggregations,
easy partitioning, and the ability to scale horizontally.


## Setup
Install the most recent version of the [dbt-cratedb2] Python package.
```shell
pip install --upgrade 'dbt-cratedb2'
```


## Connect
**dbt Profile Configuration:** CrateDB targets should be set up using the
following configuration in your `profiles.yml` file.
## Configure
Because CrateDB is compatible with PostgreSQL, the same connectivity
options apply like outlined on the [dbt Postgres Setup] documentation
page.

The dbt connection profile settings for CrateDB stored in [`profiles.yml`]
are identical with PostgreSQL.
```yaml
company-name:
cratedb_analytics:
target: dev
outputs:
dev:
type: cratedb
host: [hostname]
host: [clustername].aks1.westeurope.azure.cratedb.net
port: 5432
user: [username]
password: [password]
port: [port] # Default is 5432.
dbname: crate # Fixed. Do not change.
schema: doc # `doc` is the default schema.
pass: [password]
dbname: crate # CrateDB's only catalog is `crate`.
schema: doc # Define schema. `doc` is the default.
search_path: doc # Use the same value like `schema` by default.
```
dbt-cratedb2 is based on dbt-postgres, which uses [psycopg2] to connect to
the database server.
Because CrateDB is compatible with PostgreSQL, the same connectivity
options apply like outlined on the [dbt Postgres Setup] documentation
page.


## Usage
## Learn

### Custom Schemas
By default, dbt writes the models into the schema you configured in your
profile, but in some dbt projects you may need to write data into different
target schemas. You can adjust the target schema using [custom schemas with
dbt].
Learn how to use CrateDB with dbt by exploring concise examples.

If your dbt project has a custom macro called `generate_schema_name`, dbt
will use it instead of the default macro. This allows you to customize
the name generation according to your needs.
:::{rubric} Tutorials
:::

```jinja
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
::::{grid} 2
:gutter: 5

:::{grid-item-card}
:link: dbt-usage
:link-type: ref
:link-alt: dbt usage guidelines
:padding: 3
:class-card: sd-text-center sd-pt-4
:class-header: sd-fs-4
{material-outlined}`integration_instructions;2.5em`
Usage Guidelines
^^^
```{toctree}
:maxdepth: 2
:hidden:

usage
```


## Learn

:::{rubric} Tutorials
+++
Usage guidelines, notes, and advanced configuration options.
:::
- [Using dbt with CrateDB]

:::{rubric} Development
:::{grid-item-card}
:link: https://github.com/crate/cratedb-examples/tree/main/framework/dbt/
:link-type: url
:link-alt: dbt CrateDB Examples
:padding: 3
:class-card: sd-text-center sd-pt-4
:class-header: sd-fs-4
{material-outlined}`apps;2.5em`
Example Projects
^^^
+++
Explore a few dbt example projects using CrateDB.
:::
- [dbt CrateDB examples]

::::


:::{rubric} Webinars
Expand Down Expand Up @@ -150,12 +186,9 @@ and then publish your project to a GitHub repository.
::::



[custom schemas with dbt]: https://docs.getdbt.com/docs/build/custom-schemas
[dbt]: https://www.getdbt.com/
[dbt-core]: https://github.com/dbt-labs/dbt-core
[dbt-cratedb2]: https://pypi.org/project/dbt-cratedb2/
[dbt Cloud]: https://www.getdbt.com/product/dbt-cloud/
[dbt Postgres Setup]: https://docs.getdbt.com/docs/core/connect-data-platform/postgres-setup
[Using dbt with CrateDB]: https://community.cratedb.com/t/using-dbt-with-cratedb/1566
[dbt CrateDB examples]: https://github.com/crate/cratedb-examples/tree/main/framework/dbt/
[psycopg2]: https://pypi.org/project/psycopg2/
[`profiles.yml`]: https://docs.getdbt.com/docs/core/connect-data-platform/profiles.yml
172 changes: 172 additions & 0 deletions docs/integrate/dbt/usage.md
Copy link
Member Author

@amotl amotl Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the naming things details of the canonical "getting started" tutorial, see also:

Here, we used usage.md for a page with the title Using dbt with CrateDB, which indeed includes general usage information more than a real tutorial.

Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
(dbt-usage)=
# Using dbt with CrateDB

:::{include} /_include/links.md
:::

_Setup instructions and guidelines for transforming data using dbt and CrateDB._

:::{div}
For running the following steps, you will need connectivity to a CrateDB
cluster, and a Python installation on your workstation. You can use
[CrateDB Self-Managed] or [CrateDB Cloud].
:::

## Setup

To start a CrateDB instance for evaluation purposes, use Docker or Podman.
```shell
docker run --rm \
--publish=4200:4200 --publish=5432:5432 \
--env=CRATE_HEAP_SIZE=2g crate:latest
```

Install the most recent version of the [dbt-cratedb2] Python package.
```shell
pip install --upgrade 'dbt-cratedb2'
```
:::{note}
dbt-cratedb2 is based on dbt-postgres, which uses [psycopg2] to connect to
the database server.
:::

## Configure
A minimal set of **dbt profile configuration** options, for example within a
[`profiles.yml`] file at `~/.dbt/profiles.yml`.
```bash
cd ~
mkdir -p .dbt
cat << EOF > .dbt/profiles.yml
cratedb_analytics:
target: dev
outputs:
dev:
type: cratedb
host: localhost
port: 5432
user: crate
pass: crate
dbname: crate
schema: doc
search_path: doc
EOF
```
Please note the values for `dbname`, `schema`, and `search_path` in this example.

## Project
When working with dbt, you are working on behalf of a dbt project.
A dbt project has a [specific structure][dbt-project-structure], and contains a
combination of SQL, Jinja, YAML, and Markdown files.
In your project folder, alongside the `models` folder that most projects have,
a folder called `macros` can include macro override files.

At [cratedb-examples » framework/dbt], you can explore a few ready-to-run dbt
projects that demonstrate usage with CrateDB.

## Appendix

A few notes about advanced configuration options and general usage
information.

### Search Path
The `search_path` config controls the CrateDB "search path" that dbt configures
when opening new connections to the database. By default, the CrateDB search
path is `"doc"`, meaning that unqualified <Term id="table" /> names will be
searched for in the `doc` schema.

### Custom Schemas
By default, dbt writes the models into the schema you configured in your
profile, but in some dbt projects you may need to write data into different
target schemas. You can adjust the target schema using [custom schemas with
dbt].

If your dbt project has a custom macro called `generate_schema_name`, dbt
will use it instead of the default macro. This allows you to customize
the name generation according to your needs.

```jinja
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
```

### Full Connection Options
CrateDB targets should be set up using the following **dbt profile configuration** in
your [`profiles.yml`] file, which is identical to the [setup options of dbt-postgres].
```yaml
cratedb_analytics:
target: dev
outputs:
dev:
type: cratedb
host: [clustername].aks1.westeurope.azure.cratedb.net
user: [username]
password: [password]
port: 5432
dbname: crate # CrateDB's only catalog is `crate`.
schema: doc # You can define any schema. `doc` is the default.
threads: [optional, 1 or more]
[keepalives_idle]: 0 # default 0, indicating the system default.
connect_timeout: 10 # default 10 seconds
[retries]: 1 # default 1 retry on error/timeout when opening connections
[search_path]: # optional, override the default postgres `search_path`
[role]: # optional, set the role dbt assumes when executing queries
[sslmode]: # optional, set the `sslmode` used to connect to the database
[sslcert]: # optional, set the `sslcert` to control the certificate file location
[sslkey]: # optional, set the `sslkey` to control the location of the private key
[sslrootcert]: # optional, set the `sslrootcert` config value to a new file path
# in order to customize the file location that contain root certificates
```


## Notes

### CrateDB's Differences
- CrateDB’s fixed catalog name is `crate`, the default schema name is `doc`.
- CrateDB does not implement the notion of a database, however tables can be created in different [schemas](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/create-table.html#ddl-create-table-schemas).
- When asked for a database name, specifying a schema name (any), or the fixed catalog name `crate` may be applicable.
- If a database/schema name is omitted while connecting, the PostgreSQL drivers may default to the “username”.
- The predefined [superuser](https://cratedb.com/docs/crate/reference/en/latest/admin/user-management.html#administration-user-management) on an unconfigured CrateDB cluster is called `crate`, defined without a password.
- For authenticating properly, please learn about the available [authentication](https://cratedb.com/docs/crate/reference/en/latest/admin/auth/index.html#admin-auth) options.

### Feature Coverage
Those dbt features have been tested successfully with CrateDB.

* [Model materializations](https://docs.getdbt.com/docs/build/materializations):
table, view, incremental, ephemeral
* [Incremental models](https://docs.getdbt.com/docs/build/incremental-models-overview)
* [Source data freshness](https://docs.getdbt.com/docs/build/sources#source-data-freshness)
* [CSV seeds](https://docs.getdbt.com/docs/build/seeds)
* [Data tests](https://docs.getdbt.com/docs/build/tests)

### Caveats
- Model materializations using the "materialized view" strategy are
not supported yet.
- Incremental materializations with CrateDB currently only support the
`delete+insert` strategy.
- Incremental materializations do not support columns using the
{ref}`OBJECT <crate-reference:data-types-objects>` data type yet.


:::{note}
CrateDB is continuously adding new features and we will endeavor to come
back and update this article if there are any updates or improvements.
We are tracking interoperability issues per [Tool: dbt], and appreciate
any contributions and reports.
:::


[cratedb-examples » framework/dbt]: https://github.com/crate/cratedb-examples/tree/main/framework/dbt/
[custom schemas with dbt]: https://docs.getdbt.com/docs/build/custom-schemas
[dbt]: https://www.getdbt.com/
[dbt-cratedb2]: https://pypi.org/project/dbt-cratedb2/
[dbt-project-structure]: https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview
[`profiles.yml`]: https://docs.getdbt.com/docs/core/connect-data-platform/profiles.yml
[psycopg2]: https://pypi.org/project/psycopg2/
[setup options of dbt-postgres]: https://docs.getdbt.com/docs/core/connect-data-platform/postgres-setup
[Tool: dbt]: https://github.com/crate/crate/labels/tool%3A%20dbt
Loading