Skip to content

Commit

Permalink
dbt: Improve entry point page. Absorb community tutorial.
Browse files Browse the repository at this point in the history
  • Loading branch information
amotl committed Nov 29, 2024
1 parent 7aac0f4 commit c35cfb8
Show file tree
Hide file tree
Showing 2 changed files with 198 additions and 46 deletions.
84 changes: 38 additions & 46 deletions docs/integrate/dbt/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,9 @@
[![](https://www.getdbt.com/ui/img/logos/dbt-logo.svg){w=180px}](https://www.getdbt.com/)
```

[dbt] is an open source tool for transforming data in data warehouses using Python and
SQL. It is an SQL-first transformation workflow platform that lets teams quickly and
[dbt] is a tool for transforming data in data warehouses using Python and SQL.

It is an SQL-first transformation workflow platform that lets teams quickly and
collaboratively deploy analytics code following software engineering best practices
like modularity, portability, CI/CD, and documentation.

Expand Down Expand Up @@ -56,69 +57,60 @@ scale.
:::


## Install
## Setup
Install the most recent version of the [dbt-cratedb2] Python package.
```shell
pip install --upgrade 'dbt-cratedb2'
```
dbt-cratedb2 is based on dbt-postgres, which uses [psycopg2] to connect to
the database server.


## Connect
**dbt Profile Configuration:** CrateDB targets should be set up using the
following configuration in your `profiles.yml` file.
## Configure
Because CrateDB is compatible with PostgreSQL, the same connectivity
options apply like outlined on the [dbt Postgres Setup] documentation
page.

The dbt connection profile settings for CrateDB stored in [`profiles.yml`]
are identical with PostgreSQL.
```yaml
company-name:
cratedb_analytics:
target: dev
outputs:
dev:
type: cratedb
host: [hostname]
host: [clustername].aks1.westeurope.azure.cratedb.net
port: 5432
user: [username]
password: [password]
port: [port] # Default is 5432.
dbname: crate # Fixed. Do not change.
schema: doc # `doc` is the default schema.
```
dbt-cratedb2 is based on dbt-postgres, which uses [psycopg2] to connect to
the database server.
Because CrateDB is compatible with PostgreSQL, the same connectivity
options apply like outlined on the [dbt Postgres Setup] documentation
page.
## Usage
### Custom Schemas
By default, dbt writes the models into the schema you configured in your
profile, but in some dbt projects you may need to write data into different
target schemas. You can adjust the target schema using [custom schemas with
dbt].
If your dbt project has a custom macro called `generate_schema_name`, dbt
will use it instead of the default macro. This allows you to customize
the name generation according to your needs.

```jinja
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
pass: [password]
dbname: crate # CrateDB's only catalog is `crate`.
schema: doc # Define schema. `doc` is the default.
search_path: doc # Use the same value like `schema` by default.
```
## Learn
:::{rubric} Tutorials
:::
- [Using dbt with CrateDB]
:::{rubric} Development
:::
- [dbt CrateDB examples]
:::::{grid}
::::{grid-item-card}
:link: dbt-usage
:link-type: ref
Advanced configuration options and other usage guidelines.
```{toctree}
:maxdepth: 2

usage
```
::::
::::{grid-item-card}
:link: https://github.com/crate/cratedb-examples/tree/main/framework/dbt/
:link-type: url
A few dbt example projects using CrateDB.
::::
:::::


:::{rubric} Webinars
Expand Down Expand Up @@ -157,5 +149,5 @@ and then publish your project to a GitHub repository.
[dbt Cloud]: https://www.getdbt.com/product/dbt-cloud/
[dbt Postgres Setup]: https://docs.getdbt.com/docs/core/connect-data-platform/postgres-setup
[Using dbt with CrateDB]: https://community.cratedb.com/t/using-dbt-with-cratedb/1566
[dbt CrateDB examples]: https://github.com/crate/cratedb-examples/tree/main/framework/dbt/
[psycopg2]: https://pypi.org/project/psycopg2/
[`profiles.yml`]: https://docs.getdbt.com/docs/core/connect-data-platform/profiles.yml
160 changes: 160 additions & 0 deletions docs/integrate/dbt/usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
(dbt-usage)=

# Using dbt with CrateDB

_Guidelines for transforming data using dbt and CrateDB._

## Introduction

### dbt's Features
The data abstraction layer provided by [dbt][dbt-core] allows the decoupling of
the models on which reports and dashboards rely from the source data. When
business rules or source systems change, you can still maintain the same models
as a stable interface.

Some of the things that dbt can do include:

* Import reference data from CSV files
* Track changes in source data with different strategies so that downstream
models do not need to be built every time from scratch.
* Run tests on data, to confirm assumptions remain valid, and to validate
any changes made to the models' logic.

### CrateDB's Benefits
Due to its unique capabilities, CrateDB is an excellent warehouse choice for
data transformation projects. It offers automatic indexing, fast aggregations,
easy partitioning, and the ability to scale horizontally.


## Setup

For running the following steps, you will need connectivity to a CrateDB
cluster, and a Python installation on your workstation. The starting point
will be a fresh installation of `dbt-cratedb2`.

```bash
pip install --upgrade 'dbt-cratedb2'
```

To start a CrateDB instance for evaluation purposes, use Docker or Podman.
```shell
docker run --rm \
--publish=4200:4200 --publish=5432:5432 \
--env=CRATE_HEAP_SIZE=2g crate:latest
```

**dbt Profile Configuration:** CrateDB targets should be set up using the
following configuration in your connection profile, e.g. within a
[`profiles.yml`] file at `~/.dbt/profiles.yml`.

Now, create a connection profile `profiles.yaml` file including your
connection details, for example at `~/.dbt/profiles.yml`.
```bash
cd ~
mkdir -p .dbt
cat << EOF > .dbt/profiles.yml
cratedb_analytics:
target: dev
outputs:
dev:
type: cratedb
host: localhost
port: 5432
user: crate
pass: crate
dbname: crate
schema: doc
search_path: doc
EOF
```
(please note the values for `database`, `schema`, and `search_path` in this example)

A dbt project has a [specific structure][dbt-project-structure], and contains a combination of SQL, Jinja, YAML, and Markdown files.
In your project folder, alongside the `models` folder that most projects have,
a folder called `macros` can include macro override files.


Those dbt features have been tested successfully:

* models with [view, table, and ephemeral materializations](https://docs.getdbt.com/docs/build/materializations)
* [dbt source freshness](https://docs.getdbt.com/docs/deploy/source-freshness)
* [dbt test](https://docs.getdbt.com/docs/build/tests)
* [dbt seed](https://docs.getdbt.com/docs/build/seeds)
* [Incremental materializations](https://docs.getdbt.com/docs/build/incremental-models) (with `incremental_strategy='delete+insert'` and without involving [OBJECT](https://crate.io/docs/crate/reference/en/5.4/general/ddl/data-types.html#objects) columns)

We hope you find this useful. CrateDB is continuously adding new features and we will endeavor to come back and update this article if there are any developments and some of these overrides require changes or become obsolete.


## Appendix

A few notes about advanced configuration options and general usage
information.

### CrateDB's Differences
- CrateDB’s fixed catalog name is `crate`, the default schema name is `doc`.
- CrateDB does not implement the notion of a database, however tables can be created in different [schemas](https://cratedb.com/docs/crate/reference/en/latest/general/ddl/create-table.html#ddl-create-table-schemas).
- When asked for a database name, specifying a schema name (any), or the fixed catalog name `crate` may be applicable.
- If a database-/schema-name is omitted while connecting, the PostgreSQL drivers may default to the “username”.
- The predefined [superuser](https://cratedb.com/docs/crate/reference/en/latest/admin/user-management.html#administration-user-management) on an unconfigured CrateDB cluster is called `crate`, defined without a password.
- For authenticating properly, please learn about the available [authentication](https://cratedb.com/docs/crate/reference/en/latest/admin/auth/index.html#admin-auth) options.

-- https://cratedb.com/docs/crate/clients-tools/en/latest/connect/#configure

### Connection Options
**dbt Profile Configuration:** CrateDB targets should be set up using the
following configuration in your [`profiles.yml`] file.
```yaml
company-name:
target: dev
outputs:
dev:
type: cratedb
host: [clustername].aks1.westeurope.azure.cratedb.net
user: [username]
password: [password]
port: 5432
dbname: crate # CrateDB's only catalog is `crate`.
schema: doc # You can define any schema. `doc` is the default.
threads: [optional, 1 or more]
[keepalives_idle](#keepalives_idle): 0 # default 0, indicating the system default. See below
connect_timeout: 10 # default 10 seconds
[retries](#retries): 1 # default 1 retry on error/timeout when opening connections
[search_path](#search_path): [optional, override the default postgres search_path]
[role](#role): [optional, set the role dbt assumes when executing queries]
[sslmode](#sslmode): [optional, set the sslmode used to connect to the database]
[sslcert](#sslcert): [optional, set the sslcert to control the certifcate file location]
[sslkey](#sslkey): [optional, set the sslkey to control the location of the private key]
[sslrootcert](#sslrootcert): [optional, set the sslrootcert config value to a new file path in order to customize the file location that contain root certificates]
```

### Search Path
The `search_path` config controls the CrateDB "search path" that dbt configures
when opening new connections to the database. By default, the CrateDB search
path is `"doc"`, meaning that unqualified <Term id="table" /> names will be
searched for in the `doc` schema.

### Custom Schemas
By default, dbt writes the models into the schema you configured in your
profile, but in some dbt projects you may need to write data into different
target schemas. You can adjust the target schema using [custom schemas with
dbt].

If your dbt project has a custom macro called `generate_schema_name`, dbt
will use it instead of the default macro. This allows you to customize
the name generation according to your needs.

```jinja
{% macro generate_schema_name(custom_schema_name, node) -%}
{%- set default_schema = target.schema -%}
{%- if custom_schema_name is none -%}
{{ default_schema }}
{%- else -%}
{{ custom_schema_name | trim }}
{%- endif -%}
{%- endmacro %}
```


[dbt]: https://www.getdbt.com/
[dbt-core]: https://github.com/dbt-labs/dbt-core
[dbt-project-structure]: https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview

0 comments on commit c35cfb8

Please sign in to comment.