Transform data
A data platform typically involves people in various roles working together, each contributing in different ways. Some individuals will be more involved in certain areas than others. For example, with dbt, analysts may focus primarily on modeling the data, but still want their models integrated into the overall pipeline.
In this step, we will incorporate a dbt project to model the data we loaded with DuckDB.
1. Add the dbt project
First we will need a dbt project to work with. Run the following to add a dbt project to the root of the etl_tutorial
:
git clone --depth=1 https://github.com/dagster-io/jaffle-platform.git transform && rm -rf transform/.git
There will now be a directory transform
within the root of the project containing our dbt project.
.
├── pyproject.toml
├── src
├── tests
├── transform # dbt project
└── uv.lock
This dbt project already contains models that work with the raw data we brought in previously and requires no modifications.
2. Scaffold a dbt component definition
Now that we have a dbt project to work with, we need to install both the Dagster dbt integration and the dbt adapter for DuckDB:
uv pip install dagster-dbt dbt-duckdb
Next, we can scaffold a dbt component definition by providing the path to the dbt project we added earlier:
dg scaffold defs dagster_dbt.DbtProjectComponent transform --project-path transform/jdbt
This will add the directory transform
to the etl_tutorial
module:
src
└── etl_tutorial
└── defs
└── transform
└── defs.yaml
3. Configure the dbt defs.yaml
The dbt component creates a single file, defs.yaml
, which configures the dagster_dbt
.DbtProjectComponent
. Most of the file’s contents were generated when we scaffolded the component and provided the path to the dbt project:
type: dagster_dbt.DbtProjectComponent
attributes:
project: '{{ project_root }}/transform/jdbt'
The component is correctly configured for our dbt project, but we need to make one addition:
type: dagster_dbt.DbtProjectComponent
attributes:
project: '{{ project_root }}/transform/jdbt'
translation:
key: "target/main/{{ node.name }}"
Adding in the translation
attribute aligns the keys of our dbt models with the source tables. Associating them together ensures the proper lineage across our assets.
Summary
We have now layered dbt into the project. The etl_tutorial
module should look like this:
src
└── etl_tutorial
├── __init__.py
├── definitions.py
└── defs
├── __init__.py
├── assets.py
└── transform
└── defs.yaml
Once again you can materialize your assets within the UI or using dg launch
from the command line. You can also use dg list
to get a full overview of the definitions in your project by executing:
dg list defs
This will return a table of all the definitions within the Dagster project. As we add more objects, you can rerun this command to see how our project grows.
You might be wondering about the relationship between components and definitions. At a high level, a component builds a definition for a specific purpose.
Components are objects that programmatically build assets and other Dagster objects. They accept specific parameters and use them to build the actual definitions you need. In the case of DbtProjectComponent
, this would be the dbt project path and the definitions it creates are the assets for each dbt model.
Definitions are objects that combine metadata about an entity with a Python function that defines how it behaves -- for example, when we used the @asset
decorator on the functions for our DuckDB ingestion. This tells Dagster both what the asset is and how to materialize it.
Next steps
In the next step, we will add a DuckDB resource to our project to more efficiently manage manage our database connections.