Databricks Asset Bundle Component
DatabricksAssetBundleComponent is currently in preview. The API may change in future releases.
The DatabricksAssetBundleComponent integrates with Databricks Asset Bundles to provide a way to define Databricks jobs, pipelines, and configuration as code using YAML. The component reads your databricks.yml bundle configuration and automatically creates Dagster assets from the job tasks (notebook, Python wheel, Spark Python, Spark JAR, etc.) defined within it, with dependency information preserved. When the assets are materialized, Dagster submits the tasks to Databricks and monitors execution.
This approach is well suited for teams that are already using Databricks Asset Bundles to manage their Databricks workflows and want to bring them into Dagster without rewriting job definitions.
The component supports the following Databricks task types:
Step 1: Prepare a Dagster project
To begin, you'll need a Dagster project. You can use an existing components-ready project or create a new one:
uvx create-dagster project my-project && cd my-project/src
Activate the project virtual environment:
source ../.venv/bin/activate
Finally, add the dagster-databricks library to the project:
- uv
- pip
uv add dagster-databricks
pip install dagster-databricks
Step 2: Scaffold the component definition
Now that you have a Dagster project, you can scaffold a DatabricksAssetBundleComponent component definition. You'll need to provide:
- A path for the
databricks.ymlconfiguration file - The URL of your Databricks workspace host
- The name of the environment variable that stores your Databricks workspace token
dg scaffold defs dagster_databricks.DatabricksAssetBundleComponent my_databricks_bundle \
--databricks-config-path /path/to/databricks.yml \
--databricks-workspace-host https://your-workspace.cloud.databricks.com \
--databricks-workspace-token "{{ env.DATABRICKS_TOKEN }}"
The dg scaffold defs call will generate a defs.yaml file:
tree src/my_project/defs
src/my_project/defs
├── __init__.py
└── my_databricks_bundle
└── defs.yaml
2 directories, 2 files
The defs.yaml defines the component in your project:
# my_project/defs/my_databricks_bundle/defs.yml
type: dagster_databricks.DatabricksAssetBundleComponent
attributes:
databricks_config_path: '{{ project_root }}/../../../../../path/to/databricks.yml'
workspace:
host: https://your-workspace.cloud.databricks.com
token: '{{ env.DATABRICKS_TOKEN }}'
Step 3: Customize component configuration
Compute configuration
You can specify how tasks should be executed on Databricks using one of three Databricks compute options:
- Serverless (default)
- New cluster
- Existing cluster
# my_project/defs/my_databricks_bundle/defs.yml
type: dagster_databricks.DatabricksAssetBundleComponent
attributes:
databricks_config_path: '{{ project_root }}/../../../../../path/to/databricks.yml'
workspace:
host: https://your-workspace.cloud.databricks.com
token: '{{ env.DATABRICKS_TOKEN }}'
compute_config:
is_serverless: true
# my_project/defs/my_databricks_bundle/defs.yml
type: dagster_databricks.DatabricksAssetBundleComponent
attributes:
databricks_config_path: '{{ project_root }}/../../../../../path/to/databricks.yml'
workspace:
host: https://your-workspace.cloud.databricks.com
token: '{{ env.DATABRICKS_TOKEN }}'
compute_config:
spark_version: "13.3.x-scala2.12"
node_type_id: "i3.xlarge"
num_workers: 2
# my_project/defs/my_databricks_bundle/defs.yml
type: dagster_databricks.DatabricksAssetBundleComponent
attributes:
databricks_config_path: '{{ project_root }}/../../../../../path/to/databricks.yml'
workspace:
host: https://your-workspace.cloud.databricks.com
token: '{{ env.DATABRICKS_TOKEN }}'
compute_config:
existing_cluster_id: "1234-567890-abcde123"
Custom asset mapping
By default, the Databricks Asset Bundle Component creates one Dagster asset per Databricks task. You can override this by providing a custom mapping from task keys to asset specs:
# my_project/defs/my_databricks_bundle/defs.yml
type: dagster_databricks.DatabricksAssetBundleComponent
attributes:
databricks_config_path: '{{ project_root }}/../../../../../path/to/databricks.yml'
workspace:
host: https://your-workspace.cloud.databricks.com
token: '{{ env.DATABRICKS_TOKEN }}'
assets_by_task_key:
my_etl_task:
- key: raw/customers
description: "Raw customer data extracted from source"
- key: raw/orders
description: "Raw order data extracted from source"