Databricks Workspace Component
DatabricksWorkspaceComponent is currently in preview. The API may change in future releases.
DatabricksWorkspaceComponent is a state-backed component, which fetches and caches Databricks workspace metadata. For information on managing component state, see Configuring state-backed components.
The DatabricksWorkspaceComponent connects directly to your Databricks workspace, discovers existing jobs, and exposes them as Dagster assets. Unlike the Asset Bundle component, it doesn't require a local databricks.yml file — it fetches job definitions from the workspace API at build time.
This approach is well suited for:
- Teams with existing Databricks jobs that want to orchestrate them through Dagster without restructuring
- Workspaces with many jobs where manual asset definition would be impractical
- Scenarios where jobs are managed directly in the Databricks workspace UI
How it works
- The component connects to your Databricks workspace using the provided credentials
- It fetches job definitions (filtered by your configuration) and caches them as state
- Each job's tasks are represented as Dagster assets with dependency information preserved
- When materialized, the component triggers a job run via the Databricks API and monitors it to completion
Step 1: Prepare a Dagster project
To begin, you'll need a Dagster project. You can use an existing components-ready project or create a new one:
uvx create-dagster project my-project && cd my-project/src
Activate the project virtual environment:
source ../.venv/bin/activate
Finally, add the dagster-databricks library to the project:
- uv
- pip
uv add dagster-databricks
pip install dagster-databricks
Step 2: Scaffold the component definition
Now that you have a Dagster project, you can scaffold a DatabricksWorkspaceComponent component definition. You'll need to provide:
- The URL of your Databricks workspace host
- The name of the environment variable that stores your Databricks workspace token
dg scaffold defs dagster_databricks.DatabricksWorkspaceComponent my_databricks_workspace
The dg scaffold defs call will generate a defs.yaml file:
tree src/my_project/defs
src/my_project/defs
├── __init__.py
└── my_databricks_workspace
└── defs.yaml
2 directories, 2 files
The defs.yaml defines the component in your project:
# my_project/defs/my_databricks_workspace/defs.yaml
type: dagster_databricks.DatabricksWorkspaceComponent
attributes: {}
Step 3: Customize component configuration
Job filtering
You can filter which Databricks jobs to include using the databricks_filter key:
# my_project/defs/my_databricks_workspace/defs.yaml
type: dagster_databricks.DatabricksWorkspaceComponent
attributes:
databricks_filter:
include_jobs:
job_ids:
- 12345
- 67890
Custom asset mapping
Similar to the Asset Bundle component, you can provide a custom mapping from task keys to asset specs:
# my_project/defs/my_databricks_workspace/defs.yaml
type: dagster_databricks.DatabricksWorkspaceComponent
attributes:
assets_by_task_key:
etl_extract:
- key: raw/customers
description: "Customers extracted from source database"
etl_transform:
- key: staging/customers_clean
description: "Cleaned and validated customer data"
Best practices
Workspace organization
Organize your Databricks workspace files in a structured hierarchy that reflects your data pipeline layers, and use descriptive file names that clearly indicate the transformation being performed (for example, extract_customers.py instead of script_1.py):
/Workspace/dagster_assets/
├── raw/ # Bronze layer
│ ├── extract_customers.py
│ └── extract_orders.py
├── staging/ # Silver layer
│ ├── clean_customers.py
│ └── clean_orders.py
└── marts/ # Gold layer
└─ ─ customer_analytics.py
Signaling completion
Ensure your Databricks scripts signal completion status:
# At the end of your notebook/script
def main():
# Your processing logic
result = process_data()
# Save output to Delta Lake
result.write.format("delta").mode("overwrite").save("/mnt/data/customers")
# Signal successful completion
dbutils.notebook.exit("success")
if __name__ == "__main__":
main()
Managing dependencies
Dependencies can be detected automatically by the component through Delta table paths:
# Upstream asset: extract_customers.py
df.write.format("delta").save("/mnt/data/raw/customers")
# Downstream asset: clean_customers.py
# Component auto-detects dependency from read operation
df = spark.read.format("delta").load("/mnt/data/raw/customers")
For more complex dependency scenarios, use explicit configuration:
# my_project/defs/my_databricks_workspace/defs.yaml
type: dagster_databricks.DatabricksWorkspaceComponent
attributes:
asset_overrides:
customer_analytics:
depends_on:
- clean_customers
- clean_orders