Databricks Workspace Component

Preview

DatabricksWorkspaceComponent is currently in preview. The API may change in future releases.

info

DatabricksWorkspaceComponent is a state-backed component, which fetches and caches Databricks workspace metadata. For information on managing component state, see Configuring state-backed components.

The DatabricksWorkspaceComponent connects directly to your Databricks workspace, discovers existing jobs, and exposes them as Dagster assets. Unlike the Asset Bundle component, it doesn't require a local databricks.yml file — it fetches job definitions from the workspace API at build time.

This approach is well suited for:

Teams with existing Databricks jobs that want to orchestrate them through Dagster without restructuring
Workspaces with many jobs where manual asset definition would be impractical
Scenarios where jobs are managed directly in the Databricks workspace UI

How it works

The component connects to your Databricks workspace using the provided credentials
It fetches job definitions (filtered by your configuration) and caches them as state
Each job's tasks are represented as Dagster assets with dependency information preserved
When materialized, the component triggers a job run via the Databricks API and monitors it to completion

Step 1: Prepare a Dagster project

To begin, you'll need a Dagster project. You can use an existing components-ready project or create a new one:

uvx create-dagster project my-project && cd my-project/src

Activate the project virtual environment:

source ../.venv/bin/activate

Finally, add the dagster-databricks library to the project:

uv add dagster-databricks

pip install dagster-databricks

Step 2: Scaffold the component definition

Now that you have a Dagster project, you can scaffold a DatabricksWorkspaceComponent component definition. You'll need to provide:

The URL of your Databricks workspace host
The name of the environment variable that stores your Databricks workspace token

dg scaffold defs dagster_databricks.DatabricksWorkspaceComponent my_databricks_workspace

The dg scaffold defs call will generate a defs.yaml file:

tree src/my_project/defs

src/my_project/defs
├── __init__.py
└── my_databricks_workspace
    └── defs.yaml

2 directories, 2 files

The defs.yaml defines the component in your project:

# my_project/defs/my_databricks_workspace/defs.yaml

type: dagster_databricks.DatabricksWorkspaceComponent

attributes: {}

Step 3: Customize component configuration

Job filtering

You can filter which Databricks jobs to include using the databricks_filter key:

# my_project/defs/my_databricks_workspace/defs.yaml

type: dagster_databricks.DatabricksWorkspaceComponent

attributes:
  databricks_filter:
    include_jobs:
      job_ids:
        - 12345
        - 67890

Custom asset mapping

Similar to the Asset Bundle component, you can provide a custom mapping from task keys to asset specs:

# my_project/defs/my_databricks_workspace/defs.yaml

type: dagster_databricks.DatabricksWorkspaceComponent

attributes:
  assets_by_task_key:
    etl_extract:
      - key: raw/customers
        description: "Customers extracted from source database"
    etl_transform:
      - key: staging/customers_clean
        description: "Cleaned and validated customer data"

Best practices

Workspace organization

Organize your Databricks workspace files in a structured hierarchy that reflects your data pipeline layers, and use descriptive file names that clearly indicate the transformation being performed (for example, extract_customers.py instead of script_1.py):

/Workspace/dagster_assets/
  ├── raw/              # Bronze layer
  │   ├── extract_customers.py
  │   └── extract_orders.py
  ├── staging/          # Silver layer
  │   ├── clean_customers.py
  │   └── clean_orders.py
  └── marts/            # Gold layer
      └── customer_analytics.py

Signaling completion

Ensure your Databricks scripts signal completion status:

# At the end of your notebook/script
def main():
    # Your processing logic
    result = process_data()

    # Save output to Delta Lake
    result.write.format("delta").mode("overwrite").save("/mnt/data/customers")

    # Signal successful completion
    dbutils.notebook.exit("success")

if __name__ == "__main__":
    main()

Managing dependencies

Dependencies can be detected automatically by the component through Delta table paths:

# Upstream asset: extract_customers.py
df.write.format("delta").save("/mnt/data/raw/customers")

# Downstream asset: clean_customers.py
# Component auto-detects dependency from read operation
df = spark.read.format("delta").load("/mnt/data/raw/customers")

For more complex dependency scenarios, use explicit configuration:

# my_project/defs/my_databricks_workspace/defs.yaml

type: dagster_databricks.DatabricksWorkspaceComponent

attributes:
  asset_overrides:
    customer_analytics:
      depends_on:
        - clean_customers
        - clean_orders

How it works​

Step 1: Prepare a Dagster project​

Step 2: Scaffold the component definition​

Step 3: Customize component configuration​

Job filtering​

Custom asset mapping​

Best practices​

Workspace organization​

Signaling completion​

Managing dependencies​