Skip to main content

Fine-tune an LLM

note

To see video of this example

In this example, you'll build a pipeline with Dagster that:

Loads a public Goodreads JSON dataset into DuckDB
Performs feature engineering to enhance the data
Creates and validates the data files needed for an OpenAI fine-tuning job
Generate a custom model and validate it

Prerequisites

To follow the steps in this guide, you'll need:

Basic Python knowledge
Python 3.9+ installed on your system. Refer to the Installation guide for information.
Familiarity with SQL and Python data manipulation libraries, such as Pandas.
Understanding of data pipelines and the extract, transform, and load process (ETL).

Step 1: Set up your Dagster environment

First, set up a new Dagster project.

Clone the Dagster repo and navigate to the project:
```
cd examples/docs_projects/project_llm_fine_tune
```
Install the required dependencies with uv:
```
uv sync
```
Activate the virtual environment:
- MacOS
- Windows
source .venv/bin/activate

Step 2: Launch the Dagster webserver

To make sure Dagster and its dependencies were installed correctly, navigate to the project root directory and start the Dagster webserver:

dg dev

Next steps

Continue this example with ingestion

Step 1: Set up your Dagster environment
Step 2: Launch the Dagster webserver
Next steps