Quickstart#

This guide walks through the typical workflow of creating a DuckLake, writing data into a table, and reading it back.

Connecting to a DuckLake#

A DuckLake consists of a catalog database (which stores metadata) and a data path (which stores the actual data files). The catalog database can be PostgreSQL, MySQL, or SQLite. The data path can be a local directory or a cloud storage URI such as s3://my-bucket/data.

To create a brand new DuckLake, use ducklake.create():

import ducklake as dl

ducklake = dl.create(
    "sqlite:///metadata.sqlite",
    data_path="data_files/",
)

To connect to an existing DuckLake, use ducklake.connect():

ducklake = dl.connect("sqlite:///metadata.sqlite")

When connecting to remote object storage, you may pass storage credentials via the storage_options argument:

ducklake = dl.connect(
    "postgresql://user:password@localhost:5432/catalog",
    storage_options={
        "aws_region": "us-east-1",
        "aws_access_key_id": "...",
        "aws_secret_access_key": "...",
    },
)

Creating a Table#

Tables are created from the Ducklake instance. The schema is described using ducklake’s data type primitives:

table = ducklake.create_table(
    "houses",
    schema={
        "zip_code": dl.Varchar(),
        "num_bedrooms": dl.UInt8(),
        "num_bathrooms": dl.UInt8(),
        "price": dl.Float64(),
    },
)

You can also pass a list of Column instances to gain fine-grained control over nullability, defaults, and tags:

table = ducklake.create_table(
    "houses",
    schema=[
        dl.Column("zip_code", dl.Varchar(), nullable=False),
        dl.Column("num_bedrooms", dl.UInt8(), nullable=False),
        dl.Column("num_bathrooms", dl.UInt8(), nullable=False),
        dl.Column("price", dl.Float64(), nullable=False),
    ],
)

Writing Data#

Data can be written into a table using one of the framework integrations. For Polars users, the sink_polars() method takes any pl.LazyFrame:

import polars as pl

lf = pl.LazyFrame(
    {
        "zip_code": ["01234", "01234", "12345"],
        "num_bedrooms": [2, 3, 1],
        "num_bathrooms": [1, 2, 1],
        "price": [100_000.0, 250_000.0, 75_000.0],
    }
)
table.sink_polars(lf)

For DuckDB users, you can write Arrow-compatible objects via write() or pass relations directly. See the API reference for the full set of methods.

Reading Data#

Reading mirrors the writing API. To get a Polars LazyFrame over the table contents:

lf = table.scan_polars()
df = lf.collect()

ducklake automatically applies any pending deletion files and inline deletions, so you always see a consistent view of the table.

Time Travel#

DuckLake snapshots are first-class citizens. To read the table as it looked at a previous point in time, time-travel the entire Ducklake connection:

import datetime as dt

# Time travel by snapshot ID
old = ducklake.at(42)

# Time travel by timestamp
yesterday = ducklake.at(dt.datetime.now(dt.timezone.utc) - dt.timedelta(days=1))

old_table = old.table("houses")
df = old_table.read_polars()

Transactions#

Multiple metadata changes can be grouped into a single atomic transaction using transaction():

with ducklake.transaction() as tx:
    tx.create_schema("analytics")
    tx.create_table(
        ("analytics", "events"),
        schema={"id": dl.Int64(), "name": dl.Varchar()},
    )

If the with block raises an exception, the transaction is rolled back automatically. Otherwise, all changes are committed atomically when the block exits.

Next Steps#

Explore the API Reference for details on all available classes and methods.
Read the upstream DuckLake documentation to learn about the underlying format.