Quickstart#
This guide walks through the typical workflow of creating a DuckLake, writing data into a table, and reading it back.
Connecting to a DuckLake#
A DuckLake consists of a catalog database (which stores metadata) and a data path (which stores the actual data
files). The catalog database can be PostgreSQL, MySQL, or SQLite. The data path can be a local directory or a cloud
storage URI such as s3://my-bucket/data.
To create a brand new DuckLake, use ducklake.create():
import ducklake as dl
ducklake = dl.create(
"sqlite:///metadata.sqlite",
data_path="data_files/",
)
To connect to an existing DuckLake, use ducklake.connect():
ducklake = dl.connect("sqlite:///metadata.sqlite")
When connecting to remote object storage, you may pass storage credentials via the storage_options argument:
ducklake = dl.connect(
"postgresql://user:password@localhost:5432/catalog",
storage_options={
"aws_region": "us-east-1",
"aws_access_key_id": "...",
"aws_secret_access_key": "...",
},
)
Creating a Table#
Tables are created from the Ducklake instance. The schema is described using ducklake’s data type
primitives:
table = ducklake.create_table(
"houses",
schema={
"zip_code": dl.Varchar(),
"num_bedrooms": dl.UInt8(),
"num_bathrooms": dl.UInt8(),
"price": dl.Float64(),
},
)
You can also pass a list of Column instances to gain fine-grained control over nullability,
defaults, and tags:
table = ducklake.create_table(
"houses",
schema=[
dl.Column("zip_code", dl.Varchar(), nullable=False),
dl.Column("num_bedrooms", dl.UInt8(), nullable=False),
dl.Column("num_bathrooms", dl.UInt8(), nullable=False),
dl.Column("price", dl.Float64(), nullable=False),
],
)
Writing Data#
Data can be written into a table using one of the framework integrations. For Polars users, the
sink_polars() method takes any pl.LazyFrame:
import polars as pl
lf = pl.LazyFrame(
{
"zip_code": ["01234", "01234", "12345"],
"num_bedrooms": [2, 3, 1],
"num_bathrooms": [1, 2, 1],
"price": [100_000.0, 250_000.0, 75_000.0],
}
)
table.sink_polars(lf)
For DuckDB users, you can write Arrow-compatible objects via write() or pass relations directly.
See the API reference for the full set of methods.
Reading Data#
Reading mirrors the writing API. To get a Polars LazyFrame over the table contents:
lf = table.scan_polars()
df = lf.collect()
ducklake automatically applies any pending deletion files and inline deletions, so you always see a consistent view
of the table.
Time Travel#
DuckLake snapshots are first-class citizens. To read the table as it looked at a previous point in time, time-travel
the entire Ducklake connection:
import datetime as dt
# Time travel by snapshot ID
old = ducklake.at(42)
# Time travel by timestamp
yesterday = ducklake.at(dt.datetime.now(dt.timezone.utc) - dt.timedelta(days=1))
old_table = old.table("houses")
df = old_table.read_polars()
Transactions#
Multiple metadata changes can be grouped into a single atomic transaction using transaction():
with ducklake.transaction() as tx:
tx.create_schema("analytics")
tx.create_table(
("analytics", "events"),
schema={"id": dl.Int64(), "name": dl.Varchar()},
)
If the with block raises an exception, the transaction is rolled back automatically. Otherwise, all changes are
committed atomically when the block exits.
Next Steps#
Explore the API Reference for details on all available classes and methods.
Read the upstream DuckLake documentation to learn about the underlying format.