Ducklake#

class ducklake.Ducklake[source]#

A connection to a DuckLake instance.

Methods:

at

Time travel to a specific snapshot in the catalog.

checkpoint

Run all recommended maintenance operations on the catalog.

cleanup_old_files

Delete files that have been scheduled for deletion.

delete_orphaned_files

Delete files in the data directory that are not referenced by any snapshot.

disconnect

Disconnect from the catalog database, gracefully closing all underlying connections.

execute_sql

Execute an arbitrary SQL query against the catalog database.

expire_snapshots

Expire snapshots in the catalog so the data they reference can be cleaned up.

get_latest_snapshot

Get metadata for the latest snapshot in the catalog.

get_table

Read a table from the catalog.

list_schemas

List all schema names in the catalog.

list_snapshots

List metadata for all snapshots in the catalog.

list_tables

List all tables in the catalog.

merge_adjacent_files

Merge small adjacent data files into larger ones across the catalog.

rewrite_data_files

Rewrite data files with a high fraction of deleted rows across the catalog.

set_metadata

Set one or more metadata options at the global or schema scope.

transaction

Start a new transaction against the catalog.

at(at: int | dt.datetime, /) Ducklake[source]#

Time travel to a specific snapshot in the catalog.

Parameters:

at – The ID of the snapshot to time travel to, or a timestamp to find the latest snapshot before that timestamp.

Returns:

A new Ducklake instance time traveled to the specified snapshot.

checkpoint() None[source]#

Run all recommended maintenance operations on the catalog.

Executes the DuckDB CHECKPOINT statement which flushes inlined data, expires snapshots, merges adjacent files, rewrites files with deletes and cleans up orphaned files. The behavior is configured via the rewrite_delete_threshold, delete_older_than, expire_older_than and auto_compact metadata options.

Note

This requires duckdb to be installed.

cleanup_old_files(
*,
cleanup_all: bool = False,
older_than: dt.datetime | None = None,
dry_run: bool = False,
) list[str][source]#

Delete files that have been scheduled for deletion.

Dispatches to ducklake_cleanup_old_files. Files are only scheduled for deletion when the snapshots referencing them are expired (see expire_snapshots()).

Parameters:
  • cleanup_all – If True, delete all files scheduled for deletion regardless of age.

  • older_than – If provided, only delete files scheduled for deletion before this timestamp.

  • dry_run – If True, no files are actually deleted.

Returns:

The paths that were deleted, or would be deleted when dry_run is True.

Note

This requires duckdb to be installed.

delete_orphaned_files(
*,
cleanup_all: bool = False,
older_than: dt.datetime | None = None,
dry_run: bool = False,
) list[str][source]#

Delete files in the data directory that are not referenced by any snapshot.

Dispatches to ducklake_delete_orphaned_files. Useful for cleaning up files that were written but never registered (e.g. due to a crashed writer).

Parameters:
  • cleanup_all – If True, delete all orphaned files regardless of age.

  • older_than – If provided, only delete orphaned files older than this timestamp.

  • dry_run – If True, no files are actually deleted.

Returns:

The paths that were deleted, or would be deleted when dry_run is True.

Note

This requires duckdb to be installed.

disconnect() None[source]#

Disconnect from the catalog database, gracefully closing all underlying connections.

After calling this method, all subsequent operations on this Ducklake instance (or any Table / Transaction derived from it) will fail.

This is normally not required because connections are released when the instance is garbage collected, but it is useful when you need to ensure that all connections are released deterministically (e.g. before dropping the catalog database).

execute_sql(query: str | sa.ReturnsRows) None[source]#

Execute an arbitrary SQL query against the catalog database.

Parameters:

query – The SQL query to execute. This may either be a raw string or a sqlalchemy query. If a raw string is provided, it must use the DuckDB SQL dialect. If a sqlalchemy query is provided, duckdb-engine must be installed.

Note

This requires duckdb to be installed.

expire_snapshots(
*,
versions: Sequence[int] | None = None,
older_than: dt.datetime | None = None,
dry_run: bool = False,
) list[int][source]#

Expire snapshots in the catalog so the data they reference can be cleaned up.

Dispatches to ducklake_expire_snapshots. Note that this does not immediately delete the underlying files; call cleanup_old_files() afterwards (or use checkpoint()).

Parameters:
  • versions – Optional list of snapshot ids to expire.

  • older_than – If provided, expire all snapshots created before this timestamp.

  • dry_run – If True, no snapshots are actually expired.

Returns:

The IDs of the snapshots that were expired, or would be expired when dry_run is True.

Note

This requires duckdb to be installed.

get_latest_snapshot() SnapshotMetadata[source]#

Get metadata for the latest snapshot in the catalog.

Returns:

Metadata for the latest snapshot in the catalog.

get_table(
name: str | tuple[str, str] | TableName,
) Table[source]#

Read a table from the catalog.

Parameters:

name – The name of the table. This can either be a string or a TableName tuple. If a string is provided, it is parsed just like DuckDB parses table names: it must be of the format <schema>.<table> where the schema is optional and defaults to “main”. If either the schema or table name contains special characters, both must be quoted using double quotes.

Returns:

The Table object.

Raises:

NotFoundError – If the table does not exist.

list_schemas() list[str][source]#

List all schema names in the catalog.

Returns:

A list of all schema names in the catalog.

list_snapshots() list[SnapshotMetadata][source]#

List metadata for all snapshots in the catalog.

When time-traveling, this returns only the snapshot that was travelled to.

Returns:

A list of metadata for all snapshots in the catalog, ordered from newest to oldest.

list_tables(schema: str | None = None) list[Table][source]#

List all tables in the catalog.

Parameters:

schema – Optional schema name to filter tables by. If None, returns all tables across all schemas.

Returns:

A list of all Table objects in the catalog, optionally filtered by schema.

merge_adjacent_files(
*,
max_compacted_files: int | None = None,
min_file_size: int | None = None,
max_file_size: int | None = None,
) list[MaintenanceResult][source]#

Merge small adjacent data files into larger ones across the catalog.

Dispatches to ducklake_merge_adjacent_files. Only tables with auto_compact enabled are considered.

Parameters:
  • max_compacted_files – Maximum number of compaction operations produced in a single call (per table).

  • min_file_size – Excludes files smaller than this many bytes from compaction.

  • max_file_size – Excludes files at or larger than this many bytes from compaction. Defaults to the target_file_size table option.

Returns:

A row for each output file created by the operation.

Note

This requires duckdb to be installed.

rewrite_data_files(
*,
delete_threshold: float | None = None,
) list[MaintenanceResult][source]#

Rewrite data files with a high fraction of deleted rows across the catalog.

Dispatches to ducklake_rewrite_data_files. Files containing more deletes than delete_threshold are rewritten without the deleted rows.

Parameters:

delete_threshold – Minimum fraction (0-1) of deleted rows required to trigger a rewrite. Defaults to the rewrite_delete_threshold metadata option (0.95).

Returns:

A row for each output file created by the operation.

Note

This requires duckdb to be installed.

set_metadata(
*,
schema: str | None = None,
**options: bool | int | float | str | None,
) None[source]#

Set one or more metadata options at the global or schema scope.

Provide options as keyword arguments. Pass None as a value to remove the option from the metadata (i.e. revert it to its default).

Parameters:

schema – Optional schema name to scope the table-level options to. If not provided, the options are set globally. Only valid for keys in TableMetadataUpdate.

Raises:

ValueError – If a key is read-only and cannot be set.

See also

Table.set_metadata() for setting metadata options at the table scope.

transaction(
*,
author: str | None = None,
message: str | None = None,
extra_info: str | None = None,
) Transaction[source]#

Start a new transaction against the catalog.

Parameters:
  • author – Optional author attached to the snapshot created on commit.

  • message – Optional commit message attached to the snapshot.

  • extra_info – Optional additional structured info attached to the snapshot.

Returns:

A new Transaction. If used as a context manager, the transaction is automatically committed on successful exit.