Ducklake#

class ducklake.Ducklake[source]#

A connection to a DuckLake instance.

Methods:

`at`	Time travel to a specific snapshot in the catalog.
`checkpoint`	Run all recommended maintenance operations on the catalog.
`cleanup_old_files`	Delete files that have been scheduled for deletion.
`create_schema`	Create a new schema in the catalog.
`create_table`	Create a new table in the catalog.
`delete_orphaned_files`	Delete files in the data directory that are not tracked in the catalog database.
`disconnect`	Disconnect from the catalog database, gracefully closing all underlying connections.
`execute_sql`	Execute an arbitrary SQL query against the catalog database.
`expire_snapshots`	Expire snapshots in the catalog so the data they reference can be cleaned up.
`get_latest_snapshot`	Get metadata for the latest snapshot in the catalog.
`get_table`	Read a table from the catalog.
`list_schemas`	List all schema names in the catalog.
`list_snapshots`	List metadata for all snapshots in the catalog.
`list_tables`	List all tables in the catalog.
`merge_adjacent_files`	Merge small adjacent data files into larger ones across the catalog.
`rewrite_data_files`	Rewrite data files with a high fraction of deleted rows across the catalog.
`set_metadata`	Set one or more metadata options at the global or schema scope.
`transaction`	Start a new transaction against the catalog.

Attributes:

metadata

The metadata associated with the catalog.

at(at: int | dt.datetime, /) → Ducklake[source]#

Time travel to a specific snapshot in the catalog.

Parameters:: at – The ID of the snapshot to time travel to, or a timestamp to find the latest snapshot before that timestamp.
Returns:: A new Ducklake instance time traveled to the specified snapshot.

checkpoint() → None[source]#

Run all recommended maintenance operations on the catalog.

Executes the DuckDB CHECKPOINT statement which flushes inlined data, expires snapshots, merges adjacent files, rewrites files with deletes and cleans up orphaned files. The behavior is configured via the rewrite_delete_threshold, delete_older_than, expire_older_than and auto_compact metadata options.

Note

This requires duckdb to be installed.

cleanup_old_files( *, cleanup_all: bool = False, older_than: dt.datetime | None = None, dry_run: bool = False, ) → list[str][source]#

Delete files that have been scheduled for deletion.

Files are only scheduled for deletion when the snapshots referencing them are expired (see expire_snapshots()).

If neither cleanup_all nor older_than is provided, files are deleted according to the "delete_older_than" metadata option (which defaults to two days). The reason for this “grace period” is to prevent deleting files that are being used by active queries.

Parameters:

cleanup_all – If True, delete all files scheduled for deletion regardless of age.
older_than – If provided, only delete files scheduled for deletion before this timestamp.
dry_run – If True, no files are actually deleted and the returned paths merely indicate what would be deleted.

Returns:

The paths that were deleted, or would be deleted when dry_run is True.

create_schema( name: str, *, data_path: str | None = None, if_exists: Literal['fail', 'skip'] = 'fail', ) → None[source]#

Create a new schema in the catalog.

Parameters:

name – The name of the new schema.
data_path – Optional data path for the schema. If not provided, it defaults to the schema name.
if_exists – The strategy to apply if a schema with the same name already exists. “fail” raises an AlreadyExistsError, while “skip” leaves the existing schema unchanged.

Create a new table in the catalog.

Parameters:

name – The fully qualified name of the new table.
schema – The schema of the new table.
partition_by – Optional partitioning for the table.
data_path – Optional data path for the table.
tags – Optional tags to attach to the table.
if_exists – The strategy to apply if a table with the same name already exists. “fail” raises an AlreadyExistsError, while “skip” returns the existing table unchanged.

Returns:

The newly created Table.

delete_orphaned_files( *, cleanup_all: bool = False, older_than: dt.datetime | None = None, dry_run: bool = False, ) → list[str][source]#

Delete files in the data directory that are not tracked in the catalog database.

This is useful for cleaning up files that were written but never registered (e.g. due to a crashed writer).

If neither cleanup_all nor older_than is provided, files are deleted according to the "delete_older_than" metadata option (which defaults to two days). The reason for this “grace period” is to prevent deleting files that are currently being written.

Parameters:

cleanup_all – If True, delete all orphaned files regardless of age.
older_than – If provided, only delete orphaned files last modified before this timestamp.
dry_run – If True, no files are actually deleted.

Returns:

The paths that were deleted, or would be deleted when dry_run is True.

disconnect() → None[source]#

Disconnect from the catalog database, gracefully closing all underlying connections.

After calling this method, all subsequent operations on this Ducklake instance (or any Table / Transaction derived from it) will fail.

This is normally not required because connections are released when the instance is garbage collected, but it is useful when you need to ensure that all connections are released deterministically (e.g. before dropping the catalog database).

execute_sql(query: str | sa.ReturnsRows) → None[source]#

Execute an arbitrary SQL query against the catalog database.

Parameters:: query – The SQL query to execute. This may either be a raw string or a sqlalchemy query. If a raw string is provided, it must use the DuckDB SQL dialect. If a sqlalchemy query is provided, duckdb-engine must be installed.

Note

This requires duckdb to be installed.

expire_snapshots( *, versions: Sequence[int] | None = None, older_than: dt.datetime | None = None, dry_run: bool = False, ) → list[SnapshotMetadata][source]#

Expire snapshots in the catalog so the data they reference can be cleaned up.

This does not immediately delete the underlying files; call cleanup_old_files() afterwards (or use checkpoint()).

The latest snapshot is always retained. If neither versions nor older_than is provided, snapshots are expired according to the "expire_older_than" metadata option. If that option is not set, no snapshots are expired.

Parameters:

versions – A list of snapshot IDs to expire. Versions that do not exist or refer to the latest snapshot are silently ignored.
older_than – If provided, expire all snapshots created before this timestamp.
dry_run – If True, no snapshots are actually expired and the returned snapshots merely indicate what would be expired.

Returns:

The snapshots that were expired, or would be expired when dry_run is True.

get_latest_snapshot() → SnapshotMetadata[source]#

Get metadata for the latest snapshot in the catalog.

Returns:: Metadata for the latest snapshot in the catalog.

get_table( name: str | tuple[str, str] | TableName, ) → Table[source]#

Read a table from the catalog.

Parameters:: name – The name of the table. This can either be a string or a TableName tuple. If a string is provided, it is parsed just like DuckDB parses table names: it must be of the format <schema>.<table> where the schema is optional and defaults to “main”. If either the schema or table name contains special characters, both must be quoted using double quotes.
Returns:: The Table object.
Raises:: NotFoundError – If the table does not exist.

list_schemas() → list[str][source]#

List all schema names in the catalog.

Returns:: A list of all schema names in the catalog.

list_snapshots() → list[SnapshotMetadata][source]#

List metadata for all snapshots in the catalog.

When time-traveling, this returns only the snapshot that was travelled to.

Returns:: A list of metadata for all snapshots in the catalog, ordered from newest to oldest.

list_tables(schema: str | None = None) → list[Table][source]#

List all tables in the catalog.

Parameters:: schema – Optional schema name to filter tables by. If None, returns all tables across all schemas.
Returns:: A list of all Table objects in the catalog, optionally filtered by schema.

merge_adjacent_files( *, max_compacted_files: int | None = None, min_file_size: int | None = None, max_file_size: int | None = None, ) → list[MaintenanceResult][source]#

Merge small adjacent data files into larger ones across the catalog.

Dispatches to ducklake_merge_adjacent_files. Only tables with auto_compact enabled are considered.

Parameters:

max_compacted_files – Maximum number of compaction operations produced in a single call (per table).
min_file_size – Excludes files smaller than this many bytes from compaction.
max_file_size – Excludes files at or larger than this many bytes from compaction. Defaults to the target_file_size table option.

Returns:

A row for each output file created by the operation.

Note

This requires duckdb to be installed.

property metadata: GlobalMetadata#: The metadata associated with the catalog.

rewrite_data_files( *, delete_threshold: float | None = None, ) → list[MaintenanceResult][source]#

Rewrite data files with a high fraction of deleted rows across the catalog.

Dispatches to ducklake_rewrite_data_files. Files containing more deletes than delete_threshold are rewritten without the deleted rows.

Parameters:: delete_threshold – Minimum fraction (0-1) of deleted rows required to trigger a rewrite. Defaults to the rewrite_delete_threshold metadata option (0.95).
Returns:: A row for each output file created by the operation.

Note

This requires duckdb to be installed.

Set one or more metadata options at the global or schema scope.

Provide options as keyword arguments. Pass None as a value to remove the option from the metadata (i.e. revert it to its default).

Parameters:: schema – Optional schema name to scope the table-level options to. If not provided, the options are set globally. Only valid for keys in TableMetadataUpdate.
Raises:: ValueError – If a key is read-only and cannot be set.