Ducklake#
- class ducklake.Ducklake[source]#
A connection to a DuckLake instance.
Methods:
Time travel to a specific snapshot in the catalog.
Run all recommended maintenance operations on the catalog.
Delete files that have been scheduled for deletion.
Delete files in the data directory that are not referenced by any snapshot.
Disconnect from the catalog database, gracefully closing all underlying connections.
Execute an arbitrary SQL query against the catalog database.
Expire snapshots in the catalog so the data they reference can be cleaned up.
Get metadata for the latest snapshot in the catalog.
Read a table from the catalog.
List all schema names in the catalog.
List metadata for all snapshots in the catalog.
List all tables in the catalog.
Merge small adjacent data files into larger ones across the catalog.
Rewrite data files with a high fraction of deleted rows across the catalog.
Set one or more metadata options at the global or schema scope.
Start a new transaction against the catalog.
- at(at: int | dt.datetime, /) Ducklake[source]#
Time travel to a specific snapshot in the catalog.
- Parameters:
at – The ID of the snapshot to time travel to, or a timestamp to find the latest snapshot before that timestamp.
- Returns:
A new
Ducklakeinstance time traveled to the specified snapshot.
- checkpoint() None[source]#
Run all recommended maintenance operations on the catalog.
Executes the DuckDB
CHECKPOINTstatement which flushes inlined data, expires snapshots, merges adjacent files, rewrites files with deletes and cleans up orphaned files. The behavior is configured via therewrite_delete_threshold,delete_older_than,expire_older_thanandauto_compactmetadata options.Note
This requires
duckdbto be installed.
- cleanup_old_files( ) list[str][source]#
Delete files that have been scheduled for deletion.
Dispatches to
ducklake_cleanup_old_files. Files are only scheduled for deletion when the snapshots referencing them are expired (seeexpire_snapshots()).- Parameters:
cleanup_all – If
True, delete all files scheduled for deletion regardless of age.older_than – If provided, only delete files scheduled for deletion before this timestamp.
dry_run – If
True, no files are actually deleted.
- Returns:
The paths that were deleted, or would be deleted when
dry_runisTrue.
Note
This requires
duckdbto be installed.
- delete_orphaned_files( ) list[str][source]#
Delete files in the data directory that are not referenced by any snapshot.
Dispatches to
ducklake_delete_orphaned_files. Useful for cleaning up files that were written but never registered (e.g. due to a crashed writer).- Parameters:
cleanup_all – If
True, delete all orphaned files regardless of age.older_than – If provided, only delete orphaned files older than this timestamp.
dry_run – If
True, no files are actually deleted.
- Returns:
The paths that were deleted, or would be deleted when
dry_runisTrue.
Note
This requires
duckdbto be installed.
- disconnect() None[source]#
Disconnect from the catalog database, gracefully closing all underlying connections.
After calling this method, all subsequent operations on this
Ducklakeinstance (or anyTable/Transactionderived from it) will fail.This is normally not required because connections are released when the instance is garbage collected, but it is useful when you need to ensure that all connections are released deterministically (e.g. before dropping the catalog database).
- execute_sql(query: str | sa.ReturnsRows) None[source]#
Execute an arbitrary SQL query against the catalog database.
- Parameters:
query – The SQL query to execute. This may either be a raw string or a
sqlalchemyquery. If a raw string is provided, it must use the DuckDB SQL dialect. If asqlalchemyquery is provided,duckdb-enginemust be installed.
Note
This requires
duckdbto be installed.
- expire_snapshots(
- *,
- versions: Sequence[int] | None = None,
- older_than: dt.datetime | None = None,
- dry_run: bool = False,
Expire snapshots in the catalog so the data they reference can be cleaned up.
Dispatches to
ducklake_expire_snapshots. Note that this does not immediately delete the underlying files; callcleanup_old_files()afterwards (or usecheckpoint()).- Parameters:
versions – Optional list of snapshot ids to expire.
older_than – If provided, expire all snapshots created before this timestamp.
dry_run – If
True, no snapshots are actually expired.
- Returns:
The IDs of the snapshots that were expired, or would be expired when
dry_runisTrue.
Note
This requires
duckdbto be installed.
- get_latest_snapshot() SnapshotMetadata[source]#
Get metadata for the latest snapshot in the catalog.
- Returns:
Metadata for the latest snapshot in the catalog.
- get_table( ) Table[source]#
Read a table from the catalog.
- Parameters:
name – The name of the table. This can either be a string or a TableName tuple. If a string is provided, it is parsed just like DuckDB parses table names: it must be of the format
<schema>.<table>where the schema is optional and defaults to “main”. If either the schema or table name contains special characters, both must be quoted using double quotes.- Returns:
The Table object.
- Raises:
NotFoundError – If the table does not exist.
- list_schemas() list[str][source]#
List all schema names in the catalog.
- Returns:
A list of all schema names in the catalog.
- list_snapshots() list[SnapshotMetadata][source]#
List metadata for all snapshots in the catalog.
When time-traveling, this returns only the snapshot that was travelled to.
- Returns:
A list of metadata for all snapshots in the catalog, ordered from newest to oldest.
- list_tables(schema: str | None = None) list[Table][source]#
List all tables in the catalog.
- Parameters:
schema – Optional schema name to filter tables by. If None, returns all tables across all schemas.
- Returns:
A list of all Table objects in the catalog, optionally filtered by schema.
- merge_adjacent_files(
- *,
- max_compacted_files: int | None = None,
- min_file_size: int | None = None,
- max_file_size: int | None = None,
Merge small adjacent data files into larger ones across the catalog.
Dispatches to
ducklake_merge_adjacent_files. Only tables withauto_compactenabled are considered.- Parameters:
max_compacted_files – Maximum number of compaction operations produced in a single call (per table).
min_file_size – Excludes files smaller than this many bytes from compaction.
max_file_size – Excludes files at or larger than this many bytes from compaction. Defaults to the
target_file_sizetable option.
- Returns:
A row for each output file created by the operation.
Note
This requires
duckdbto be installed.
- rewrite_data_files( ) list[MaintenanceResult][source]#
Rewrite data files with a high fraction of deleted rows across the catalog.
Dispatches to
ducklake_rewrite_data_files. Files containing more deletes thandelete_thresholdare rewritten without the deleted rows.- Parameters:
delete_threshold – Minimum fraction (0-1) of deleted rows required to trigger a rewrite. Defaults to the
rewrite_delete_thresholdmetadata option (0.95).- Returns:
A row for each output file created by the operation.
Note
This requires
duckdbto be installed.
- set_metadata( ) None[source]#
Set one or more metadata options at the global or schema scope.
Provide options as keyword arguments. Pass
Noneas a value to remove the option from the metadata (i.e. revert it to its default).- Parameters:
schema – Optional schema name to scope the table-level options to. If not provided, the options are set globally. Only valid for keys in
TableMetadataUpdate.- Raises:
ValueError – If a key is read-only and cannot be set.
See also
Table.set_metadata()for setting metadata options at the table scope.
- transaction( ) Transaction[source]#
Start a new transaction against the catalog.
- Parameters:
author – Optional author attached to the snapshot created on commit.
message – Optional commit message attached to the snapshot.
extra_info – Optional additional structured info attached to the snapshot.
- Returns:
A new
Transaction. If used as a context manager, the transaction is automatically committed on successful exit.