The iceberg extension is a loadable extension that implements support for the Apache Iceberg format.
Installing and Loading
To install and load the iceberg extension, run:
INSTALL iceberg;
LOAD iceberg;
Updating the Extension
The iceberg extension often receives updates between DuckDB releases. To make sure that you have the latest version, run:
UPDATE EXTENSIONS (iceberg);
Usage
To test the examples, download the iceberg_data.zip file and unzip it.
Common Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
allow_moved_paths |
BOOLEAN |
false |
Allows scanning Iceberg tables that are moved |
metadata_compression_codec |
VARCHAR |
'' |
Treats metadata files as when set to 'gzip' |
version |
VARCHAR |
'?' |
Provides an explicit version string, hint file or guessing |
version_name_format |
VARCHAR |
'v%s%s.metadata.json,%s%s.metadata.json' |
Controls how versions are converted to metadata file names |
Querying Individual Tables
SELECT count(*)
FROM iceberg_scan('data/iceberg/lineitem_iceberg', allow_moved_paths = true);
| count_star() |
|---|
| 51793 |
The
allow_moved_pathsoption ensures that some path resolution is performed, which allows scanning Iceberg tables that are moved.
You can also address specify the current manifest directly in the query, this may be resolved from the catalog prior to the query, in this example the manifest version is a UUID.
SELECT count(*)
FROM iceberg_scan('data/iceberg/lineitem_iceberg/metadata/02701-1e474dc7-4723-4f8d-a8b3-b5f0454eb7ce.metadata.json');
This extension can be paired with the httpfs extension to access Iceberg tables in object stores such as S3.
SELECT count(*)
FROM iceberg_scan(
's3://bucketname/lineitem_iceberg/metadata/02701-1e474dc7-4723-4f8d-a8b3-b5f0454eb7ce.metadata.json',
allow_moved_paths = true
);
Access Iceberg Metadata
SELECT *
FROM iceberg_metadata('data/iceberg/lineitem_iceberg', allow_moved_paths = true);
| manifest_path | manifest_sequence_number | manifest_content | status | content | file_path | file_format | record_count |
|---|---|---|---|---|---|---|---|
| lineitem_iceberg/metadata/10eaca8a-1e1c-421e-ad6d-b232e5ee23d3-m1.avro | 2 | DATA | ADDED | EXISTING | lineitem_iceberg/data/00041-414-f3c73457-bbd6-4b92-9c15-17b241171b16-00001.parquet | PARQUET | 51793 |
| lineitem_iceberg/metadata/10eaca8a-1e1c-421e-ad6d-b232e5ee23d3-m0.avro | 2 | DATA | DELETED | EXISTING | lineitem_iceberg/data/00000-411-0792dcfe-4e25-4ca3-8ada-175286069a47-00001.parquet | PARQUET | 60175 |
Visualizing Snapshots
SELECT *
FROM iceberg_snapshots('data/iceberg/lineitem_iceberg');
| sequence_number | snapshot_id | timestamp_ms | manifest_list |
|---|---|---|---|
| 1 | 3776207205136740581 | 2023-02-15 15:07:54.504 | lineitem_iceberg/metadata/snap-3776207205136740581-1-cf3d0be5-cf70-453d-ad8f-48fdc412e608.avro |
| 2 | 7635660646343998149 | 2023-02-15 15:08:14.73 | lineitem_iceberg/metadata/snap-7635660646343998149-1-10eaca8a-1e1c-421e-ad6d-b232e5ee23d3.avro |
Selecting Metadata versions
By default, the iceberg extension will look for a version-hint.text file to identify the proper metadata version to use. This can be overridden by explicitly supplying a version number via the version parameter to iceberg table functions. By default, this will look for both v{version}.metadata.json and {version}.metadata.json files, or v{version}.gz.metadata.json and {version}.gz.metadata.json when metadata_compression_codec = 'gzip' is specified. Other compression codecs are not supported.
Additionally, if any .text or .txt file is provided as a version, it is opened and treated as a version-hint file. The iceberg extension will open this file and use the entire contents of the file as a provided version number.
The entire contents of the
version-hint.txtfile will be treated as a literal version name, with no encoding, escaping or trimming. This includes any whitespace, or unsafe characters which will be explicitly passed formatted into filenames in the logic described below.
SELECT *
FROM iceberg_snapshots(
'data/iceberg/lineitem_iceberg',
version = '1',
allow_moved_paths = true
);
| count_star() |
|---|
| 60175 |
Working with Alternative Metadata Naming Conventions
The iceberg extension can handle different metadata naming conventions by specifying them as a comma-delimited list of format strings via the version_name_format parameter. Each format string must take two %s parameters. The first is the location of the version number in the metadata filename and the second is the location of the metadata_compression_codec extension. The behavior described above is provided by the default value of "v%s%s.metadata.gz,%s%smetadata.gz. In the event you had an alternatively named metadata file with such as rev-2.metadata.json.gz, the table could be read via the follow statement.
SELECT *
FROM iceberg_snapshots(
'data/iceberg/alternative_metadata_gz_naming',
version = '2',
version_name_format = 'rev-%s.metadata.json%s',
metadata_compression_codec = 'gzip',
allow_moved_paths = true
);
| count_star() |
|---|
| 60175 |
“Guessing” Metadata Versions
By default, either a table version number or a version-hint.text must be provided for the iceberg extension to read a table. This is typically provided by an external data catalog. In the event neither is present, the iceberg extension can attempt to guess the latest version by passing ? as the table version. The “latest” version is assumed to be the filename that is lexicographically largest when sorting the filenames. Collations are not considered. This behavior is not enabled by default as it may potentially violate ACID constraints. It can be enabled by setting unsafe_enable_version_guessing to true. When this is set, iceberg functions will attempt to guess the latest version by default before failing.
SET unsafe_enable_version_guessing=true;
SELECT count(*)
FROM iceberg_scan('data/iceberg/lineitem_iceberg_no_hint', allow_moved_paths = true);
-- Or explicitly as:
-- FROM iceberg_scan(
-- 'data/iceberg/lineitem_iceberg_no_hint',
-- version = '?',
-- allow_moved_paths = true
-- );
| count_star() |
|---|
| 51793 |
Limitations
Writing (i.e., exporting to) Iceberg files is currently not supported.