Releases
Downloads🔗
The latest version of Iceberg is 1.4.3.
- 1.4.3 source tar.gz -- signature -- sha512
- 1.4.3 Spark 3.5_2.12 runtime Jar -- 3.5_2.13
- 1.4.3 Spark 3.4_2.12 runtime Jar -- 3.4_2.13
- 1.4.3 Spark 3.3_2.12 runtime Jar -- 3.3_2.13
- 1.4.3 Flink 1.18 runtime Jar
- 1.4.3 Flink 1.17 runtime Jar
- 1.4.3 Flink 1.16 runtime Jar
- 1.4.3 Hive runtime Jar
- 1.4.3 aws-bundle Jar
- 1.4.3 gcp-bundle Jar
- 1.4.3 azure-bundle Jar
To use Iceberg in Spark or Flink, download the runtime JAR for your engine version and add it to the jars folder of your installation.
To use Iceberg in Hive 2 or Hive 3, download the Hive runtime JAR and add it to Hive using ADD JAR
.
Gradle🔗
To add a dependency on Iceberg in Gradle, add the following to build.gradle
:
You may also want to include iceberg-parquet
for Parquet file support.
Maven🔗
To add a dependency on Iceberg in Maven, add the following to your pom.xml
:
<dependencies>
...
<dependency>
<groupId>org.apache.iceberg</groupId>
<artifactId>iceberg-core</artifactId>
<version>1.4.3</version>
</dependency>
...
</dependencies>
1.5.0 release🔗
Apache Iceberg 1.5.0 was released on March 11, 2024. The 1.5.0 release adds a variety of new features and bug fixes.
- API
- Extend FileIO and add EncryptingFileIO. (#9592)
- Track partition statistics in TableMetadata (#8502)
- Add sqlFor API to views to handle resolving a representation for a dialect(#9247)
- Core
- Add view support for REST catalog (#7913)
- Add view support for JDBC catalog (#9487)
- Add catalog type for glue,jdbc,nessie (#9647)
- Support Avro file encryption with AES GCM streams (#9436)
- Add ApplyNameMapping for Avro (#9347)
- Add StandardEncryptionManager (#9277)
- Add REST catalog table session cache (#8920)
- Support view metadata compression (#8552)
- Track partition statistics in TableMetadata (#8502)
- Enable column statistics filtering after planning (#8803)
- Spark
- Remove support for Spark 3.2 (#9295)
- Support views via SQL for Spark 3.4 and 3.5 (#9423, #9421, #9343), (#9513, (#9582
- Support executor cache locality (#9563)
- Added support for delete manifest rewrites (#9020)
- Support encrypted output files (#9435)
- Add Spark UI metrics from Iceberg scan metrics (#8717)
- Parallelize reading files in add_files procedure (#9274)
- Support file and partition delete granularity (#9384)
- Flink
- Remove Flink 1.15
- Adds support for 1.18 version #9211
- Emit watermarks from the IcebergSource (#8553)
- Watermark read options (#9346)
- Parquet
- Support reading INT96 column in row group filter (#8988)
- Add system config for unsafe Parquet ID fallback. (#9324)
- Kafka-Connect
- Initial project setup and event data structures (#8701)
- Sink connector with data writers and converters (#9466)
- Spec
- Add partition stats spec (#7105)
- add nanosecond timestamp types (#8683)
- Add multi-arg transform (#8579)
- Vendor Integrations
- AWS: Support setting description for Glue table (#9530)
- AWS: Update S3FileIO test to run when CLIENT_FACTORY is not set (#9541)
- AWS: Add S3 Access Grants Integration (#9385)
- AWS: Glue catalog strip trailing slash on DB URI (#8870)
- Azure: Add FileIO that supports ADLSv2 storage (#8303)
- Azure: Make ADLSFileIO implement DelegateFileIO (#8563)
- Nessie: Support views for NessieCatalog (#8909)
- Nessie: Strip trailing slash for warehouse location (#9415)
- Nessie: Infer default API version from URI (#9459)
- Dependencies
- Bump Nessie to 0.77.1
- Bump ORC to 1.9.2
- Bump Arrow to 15.0.0
- Bump AWS Java SDK to 2.24.5
- Bump Azure Java SDK to 1.2.20
- Bump Google cloud libraries to 26.28.0
Note:
1. To enable view support for JDBC catalog, configure jdbc.schema-version
to V1
in catalog properties.
For more details, please visit Github. https://github.com/apache/iceberg/releases/tag/apache-iceberg-1.5.0
Past releases🔗
1.4.3 Release🔗
Apache Iceberg 1.4.3 was released on December 27, 2023. The main issue it solves is missing files from a transaction retry with conflicting manifests. It is recommended to upgrade if you use transactions.
- Core: Scan only live entries in partitions table (#8969) by @Fokko in #9197
- Core: Fix missing files from transaction retries with conflicting manifest merges by @nastra in #9337
- JDBC Catalog: Fix namespaceExists check with special characters by @ismailsimsek in #9291
- Core: Expired Snapshot files in a transaction should be deleted by @bartash in #9223
- Core: Fix missing delete files from transaction by @nastra in #9356
1.4.2 Release🔗
Apache Iceberg 1.4.2 was released on November 2, 2023. The 1.4.2 patch release addresses fixing a remaining case where split offsets should be ignored when they are deemed invalid.
- Core
- Ignore split offsets array when split offset is past file length (#8925)
1.4.1 Release🔗
Apache Iceberg 1.4.1 was released on October 23, 2023. The 1.4.1 release addresses various issues identified in the 1.4.0 release.
- Core
- AWS
- Avoid static global credentials provider which doesn't play well with lifecycle management (#8677)
- Flink
- Reverting the default custom partitioner for bucket column (#8848)
1.4.0 release🔗
Apache Iceberg 1.4.0 was released on October 4, 2023. The 1.4.0 release adds a variety of new features and bug fixes.
- API
- Core
- Use V2 format by default in new tables (#8381)
- Use
zstd
compression for Parquet by default in new tables (#8593) - Add strict metadata cleanup mode and enable it by default (#8397) (#8599)
- Avoid generating huge manifests during commits (#6335)
- Add a writer for unordered position deletes (#7692)
- Optimize
DeleteFileIndex
(#8157) - Optimize lookup in
DeleteFileIndex
without useful bounds (#8278) - Optimize split offsets handling (#8336)
- Optimize computing user-facing state in data tasks (#8346)
- Don't persist useless file and position bounds for deletes (#8360)
- Don't persist counts for paths and positions in position delete files (#8590)
- Support setting system-level properties via environmental variables (#5659)
- Add JSON parser for
ContentFile
andFileScanTask
(#6934) - Add REST spec and request for commits to multiple tables (#7741)
- Add REST API for committing changes against multiple tables (#7569)
- Default to exponential retry strategy in REST client (#8366)
- Support registering tables with REST session catalog (#6512)
- Add last updated timestamp and snapshot ID to partitions metadata table (#7581)
- Add total data size to partitions metadata table (#7920)
- Extend
ResolvingFileIO
to support bulk operations (#7976) - Key metadata in Avro format (#6450)
- Add AES GCM encryption stream (#3231)
- Fix a connection leak in streaming delete filters (#8132)
- Fix lazy snapshot loading history (#8470)
- Fix unicode handling in HTTPClient (#8046)
- Fix paths for unpartitioned specs in writers (#7685)
- Fix OOM caused by Avro decoder caching (#7791)
- Spark
- Added support for Spark 3.5
- Code for DELETE, UPDATE, and MERGE commands has moved to Spark, and all related extensions have been dropped from Iceberg.
- Support for WHEN NOT MATCHED BY SOURCE clause in MERGE.
- Column pruning in merge-on-read operations.
- Ability to request a bigger advisory partition size for the final write to produce well-sized output files without harming the job parallelism.
- Dropped support for Spark 3.1
- Deprecated support for Spark 3.2
- Support vectorized reads for merge-on-read operations in Spark 3.4 and 3.5 (#8466)
- Increase default advisory partition size for writes in Spark 3.5 (#8660)
- Support distributed planning in Spark 3.4 and 3.5 (#8123)
- Support pushing down system functions by V2 filters in Spark 3.4 and 3.5 (#7886)
- Support fanout position delta writers in Spark 3.4 and 3.5 (#7703)
- Use fanout writers for unsorted tables by default in Spark 3.5 (#8621)
- Support multiple shuffle partitions per file in compaction in Spark 3.4 and 3.5 (#7897)
- Output net changes across snapshots for carryover rows in CDC (#7326)
- Display read metrics on Spark SQL UI (#7447) (#8445)
- Adjust split size to benefit from cluster parallelism in Spark 3.4 and 3.5 (#7714)
- Add
fast_forward
procedure (#8081) - Support filters when rewriting position deletes (#7582)
- Support setting current snapshot with ref (#8163)
- Make backup table name configurable during migration (#8227)
- Add write and SQL options to override compression config (#8313)
- Correct partition transform functions to match the spec (#8192)
- Enable extra commit properties with metadata delete (#7649)
- Added support for Spark 3.5
- Flink
- Add possibility of ordering the splits based on the file sequence number (#7661)
- Fix serialization in
TableSink
with anonymous object (#7866) - Switch to
FileScanTaskParser
for JSON serialization ofIcebergSourceSplit
(#7978) - Custom partitioner for bucket partitions (#7161)
- Implement data statistics coordinator to aggregate data statistics from operator subtasks (#7360)
- Support alter table column (#7628)
- Parquet
- ORC
- Handle filters with transforms by assuming the filter matches (#8244)
- Vendor Integrations
- GCP: Fix single byte read in
GCSInputStream
(#8071) - GCP: Add properties for OAtuh2 and update library (#8073)
- GCP: Add prefix and bulk operations to
GCSFileIO
(#8168) - GCP: Add bundle jar for GCP-related dependencies (#8231)
- GCP: Add range reads to
GCSInputStream
(#8301) - AWS: Add bundle jar for AWS-related dependencies (#8261)
- AWS: support config storage class for
S3FileIO
(#8154) - AWS: Add
FileIO
tracker/closer to Glue catalog (#8315) - AWS: Update S3 signer spec to allow an optional string body in
S3SignRequest
(#8361) - Azure: Add
FileIO
that supports ADLSv2 storage (#8303) - Azure: Make
ADLSFileIO
implementDelegateFileIO
(#8563) - Nessie: Provide better commit message on table registration (#8385)
- GCP: Fix single byte read in
- Dependencies
- Bump Nessie to 0.71.0
- Bump ORC to 1.9.1
- Bump Arrow to 12.0.1
- Bump AWS Java SDK to 2.20.131
1.3.1 release🔗
Apache Iceberg 1.3.1 was released on July 25, 2023. The 1.3.1 release addresses various issues identified in the 1.3.0 release.
- Core
- Table Metadata parser now accepts null for fields: current-snapshot-id, properties, and snapshots (#8064)
- Hive
- Fix HiveCatalog deleting metadata on failures in checking lock status (#7931)
- Spark
- Flink
- FlinkCatalog creation no longer creates the default database (#8039)
1.3.0 release🔗
Apache Iceberg 1.3.0 was released on May 30th, 2023. The 1.3.0 release adds a variety of new features and bug fixes.
- Core
- Expose file and data sequence numbers in ContentFile (#7555)
- Improve bit density in object storage layout (#7128)
- Store split offsets for delete files (#7011)
- Readable metrics in entries metadata table (#7539)
- Delete file stats in partitions metadata table (#6661)
- Optimized vectorized reads for Parquet Decimal (#3249)
- Vectorized reads for Parquet INT96 timestamps in imported data (#6962)
- Support selected vector with ORC row and batch readers (#7197)
- Clean up expired metastore clients (#7310)
- Support for deleting old partition spec columns in V1 tables (#7398)
- Spark
- Initial support for Spark 3.4
- Removed integration for Spark 2.4
- Support for storage-partitioned joins with mismatching keys in Spark 3.4 (MERGE commands) (#7424)
- Support for TimestampNTZ in Spark 3.4 (#7553)
- Ability to handle skew during writes in Spark 3.4 (#7520)
- Ability to coalesce small tasks during writes in Spark 3.4 (#7532)
- Distribution and ordering enhancements in Spark 3.4 (#7637)
- Action for rewriting position deletes (#7389)
- Procedure for rewriting position deletes (#7572)
- Avoid local sort for MERGE cardinality check (#7558)
- Support for rate limits in Structured Streaming (#4479)
- Read and write support for UUIDs (#7399)
- Concurrent compaction is enabled by default (#6907)
- Support for metadata columns in changelog tables (#7152)
- Add file group failure info for data compaction (#7361)
- Flink
- Initial support for Flink 1.17
- Removed integration for Flink 1.14
- Data statistics operator to collect traffic distribution for guiding smart shuffling (#6382)
- Data statistics operator sends local data statistics to coordinator and receives aggregated data statistics from coordinator for smart shuffling (#7269)
- Exposed write parallelism in SQL hints (#7039)
- Row-level filtering (#7109)
- Use starting sequence number by default when rewriting data files (#7218)
- Config for max allowed consecutive planning failures in IcebergSource before failing the job (#7571)
- Vendor Integrations
- Dependencies
- Bump Arrow to 12.0.0
- Bump ORC to 1.8.3
- Bump Parquet to 1.13.1
- Bump Nessie to 0.59.0
1.2.1 release🔗
Apache Iceberg 1.2.1 was released on April 11th, 2023. The 1.2.1 release is a patch release to address various issues identified in the prior release. Here is an overview:
- CORE
- Spark
- AWS
1.2.0 release🔗
Apache Iceberg 1.2.0 was released on March 20th, 2023. The 1.2.0 release adds a variety of new features and bug fixes. Here is an overview:
- Core
- Added AES GCM encrpytion stream spec (#5432)
- Added support for Delta Lake to Iceberg table conversion (#6449, #6880)
- Added support for
position_deletes
metadata table (#6365, #6716) - Added support for scan and commit metrics reporter that is pluggable through catalog (#6404, #6246, #6410)
- Added support for branch commit for all operations (#4926, #5010)
- Added
FileIO
support for ORC readers and writers (#6293) - Updated all actions to leverage bulk delete whenever possible (#6682)
- Updated snapshot ID definition in Puffin spec to support statistics file reuse (#6272)
- Added human-readable metrics information in
files
metadata table (#5376) - Fixed incorrect Parquet row group skipping when min and max values are
NaN
(#6517) - Fixed a bug that location provider could generate paths with double slash (
//
) which is not compatible in a Hadoop file system (#6777) - Fixed metadata table time travel failure for tables that performed schema evolution (#6980)
- Spark
- Added time range query support for changelog table (#6350)
- Added changelog view procedure for v1 table (#6012)
- Added support for storage partition joins to improve read and write performance (#6371)
- Updated default Arrow environment settings to improve read performance (#6550)
- Added aggregate pushdown support for
min
,max
andcount
to improve read performance (#6622) - Updated default distribution mode settings to improve write performance (#6828, #6838)
- Updated DELETE to perform metadata-only update whenever possible to improve write performance (#6899)
- Improved predicate pushdown support for write operations (#6636)
- Added support for reading a branch or tag through table identifier and
VERSION AS OF
(a.k.a.FOR SYSTEM_VERSION AS OF
) SQL syntax (#6717, #6575) - Added support for writing to a branch through identifier or through write-audit-publish (WAP) workflow settings (#6965, #7050)
- Added DDL SQL extensions to create, replace and drop a branch or tag (#6638, #6637, #6752, #6807)
- Added UDFs for
years
,months
,days
andhours
transforms (#6207, #6261, #6300, #6339) - Added partition related stats for
add_files
procedure result (#6797) - Fixed a bug that
rewrite_manifests
procedure produced a new manifest even when there was no rewrite performed (#6659) - Fixed a bug that statistics files were not cleaned up in
expire_snapshots
procedure (#6090)
- Flink
- Added support for metadata tables (#6222)
- Added support for read options in Flink source (#5967)
- Added support for reading and writing Avro
GenericRecord
(#6557, #6584) - Added support for reading a branch or tag and write to a branch (#6660, #5029)
- Added throttling support for streaming read (#6299)
- Added support for multiple sinks for the same table in the same job (#6528)
- Fixed a bug that metrics config was not applied to equality and position deletes (#6271, #6313)
- Vendor Integrations
- Added Snowflake catalog integration (#6428)
- Added AWS sigV4 authentication support for REST catalog (#6951)
- Added support for AWS S3 remote signing (#6169, #6835, #7080)
- Updated AWS Glue catalog to skip table version archive by default (#6919)
- Updated AWS Glue catalog to not require a warehouse location (#6586)
- Fixed a bug that a bucket-only AWS S3 location such as
s3://my-bucket
could not be parsed (#6352) - Fixed a bug that unnecessary HTTP client dependencies had to be included to use any AWS integration (#6746)
- Fixed a bug that AWS Glue catalog did not respect custom catalog ID when determining default warehouse location (#6223)
- Fixes a bug that AWS DynamoDB catalog namespace listing result was incomplete (#6823)
- Dependencies
For more details, please visit Github.
1.1.0 release🔗
Apache Iceberg 1.1.0 was released on November 28th, 2022. The 1.1.0 release deprecates various pre-1.0.0 methods, and adds a variety of new features. Here is an overview:
- Core
- Puffin statistics have been added to the Table API
- Support for Table scan reporting, which enables collection of statistics of the table scans.
- Add file sequence number to ManifestEntry
- Support register table for all the catalogs (previously it was only for Hive)
- Support performing merge appends and delete files on branches
- Improved Expire Snapshots FileCleanupStrategy
- SnapshotProducer supports branch writes
- Spark
- Support for aggregate expressions
- SparkChangelogTable for querying changelogs
- Dropped support for Apache Spark 3.0
- Flink
- FLIP-27 reader is supported in SQL
- Added support for Flink 1.16, dropped support for Flink 1.13
- Dependencies
For more details, please visit Github.
1.0.0 release🔗
The 1.0.0 release officially guarantees the stability of the Iceberg API.
Iceberg's API has been largely stable since very early releases and has been integrated with many processing engines, but was still released under a 0.y.z version number indicating that breaking changes may happen. From 1.0.0 forward, the project will follow semver in the public API module, iceberg-api.
This release removes deprecated APIs that are no longer part of the API. To make transitioning to the new release easier, it is based on the 0.14.1 release with only important bug fixes:
- Increase metrics limit to 100 columns (#5933)
- Bump Spark patch versions for CVE-2022-33891 (#5292)
- Exclude Scala from Spark runtime Jars (#5884)
0.14.1 release🔗
This release includes all bug fixes from the 0.14.x patch releases.
Notable bug fixes🔗
- API
- API: Fix ID assignment in schema merging (#5395)
- Core
- Spark
- Spark: Fix stats in rewrite metadata action (#5691)
- File Formats
- Parquet: Close zstd input stream early to avoid memory pressure (#5681)
- Vendor Integrations
0.14.0 release🔗
Apache Iceberg 0.14.0 was released on 16 July 2022.
Highlights🔗
- Added several performance improvements for scan planning and Spark queries
- Added a common REST catalog client that uses change-based commits to resolve commit conflicts on the service side
- Added support for Spark 3.3, including
AS OF
syntax for SQL time travel queries - Added support for Scala 2.13 with Spark 3.2 or later
- Added merge-on-read support for MERGE and UPDATE queries in Spark 3.2 or later
- Added support to rewrite partitions using zorder
- Added support for Flink 1.15 and dropped support for Flink 1.12
- Added a spec and implementation for Puffin, a format for large stats and index blobs, like Theta sketches or bloom filters
- Added new interfaces for consuming data incrementally (both append and changelog scans)
- Added support for bulk operations and ranged reads to FileIO interfaces
- Added more metadata tables to show delete files in the metadata tree
High-level features🔗
- API
- Added IcebergBuild to expose Iceberg version and build information
- Added binary compatibility checking to the build (#4638, #4798)
- Added a new IncrementalAppendScan interface and planner implementation (#4580)
- Added a new IncrementalChangelogScan interface (#4870)
- Refactored the ScanTask hierarchy to create new task types for changelog scans (#5077)
- Added expression sanitizer (#4672)
- Added utility to check expression equivalence (#4947)
- Added support for serializing FileIO instances using initialization properties (#5178)
- Updated Snapshot methods to accept a FileIO to read metadata files, deprecated old methods (#4873)
- Added optional interfaces to FileIO, for batch deletes (#4052), prefix operations (#5096), and ranged reads (#4608)
- Core
- Added a common client for REST-based catalog services that uses a change-based protocol (#4320, #4319)
- Added Puffin, a file format for statistics and index payloads or sketches (#4944, #4537)
- Added snapshot references to track tags and branches (#4019)
- ManageSnapshots now supports multiple operations using transactions, and added branch and tag operations (#4128, #4071)
- ReplacePartitions and OverwriteFiles now support serializable isolation (#2925, #4052)
- Added new metadata tables:
data_files
(#4336),delete_files
(#4243),all_delete_files
, andall_files
(#4694) - Added deleted files to the
files
metadata table (#4336) and delete file counts to themanifests
table (#4764) - Added support for predicate pushdown for the
all_data_files
metadata table (#4382) and theall_manifests
table (#4736) - Added support for catalogs to default table properties on creation (#4011)
- Updated sort order construction to ensure all partition fields are added to avoid partition closed failures (#5131)
- Spark
- Spark 3.3 is now supported (#5056)
- Added SQL time travel using
AS OF
syntax in Spark 3.3 (#5156) - Scala 2.13 is now supported for Spark 3.2 and 3.3 (#4009)
- Added support for the
mergeSchema
option for DataFrame writes (#4154) - MERGE and UPDATE queries now support the lazy / merge-on-read strategy (#3984, #4047)
- Added zorder rewrite strategy to the
rewrite_data_files
stored procedure and action (#3983, #4902) - Added a
register_table
stored procedure to create tables from metadata JSON files (#4810) - Added a
publish_changes
stored procedure to publish staged commits by ID (#4715) - Added
CommitMetadata
helper class to set snapshot summary properties from SQL (#4956) - Added support to supply a file listing to remove orphan data files procedure and action (#4503)
- Added FileIO metrics to the Spark UI (#4030, #4050)
- DROP TABLE now supports the PURGE flag (#3056)
- Added support for custom isolation level for dynamic partition overwrites (#2925) and filter overwrites (#4293)
- Schema identifier fields are now shown in table properties (#4475)
- Abort cleanup now supports parallel execution (#4704)
- Flink
- Flink 1.15 is now supported (#4553)
- Flink 1.12 support was removed (#4551)
- Added a FLIP-27 source and builder to 1.14 and 1.15 (#5109)
- Added an option to set the monitor interval (#4887) and an option to limit the number of snapshots in a streaming read planning operation (#4943)
- Added support for write options, like
write-format
to Flink sink builder (#3998) - Added support for task locality when reading from HDFS (#3817)
- Use Hadoop configuration files from
hadoop-conf-dir
property (#4622)
- Vendor integrations
- Added Dell ECS integration (#3376, #4221)
- JDBC catalog now supports namespace properties (#3275)
- AWS Glue catalog supports native Glue locking (#4166)
- AWS S3FileIO supports using S3 access points (#4334), bulk operations (#4052, #5096), ranged reads (#4608), and tagging at write time or in place of deletes (#4259, #4342)
- AWS GlueCatalog supports passing LakeFormation credentials (#4280)
- AWS DynamoDB catalog and lock supports overriding the DynamoDB endpoint (#4726)
- Nessie now supports namespaces and namespace properties (#4385, #4610)
- Nessie now passes most common catalog tests (#4392)
- Parquet
- ORC
Performance improvements🔗
- Core
- Fixed manifest file handling in scan planning to open manifests in the planning threadpool (#5206)
- Avoided an extra S3 HEAD request by passing file length when opening manifest files (#5207)
- Refactored Arrow vectorized readers to avoid extra dictionary copies (#5137)
- Improved Arrow decimal handling to improve decimal performance (#5168, #5198)
- Added support for Avro files with Zstd compression (#4083)
- Column metrics are now disabled by default after the first 32 columns (#3959, #5215)
- Updated delete filters to copy row wrappers to avoid expensive type analysis (#5249)
- Snapshot expiration supports parallel execution (#4148)
- Manifest updates can use a custom thread pool (#4146)
- Spark
- Flink
- Hive
Notable bug fixes🔗
This release includes all bug fixes from the 0.13.x patch releases.
- Core
- Fixed an exception thrown when metadata-only deletes encounter delete files that are partially matched (#4304)
- Fixed transaction retries for changes without validations, like schema updates, that could ignore an update (#4464)
- Fixed failures when reading metadata tables with evolved partition specs (#4520, #4560)
- Fixed delete files dropped when a manifest is rewritten following a format version upgrade (#4514)
- Fixed missing metadata files resulting from an OOM during commit cleanup (#4673)
- Updated logging to use sanitized expressions to avoid leaking values (#4672)
- Spark
- Flink
- Fixed table property update failures when tables have a primary key (#4561)
- Integrations
Dependency changes🔗
- Updated Apache Avro to 1.10.2 (previously 1.10.1)
- Updated Apache Parquet to 1.12.3 (previously 1.12.2)
- Updated Apache ORC to 1.7.5 (previously 1.7.2)
- Updated Apache Arrow to 7.0.0 (previously 6.0.0)
- Updated AWS SDK to 2.17.131 (previously 2.15.7)
- Updated Nessie to 0.30.0 (previously 0.18.0)
- Updated Caffeine to 2.9.3 (previously 2.8.4)
0.13.2🔗
Apache Iceberg 0.13.2 was released on June 15th, 2022.
- Git tag: 0.13.2
- 0.13.2 source tar.gz -- signature -- sha512
- 0.13.2 Spark 3.2 runtime Jar
- 0.13.2 Spark 3.1 runtime Jar
- 0.13.2 Spark 3.0 runtime Jar
- 0.13.2 Spark 2.4 runtime Jar
- 0.13.2 Flink 1.14 runtime Jar
- 0.13.2 Flink 1.13 runtime Jar
- 0.13.2 Flink 1.12 runtime Jar
- 0.13.2 Hive runtime Jar
Important bug fixes and changes:
- Core
- #4673 fixes table corruption from OOM during commit cleanup
- #4514 row delta delete files were dropped in sequential commits after table format updated to v2
- #4464 fixes an issue were conflicting transactions have been ignored during a commit
- #4520 fixes an issue with wrong table predicate filtering with evolved partition specs
- Spark
- #4663 fixes NPEs in Spark value converter
- #4687 fixes an issue with incorrect aborts when non-runtime exceptions were thrown in Spark
- Flink
- Note that there's a correctness issue when using upsert mode in Flink 1.12. Given that Flink 1.12 is deprecated, it was decided to not fix this bug but rather log a warning (see also #4754).
- Nessie
- #4509 fixes a NPE that occurred when accessing refreshed tables in NessieCatalog
A more exhaustive list of changes is available under the 0.13.2 release milestone.
0.13.1🔗
Apache Iceberg 0.13.1 was released on February 14th, 2022.
- Git tag: 0.13.1
- 0.13.1 source tar.gz -- signature -- sha512
- 0.13.1 Spark 3.2 runtime Jar
- 0.13.1 Spark 3.1 runtime Jar
- 0.13.1 Spark 3.0 runtime Jar
- 0.13.1 Spark 2.4 runtime Jar
- 0.13.1 Flink 1.14 runtime Jar
- 0.13.1 Flink 1.13 runtime Jar
- 0.13.1 Flink 1.12 runtime Jar
- 0.13.1 Hive runtime Jar
Important bug fixes:
- Spark
- #4023 fixes predicate pushdown in row-level operations for merge conditions in Spark 3.2. Prior to the fix, filters would not be extracted and targeted merge conditions were not pushed down leading to degraded performance for these targeted merge operations.
-
#4024 fixes table creation in the root namespace of a Hadoop Catalog.
-
Flink
- #3986 fixes manifest location collisions when there are multiple committers in the same Flink job.
0.13.0🔗
Apache Iceberg 0.13.0 was released on February 4th, 2022.
- Git tag: 0.13.0
- 0.13.0 source tar.gz -- signature -- sha512
- 0.13.0 Spark 3.2 runtime Jar
- 0.13.0 Spark 3.1 runtime Jar
- 0.13.0 Spark 3.0 runtime Jar
- 0.13.0 Spark 2.4 runtime Jar
- 0.13.0 Flink 1.14 runtime Jar
- 0.13.0 Flink 1.13 runtime Jar
- 0.13.0 Flink 1.12 runtime Jar
- 0.13.0 Hive runtime Jar
High-level features:
- Core
- Vendor Integrations
- Google Cloud Storage (GCS)
FileIO
is supported with optimized read and write using GCS streaming transfer [#3711] - Aliyun Object Storage Service (OSS)
FileIO
is supported [#3553] - Any S3-compatible storage (e.g. MinIO) can now be accessed through AWS
S3FileIO
with custom endpoint and credential configurations [#3656] [#3658] - AWS
S3FileIO
now supports server-side checksum validation [#3813] - AWS
GlueCatalog
now displays more table information including table location, description [#3467] and columns [#3888] - Using multiple
FileIO
s based on file path scheme is supported by configuring aResolvingFileIO
[#3593]
- Google Cloud Storage (GCS)
- Spark
- Spark 3.2 is supported [#3335] with merge-on-read
DELETE
[#3970] RewriteDataFiles
action now supports sort-based table optimization [#2829] and merge-on-read delete compaction [#3454]. The corresponding Spark call procedurerewrite_data_files
is also supported [#3375]- Time travel queries now use snapshot schema instead of the table's latest schema [#3722]
- Spark vectorized reads now support row-level deletes [#3557] [#3287]
add_files
procedure now skips duplicated files by default (can be turned off with thecheck_duplicate_files
flag) [#2895], skips folder without file [#2895] and partitions withnull
values [#2895] instead of throwing exception, and supports partition pruning for faster table import [#3745]
- Spark 3.2 is supported [#3335] with merge-on-read
- Flink
- Hive
- File Formats
Important bug fixes:
- Core
- Iceberg new data file root path is configured through
write.data.path
going forward.write.folder-storage.path
andwrite.object-storage.path
are deprecated [#3094] - Catalog commit status is
UNKNOWN
instead ofFAILURE
when new metadata location cannot be found in snapshot history [#3717] - Dropping table now also deletes old metadata files instead of leaving them strained [#3622]
history
andsnapshots
metadata tables can now query tables with no current snapshot instead of returning empty [#3812]
- Iceberg new data file root path is configured through
- Vendor Integrations
- Spark
- For Spark >= 3.1,
REFRESH TABLE
can now be used with Spark session catalog instead of throwing exception [#3072] - Insert overwrite mode now skips partition with 0 record instead of failing the write operation [#2895]
- Spark snapshot expiration action now supports custom
FileIO
instead of justHadoopFileIO
[#3089] REPLACE TABLE AS SELECT
can now work with tables with columns that have changed partition transform. Each old partition field of the same column is converted to a void transform with a different name [#3421]- Spark SQL filters containing binary or fixed literals can now be pushed down instead of throwing exception [#3728]
- For Spark >= 3.1,
- Flink
- A
ValidationException
will be thrown if a user configures bothcatalog-type
andcatalog-impl
. Previously it chose to usecatalog-type
. The new behavior brings Flink consistent with Spark and Hive [#3308] - Changelog tables can now be queried without
RowData
serialization issues [#3240] java.sql.Time
data type can now be written without data overflow problem [#3740]- Avro position delete files can now be read without encountering
NullPointerException
[#3540]
- A
- Hive
- File Formats
Other notable changes:
- The community has finalized the long-term strategy of Spark, Flink and Hive support. See Multi-Engine Support page for more details.
0.12.1🔗
Apache Iceberg 0.12.1 was released on November 8th, 2021.
- Git tag: 0.12.1
- 0.12.1 source tar.gz -- signature -- sha512
- 0.12.1 Spark 3.x runtime Jar
- 0.12.1 Spark 2.4 runtime Jar
- 0.12.1 Flink runtime Jar
- 0.12.1 Hive runtime Jar
Important bug fixes and changes:
- #3264 fixes validation failures that occurred after snapshot expiration when writing Flink CDC streams to Iceberg tables.
- #3264 fixes reading projected map columns from Parquet files written before Parquet 1.11.1.
- #3195 allows validating that commits that produce row-level deltas don't conflict with concurrently added files. Ensures users can maintain serializable isolation for update and delete operations, including merge operations.
- #3199 allows validating that commits that overwrite files don't conflict with concurrently added files. Ensures users can maintain serializable isolation for overwrite operations.
- #3135 fixes equality-deletes using
DATE
,TIMESTAMP
, andTIME
types. - #3078 prevents the JDBC catalog from overwriting the
jdbc.user
property if any property called user exists in the environment. - #3035 fixes drop namespace calls with the DyanmoDB catalog.
- #3273 fixes importing Avro files via
add_files
by correctly setting the number of records. - #3332 fixes importing ORC files with float or double columns in
add_files
.
A more exhaustive list of changes is available under the 0.12.1 release milestone.
0.12.0🔗
Apache Iceberg 0.12.0 was released on August 15, 2021. It consists of 395 commits authored by 74 contributors over a 139 day period.
- Git tag: 0.12.0
- 0.12.0 source tar.gz -- signature -- sha512
- 0.12.0 Spark 3.x runtime Jar
- 0.12.0 Spark 2.4 runtime Jar
- 0.12.0 Flink runtime Jar
- 0.12.0 Hive runtime Jar
High-level features:
- Core
- Allow Iceberg schemas to specify one or more columns as row identifiers [#2465]. Note that this is a prerequisite for supporting upserts in Flink.
- Added JDBC [#1870] and DynamoDB [#2688] catalog implementations.
- Added predicate pushdown for partitions and files metadata tables [#2358, #2926].
- Added a new, more flexible compaction action for Spark that can support different strategies such as bin packing and sorting. [#2501, #2609].
- Added the ability to upgrade to v2 or create a v2 table using the table property format-version=2 [#2887].
- Added support for nulls in StructLike collections [#2929].
- Added
key_metadata
field to manifest lists for encryption [#2675].
- Flink
- Added support for SQL primary keys [#2410].
- Hive
- Added the ability to set the catalog at the table level in the Hive Metastore. This makes it possible to write queries that reference tables from multiple catalogs [#2129].
- As a result of [#2129], deprecated the configuration property
iceberg.mr.catalog
which was previously used to configure the Iceberg catalog in MapReduce and Hive [#2565]. - Added table-level JVM lock on commits[#2547].
- Added support for Hive's vectorized ORC reader [#2613].
- Spark
- Added
SET
andDROP IDENTIFIER FIELDS
clauses toALTER TABLE
so people don't have to look up the DDL [#2560]. - Added support for
ALTER TABLE REPLACE PARTITION FIELD
DDL [#2365]. - Added support for micro-batch streaming reads for structured streaming in Spark3 [#2660].
- Improved the performance of importing a Hive table by not loading all partitions from Hive and instead pushing the partition filter to the Metastore [#2777].
- Added support for
UPDATE
statements in Spark [#2193, #2206]. - Added support for Spark 3.1 [#2512].
- Added
RemoveReachableFiles
action [#2415]. - Added
add_files
stored procedure [#2210]. - Refactored Actions API and added a new entry point.
- Added support for Hadoop configuration overrides [#2922].
- Added support for the
TIMESTAMP WITHOUT TIMEZONE
type in Spark [#2757]. - Added validation that files referenced by row-level deletes are not concurrently rewritten [#2308].
- Added
Important bug fixes:
- Core
- Hive
- Enabled dropping HMS tables even if the metadata on disk gets corrupted [#2583].
- Parquet
- Fixed Parquet row group filters when types are promoted from
int
tolong
or fromfloat
todouble
[#2232]
- Fixed Parquet row group filters when types are promoted from
- Spark
Other notable changes:
- The Iceberg Community voted to approve version 2 of the Apache Iceberg Format Specification. The differences between version 1 and 2 of the specification are documented here.
- Bugfixes and stability improvements for NessieCatalog.
- Improvements and fixes for Iceberg's Python library.
- Added a vectorized reader for Apache Arrow [#2286].
- The following Iceberg dependencies were upgraded:
0.11.1🔗
- Git tag: 0.11.1
- 0.11.1 source tar.gz -- signature -- sha512
- 0.11.1 Spark 3.0 runtime Jar
- 0.11.1 Spark 2.4 runtime Jar
- 0.11.1 Flink runtime Jar
- 0.11.1 Hive runtime Jar
Important bug fixes:
- #2367 prohibits deleting data files when tables are dropped if GC is disabled.
- #2196 fixes data loss after compaction when large files are split into multiple parts and only some parts are combined with other files.
- #2232 fixes row group filters with promoted types in Parquet.
- #2267 avoids listing non-Iceberg tables in Glue.
- #2254 fixes predicate pushdown for Date in Hive.
- #2126 fixes writing of Date, Decimal, Time, UUID types in Hive.
- #2241 fixes vectorized ORC reads with metadata columns in Spark.
- #2154 refreshes the relation cache in DELETE and MERGE operations in Spark.
0.11.0🔗
- Git tag: 0.11.0
- 0.11.0 source tar.gz -- signature -- sha512
- 0.11.0 Spark 3.0 runtime Jar
- 0.11.0 Spark 2.4 runtime Jar
- 0.11.0 Flink runtime Jar
- 0.11.0 Hive runtime Jar
High-level features:
- Core API now supports partition spec and sort order evolution
- Spark 3 now supports the following SQL extensions:
- MERGE INTO (experimental)
- DELETE FROM (experimental)
- ALTER TABLE ... ADD/DROP PARTITION
- ALTER TABLE ... WRITE ORDERED BY
- Invoke stored procedures using CALL
- Flink now supports streaming reads, CDC writes (experimental), and filter pushdown
- AWS module is added to support better integration with AWS, with AWS Glue catalog support and dedicated S3 FileIO implementation
- Nessie module is added to support integration with project Nessie
Important bug fixes:
- #1981 fixes bug that date and timestamp transforms were producing incorrect values for dates and times before 1970. Before the fix, negative values were incorrectly transformed by date and timestamp transforms to 1 larger than the correct value. For example,
day(1969-12-31 10:00:00)
produced 0 instead of -1. The fix is backwards compatible, which means predicate projection can still work with the incorrectly transformed partitions written using older versions. - #2091 fixes
ClassCastException
for type promotionint
tolong
andfloat
todouble
during Parquet vectorized read. Now Arrow vector is created by looking at Parquet file schema instead of Iceberg schema forint
andfloat
fields. - #1998 fixes bug in
HiveTableOperation
thatunlock
is not called if new metadata cannot be deleted. Now it is guaranteed thatunlock
is always called for Hive catalog users. - #1979 fixes table listing failure in Hadoop catalog when user does not have permission to some tables. Now the tables with no permission are ignored in listing.
- #1798 fixes scan task failure when encountering duplicate entries of data files. Spark and Flink readers can now ignore duplicated entries in data files for each scan task.
- #1785 fixes invalidation of metadata tables in
CachingCatalog
. When a table is dropped, all the metadata tables associated with it are also invalidated in the cache. - #1960 fixes bug that ORC writer does not read metrics config and always use the default. Now customized metrics config is respected.
Other notable changes:
- NaN counts are now supported in metadata
- Shared catalog properties are added in core library to standardize catalog level configurations
- Spark and Flink now support dynamically loading customized
Catalog
andFileIO
implementations - Spark 2 now supports loading tables from other catalogs, like Spark 3
- Spark 3 now supports catalog names in DataFrameReader when using Iceberg as a format
- Flink now uses the number of Iceberg read splits as its job parallelism to improve performance and save resource.
- Hive (experimental) now supports INSERT INTO, case insensitive query, projection pushdown, create DDL with schema and auto type conversion
- ORC now supports reading tinyint, smallint, char, varchar types
- Avro to Iceberg schema conversion now preserves field docs
0.10.0🔗
- Git tag: 0.10.0
- 0.10.0 source tar.gz -- signature -- sha512
- 0.10.0 Spark 3.0 runtime Jar
- 0.10.0 Spark 2.4 runtime Jar
- 0.10.0 Flink runtime Jar
- 0.10.0 Hive runtime Jar
High-level features:
- Format v2 support for building row-level operations (
MERGE INTO
) in processing engines- Note: format v2 is not yet finalized and does not have a forward-compatibility guarantee
- Flink integration for writing to Iceberg tables and reading from Iceberg tables (reading supports batch mode only)
- Hive integration for reading from Iceberg tables, with filter pushdown (experimental; configuration may change)
Important bug fixes:
- #1706 fixes non-vectorized ORC reads in Spark that incorrectly skipped rows
- #1536 fixes ORC conversion of
notIn
andnotEqual
to match null values - #1722 fixes
Expressions.notNull
returning anisNull
predicate; API only, method was not used by processing engines - #1736 fixes
IllegalArgumentException
in vectorized Spark reads with negative decimal values - #1666 fixes file lengths returned by the ORC writer, using compressed size rather than uncompressed size
- #1674 removes catalog expiration in HiveCatalogs
- #1545 automatically refreshes tables in Spark when not caching table instances
Other notable changes:
- The
iceberg-hive
module has been renamed toiceberg-hive-metastore
to avoid confusion - Spark 3 is based on 3.0.1 that includes the fix for SPARK-32168
- Hadoop tables will recover from version hint corruption
- Tables can be configured with a required sort order
- Data file locations can be customized with a dynamically loaded
LocationProvider
- ORC file imports can apply a name mapping for stats
A more exhaustive list of changes is available under the 0.10.0 release milestone.
0.9.1🔗
- Git tag: 0.9.1
- 0.9.1 source tar.gz -- signature -- sha512
- 0.9.1 Spark 3.0 runtime Jar
- 0.9.1 Spark 2.4 runtime Jar
0.9.0🔗
- Git tag: 0.9.0
- 0.9.0 source tar.gz -- signature -- sha512
- 0.9.0 Spark 3.0 runtime Jar
- 0.9.0 Spark 2.4 runtime Jar
0.8.0🔗
- Git tag: apache-iceberg-0.8.0-incubating
- 0.8.0-incubating source tar.gz -- signature -- sha512
- 0.8.0-incubating Spark 2.4 runtime Jar