UGPL.net/blog
Posted on
Background

How DuckLake compares to Databricks, IceBerg, BigQuery and Snowflake?

Author
How DuckLake compares to Databricks, IceBerg, BigQuery and Snowflake?

Found on thegerister.com.

DuckDB proposed DuckLake as a standard for metadata and catalogs, claims it simplifies lakehouses by using a standard SQL database for all metadata, instead of complex file-based systems, while still storing data in open formats like Parquet. It will be seen, if this makes them more reliable, faster, and easier to manage.

The corresponding blog post is worth reading, because it gives a systematic view on the high level data architectures of DuckLake, Databricks, IceBerg, BigQuery and Snowflake.

Innovative data systems like BigQuery and Snowflake have shown that disconnecting storage and compute is a great idea in a time where storage is a virtualized commodity. That way, both storage and compute can scale independently and we don't have to buy expensive database machines just to store tables we will never read.

Let us start with the main differences between data files and collections of data files, Lakehouses and Database like systems:

  1. A singe data file is usually some file with structured data, which can be used via libraries and modules in a programing language or be used with some standard software. The problem here is, that each file has to be handled individually and there are many different file formats around all with their corresponding libriaries, modules, API's etc.
  2. From the user perspective the main idea behind Lakehouses is to bring all the different types of data (files) together and use them just like all of them are in one place and with all of them usable in the same (standard) way. This is mainly done with technical metadata stored along the data files. This technical metadata then is also used to optimize the data retrievial.
  3. If you add some table catalog information to this Lakehouse idea you will basically get a complete database like data storage.

Based on this definitions we can now answer the question, how DuckLake data architecture compares to Databricks, IceBerg, BigQuery and Snowflake:

Data architectural layers Databricks Iceberg DuckLake BigQuery Snowflake
Catalog Unity Catatog in Databricks, Hive catalog in DeltaLake Iceberg Catalog Schema in SQL Database for both catalog and table metadata Spanner FoundationDB
Metadata layer DeltaLake Metadata files Iceberg metadata files (manifest list, manifest files) Schema in SQL Database for both catalog and table metadata Spanner FoundationDB
Data layer Data files (Parquet etc.) Data files (Parquet etc.) Data files (Parquet etc.) Data files Data files

Please refer also to the following articles for more background: