[go: up one dir, main page]

opensource.google.com

Menu

Posts from 2026

Explore public datasets with Apache Iceberg & BigLake

Wednesday, January 14, 2026

A vintage-style illustration titled THE PUBLIC DATASETS OF APACHE ICEBERG shows a man in a boat named BigLake Explorer viewing a large iceberg.

The promise of the Open Data Lakehouse is simple: your data should not be locked into a single engine. It should be accessible, interoperable, and built on open standards. Today, we are taking a major step forward in making that promise a reality for developers, data engineers, and researchers everywhere.

We are thrilled to announce the availability of high-quality Public Datasets served via the Apache Iceberg REST Catalog. Hosted on Google Cloud's BigLake, these datasets are available for read-only access to anyone with a Google Cloud account.

Whether you are using Apache Spark, Trino, Flink, or BigQuery, you can now connect to a live, production-grade Iceberg Catalog and start querying data immediately. No copying files, no managing storage bucket. Just configure your catalog and query.

How to Access Public Datasets

This initiative is designed to be engine-agnostic. We provide the storage and the catalog and you bring the compute. This allows you to benchmark different engines, test new Iceberg features, or simply explore interesting data without setting up infrastructure or finding data to ingest.

How to Connect with Apache Spark

You can connect to the public dataset using any standard Spark environment (local, Google Cloud Dataproc, or other vendors). You only need to point your Iceberg catalog configuration to our public REST endpoint.

Prerequisites:

  • A Google Cloud Project (for authentication).
  • Standard Google Application Default Credentials (ADC) set up in your environment.

Spark Configuration:

Use the following configuration flags when starting your Spark Shell or SQL session. This configures a catalog named bqms (BigQuery Metastore) pointing to our public REST endpoint.

PROJECT_ID=<YOUR_PROJECT_ID>

  spark-sql \
    --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0,org.apache.iceberg:iceberg-gcp-bundle:1.10.0 \
    --conf spark.hadoop.hive.cli.print.header=true \
    --conf spark.sql.catalog.bqms=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.bqms.type=rest \
    --conf spark.sql.catalog.bqms.uri=https://biglake.googleapis.com/iceberg/v1/restcatalog \
    --conf spark.sql.catalog.bqms.warehouse=gs://biglake-public-nyc-taxi-iceberg \
    --conf spark.sql.catalog.bqms.header.x-goog-user-project=$PROJECT_ID \
    --conf spark.sql.catalog.bqms.rest.auth.type=google \
    --conf spark.sql.catalog.bqms.io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO \
    --conf spark.sql.catalog.bqms.header.X-Iceberg-Access-Delegation=vended-credentials \
    --conf spark.sql.defaultCatalog=bqms

Note: Replace <YOUR_PROJECT_ID> with your actual Google Cloud Project ID. This is required for the REST Catalog to authenticate your quota usage, even for free public access.

Exploring the Data: Sample Queries

Once connected, you have full SQL access to the datasets. We are launching with the classic NYC Taxi dataset, modeled as an Iceberg table to showcase partitioning and metadata capabilities.

1. The "Hello World" of Analytics

This query aggregates millions of records to find the average fare and trip distance by passenger count. It demonstrates how Iceberg efficiently scans data files without needing to list directories.

SELECT 
    passenger_count,
    COUNT(1) AS num_trips,
    ROUND(AVG(total_amount), 2) AS avg_fare,
    ROUND(AVG(trip_distance), 2) AS avg_distance
FROM 
    bqms.public_data.nyc_taxicab
WHERE 
    data_file_year = 2021
    AND passenger_count > 0
GROUP BY 
    passenger_count
ORDER BY 
    num_trips DESC;

What this demonstrates:

  • Partition Pruning: The query filters on data_file_year, allowing the engine to skip scanning data from other years entirely.
  • Vectorized Reads: Engines like Spark can process the Parquet files efficiently in batches.

2. Time Travel: Auditing Data History

One of Iceberg's most powerful features is Time Travel. You can query the table as it existed at a specific point in the past.

-- Compare the row count of the current version vs. a specific snapshot
SELECT 
    'Current State' AS version, 
    COUNT(*) AS count 
FROM bqms.public_data.nyc_taxicab
UNION ALL
SELECT 
    'Past State' AS version, 
    COUNT(*) AS count 
FROM bqms.public_data.nyc_taxicab VERSION AS OF 2943559336503196801;

Description:

This query allows you to audit changes. By querying the history metadata table (e.g., SELECT * FROM bqms.public_data.nyc_taxicab.history), you can find snapshot IDs and "travel back" to see how the dataset grew over time.

Coming Soon: An Iceberg V3 Playground

We are not just hosting static data; we are building a playground for the future of Apache Iceberg. We plan to release new datasets specifically designed to help you test Iceberg V3 Spec features.

Start Building Today

The goal of these public datasets is to lower the barrier to entry. You don't need to manage infrastructure to learn Iceberg; you just need to connect. Whether you are a data analyst, data scientist, data engineer or a data enthusiast, today you can:

  • Use BigQuery (via BigLake) to query these tables directly using SQL, combining them with your private data.
  • Test your OSS engine (e.g. Spark, Trino, Flink etc.) configurations against a live REST Catalog.

Start building an open, managed and high-performance Iceberg lakehouse to enable advanced analytics and data science with https://cloud.google.com/biglake today!

Happy Querying!

This Week in Open Source #12

Friday, January 9, 2026

This Week in Open Source for January 9, 2026

A look around the world of open source

Here we are at the beginning of a new year. What will it bring to the open source world? What new projects will be started? What should we be focusing on? What is your open source resolution for 2026? One of ours is to better connect with various open source communities on social media. We've gotten off to a big start by launching an official Google Open Source account on Bluesky. Already, we are enjoying the community there.

Upcoming Events

  • January 21 - 23: Everything Open 2026 is happening in Canberra, Australia. Everything Open is a conference focused on open technologies, including Linux, open source software, open hardware and open data, and the communities that surround them. The conference provides technical deep-dives as well as updates from industry leaders and experts on a wide array of topics from these areas.
  • January 29: CHAOSScon Europe 2026 is co-located with FOSDEM in Brussels, Belgium. This conference revolves around discussing open source project health, CHAOSS updates, use cases, and hands-on workshops for developers, community managers, project managers, and anyone interested in measuring open source project health. It also shares insights from the CHAOSS context working groups including OSPOs, University Open Source, and Open Source in Science and Research.
  • January 31 - February 1: FOSDEM 2026 is happening at the Université Libre de Bruxelles in Brussels, Belgium. It is a free event for software developers to meet, share ideas and collaborate. Every year, thousands of developers of free and open source software from all over the world gather at the event in Brussels.
  • February 24 - 25: The Linux Foundation Member Summit is happening in Napa, California. It is the annual gathering for Linux Foundation members that fosters collaboration, innovation, and partnerships among the leading projects and organizations working to drive digital transformation with open source technologies.

Open Source Reads and Links

  • [Talk] State of the Source at ATO 2025: State of the "Open" AI - At the end of last year Open Source Initiative gave a summary of Gabriel Toscano's talk at All Things Open. In the talk he discusses how AI models call themselves "open" but often lack the legal or technical freedoms that true open source requires. Analysis of ~20,000 Hugging Face models found Apache 2.0 and MIT are common, but many models have no license or use restrictive custom terms. The study warns that inconsistent labeling and mutable restrictions muddy openness and urges clearer licensing and platform checks.
  • [Article] The Reality of Open Source: More Puppies, Less Beer - Bitnami's removal of popular containers last year shows that open source can suddenly change and disrupt users. Organizations must evaluate who funds and maintains each open source component, not just the code. Plan for business continuity, supply-chain visibility, and the ability to fork or replace critical components.
  • [Blog] The Open Source Community and U.S. Public Policy - The Open Source Initiative is increasing its U.S. policy work to ensure open source developers are part of technology and AI rulemaking. Since policymakers often lack deep knowledge of open source, the community must explain how shared code differs from deployed systems. Joining groups like the Open Policy Alliance helps nonprofits engage and influence policy.
  • [Article] Pebble, the e-ink smartwatch that refuses to die, just went fully open source - Pebble, the e-ink smartwatch with a tumultuous history, is making a move sure to please the DIY enthusiasts that make up the bulk of its fans: Its entire software stack is now fully open source, and key hardware design files are available too.
  • [Article] Forget Predictions: Tech Leaders' Actual 2026 Resolutions - We want to know your open source resolutions and perhaps these resolutions from some tech leaders (open source and otherwise) can point you in a direction. Their plans run the gamut of securing and managing AI responsibly, reducing noise in security data, and creating healthier tech habits. The common theme is intentional, measurable change over speculation.
  • [Paper] Everything is Context: Agentic File System Abstraction for Context Engineering - GenAI systems may produce inaccurate or misleading outputs due to limited contextual awareness and evolving data sources. Thus mechanisms are needed to govern how persistent knowledge transitions into bounded context in a traceable, verifiable, and human-aware manner, ensuring that human judgment and knowledge are embedded within the system's evolving context for reasoning and evaluation.

    The paper proposes using a file-system abstraction based on the open-source AIGNE framework to manage all types of context for generative AI agents. This unified infrastructure makes context persistent, traceable, and governed so agents can read, write, and version memory, tools, and human input.

What exciting open source events and news are you hearing about? Let us know on our @GoogleOSS X account or our new @opensource.google Bluesky account.

.