Interesting links - June 2025

🔥 Not got time for all this? I’ve marked my top reads of the month :)
📧 Want to receive this monthly round-up as an email? Subscribe to my Substack where I cross-post the same content
🔗 Medium posts often skulk behind a gate, so I’ve hyperlinked to the Freedium version. You’ll see [Medium ↗] next to each link if you prefer the original.

Open Table Formats / Data Lakehouses 🔗

🔥 Instead of enthusiastically hopping on the Iceberg bandwagon with both webbed feet, DuckDB Labs have been quietly building their own format. DuckLake was announced at the beginning of the month, and is a replacement for both an OTF such as Iceberg and the metadata catalog that an OTF user will invariably need to wire up too. I had a quick poke around it, and Tobias Müller and Thomas F McGeehan V both went into it in more detail, along with MotherDuck who’ll unsurprisingly be offering it as a managed service.
The catalog space is a vital one for OTFs, and there are various projects springing up alongside the more established Unity and Polaris catalogs. Some do slightly different things (such as management e.g. compaction) or have a broader ambition (cataloging all your data, not just OTF):
- apache/amoro
- nimtable (from RisingWave)
- apache/gravitino
I’ve written about writing to Iceberg from Kafka with Flink before, but shipped a new blog this week that throws AWS Glue Data Catalog into the mix.
Details of the latest Iceberg version Iceberg v3: Moving the Ecosystem Towards Unification
In a blog post that makes me nostalgic for my Oracle days, Yuval Yogev writes about statistics in Iceberg
Some useful points to think about from Jacek Migdal covering where Iceberg might not be the right fit for your requirements
Vu Trinh has a nice summary of how Meta modernised their lakehouse, based on a paper from 2023
Apache Hudi is still kicking around and in use—this blog post from Shiyan Xu details file pruning with multi-modal index.

Kafka and Event Streaming 🔗

🔥 LinkedIn, the birthplace of Kafka, have published details of their Kafka replacement, Northguard and Xinfra.
Chris Riccomini has a hot take on Kafka: The End of the Beginning
Responsive’s Almog Gavra has written about a new serialisation format with performance benefits in data streaming called Imprint
If you run Apache Kafka, this list of CVEs that the project publishes is worth keeping an eye on. There were three CVEs published earlier this month impacting various versions including up to 3.9.0
Kloia explains how they solved Kafka event sequencing in their online grocery application
ShareChat migrated away from a multi-AZ deployment of Apache Kafka to Warpstream, and explains why in this post.
Agoda describes how they handle Kafka consumer failover across data centers
Expedia Group Tech shares details of their real-time A/B test monitoring system, built on Apache Pinot and Kafka.

Stream Processing 🔗

Grab have built a bespoke Flink SQL platform that they describe here.
Running Flink SQL once you get out of your development environment can be tricky. The excellently-named DataSQRL have published a companion to the Flink Kubernetes operator to run SQL jobs.
Luthra Sahil has some practical advice for exception handling in Kafka Streams
Riskified have written about their migration from ksqlDB to Flink
The team at EloElo have been busy, writing a two-part series on running Apache Flink in EKS, as well as details of their batch & realtime data platform.
🔥 Excellent example (as always) from Simon Aubury of using Complex Event Processing (CEP) in Flink with ADS-B aviation data
Details of how Swiggy built their streaming data platform on Flink, and use it for real-time business monitoring.
Trade Republic shares how they built a system to calculate bond yield to maturity in real-time using Kafka and Redis.

AI 🔗

🔥 I’m so bored of AI already, even though it is revolutionising how we do things. That said, I did enjoy reading this article: My AI Skeptic Friends Are All Nuts
An example from the Debezium folks showing how Debezium, Milvus, and Ollama can work together to form a Retrieval-Augmented Generation (RAG) pipeline
Pinterest Engineering explains how they use Ray in their ML Infrastructure

CDC / Debezium 🔗

🔥 Backfilling Postgres TOAST Columns in Debezium Data Change Events
Some good detail from Mohammad Mahdi Azadjalal on setting up CDC with PostgreSQL, Debezium, and Kafka Connect
snyk/skemium: Generate and Compare Debezium CDC (Change Data Capture) Avro Schema, directly from your Database.
Kleinanzeigen writes about their approach to a zero-downtime user migration using Debezium and Kafka.

Data Platforms & Architecture 🔗

Benn Stancil asks the important question: which way from here for the analytics and data industry?
Netflix Engineering introduces their Unified Data Architecture (UDA) with the principle of Model Once, Represent Everywhere
I’d be fascinated to know what proportion of companies are still running Hadoop. Pinterest, for one, are, and they’ve written about the automated migration and scaling of their Hadoop clusters.
Atlassian shares how they are enhancing resiliency in their OpenSearch clusters
Instacart tech talks about building their modern search infrastructure on Postgres
Grab Engineering on how they rewrote their Counter Service in Rust
The tech team at Just Eat on How Data Products Become the Promethean Fire
Uber are prolific in their technical blog posts, and have three for us this month:
Their migration from Hive to Spark SQL for ETL workloads
The evolution of their search platform
Their approach to Config-Driven Data Pipelines

Databases Engines & Performance 🔗

A look at how OpenAI are scaling PostgreSQL
I love this kind of well-written, deep dive blog from TigerBeetle about a bug that Jepsen found.
A deep dive from LanceDB on Columnar File Readers in Depth: Repetition & Definition Levels
Two interesting DuckDB extensions, providing real-time support and support for Apache Arrow
🔥 Chris Riccomini started a good thread on Bluesky about graph databases
LiveStore is a state management framework and local-first data layer for high-performance apps, based on SQLite and event-sourcing.
Jan Nidzwetzki writes about The Art of SQL Query Optimization
InfluxDB’s Andrew Lamb did a talk on accelerating Apache Parquet with metadata stores and specialized indexes using Apache DataFusion
Part of Arrow, Flight SQL is a protocol for interacting with SQL databases using the Arrow in-memory format and the Flight RPC framework. Porter implements Flight SQL to provide a server on top of DuckDB or Clickhouse.

And finally… 🔗

Nothing to do with data, but stuff that I’ve found interesting or has made me smile.

Human APIs (a.k.a The Soft Squishy Stuff) 🔗

📧 Want to receive this monthly round-up as an email? Subscribe to my Substack where I cross-post the same content
If you like these kind of links you might like to read about How I Try To Keep Up With The Data Tech World (A List of Data Blogs)

Interesting links - June 2025

Open Table Formats / Data Lakehouses 🔗

Kafka and Event Streaming 🔗

Stream Processing 🔗

AI 🔗

CDC / Debezium 🔗

Data Platforms & Architecture 🔗

Databases Engines & Performance 🔗

And finally… 🔗

Human APIs (a.k.a The Soft Squishy Stuff) 🔗

Misc 🔗

Geek 🔗

Data Viz 🔗