A bit of a streamlined edition, this month. Lots of interesting links still, but less commentary. You can put that down to me prevaricating on getting my previous blog about Materialized Tables in Apache Flink finished, and leaving myself little time to work on this one :) Not including the detailed narration actually knocks a bunch of time off the preparation—I’d be interested in your feedback as to how much the absence of narration impacts (if at all) your enjoyment of reading it. Let me know in the comments below!
Something that I’m slowly changing is how I categorise links to do with AI. A few months back anything "AI" got its own section. It wasn’t much more than a novelty really; certainly not something worth distracting the regular link sections with. But now AI is just part-and-parcel of many people’s workflows, a regular component in their toolbox. So where an article is about credibly using AI as part of an existing topic (such as data engineering), I’ll file it in that section. (And if this news makes you cross because you abhor anything AI, well, I’ve got news for you).
Current London 2026 - wanna free ticket? 🎟️ 🔗
If you’re in the UK and interested in Kafka, Flink, Iceberg, etc etc (which, since you’re reading this blog post, I assume you at least have a passing interest in) then you might be interested in Current London in May—and I have a free ticket code for you to use! Register with code L-CMP-LDNKafka and it’s all yours :)
|
Analytics 🔗
-
Ben Sykes - Interval-Aware Caching for Druid at Netflix Scale
-
Dorothée Clerc - How BlaBlaCar PMs use AI to self-serve data
-
DuckDB 1.5.2 has been released, with support for DuckLake 1.0, even better Iceberg support, and fixes as a result of initial Jepsen testing.
-
Randy Au - Dashboard rot as org attention grave markers
-
Ahmed Youssef - Nobody Is Making Decisions With Your Dashboards
-
🔥 Torsten Grust has published a course about the Design and Implementation of DuckDB Internals
-
Hamel Husain - The Revenge of the Data Scientist
Data Platforms, Architectures, and Modelling 🔗
-
Antonia Badarau and team at Monzo - A “meshy” approach to Data: Enabling 100+ teams to build Data Models
-
Justina Bartulevičienė & Benediktas Kazanavičius (Vinted) - Serving Personalised Search Autocomplete
-
Rishabh Kumar (Airbnb) - Building a fault-tolerant metrics storage system at Airbnb
-
Matt Lawhon and team at Pinterest - Scaling Recommendation Systems with Request-Level Deduplication
-
Facundo Agriel (Dropbox) - Improving storage efficiency in Magic Pocket, our immutable blob store
-
🔥 A couple of interesting posts from the teams at Notion: Enabling Multi-Region Data Systems, and Two years of vector search: 10x scale, 1/10th cost
-
Nikola Ilic - Data Modeling for Analytics Engineers: The Complete Primer
-
Chris Gambill - The Medallion Masterclass: Why Knowing the Colors Isn’t Enough
-
Joe Reis - Why Time Matters in Data Modeling
Data Engineering, Pipelines, and CDC 🔗
-
Alexander Goida - Three Kafka S3 Sink Settings for Easier File Processing
-
Chris Gambill - AI Agents are Failing Your Data Engineers
-
Sugat Mahanti (Zapier) - Lessons from using the outbox pattern at scale
-
Couple of good posts from Chris Hillman - Your Data Platform Costs More Than It Should, and Why Your Pipeline Finishes Later Every Month
-
Jin-won Park (Karrot) - In the AI era where everyone handles data, how has the data team changed over the past year? (original Korean)
-
Tristan Handy (dbt Labs) - Five things I believe about the future of analytics
-
Igor Shurmin (Riskified Tech) - Data Exploration for Software Engineers: Evaluating and Integrating External Datasets
-
Aleksandr Klein (Just Eat Takeaway) - Daedalus and the Data Labyrinth
-
🔥 An excellent deep-dive from George Zefkilis, looking at PostgreSQL WAL Internals in the context of building a CDC pipeline.
-
Debezium 3.6.0.Alpha1 and Debezium 3.5.0.Final have been released.
-
Yaroslav Tkachenko analysed the performance of different technologies for getting data from Postgres into Iceberg.
-
Leonard Xu looks at good practices when building Large-Scale Lake Ingestion with Flink CDC and Paimon
-
Real-world details from Nathan Smit of how they’ve been using Debezium with Oracle for four years, and how they addressed issues with Oracle CDC Replication Lag.
-
🔥 Yanquan Lv published the announcement of the release of Apache Flink CDC 3.6.0 as well as an excellent Deep Dive into Apache Flink CDC 3.6.0
-
Jason Ganz & Benoit Perigaud (dbt Labs) - Semantic Layer vs. Text-to-SQL: 2026 Benchmark Update
Kafka and Event Streaming 🔗
-
Zapier - Reducing Kafka connections by 10x with a sidecar pattern
-
Yunhong Zheng - How Apache Fluss Achieves True Pruning in Streaming Storage
-
Bibek Maharjan - AI-Driven Autonomous Optimization of Apache Kafka on AWS MSK for High-Volume Financial Systems
-
Piotr Minkowski - Deep Dive into Kafka Offset Commit with Spring Boot
-
StrimziCon 2026 is on 3rd June, and the schedule has been published.
Flink 🔗
-
🔥 Robin Moffatt (that’s me!) - Materialized Tables in Apache Flink
-
Yaroslav Tkachenko - Apache Flink: Reading and Modifying Kafka Consumer Offsets Using the State Processor API
-
Lee Seung-min / Choi Won-yong - Extending Real-time Ad Frequency Capping Aggregation to One Week with Apache Flink + RocksDB Tuning (original)
-
Katya Gorshkova - Hands-On with Flink — Part 5: Managing State (previously: 1, 2, 3, 4)
-
Viktor Gamov digs out the open source toolbox to use Kafka, Flink, Iceberg, Superset and more to build Building a Streaming Lakehouse.
Open Table Formats (OTF), Catalogs, Lakehouses etc. 🔗
-
Gunnar Morling’s Hardwood project has had its second beta release, which includes a very cool TUI for working with Parquet files.
-
Laurent Saint-Félix has written aq - "query and transform Parquet, Arrow IPC, CSV, and NDJSON files using jq-style expressions."
-
Yusuf Gözübüyük (TOM Tech) - The Performance Improvement Journey in Apache Iceberg Tables
-
Ved Prakash - Deep Dive into Apache Iceberg Architecture
-
🔥 CMU-DB tech talk - Kurt Westerfeld & Mark Cusack - Floe: A SQL Compute Service for the Data Lakehouse
-
Apache Iceberg has moved to "adopt" on the latest Technology Radar from Thoughtworks
-
Qiegang Long - Preliminary Notes on Open-Source Variant Performance
-
Steve Loughran - Benchmarking Parquet Variants through Iceberg
-
Anahita Singla (Picnic) - Leveraging contextual data in real-time analytics with Apache Iceberg
-
DuckLake version 1.0 has been released, and thus is now deemed production-ready. AFAIK it’s only got real support within DuckDB, but do let me know if you see it supported elsewhere. Thoughtworks have marked it as "assess" on their Tech Radar.
-
A nice hands-on guide for setting up a local playground with Iceberg using Minio and Gravitino
-
Pedro Holanda describes how DuckLake deals with the small-files problem (often encountered when one starts streaming data to these types of table format). Using Data Inlining in DuckLake, they saw vast performance improvements over the same kind of processing done with Iceberg.
RDBMS 🔗
-
🔥 Ohad Ravid - The Best (Query) Plans of Mice and Men
-
Radim Marek - PostgreSQL MVCC, Byte by Byte
-
Simeon Griggs - Keeping a Postgres queue healthy
-
Thomas Kejser - Joins are NOT Expensive!
-
Mike Freedman - Introducing TigerFS - a filesystem backed by PostgreSQL, and a filesystem interface to PostgreSQL (Renato Losio wrote an InfoQ article about it)
-
Nikita Volkov - My 14-Year Journey Away from ORMs
General Data Stuff 🔗
-
Almog Gavra - The Broken Economics of Databases
-
Kirill Bobrov - The Power of Data Sketches: A Comprehensive Guide
-
🔥 Gergely Orosz (a.k.a. The Pragmatic Engineer) interviews Martin Kleppmann about the second edition of Designing Data-intensive Applications.
-
Animesh Kumar - AI-Ready Data vs. Analytics-Ready Data
-
I’m slightly fascinated by the idea of ggsql, which brings SQL to the world of ggplot2 and the Grammar of Graphics.
-
Akshat Vig & Andrew Davidson (MongoDB) - Open Source, Community, and Consequence: The Story of MongoDB (InfoQ London 2026)
AI 🔗
I warned you previously…this AI stuff is here to stay, and it’d be short-sighted to think otherwise. As I read and learn more about it, I’m going to share interesting links (the clue is in the blog post title) that I find—whilst trying to avoid the breathless hype and slop.
-
🔥 Joe Reis - Why Electricity (Not Dot-Com) Is the Right AI Analogy. I like this idea from Joe. It also makes me think of the lift-and-shift that folk did with on-premises workloads to VMs in the Cloud, instead of re-architecting properly.
-
Jason Ganz - A Dispatch from the Jagged Frontier of Analytics Engineering (referencing Ethan Mollick’s jagged frontier article from 2023).
-
Industry legends Mark Russinovich and Scott Hanselman wrote this opinion piece for ACM: Redefining the Software Engineering Profession for AI.
Without the hiring of early-in-career developers, the profession’s talent pipeline will collapse, and organizations will face a future without the next generation of experienced engineers.
-
🔥 Elena Verna - Confessions of a Millennial in Tech
-
🎥 Vik Gamov - If Memento was about AI Agents. I watched Memento in preparation for this…I still have no idea what was going on in either 😆
-
Addy Osmani - Agent Harness Engineering
-
🔥 Hamel Husain - LLM Evals: Everything You Need to Know
-
Robin Moffatt - Kicking the Tyres on Harbor for Agent Evals
-
🔥 Bryan Cantrill - The peril of laziness lost
-
Adam Jacob - Laziness, Impatience, and Hubris
-
Alex Woods - Don’t Let AI Write For You. (Reminder: I disclose my use of AI, and it’s NEVER for writing!)
And finally… 🔗
Nothing to do with data, but stuff that I’ve found interesting or has made me smile.
-
Mitchell Hashimoto - Ghostty Is Leaving GitHub
Tool 🔗
-
I love the agility with which one can collaboratively work in GDocs, but I also prefer working with plain text and Markdown (or even better, Asciidoc). mist brings the concept of GDocs collaboration to Markdown files. It’s pretty neat, and it’s now open source.
-
A useful reminder from Christian Hofstede-Kuhn of Shell Tricks That Actually Make Life Easier (And Save Your Sanity)
Watch/Listen 🔗
-
🔥 A very cool example from the demo-scene: Razor1911
-
The Internet Archive isn’t just about finding webpages that have gone offline—it also hosts tons of media, like this recording of Nirvana Live at Dreamerz 1989-07-08
-
I love this idea: TrainJazz: Every train, a note.
Nerd 🔗
-
😸 Not all specification drafts published are serious. Meow.
-
The ways in which one can play Doom continue to increase, with DOOM, played over cURL, and Can it Resolve DOOM? Game Engine in 2,000 DNS Records
-
HackerNews members share their memories of What was it like in the era of BBS before the internet?
My own memories are around Acorn-based BBSes. My favourite was Arcade BBS. Ah, memories. Fidonet, filebases…good times :)
-
What’s more important than the code that
you're writingClaude’s writing for you? Getting it in the right font of course! Shave many a yak and waste plenty of time at Codingfont picking just the right font…
|