Phew, what a month! February may be shorter but that’s not diminished the wealth of truly interesting posts I’ve found to share with you this month.
As this compendium grows in popularity, I’m keen to hear from you 🫵🏻 with your thoughts on:
-
What do you want more of?
-
Less of? (unless it’s AI; that stuff is staying whether we like it or not…)
-
Fewer links?
-
More commentary?
Use the comment section below, or feel free to email me directly.
As is usual (and my right as the author 😁) I’ll share my own articles first.
It’s a good mix this month: some AI in practice, some AI contemplative thoughts, some real data engineering (that even r/dataengineering liked!):
I was also honoured to be a guest on Dan Beach’s podcast: The Evolution of Software, Streaming, and Data Engineering with Robin Moffatt, and interviewed for Cynthia Dunlop and Piotr Sarna’s blog: Robin Moffatt on Technical Blogging
D’ya like DAGs? (or data engineering in general?) 🔗

Confluent are putting on a one-day conference for data engineers and architects in San Francisco next month (March!)
-
Speakers to include Joe Reis, Max Beauchemin, and Holden Karau
-
No vendors, no salesfolk
-
Completely free
-
March 26, in SF
-
Limited to 100 attendees; for full details and to register for free head here

So…on with the interesting links!
|
Kafka and Event Streaming 🔗
-
🔥 Andrew Schofield has an excellent article showing how Queues for Kafka (added in KIP-932) is now ready for prime time (docs), and Rion Williams has also written a good article: Surviving the Streaming Dungeon with Kafka Queues
-
Armin Aminian at Trivago has details of some neat Kubernetes tricks with Kafka: From Always-On to On-Demand: Scaling Kafka Sinks with KEDA
-
Uber have open-sourced (Apache 2.0) their uForwarder consumer proxy for Kafka. There’s details in their blog and this InfoQ summary.
-
Stanislav Kozlovski is back with a deep-dive of a couple of KIPs: How KIP-881 and KIP-392 Reduce Inter-AZ Networking Costs
-
Manu Cupcic and Yusuf Birader from WarpStream cover some cool design details of how they’ve achieved lower latency in WarpStream.
-
Katya Gorshkova has published a book with Manning: Kafka for Architects. You can get 50% off using code
PBGORSHKOVA50RE. -
A postmortem from the folks at Honeycomb about their Kafka Outage in December 2025. Fascinating reading.
-
Neil Buesing has written a new tool for monitoring Kafka consumer group offsets: koffset
-
Andrew Morris has published a set of Kafka agent skills.
Stream Processing 🔗
-
🔥 Rion Williams has some good advice on Enrichment Strategies for Apache Flink
-
🔥 I always find Vu Trinh’s articles to be so well explained, and this one about Spark 4.1’s Real-Time Mode is no exception. If you take the subheading literally, Spark’s real-time mode "brings millisecond latency to structured streaming…it changes everything".
-
🔥 Vik Gamov has published an excellent interactive guide to some of the key concepts in Apache Flink: selectstar.stream
-
Nikhil Gupta at EloElo has written up details of how they take a configuration-driven approach to using Flink in their Real-Time Data Streaming Platform.
-
Flink Agents are part of the Apache Flink project, providing an Agentic AI framework. Tomasz Krol has put together a nice example of them in use: Building Real-Time AI Agent with Apache Flink
-
A couple of posts from Jayanth Reddy looking at doing Iceberg Table Maintenance in Flink, as well as hands-on exploration of Flink and Iceberg Catalogs.
-
A paper from Polytechnic University of Marche about using Flink and Kafka in a Context-Aware Knowledge Graph Platform for Stream Processing in Industrial IoT.
-
A somewhat sad thread on Reddit reflecting on the absence of a first-rate, first-class stream processing library for Python.
-
Is this too meta? A monthly roundup linking to another monthly roundup? Either way, it’s great to see David Radley in the Flink community resurrecting the long-lapsed Community Update.
-
AWS’s Qingyuan Tang gets into some good details in this post about Optimizing Flink’s Join Operations on Amazon EMR with Alluxio
-
Neil Buesing has published a nice demo of Stream-to-Table Join in Kafka Streams (code)
-
Abhishek Bharti at Flipkart details how they’ve built their AdTech platform using Flink, handling 1 Million Events/Second.
-
Shaurya Rawat takes a look at ways to speed up stream processing in Flink (including looking at Fluss), along with a series about Stream Processing with Flink split over four parts 1 2 3 4.
Analytics 🔗
-
🔥 Simon Späti takes a look at StarRocks in this excellent article with details of both the technology and detailed real-world adoption examples. StarRocks originated from a fork of Apache Doris, and Alex Polorotov has an entertaining (if somewhat subjective) account of its history and relationship with VeloDB.
-
🔥 A couple of good articles about BigQuery costs, from Michael Petro in Reddit’s Engineering team about how Reddit optimised their BigQuery costs, and from Kirill Bobrov explaining the mechanics of BigQuery charges.
-
A practical account from the Whatnot Engineering team of their experience using LLMs to support end-users of their data. Contrast this to OpenAI’s account of their in-house data agent.
Data Platforms & Architectures 🔗
-
🔥 Nicoleta Lazar from Fresha does a deep-dive looking at how query federation differs between StarRocks and Trino, with Iceberg as a source.
-
Trade Republic’s Sadeq Dousti has more details from their real-world experience implementing the Outbox pattern (see also part 1).
-
A recording of a panel with Adi Polak, Sarah Usher, and Matthias Niehoff at QCon London 2025 discussing Modern Data Architectures.
-
Details from Emmie Dong and team at Netflix about the batch-based Data Movement platform called Data Bridge.
-
A couple of interesting papers:
Data Engineering and Pipelines 🔗
-
🔥 Good stuff from Ben Rogojan on Backfills: The Necessary Evil of Data. Chris Hillman has also published a good article on the same topic: Self-Healing Tables
-
🔥 In 2017 Max Beauchemin published The rise (and then the downfall) of the Data Engineer—here’s a good reddit thread in which people reflect on what has changed in that time.
-
🔥 Joe Reis has published results in an interactive form from his 2026 Data Engineering Survey (including a wicked EnterpriseSynergy version), as well as commentary on certain aspects of it including The Insanity of Data Education
-
dbt Labs have published a set of agent skills—I’m looking forward to trying these out alongside the dbt project that I built recently.
-
This post from Andrew Hawker at Auth0 is from a few years ago, but is interesting nonetheless for how they built their Data Pipelines on Private Cloud with tools including Kafka Connect and Schema Registry.
-
Very interesting article from Liang Mou and team at Pinterest about their Debezium/Kafka/Flink/Spark/Iceberg pipeline: Next Generation DB Ingestion at Pinterest
-
Celina Amados describes how Netflix have setup automated testing of data to catch issues that their standard code testing would miss: Netflix Data Canary. On a similar subject, Isra Nurul Habibi at Halodoc has written up some good details of how they do multiple levels of data validation against their data warehouse.
-
Details of Wix’s Python-based Iceberg to Clickhouse pipeline for low-latency serving of data to microservices.
-
A good talk from Sarah Usher at QCon London 2025, talking about data lineage and lifecycle.
-
Marcel-Jan Krijgsman has written up a nice three-part exploration of Data Engineering in the European Cloud (covering Iceberg, Nessie, Trino, JupyterHub, and PowerBI): 1 2 3
Data Modelling 🔗
-
🔥 Joe Reis' new book, Practical Data Modeling, is in its final stretch of editing now, and Joe has released Chapter 2 on his SubStack: What Data Modeling Is (and Is Not)
-
🔥 Good stuff from Tim Castillo: How I Structure My Data Pipelines (Silver Layer) (he also has a good overview post and bronze layer)
-
A good writeup from Adediwura Boluro-Ajayi on Understanding Dimension Tables in Dimensional Modelling (Kimball Style)
CDC 🔗
-
🔥 The Debezium project are inviting submissions to their Debezium 2026 Community Feedback Survey (open until April 5th).
-
🔥 Yaroslav Tkachenko has been looking at the performance of three different CDC tools (Debezium, Flink CDC, and Supermetal) taking data from Postgres into Kafka, running a set of benchmarks covering both snapshots and live CDC. Which won? You’ll have to read the article (hint: Rust & Arrow apparently is rather quick!)
-
A couple of interesting posts on the Debezium blog this month, looking at Measuring Debezium Server Performance, and a new feature being introduced to provide Reusable Connections in Debezium.
-
A reddit thread discussing the pros and cons of using CDC and Kafka to synchronise two databases, vs dedicated replication software.
-
🔥 Excellent deep-dive troubleshooting post from the team at Zepto, covering Debezium performance problems (and fixes made).
-
Chandrasekar Gnanasambandam and team at Guidewire show how they reduced Debezium snapshot duration by 70%.
Open Table Formats (OTF), Catalogs, Lakehouses etc. 🔗
-
🔥 Russell Spitzer was interviewed by dbt’s Tristan Handy in a podcast (and blog post) about Apache Iceberg and the Catalog Layer
-
🔥 The Apache Iceberg File Format API has been published, opening up the possibility of Iceberg data files being written in other formats including active work to implement Vortex.
-
Apache Polaris has graduated to a top level project (TLP).
-
Useful post from Yossi Reitblat at Ryft describing three strategies for Migrating from Hive to Apache Iceberg.
-
The problem with engineers is that they love naming things following a theme. So Iceberg tools often anchor on related words, meaning that unfortunately
floeandfloecatare AFAICT completely unrelated despite their names.-
floe is a project from Neelesh Salian providing policy-based maintenance for Iceberg tables using either Trino or Spark. You can read more about it in this blog post.
-
floecat is a catalog of catalogs, supporting metadata from both Iceberg and Delta Lake.
-
-
ice-lens is a project from Muhammed Demirbaş that provides a GUI for inspecting your Iceberg tables.
-
A useful article looking at different compaction engines for Iceberg.
-
It’s in Russian, but this article looks useful for anyone comparing Apache Iceberg vs Apache Paimon.
-
A good thread from the Iceberg [email protected]">dev mailing list discussing Streaming Upserts Best Practices.
RDBMS 🔗
-
🔥 RedMonk’s Rachel Stephens discusses hypotheses regarding the Disintermediation of Databases.
-
🔥 OpenAI’s Bohan Zhang discusses how they scale PostgreSQL for 800M ChatGPT Users.
-
Dalto Curvelano has a thorough Introduction to PostgreSQL Indexes, whilst Haki Benita discusses some Unconventional PostgreSQL Optimizations.
-
A nicely written modern guide to SQL JOINs from Alexey Makhotkin.
-
Details from Ioannis Androulidakis at Booking.com of how they migrated their backup catalog for 250+ MySQL clusters to AWS
-
Vinicius Malvestio Grippa has taken Brendan Gregg’s awesome FlameGraph concept and forked it to work with MySQL: myflames: MySQL EXPLAIN ANALYZE Visualizer
-
An impressive line-up of speakers at CMU’s free Spring 2026 Seminar Series, which is titled PostgreSQL vs. The World. Talks are recorded and available along with those from previous series here.
General Data Stuff 🔗
-
🔥 A peek under the covers of How AWS S3 Is Built in this interview with AWS' Mai-Lan Tomsen Bukovec by Gergely Orosz (The Pragmatic Engineer).
-
Gunnar Morling has released the first version of Hardwood: A New Parser for Apache Parquet and has some good details in this blog post.
-
Neelesh Salian has created a project called PARX, looking at optimising performance through persistent metadata caching for Parquet files.
-
Matthew Mullins has a clear opinion well articulated on How Not to Run Open Standards.
-
Polyglot is a SQL transpiler for 32+ dialects by Tobias Müller.
-
Dan Harrison at TurboPuffer writes about the distributed queue that they built on S3.
-
A couple of good conference talks that I came across recently:
-
Tim Berglund: Event-Driven Architectures Done Right, Apache Kafka (Devoxx Poland 2021)
-
Lutz Huehnken: Events, Workflows, Sagas? Keep Your Event-driven Architecture Sane (KanDDDinsky 2022)
-
-
Justin Cormack takes a look at two MinIO replacements: RustFS and Garage (both written in Rust).
-
If you’re interested in alternatives to MinIO, also check out my post from last month: Alternatives to MinIO for single-node local S3
-
AI 🔗
I warned you previously…this AI stuff is here to stay, and it’d be short-sighted to think otherwise. As I read and learn more about it, I’m going to share interesting links (the clue is in the blog post title) that I find—whilst trying to avoid the breathless hype and slop.
Big Picture & Culture 🔗
-
🔥 Excellent analysis and commentary from RedMonk’s Stephen O’Grady: Besieged
-
Matt Shumer - Something Big Is Happening (discussed on this Hard Fork podcast episode)
-
James Randall - The Thing I Loved Has Changed
-
Elena Verna - There’s a Short Window to Get Radically AI-Native
-
🔥 MotherDuck’s Jordan Tigani shared some important thoughts about the implications of using AI in your writing.
-
Communities including Reddit and Lobste.rs are seeing a deluge of AI generated articles and tools. Some of it is interesting; a lot of it is low-effort. Unchecked, this is likely to fundamentally shift the culture of these communities. On lobste.rs there’s a proposal to Add 'AI Generated' as a Flag Reason.
-
Sid Sundharam’s short but punchy article is great: ai;dr
AI’s impact on Open-Source Projects 🔗
-
🔥 Important article from RedMonk’s Kate Holterhoff looking at the impact of AI on open-source projects. By the very definition of OSS, they are open to external contributions—and the sheer scale at which AI agents can churn out code is causing problems: AI Slopageddon and the OSS Maintainers. Steef-Jan Wiggers covers a similar subject in his InfoQ article.
-
🔥 Tomas Vondra argues that previously there was "proof of work" that ensured that well-intentioned contributors in effect demonstrated their willingness to put in the time proportional to that of the maintainers required to review contributions: The AI Inversion
-
Ultimately, code contributions are the responsibility of the human interacting with the agent. Used well, coding agents are fantastic productivity boosters and enable people to create code they could never have done before. Used irresponsibly, it’s little better than monkeys throwing crap around in a zoo. Mitchell Hashimoto has conceived Vouch as a way for projects to deal with the influx of interactions, ultimately forming a web of trust for contributors.
AI in Software Engineering 🔗
-
🔥 Rahul Garg at Thoughtworks discusses the idea of Knowledge Priming to give an agent the best 'understanding' of a code base. The article is part of a series that’s being developed.
-
🔥 Good set of notes and learnings from Andrej Karpathy about his experience with Claude Code
-
🔥 Mitchell Hashimoto - My AI Adoption Journey
-
I enjoyed this thoughtful and well-written discussion from Wes McKinney of some of the limitations and implications of AI coding: The Mythical Agent-Month
-
Simon Willison nails it: "the productivity boost these things can provide is exhausting"
-
Swizec Teller argues that LLMs churning out code is the easy bit; keeping it running is the hard part and thus the future of software engineering is SRE.
AI in Practice 🔗
-
🔥 A proposal from Max Beauchemin for
DB-AGENTS, an RDBMS equivalent ofAGENTS.md(LinkedIn thread) -
Nicolas Bustamante - Lessons from Building AI Agents for Financial Services
-
Mikaela Grace and colleagues at Anthropic have advice on Demystifying evals for AI Agents
-
Perfect example from McDonald’s Arth Shah showing where AI can be used to accelerate repetitive but nuanced tasks. In this case they can do migration analysis with Pandas in two days instead of the usual six weeks.
-
Detailed article from Antonio Castelli and team at Booking.com discussing how they evaluate LLM agents.
-
This is fun/interesting: Marc Weistroff gave Claude access to his Pen Plotter.
And finally… 🔗
Nothing to do with data, but stuff that I’ve found interesting or has made me smile.
-
🔥 I like this suggestion from Michael Heap for framing the acceptance of requested work: Yes, If…
-
Networking doesn’t have to be the gross BS-fest that LinkedIn suggests it should be. Jay Miller shares a very practical list of 7 rules for authentic networking
-
Leon Adato has written a good guide to how not to answer the salary question
-
After 1.5 BILLION views, YouTube took the Rickroll song offline :-(
-
I’ve followed the SwiftOnSecurity Twitter/X account for many years, and so loved reading their backstory.
-
Cassidy Williams recorded this good video with her thoughts on being a parent in tech.
-
TIL: Lazy Consensus is a concept used in some open source projects for helping ensure progress on work doesn’t stall unnecessarily.
-
Great data viz and analysis from Lauren Leek: Britain Lost 14,000 Third Places. They Were Called Pubs. Is Your Local Next?
|