Welcome to May’s Interesting Links! This month saw the Current conference in London with the usual 5k run, lots of familiar faces and friendly conversations—and plenty of excellent breakout sessions too. It seems live-tweeting conferences isn’t a thing any more, with only myself and Thomas Cooper seeming to post anything, but if you want you can go review the hashtag feed on BlueSky for some highlights of the conference.
I got my first Hacker News front page hit with AI Slop is Killing Online Communities (51k views and climbing!), and a nice little halo boost for another rant from earlier this year, AI will fsck you up if you’re not on board.
Oh, and I got involved in some thought leadering over on LinkedIn (which a non-zero number of people thought was serious) with my shitposting about fried breakfasts.
|
Kafka and Event Streaming 🔗
-
🔥 Apache Kafka 4.3.0 has been released. Check out the release announcement, as well as a video from Sandon Jacobs covering the new features.
-
🔥 After a few quiet months on his blog, Jack Vanlightly is back with a bang! He’s written a new tool, Dimster, a performance benchmarking tool for Apache Kafka, and has written several more blog posts off the back of it:
-
Benchmarking Apache Kafka Consumer Groups vs Share Groups (overhead test).
-
Kafka Share Groups and Parallelizing Consumption Part 1: Tuning max.poll.records, Part 2: Producer Batches and share.acquire.mode.
-
-
🔥 I had the absolute pleasure to watch Victor Rentea present at Devoxx UK earlier this month. This guy redefines what it means to be an entertaining, energetic, enthusiastic—and educational presenter. Whilst his specific talk, "Event-Driven Architecture Pitfalls" isn’t online yet, you can find the slides here, and a recording from Devoxx last year of a similar talk.
-
The Parallel Consumer library from Confluent has been marked as no longer maintained, prompting a discussion of alternatives (and the concept itself) on LinkedIn, as well as a fork from one of the original authors, Tony Stubbs.
-
Mariano Gonzalez - Benchmarking KPipe against the parallel-Kafka libraries you would actually pick.
-
Michel Tricot - Event-Driven vs. Polling Architectures for Agent Triggers.
-
An interesting idea from Florent Ramiere and colleagues: what if you specify a set of interesting additions to Kafka’s functionality, with strict rules around the implementation, and then have LLMs take their best shot at it? You can see the ideas and results in the branches of this repository.
-
Viquar Khan - Architecting Cloud-Native Kafka: From Tiered Storage Towards a Diskless Future.
-
Elad Eldor - Kafka’s Real Compression Problem Is Batch Depth, and Kafka Compute Is Cheap. Network Is Not.
-
Kroxylicious version 0.21.0 has been released, and Sam Barker from the Kroxylicious project has been running some benchmarks to look at the impact that the proxy has, both pass-through and when encrypting records.
-
Aiven’s Juha Mynttinen explores why they think Apache Kafka Deserves Topic Types.
-
Details of a Coinbase outage involving their Kafka provider, which based on blogs from 2022 and 2023 is MSK.
-
Andy Muir - Kafka Schema Registry doesn’t guarantee compatibility (and what actually does).
-
Bruno Cadonna - OpenData Buffer: HA pipelines without Kafka.
-
Jeffrey J. Jennings - Kafka’s quiet observability superpower - Kafka Interceptors.
-
Grzegorz Kocur - Do Kafka metrics have to be so difficult?
Stream Processing 🔗
-
Flink’s Stateful Functions (StateFun) is not maintained by the project any more, so kzmlabs' Oleksandr Kazimirov forked it to continue developing it.
-
Olena Vodzianova - How Chandy-Lamport Inspired Apache Flink Checkpointing.
-
🔥 Two good posts from the team at Grab:
-
Details of how they built their one-click data ingestion platform with Apache Flink.
-
Details of how Smartsheet use Flink for optimising both costs and performance by filtering messages.
-
flink-state-explorer is, as the name suggests, a tool for exploring Apache Flink 1.20 canonical savepoints interactively.
-
A hands-on github repo from Patrick Neff showing off Stream processing pipeline using dbt and Flink on Confluent Cloud.
-
Shuva Jyoti Kar - Designing stateful serverless Agentic Loop with Kafka and Flink.
-
A couple of security issues for Flink to be aware of if you’re running it:
-
CVE-2026-35194 (SQL injection).
-
CVE-2026-40564 (K8s operator).
-
Analytics 🔗
-
🔥 Tristan Handy - BI’s Second Unbundling.
-
A good writeup from Cloudflare’s James Morrison and Christian Endres about tracing performance issues in ClickHouse.
-
Several posts from StarRocks covering new features in 4.1:
-
Two BigQuery optimisation/cost saving articles, from Christophe Oudar and Azeem Jalageri.
-
Daniel Beach - Spark is Dead. Long Live DuckDB.
-
Alibaba added DuckDB into their fork of MySQL, AliSQL, providing storage and query for OLAP workloads.
-
Simon Aubury - I don’t need an untrusted LLM to tell me I’m spending too much on coffee.
-
The DuckDB team announced Quack: The DuckDB Client-Server Protocol.
-
Ben Fleis explores DuckDB’s support for Delta and Unity Catalog.
-
🔥 I’ve been a fan of Mark Litwintschik’s no-nonsense blog posts showing current technologies and exploring interesting data sets for many years. In this one he uses DuckDB to analyse details of 10K+ Satellites in Space.
Data Platforms, Architectures, and Modelling 🔗
-
🔥 Nikola Ilic - Data Modeling for Analytics Engineers: The Complete Primer.
-
AirTable’s Matthew Jin details how they optimised their costs by moving PBs of cold data from MySQL to S3, and wrote a query engine using Data Fusion to serve it.
-
Brian Brunner and his colleagues at Cloudflare published details of how they built Cloudflare’s data platform and an AI agent on top of it.
Data Engineering, Pipelines, and CDC 🔗
-
Caesario Kisty - A Practical Implementation of Medallion Architecture Using ClickHouse.
-
Xinran Waibel - Data Engineering Open Forum 2026 Recap.
-
🔥 After doing a bit of fairly naïve experimentation with Claude and dbt earlier this year, I was very interested to read Jason Ganz’s article What data agent benchmarks do and don’t tell us, and hope to try out the referenced ADE-bench ("a framework for evaluating AI agents on data analyst tasks") soon.
-
Whilst Thijs Nieuwdorp’s article about Handling Schema Issues in Polars is specific to Polars, it’s a useful reference for the kind of schema changes one will want to make in a data pipeline, and the challenges it can cause depending on how or if your implementation technology of choice supports it.
-
🔥 Pedram Navid - We need to talk about dbt.
-
A summary/re-write by Alex Yu (a.k.a. ByteByteGo) looking at How Figma Upgraded Data Pipeline from Multi-Day Latency to Real-Time (based on the original blog post by Yichao Zhao from last year).
-
Netflix - The Evolution of Cassandra Data Movement at Netflix.
-
Alexey Makhotkin has a two parter on Data Quality part 1 / part 2.
-
Chris Hillman - Don’t Go Dark: Visibility Is a Data Engineering Skill.
-
After the excellent survey and results that Joe Reis published about data engineering earlier this year, he’s now following up with a survey on The Organizational State of Data Engineering (open for submissions until Sunday, June 21).
-
Mahendran Vasagam - From SSH to REST: A Security-Driven Modernization of Slack’s EMR Data Pipelines.
-
Dana Rabba - Building Self-Healing Data Pipelines at Halodoc.
-
rocky is a dbt alternative that looks quite interesting. It describes itself as "the trust plane for your warehouse", and targeting Databricks users primarily, with Snowflake and BigQuery to follow. There’s a built-in playground feature that’s worth poking around to get a feel for it.
CDC 🔗
-
Marc Bowes describes how Aurora DSQL’s CDC feature works. If you want more, there’s further details of its use from Vijay Karumajji.
-
🔥 George Zefko - Building a CDC pipeline, part 2: Debezium Internals (I featured part 1 last month, if you missed it it’s here).
-
A couple of good posts from the Debezium team:
-
Chris Cranford discusses What Nobody Explains About Debezium in 2026 (But Should).
-
Jiri Pechanec explores Change Data Capture with Debezium and Jupyter.
-
Open Table Formats (OTF), Catalogs, Lakehouses etc. 🔗
-
Apache Iceberg 1.11 has been released (I even got some small contributions merged 🎉). There are more details of the release in these blog posts:
-
Alex Stephen & Talat Uyarer - Announcing Apache Iceberg 1.11.0.
-
Ajantha Bhat - Apache Iceberg 1.11.0 Adds registerView: Closing a Catalog Migration Gap.
-
Alex Merced - An In-Depth Overview of the Apache Iceberg 1.11.0 Release.
-
-
🔥 The talks from Iceberg Summit 2026 are now online.
-
Alex Merced begins an epic 15-part series about Apache Iceberg by looking at What Are Table Formats and Why Were They Needed?
-
Yelp’s Nick Del Nano looks at How Partition Access Visualizations Reduced Data Lake S3 Cost by 33%.
-
Honest words from Fresha’s Samuel Valente as he looks at the use of Iceberg with Snowflake in practice: Snowflake with Iceberg: Lakekeeper, dbt, and some Sparks Flying.
RDBMS 🔗
-
Daniel Guzman-Burgos describes bintrail which provides time-travel SQL for MySQL. Renato Losio has a summary on InfoQ.
-
Teiva Harsanyi - How Linux 7.0 Broke PostgreSQL: The Preemption Regression Explained.
-
Radim Marek covers the ORDER BY jungle, as well as PostgreSQL’s TOAST.
-
Markus Winand also looks at ORDER BY and the evolution of support in different RDBMS.
-
🔥 James Blackwood-Sewell writes up details of the benchmarking platform they built, whilst Ben Dicken muses on benchmarking at PlanetScale too.
General Data Stuff 🔗
-
An opinionated, and fairly concise, set of recommendations for the use of different Open Standards for Modern Data Architecture.
-
LinkedIn’s Pratikmohan Srivastav writes about a performance troubleshooting experience - The 58-Million-Key Freeze: What a HashMap Resize Taught Us About Memory Allocation at Scale.
-
Sem Sinchenko - Same buffers, same instructions, same hardware. Where Is the JVM Tax?
-
Gergely Orosz shares some excerpts from Martin Kleppmann’s second edition of Designing Data Intensive Applications.
AI 🔗
I warned you previously…this AI stuff is here to stay, and it’d be short-sighted to think otherwise. As I read and learn more about it, I’m going to share interesting links (the clue is in the blog post title) that I find—whilst trying to avoid the breathless hype and slop.
-
🔥 Ben Evans - AI Eats the World.
-
🔥 TikTok is my guilty pleasure, but instead of dogs misbehaving in comical ways, here’s an excellent piece to camera from Scott Hanselman reflecting on the impact of AI in our lives as software developers.
Pro-tipyt-dlpworks great with TikTok, so you don’t have to actually open the page if you still wanna view the video. -
Nate Berkopec - Thoughts on LLMs in 2026.
-
Julien Hurault - Time for AI Coding to Turn Boring?
-
Kate Holterhoff - AI Slop & the Vulnerability Treadmill.
-
Paulo Arruda - What I Learned Building Multi-Agent Systems from Scratch at Shopify.
-
🔥 Loris Cro - Contributor Poker and Zig’s AI Ban.
-
Lucia Cerchie - Why You Need More Than a SKILL.md.
And finally… 🔗
Nothing to do with data, but stuff that I’ve found interesting or has made me smile.
Work and Career 🔗
-
🔥 An oldie (2008!) but a goodie: Jeff Atwood - Don’t Go Dark.
-
🔥 Lara Hogan - Be a thermostat, not a thermometer.
-
As an IC, I endorse this pitch from Elena Verna ;-) IC work is the new career flex.
-
🔥 Ana Rodrigues - It’s 2026 and women are still asked to teach others to think a little bit and not be a prick.
-
Leyla Kazim - I did no work for a year and no one noticed.
Community and Blogging 🔗
-
🔥 Kevin Powell wrote this article which resonated hard for me. I think it’s a boiling-frog situation; if I think about my motivation to write today, vs a year ago, vs 5, it’s definitely very different. AI noise drowns things out, kinda like SEO marketing 'content factories' did but on a bigger and more destructive scale, so as an author is it even worth writing original material? Is anyone even gonna read it?
-
An excellent writeup from Vicki Boykis about Tagging my blog posts with BERTopic and LLMs - definitely need to try this.
-
Mike McQuaid - Open Source Resistance: Keep OSS alive on company time.
-
Very cool idea for conference badges from Shy Ruparel, with an excellent writeup to boot.
Read, Watch, Listen 🔗
I couldn’t think of a good subheading for these :)
-
Some fun nostalgia, with screenshots of various old OSes at typewritten.org and The Virtual OS Museum (the latter even has, IIUC, runnable VMs for download!)
-
Dan Carlin is probably my favourite podcaster, and as well as his well-known Hardcore Histories he has occasional thoughts on more current affairs, including this one: The Water in Which We Swim.
-
The Middle Class Museum (A memorial to affordable living).
-
Fast16: The Cyberweapon That Predates Stuxnet by Five Years.
|