Well it’s that time of year already! Whilst munching on a mince pie, enjoy the final Interesting Links for 2025.
It’s been a busy twelve months for me; this time last year I was signing off from my last company, which went on to be acquired—and last week I found out that my current company (Confluent) is to be acquired by IBM. Despite my reaction against any kind of cheese moving, I figure this is going to be an interesting development and a whole new experience for me :)
Just one blog post of my own to share from this month—a write-up of some investigation that I did using Neo4j and graph analysis to identify astroturfing on Reddit. It turns out that there are marketing agencies out there who think it’s a good idea to spoil things for everyone else by offering astroturfing-as-a-service to at least two vendors in this space who paid them for it 🙄.
My previous employer has kindly allowed me to host my previous blog posts here on rmoff.net, which I’m delighted about. If you’ve not seen them already, here are some of the highlights:
-
Sending Data to Apache Iceberg from Apache Kafka with Apache Flink
-
Troubleshooting Flink SQL S3 & general JAR problems
And so…on with the interesting links!
Not got time for all this? I’ve marked 🔥 for my top reads of the month :)
|
Tip
|
Medium posts often skulk behind a gate, so I’ve hyperlinked to the Freedium version and included a link to the original using a ⓜ️ icon should you prefer to visit that (or if freedium goes offline). |
Kafka and Event Streaming 🔗
-
Aviv Dozorets published a new tool, klag, billed as a replacement for the deprecated Kafka Lag Exporter.
-
🔥 Sandon Jacobs has an excellent lightboard explainer of the new Queues for Kafka introduced with KIP-932.
-
Jepsen’s test reports are always interesting to read, including this recent one in which he uncovers issues in NATS JetStream.
-
I mean, if a Kafka alternative isn’t written in Rust these days, is it even worth writing? Snark aside, walrus claims higher performance than Kafka, although it isn’t API compatible.
-
WarpStream have always published good blog posts, and this one from Maud Gautier continues the trend, with technical details of how they added support for Protobuf with Schema Registry.
-
This post from Yifeng Liu gives some practical advice on how to architect Kafka topics, specifically with regards to duplication (which is sometimes totally OK, the author argues).
-
Platformatic’s Node.js client has had a 223% speed boost—Paolo Insogna describes how.
Stream Processing 🔗
-
Souquieres Adam has been busy, looking at Kafka Streams vs Apache Flink and Where Are We Really With Streaming Technology Adoption?, as well as detailing Why I Stopped Using High-Level Streaming Joins
-
Enes Harman takes a look at Watermark Generation in Flink.
-
Shameless plug: If you’re interested in watermarks, make sure to check out Flink Watermarks…WTF :)
-
-
A deep-dive from Shuo Cheng looking at improvements to ingestion with Flink in Apache Hudi 1.1.
-
Details from Avito about how they use Flink SQL and deploy it with the Flink Kubernetes operator. The pages are in Russian but browser translation works well :)
-
Jaehyeon Kim has a new post, showing how to use Flink with Kotlin.
-
🔥 Great post from Mehul Batra and Luo Yuxia making the case for Apache Fluss, and explaining why Iceberg alone isn’t sufficient for a real-time lakehouse.
-
Details from the team at Uber about their move from batch to realtime with Kafka/Flink/Hudi and some of the problems they solved along the way.
-
Apache Flink 2.2 has been released, and includes improvements to Materialized Tables, Delta Join as well as new vector search and real-time AI features.
-
Yaroslav Tkachenko wrote a thoughtful piece about Flink, defending it from its detractors.
-
Shopify’s Farhan Thawar shared some pretty cool nerd stats from Black Friday Cyber Monday weekend, including Kafka and Flink processing over 150 MB/s.
Analytics 🔗
-
Lots of good StarRocks content this month, with Anton Borisov looking at plans for Incremental View Maintenance (IVM) whilst Jeff Ding covers Compaction and I/O.
-
Rachel Herrera from Hex argues that Dashboards were never the destination.
-
🔥 Nice deep-dive from Spotify’s Kirill Bobrov looking at What Really Happens When You Hit “Run” in BigQuery.
-
David Wheeler covers the details of pg_clickhouse, an extension that ClickHouse have released enabling you to run analytics queries on ClickHouse directly from Postgres.
-
Interesting details of analysis and solutions that Uber did to improve the performance of their Apache Pinot implementation.
Data Platforms, Architectures, and Modelling 🔗
-
James Carr is publishing an Advent of Enterprise Integration Patterns
-
Gunnar Morling argues that it’s OK to have multiple copies of your data in his latest blog post, looking at push vs pull queries concepts and materialized views.
-
After too many years in the wilderness, data modeling is finally coming back into vogue, and not a moment too soon. Not sure where to start? Michael New has written an excellent guide.
-
Data Mesh was sooooo last year, right? Daniel Beach takes a look at quite what happened to it in his entertaining article Data Mesh Theology. Dead or Alive?
-
Simon Späti looks at closed vs open-source data stacks, as well as the idea of "git for data".
-
Tom Schreiber & Lionel Palacin from ClickHouse take a look at the nuts and bolts of the pricing and compute models of the 5 major cloud data warehouses.
-
Matthias Niehoff did an interesting talk recently at QCon London, looking at Reliable Data Flows and Scalable Platforms: Tackling Key Data Challenges.
-
RedHat’s Vojtěch Juránek has a good blog post showing how Debezium can be used in implementing CQRS Design Patterns.
Data Engineering, Pipelines, and CDC 🔗
-
🔥 Joe Reis is looking for folk to answer a few questions in a survey to support the Practical Data 2026 State of Data Engineering Report that he’s writing
-
A nice primer from Henry Liao for anyone new to dbt or DuckDB on using them to build an ETL pipeline.
-
A couple of good blogs from the team at Karrot this month. Jin-won Park has a deep-dive explanation of how they derive column-level lineage by parsing query logs in BigQuery, whilst Seungki Kim details the evaluation they did of Kafka Connect, Debezium, and Flink for doing CDC from MongoDB and which they chose.
-
It’s not all tools and streaming—here’s a post from McDonald’s describing their batch-based Python+YAML pipelines.
-
What’s old is new, and nowhere is that more true at the moment than with the semantic layer—a concept that has been around in IT for decades. My former boss Mark Rittman has a great write-up here of one of the tools that was there then: An Homage to Oracle Warehouse Builder.
-
Ojesav Srivastava at Flipkart writes up Triton, a ZooKeeper-coordinated Coordinator/Master/Worker platform on Kubernetes StatefulSets for reliable, high‑throughput bulk file ingest at scale.
-
Nurdan Almazbekov has a detailed write-up of how Yelp stores and queries S3 access logs at scale using Parquet and Athena.
-
Good tips from Erfan Hesami on Data Quality Design Patterns based around the WAP (Write-Audit-Publish) concept.
-
The second part of an excellent hands-on blog series from Nicoleta Lazar at Fresha with details of their Postgres/Kafka/Flink/StarRocks pipeline.
-
"One version of the truth" is an aim many have but not all achieve—Shlomit Goldenberg and Lihi Gilboa (Aziz) from Riskified describe their Journey to a Single Source of Truth.
-
A couple of interesting blogs from DoorDash: Omik Mahajan has an excellent deep-dive into the performance of their in-house search engine platform, whilst Dave Press describes how they built an anomaly detection platform.
-
Lokeshbabu Radhakrishnan has an excellent blog post detailing why and how Zalando use Delta Sharing for low-latency and zero-copy access to data with partners.
Open Table Formats (OTF), Catalogs, Lakehouses etc. 🔗
-
WarpStream’s Richard Artoul has written an explainer of how their Tableflow product efficiently writes Iceberg metadata.
-
Tips from Zach King on how to avoid expensive mistakes with Delta Lake S3 storage.
-
DuckDB now supports
UPDATE/INSERT/DELETEin Iceberg - Tom Ebergen demonstrates it in this article. -
Alireza Sadeghi takes a detailed look at existing Open Table Formats (OTFs) like Iceberg, and then compares DuckLake’s solution and discusses possible limitations.
-
Just as the advent of OTFs blew up our assumptions about how to store data for tabular access, flat file formats such as Parquet are now in the spotlight. Moshe Derri looks at one of the possible replacements, Vortex.
-
🔥 The PMC chair for Apache Parquet, Julien Le Dem, has written an article looking at the criticisms of Parquet, and suggests various evolutions of the project to address these which would have the additional benefit of retaining the wide interoperability that is so important.
-
🔥 Staying with the file format theme, this post from Jack Ye and Prashanth Rao at LanceDB is interesting as it positions Lance as not just a file format, but also a table format—a qualification that I’d not come across before.
RDBMS 🔗
-
🔥 Very cool deep-dive by Radim Marek into Postgres indexes and storage.
-
Two part series from Siddharth Singh and team at Uber looking at how they manage high-availability for their MySQL clusters.
-
Sometimes the answer is not NoSQL, but No! SQL. Or something like that. Anyway, Trendyol have written about their challenges with Couchbase (NoSQL) and subsequent migration to Postgres.
General Data Stuff 🔗
-
🔥 Industry luminaries Mike Stonebraker and Andy Pavlo present their 2025 Year in Review
-
Ergest Xheblati argues that the most important skill as a data professional isn’t fancy SQL or l33t coding; it’s being able to work within an organisation to identify where you can have the most benefit and help identify the actual questions the business want to answer from the data.
-
A new paper from Matthias Jasny and others looking at io_uring for High-Performance DBMSs: When and How to Use It
-
🪦 RIP MinIO :( Previously an excellent choice for building demos and PoCs for projects that needed S3 storage locally, MinIO has spent the last few months dismantling their community offering, removing the GUI, ending Docker builds, and now moving the project into maintenance mode.
-
Arkadiusz Chmura looks at idempotency and effectively-once processing in his blog post Transactions Aren’t Enough: The Need For End-To-End Thinking
AI 🔗
I warned you previously…this AI stuff is here to stay, and it’d be short-sighted to think otherwise. As I read and learn more about it, I’m going to share interesting links (the clue is in the blog post title) that I find—whilst trying to avoid the breathless hype and slop.
|
Note
|
A request to you: Are there any good blog posts out there documenting how companies are actually implementing their user-facing AI features? For example, Strava has the awfully-named "Athlete Intelligence" (AI - geddit?!) - but I would love to see how it’s built. It feels like there’s a chasm between "ooooh you can build this" from the vendors (hi!) and the reality of actually building with it. Perhaps that’s always the case, but for hype stuff it’s even more valuable to hear [unfiltered] stories of how people really build with it. |
-
Brendan Gregg (for it is he) writes about the concept of "AI Brendans" or "Virtual Brendans".
-
Good advice from Jeffrey Snover on how to use AI for practical purposes today.
-
🔥 Annie Hedgpeth analyses the impact of AI in the workplace and specifically on juniors and the traditional growth ladder.
-
Cory Doctorow: The Reverse-Centaur’s Guide to Criticizing AI
-
Christina Wodtke argues that UX Is Your Moat and in the context of AI incremental improvements in models will not be enough to shift users between products.
And finally… 🔗
Nothing to do with data, but stuff that I’ve found interesting or has made me smile.
Work 🔗
-
🔥 Ellen Scherr - Aging Out of Fucks: The Neuroscience of Why You Suddenly Can’t Pretend Anymore
-
Joe Reis - Why You Should Start Your New Year in December and Tackle That Scary Goal
-
Carter Baxter - Feedback doesn’t scale
Write 🔗
-
Paradoxically, I read this post from Joe Boudreau on 10 Years of Writing a Blog Nobody Reads
Listen 🔗
-
I used to love listening to LoFi ATC and was disappointed to find it no longer works. One alternative I found recently is Lo-fi ATC - 🇧🇷 Brazil Edition. Something else I’ve found pretty neat is two browser tabs open; one with MyNoise’s Cafe Restaurant sounds, the other with some Lo-fi, house, or whatever :)
|
Note
|
|
