Not got time for all this? I’ve marked 🔥 for my top reads of the month :)
Open Table Formats / Data Lakehouses 🔗
-
🔥 Instead of enthusiastically hopping on the Iceberg bandwagon with both webbed feet, DuckDB Labs have been quietly building their own format. DuckLake was announced at the beginning of the month, and is a replacement for both an OTF such as Iceberg and the metadata catalog that an OTF user will invariably need to wire up too. I had a quick poke around it, and Tobias Müller and Thomas F McGeehan V both went into it in more detail, along with MotherDuck who’ll unsurprisingly be offering it as a managed service.
-
The catalog space is a vital one for OTFs, and there are various projects springing up alongside the more established Unity and Polaris catalogs. Some do slightly different things (such as management e.g. compaction) or have a broader ambition (cataloging all your data, not just OTF):
-
nimtable (from RisingWave)
-
I’ve written about writing to Iceberg from Kafka with Flink before, but shipped a new blog this week that throws AWS Glue Data Catalog into the mix.
-
Details of the latest Iceberg version Iceberg v3: Moving the Ecosystem Towards Unification
-
In a blog post that makes me nostalgic for my Oracle days, Yuval Yogev writes about statistics in Iceberg
-
Some useful points to think about from Jacek Migdal covering where Iceberg might not be the right fit for your requirements
-
Vu Trinh has a nice summary of how Meta modernised their lakehouse, based on a paper from 2023
-
Apache Hudi is still kicking around and in use—this blog post from Shiyan Xu details file pruning with multi-modal index.
Kafka and Event Streaming 🔗
-
🔥 LinkedIn, the birthplace of Kafka, have published details of their Kafka replacement, Northguard and Xinfra.
-
Chris Riccomini has a hot take on Kafka: The End of the Beginning
-
Responsive’s Almog Gavra has written about a new serialisation format with performance benefits in data streaming called Imprint
-
If you run Apache Kafka, this list of CVEs that the project publishes is worth keeping an eye on. There were three CVEs published earlier this month impacting various versions including up to 3.9.0
-
Kloia explains how they solved Kafka event sequencing in their online grocery application
-
ShareChat migrated away from a multi-AZ deployment of Apache Kafka to Warpstream, and explains why in this post.
-
Agoda describes how they handle Kafka consumer failover across data centers
-
Expedia Group Tech shares details of their real-time A/B test monitoring system, built on Apache Pinot and Kafka.
Stream Processing 🔗
-
Grab have built a bespoke Flink SQL platform that they describe here.
-
Running Flink SQL once you get out of your development environment can be tricky. The excellently-named DataSQRL have published a companion to the Flink Kubernetes operator to run SQL jobs.
-
Luthra Sahil has some practical advice for exception handling in Kafka Streams
-
Riskified have written about their migration from ksqlDB to Flink
-
The team at EloElo have been busy, writing a two-part series on running Apache Flink in EKS, as well as details of their batch & realtime data platform.
-
🔥 Excellent example (as always) from Simon Aubury of using Complex Event Processing (CEP) in Flink with ADS-B aviation data
-
Details of how Swiggy built their streaming data platform on Flink, and use it for real-time business monitoring.
-
Trade Republic shares how they built a system to calculate bond yield to maturity in real-time using Kafka and Redis.
AI 🔗
-
🔥 I’m so bored of AI already, even though it is revolutionising how we do things. That said, I did enjoy reading this article: My AI Skeptic Friends Are All Nuts
-
An example from the Debezium folks showing how Debezium, Milvus, and Ollama can work together to form a Retrieval-Augmented Generation (RAG) pipeline
-
Pinterest Engineering explains how they use Ray in their ML Infrastructure
CDC / Debezium 🔗
-
🔥 Backfilling Postgres TOAST Columns in Debezium Data Change Events
-
Some good detail from Mohammad Mahdi Azadjalal on setting up CDC with PostgreSQL, Debezium, and Kafka Connect
-
Kleinanzeigen writes about their approach to a zero-downtime user migration using Debezium and Kafka.
Data Platforms & Architecture 🔗
-
Benn Stancil asks the important question: which way from here for the analytics and data industry?
-
Netflix Engineering introduces their Unified Data Architecture (UDA) with the principle of Model Once, Represent Everywhere
-
I’d be fascinated to know what proportion of companies are still running Hadoop. Pinterest, for one, are, and they’ve written about the automated migration and scaling of their Hadoop clusters.
-
Atlassian shares how they are enhancing resiliency in their OpenSearch clusters
-
Instacart tech talks about building their modern search infrastructure on Postgres
-
Grab Engineering on how they rewrote their Counter Service in Rust
-
The tech team at Just Eat on How Data Products Become the Promethean Fire
-
Uber are prolific in their technical blog posts, and have three for us this month:
-
Their approach to Config-Driven Data Pipelines
Databases Engines & Performance 🔗
-
A look at how OpenAI are scaling PostgreSQL
-
I love this kind of well-written, deep dive blog from TigerBeetle about a bug that Jepsen found.
-
A deep dive from LanceDB on Columnar File Readers in Depth: Repetition & Definition Levels
-
Two interesting DuckDB extensions, providing real-time support and support for Apache Arrow
-
🔥 Chris Riccomini started a good thread on Bluesky about graph databases
-
LiveStore is a state management framework and local-first data layer for high-performance apps, based on SQLite and event-sourcing.
-
Jan Nidzwetzki writes about The Art of SQL Query Optimization
-
InfluxDB’s Andrew Lamb did a talk on accelerating Apache Parquet with metadata stores and specialized indexes using Apache DataFusion
-
Part of Arrow, Flight SQL is a protocol for interacting with SQL databases using the Arrow in-memory format and the Flight RPC framework. Porter implements Flight SQL to provide a server on top of DuckDB or Clickhouse.
And finally… 🔗
Nothing to do with data, but stuff that I’ve found interesting or has made me smile.
Human APIs (a.k.a The Soft Squishy Stuff) 🔗
Geek 🔗
-
A lovely nerdy example of reverse engineering: Why I no longer have an old-school cert on my https site
-
🔥 The Xenon Death Flash: How a Camera Nearly Killed the Raspberry Pi 2
Data Viz 🔗
-
A lovely example of some creativity to illustrate data: A Garden of Sleep: Tracking the Emotional Distance Between Two Bedtimes - Nightingale
Tip
|
If you like these kind of links you might like to read about How I Try To Keep Up With The Data Tech World (A List of Data Blogs) |