Nov 22, 2019

Common mistakes made when configuring multiple Kafka Connect workers

Kafka Connect can be deployed in two modes: Standalone or Distributed. You can learn more about them in my Kafka Summit London 2019 talk.

I usually recommend Distributed for several reasons:

It can scale
It is fault-tolerant
It can be run on a single node sandbox or a multi-node production environment
It is the same configuration method however you run it

I usually find that Standalone is appropriate when:

You need to guarantee locality of task execution, such as picking up a log file from a folder on a specific machine
You don’t care about scale or fault-tolerance ;-)
You like re-learning how to configure something when you realise that you do care about scale or fault-tolerance X-D

Nov 20, 2019

Streaming data from SQL Server to Kafka to Snowflake ❄️ with Kafka Connect

Snowflake is the data warehouse built for the cloud, so let’s get all ☁️ cloudy and stream some data from Kafka running in Confluent Cloud to Snowflake!

What I’m showing also works just as well for an on-premises Kafka cluster. I’m using SQL Server as an example data source, with Debezium to capture and stream and changes from it into Kafka.

I’m assuming that you’ve signed up for Confluent Cloud and Snowflake and are the proud owner of credentials for both. I’m going to use a demo rig based on Docker to provision SQL Server and a Kafka Connect worker, but you can use your own setup if you want.

Nov 12, 2019

Running Dockerised Kafka Connect worker on GCP

I talk and write about Kafka and Confluent Platform a lot, and more and more of the demos that I’m building are around Confluent Cloud. This means that I don’t have to run or manage my own Kafka brokers, Zookeeper, Schema Registry, KSQL servers, etc which makes things a ton easier. Whilst there are managed connectors on Confluent Cloud (S3 etc), I need to run my own Kafka Connect worker for those connectors not yet provided. An example is the MQTT source connector that I use in this demo. Up until now I’d either run this worker locally, or manually build a cloud VM. Locally is fine, as it’s all Docker, easily spun up in a single docker-compose up -d command. I wanted something that would keep running whilst my laptop was off, but that was as close to my local build as possible—enter GCP and its functionality to run a container on a VM automagically.

You can see the full script here. The rest of this article just walks through the how and why.

Oct 16, 2019

Using Kafka Connect and Debezium with Confluent Cloud

This is based on using Confluent Cloud to provide your managed Kafka and Schema Registry. All that you run yourself is the Kafka Connect worker.

Optionally, you can use this Docker Compose to run the worker and a sample MySQL database.

Oct 15, 2019

Skipping bad records with the Kafka Connect JDBC sink connector

The Kafka Connect framework provides generic error handling and dead-letter queue capabilities which are available for problems with [de]serialisation and Single Message Transforms. When it comes to errors that a connector may encounter doing the actual pull or put of data from the source/target system, it’s down to the connector itself to implement logic around that. For example, the Elasticsearch sink connector provides configuration (behavior.on.malformed.documents) that can be set so that a single bad record won’t halt the pipeline. Others, such as the JDBC Sink connector, don’t provide this yet. That means that if you hit this problem, you need to manually unblock it yourself. One way is to manually move the offset of the consumer on past the bad message.

TL;DR : You can use kafka-consumer-groups --reset-offsets --to-offset <x> to manually move the connector past a bad message

Oct 7, 2019

Kafka Connect and Elasticsearch

I use the Elastic stack for a lot of my talks and demos because it complements Kafka brilliantly. A few things have changed in recent releases and this blog is a quick note on some of the errors that you might hit and how to resolve them. It was inspired by a lot of the comments and discussion here and here.

Aug 15, 2019

Reset Kafka Connect Source Connector Offsets

Kafka Connect in distributed mode uses Kafka itself to persist the offsets of any source connectors. This is a great way to do things as it means that you can easily add more workers, rebuild existing ones, etc without having to worry about where the state is persisted. I personally always recommend using distributed mode, even if just for a single worker instance - it just makes things easier, and more standard. Watch my talk online here to understand more about this. If you want to reset the offset of a source connector then you can do so by very carefully modifying the data in the Kafka topic itself.

Aug 9, 2019

Starting a Kafka Connect sink connector at the end of a topic

When you create a sink connector in Kafka Connect, by default it will start reading from the beginning of the topic and stream all of the existing—and new—data to the target. The setting that controls this behaviour is auto.offset.reset, and you can see its value in the worker log when the connector runs:

[2019-08-05 23:31:35,405] INFO ConsumerConfig values:
        allow.auto.create.topics = true
        auto.commit.interval.ms = 5000
        auto.offset.reset = earliest
…

Aug 9, 2019

Resetting a Consumer Group in Kafka

I’ve been using Replicator as a powerful way to copy data from my Kafka rig at home onto my laptop’s Kafka environment. It means that when I’m on the road I can continue to work with the same set of data and develop pipelines etc. With a VPN back home I can even keep them in sync directly if I want to.

I hit a problem the other day where Replicator was running, but I had no data in my target topics on my laptop. After a bit of head-scratching I realised that my local Kafka environment had been rebuilt (I use Docker Compose so complete rebuilds to start from scratch are easy), hence no data in the topic. But, even after restarting the Replicator Kafka Connect worker, I still had no data loaded into the empty topics. What was going on? Well Replicator acts as a consumer from the source Kafka cluster (on my home server), and so far as that Kafka cluster was concerned, Replicator had already read the messages. It thought that because even though I’d rebuilt everything on my laptop, Replicator was using the same connector name as before, and the connector name is used as the Consumer group name - which is how the source Kafka cluster keeps track of the offsets. So my "new" Kafka environment was going back to the source, which viewed it as the existing "old" one, which had already received the messages.

Jun 23, 2019

Manually delete a connector from Kafka Connect

Kafka Connect has as REST API through which all config should be done, including removing connectors that have been created. Sometimes though, you might have reason to want to manually do this—and since Kafka Connect running in distributed mode uses Kafka as its persistent data store, you can achieve this by manually writing to the topic yourself.

Jun 6, 2019

Automatically restarting failed Kafka Connect tasks

Here’s a hacky way to automatically restart Kafka Connect connectors if they fail. Restarting automatically only makes sense if it’s a transient failure; if there’s a problem with your pipeline (e.g. bad records or a mis-configured server) then you don’t gain anything from this. You might want to check out Kafka Connect’s error handling and dead letter queues too.

May 24, 2019

Putting Kafka Connect passwords in a separate file / externalising secrets

Kafka Connect configuration is easy - you just write some JSON! But what if you’ve got credentials that you need to pass? Embedding those in a config file is not always such a smart idea. Fortunately with KIP-297 which was released in Apache Kafka 2.0 there is support for external secrets. It’s extendable to use your own ConfigProvider, and ships with its own for just putting credentials in a file - which I’ll show here. You can read more here.

May 22, 2019

Deleting a Connector in Kafka Connect without the REST API

Kafka Connect exposes a REST interface through which all config and monitoring operations can be done. You can create connectors, delete them, restart them, check their status, and so on. But, I found a situation recently in which I needed to delete a connector and couldn’t do so with the REST API. Here’s another way to do it, by amending the configuration Kafka topic that Kafka Connect in distributed mode uses to persist configuration information for connectors. Note that this is not a recommended way of working with Kafka Connect—the REST API is there for a good reason :)

May 8, 2019

When a Kafka Connect converter is not a converter

Kafka Connect is a API within Apache Kafka and its modular nature makes it powerful and flexible. Converters are part of the API but not always fully understood. I’ve written previously about Kafka Connect converters, and this post is just a hands-on example to show even further what they are—and are not—about.

Note	To understand more about Kafka Connect in general, check out my talk from Kafka Summit London From Zero to Hero with Kafka Connect.

May 2, 2019

Reading Kafka Connect Offsets via the REST Proxy

When you run Kafka Connect in distributed mode it uses a Kafka topic to store the offset information for each connector. Because it’s just a Kafka topic, you can read that information using any consumer.

Jan 29, 2019

Kafka Connect Change Log Level and Write Log to File

By default Kafka Connect sends its output to stdout, so you’ll see it on the console, Docker logs, or wherever. Sometimes you might want to route it to file, and you can do this by reconfiguring log4j. You can also change the configuration to get more (or less) detail in the logs by changing the log level.

Finding the log configuration file

The configuration file is called connect-log4j.properties and usually found in etc/kafka/connect-log4j.properties.

Dec 15, 2018

Docker Tips and Tricks with Kafka Connect, ksqlDB, and Kafka

A few years ago a colleague of mine told me about this thing called Docker, and I must admit I dismissed it as a fad…how wrong was I. Docker, and Docker Compose, are one of my key tools of the trade. With them I can build self-contained environments for tutorials, demos, conference talks etc. Tear it down, run it again, without worrying that somewhere a local config changed and will break things.

Dec 3, 2018

Kafka Connect CLI tricks

I do lots of work with Kafka Connect, almost entirely in Distributed mode—even just with 1 node -> makes scaling out much easier when/if needed. Because I’m using Distributed mode, I use the Kafka Connect REST API to configure and manage it. Whilst others might use GUI REST tools like Postman etc, I tend to just use the commandline. Here are some useful snippets that I use all the time.

I’m showing the commands split with a line continuation character (\) but you can of course run them on a single line. You might also choose to get fancy and set the Connect host and port as environment variables etc, but I leave that as an exercise for the reader :)

May 21, 2018

Kafka Connect and Oracle data types

The Kafka Connect JDBC Connector by default does not cope so well with:

NUMBER columns with no defined precision/scale. You may end up with apparent junk (bytes) in the output, or just errors.
TIMESTAMP WITH LOCAL TIME ZONE. Throws JDBC type -102 not currently supported warning in the log.

Read more about NUMBER data type in the Oracle docs.

tl;dr : How do I make it work?

There are several options:

New in Confluent Platform 4.1.1 : `numeric.mapping`

In the connector configuration, set "numeric.mapping":"best_fit"
New in Confluent Platform 4.1.1 (Doc)

Avoid the problem in the first place

Change the DDL of the source object. For example:
- refine the NUMBER ’s precision and scale
- Use a TIMESTAMP type that is supported

CAST the datatypes in the `query`

Pull from the object directly, and use query in the JDBC connector (instead of table.whitelist)—and cast the columns appropriately:

Mar 27, 2018

Streaming Data from MongoDB into Kafka with Kafka Connect and Debezium

Disclaimer: I am not a MongoDB person. These steps may or may not be appropriate and proper. But they worked for me :) Feel free to post in comments if I’m doing something wrong

MongoDB config - enabling replica sets

For Debezium to be able to stream changes from MongoDB, Mongo needs to have replication configured:

Docs: Replication / Convert a Standalone to a Replica Set

Stop Mongo:

rmoff@proxmox01 ~> sudo service mongod stop

Add replica set config to /etc/mongod.conf: