Marlow Navigation, a world leader in commercial ship management services, wanted to understand if...
End-to-End tracing in event driven architectures
It is crucial to have end-to-end tracing in event driven architectures to understand the flow of messages and ensure proper processing. We will explore how to implement end-to-end tracing in Kafka Streams and Kafka Connect, two popular components of the Apache Kafka ecosystem. First let's explain what makes up a trace:
OpenTelemetry
The OpenTelemery project offers a standardised, vendor agnostic approach capturing data in a structured format. The collection of APIs are high-quality data and portable which are made available via the SDK. These are specially designed to analyse software performance and behaviour.
Stateless Processing
In Kafka Streams, stateless processing is straightforward to trace. The Java agent provides hooks to start and finish the trace span at the appropriate points in the topology. However, for interceptors and wrappers, which are consumer interceptor wrappers, manual implementation is required. One workaround is to add a Transform node at the start of the topology to initiate the trace span. This hack allows us to capture the processing of each message and close the span at the end of the topology. However, this approach can cause partitioning issues, so thorough testing is necessary.
Stateful Processing
Stateful processing in Kafka Streams becomes more complex when caching is involved. Caching in Kafka Streams can break the end-to-end tracing flow because the trace context is lost when messages are stored in the cache and not immediately processed downstream. To address this, we can use the Aggregate operation to capture the trace information. By wrapping the state store with tracing logic, we can create a link between the previous and current values in the state store. This allows us to track the flow of messages and understand how they have changed over time.
Another approach is to create a custom implementation of the state store, such as the RocksDB timestamp store. By delegating the get and put functions to include trace information, we can explicitly capture read and write operations to the state store. This provides a more explicit link between the previous and current values.
However, when caching is involved, it becomes more challenging to integrate tracing logic. Messages are stored in the cache without going through the full processor API, making it difficult to capture the trace information. Unfortunately, there is no integration point in the DSL to handle this scenarios.
In Kafka Connect, there are two types of tasks: source and sink. For source tasks, which involve reading from a non-Kafka source system, we can easily extend the tracing logic. By extracting trace information from the headers or payload, we can create a trace span for each record and link them together. This allows us to trace the flow of messages from the source system to Kafka.
For sink tasks, which involve writing to a non-Kafka destination system, the tracing implementation is more complex. The consumer pulls a batch of messages, and the transform operation runs on the entire batch. To create a trace span for each individual message, we need a trace-aware connector implementation. The connector needs to be able to create traces in its boot implementation. However, there is a limitation when sending multiple messages in a batch. We can only create one trace for the entire batch, which means we lose the individual trace information for each message. To address this, we can create a batch span and link it to all the individual traces from the source messages.
It's important to note that when using Kafka Connect, the standard producer tracing interceptor only takes a span from the current context. If there is no span in the context, it won't link the trace information. To overcome this limitation, we need to extend the tracing logic to extract information from headers and put it in the appropriate context for the interceptor to pick up.
End-to-end tracing is crucial in event-driven architectures to understand the flow of messages and ensure proper processing. In Kafka Streams, tracing stateless processing is straightforward with the Java agent, but stateful processing with caching can be more challenging. Kafka Connect provides support for tracing both source and sink tasks, but there are limitations when sending multiple messages in a batch.
Implementing end-to-end tracing in Kafka Streams and Kafka Connect requires careful consideration and manual implementation in certain scenarios. However, with the right approach, it is possible to achieve comprehensive tracing and gain insights into the flow of messages in event-driven systems.
We have implemented many different types of tracing for a large number of clients, talk to one of the OSO Kafka experts for help.
Fore more content:
How to take your Kafka projects to the next level with a Confluent preferred partner
Event driven Architecture: A Simple Guide
Watch Our Kafka Summit Talk: Offering Kafka as a Service in Your Organisation
Successfully Reduce AWS Costs: 4 Powerful Ways
Kafka performance best practices for monitoring and alerting