It is crucial to have end-to-end tracing in event driven architectures to understand the flow of...
Apache Kafka Best Practices
Marlow Navigation, a world leader in commercial ship management services, wanted to understand if their current production Kafka architecture was compliant with best practice and scale with increased demand — so OSO performed a Kafka health check to provide executives with an expert review and a set of prioritised recommendations.
OSO assess Marlow Navigation’s adoption of Apache Kafka best practices
Apache Kafka Health Check Goals
- Review the current Kafka version, configuration and provide a detailed description of the overall health of the Kafka environment.
- Examine the design and deployment processes of Open Source Kafka clusters being used by Marlow Navigation teams.
- Provide a detailed list of recommendations, together with their priority and dependencies of each item.
Marlow, the complete Ship Management platform for Clients
Established in 1982, Marlow navigation has grown to become a globally recognised and trusted name in the commercial ship management industry. Now, the platform which handles recruitment, payroll, training, health and safety for over 15,000 seafarers is undergoing a complete digital transformation to support the explosive growth in customers.
Marlow helps commercial shipping companies all over the world manage end-to-end shipping procedures and provides first-class care to their vessels. These clients rely on Marlow’s platform, both at sea and ashore to access online fleet performance monitoring and fleet analytics with predictive modelling in real-time.
Marlow Navigation undergoing digital transformation
Starting in 2020, Marlow’s technical team started rebuilding everything from scratch with the goal of embracing a cloud native architecture. True digital transformation implies real-time capability, Apache Kafka is serving as the backbone for these next generation of platform services to greatly simplify working with data.
The goal was to create a data driven product focused around a world-class customer experience, built in containerized manor as leverage canary based deployments. Marlow runs the fully managed RedHat OpenShift container platform on AWS in order to abstract away the operational complexities of running enterprise grade Kubernetes in the cloud. In turn, the team leveraged the out of the box Strimzi Operator to deploy Apache Kafka to support this transformation and ease the load for the Marlow developers.
Challenge: Is our Kafka cluster set up correctly?
Prior to the replatforming, the Marlow team had never used or even deployed Apache Kafka brokers so all of this was one big learning curve. Inexperience cost the team countless hours of debugging issues caused by misconfiguration or inconsistencies between environments. The clusters were shared between development teams, in some cases for durable storage and others as high throughput messaging.
The team found vast swings in performance, producers stalling for unknown reasons and brokers becoming unresponsive at certain times each day. Consumer group lag metrics were growing unexpectedly with seemingly no one solution to solve the problem – changing topic partition configuration values fixed one problem but created others.
Bring in the Kafka experts: OSO On Board to benchmark Marlow against Apache Kafka best practices
The Kafka learning curve is steep, Marlow’s developers were finding it difficult to balance the speed of developing new features with operational configuration and best practices. Ultimately, Marlow’s platform engineers didn’t have the level of confidence in their current state in order to move into production, so decided to partner with OSO – The Kafka Experts!
During our four day assessment of the configuration, operational practices and skills maturity of Marlow Kafka platform, we covered 136 points of analysis – scoring each on its level of compliance with best practices.
We also assessed the current production environment against the business needs, topic retention must be sized against SLAs and operational requiremts and provided feedback on the issues highlighted by the Marlow team during the initial discovery phase.
OSO Kafka best practices
Here is a small example excerpt from a typical assessment:
|Area||Y/N||Comments and Recommendations|
|Are the number of topic partitions on all brokers fewer than 4,000?||Compliant||Answer: Yes, from our observation we can confirm that this is correct.|
|Are log.flush.interval.messages and log.flush.interval.ms set to default values? If not, producer performance and memory usage could be degraded.||Compliant||Answer: From our observation this is currently set to default – 9223372036854775807.|
|Is ( broker.rack) rack awareness enabled for multi-rack (or in the cloud, multi-zone) clusters?||Not Compliant||Answer: No, from our observation the broker.rack is not set, which means it defaults to a single zone. This should be changed to multi-zone if running EC2 instances in multiple AWS Availability Zones.
Recommendation: Please note that the rack awareness feature spreads replicas of the same partition across different racks using the same data rate.
|Are alerts working for important JMX metrics? In particular: offline partitions, under min ISR partitions, and under-replicated partitions.||Compliant||Answer: Yes, JMX metrics are pushed to Prometheus & Grafana and alerted upon within the Prometheus alert manager.|
|Are there any offline partitions, under replicated partitions, or under min ISR partitions?||Compliant||Answer: Yes, from our observation we did not observe any offline partitions
Recommendation: It is recommended to always ensure that there are no topics under min-isr or unavailable-partitions or under-replicated-partitions in the Kafka environment.
|Is the replication factor set correctly for each topic? Same with min.insync.replicas?||Compliant||Answer: Yes, from our observation the brokers are running with replication factor 3 as default.|
Recommendations: A set of actionable Kafka best practices for Marlow platform engineers
Using our Kafka expertise, we wrote up all of our findings, their compliance scores and provided detailed recommendations on how to remediate or improve. These recommendations were used to help Marlow’s executives formulate a strategy in conjunction with the OSO team to successfully operate Kafka at scale.
Based on our work with 20+ similar sized organisations, we were confident in stating that Marlow Navigation had the opportunity to improve efficiency, reliability and most importantly reduce operational costs through adopting our recommendations. Below is an example of the high level results which can be further broken down into detailed recommendations for non-compliant components.
1. Define a set of productionised configurations for all components of the Strimzi operator.
This configuration should be security focused, operational resilient and meet the SLA requirements of the business. Topic creation and naming needs standardisation, the number of partitions, throughput and topic retention needs to be set correctly for each use case. The OSO team recommended a phased approach to updating this configuration in dual delivery manor, allowing us to assess and support each change incrementally.
2. Consider adopting a managed Kafka solution like Confluent Cloud or Aiven
Kafka should be easy to use, and transparent to read messages from kafka topics for each application developer. To reduce this operational burden, complexity and total cost of the platform, OSO recommends moving to a fully managed Kafka solution. Dedicated workshops are needed in order to define a clear set of requirements to adequately compare offerings.
3. Create an onboarding pipeline for applications consuming Kafka
This should include a suite of governance rules which validate things like a topic naming conventions and related ACLs. The pipeline should be accompanied by documentation outlining the default partitions in the topic, one consumer example and the expected data rate for each partition.
Result of Apache Kafka Best Practice:
Marlow’s platform engineers now have a prioritised set of detailed recommendations that they can work through. Adopting these best practices will give the executives the confidence that the platform will scale to support the needs of the business as it grows.
OSO now has a collaborative relationship with the Marlow team, a good understanding of the architecture, requirements and business needs. As a result, Marlow engineers can rely on us to provide support for any of their Kafka needs, enabling them to focus on coding valuable new features into the platform while leveraging the power of Apache Kafka best practices.
Recap: OSO’s Contributions
- Kafka operational best practices assessed in accordance with Marlow’s business needs.
- Provided an Executive strategy on prioritised non compliant findings.
- Delivered a set of actionable recommendations for platform engineers.
- Benchmarked the level of Kafka maturity in the business.
- Highlighted the risks of major non compliance configurations.
Fore more content: