CASE STUDY
Migration from on-prem Kafka Clusters to AWS MSK
Executive Summary
Siteimprove A/S
Siteimprove’s application transformation journey is split into various phases. The very first phase is to migrate their core platform applications from the data centers located in the US and Europe to corresponding AWS regions. Their team, in partnership with Cornerstone Consulting Group, devised a migration strategy centered around re-platforming and rehosting their applications to AWS.
The core platform has a loosely coupled, highly distributed architecture. They have achieved this by implementing Apache Kafka as the messaging platform, which processes more than 5000 messages/sec between its core services. One of the key workstreams was to migrate the Apache Kafka clusters from on-prem to the Amazon MSK service in the US and Europe regions.
Business Impact
With a sharp eye on the business requirements and support from AWS, we successfully migrated all on-prem and self-managed AWS Apache Kafka clusters to Amazon MSK. This migration was a critical deliverable in the overall migration of their data center assets to AWS.
Post migration, the MM2 replication process and all Apache Kafka clusters (within the colos and self-managed in AWS) were decommissioned.
The Danish SaaS company is now positioned to leverage the benefits of running their applications in AWS, such as auto scaling, increased agility for development and reducing the total cost for running the infrastructure.
The Solution
Minimize change and protect service
Amazon MSK was selected as the target for this solution. The primary reasons were:
- Eliminate operational overhead with managing cluster thereby reducing TCO.
- Seamless application migration with no code changes.
- Highly available and secure cluster provisioning within minutes with automatic cluster scaling.
The team also focused on the following points:
Producer/Consumer Dependency Mapping:
A dependency mapping chart was put together to identify the producers and consumers for various topics. This was a key exercise to ensure that the producers are migrated only after all the consumers are migrated to AWS and to prevent any data loss.
AWS MSK Capacity Planning:
The workload on the on-prem Apache Kafka clusters was analyzed and the target state MSK architecture was developed taking the service limits into consideration.
Message Replication:
The order of migration was of critical importance to ensure that the customer experience was not impacted. This meant that some consumers were migrated to Amazon MSK, but producers remained on their existing Kafka cluster. Replication was key as the messages produced on the existing Kafka clusters still had to be replicated to Amazon MSK, so that the migrated consumers could consume those messages.
The following design principles were implemented to avoid any performance impact on the DirectConnect network between the datacenter and AWS or the MSK cluster itself:
- The migration of topics was managed in a certain order to make sure all available direct connect bandwidth is not consumed by the Kafka migration process.
- All topics on Amazon MSK were created with the same configurations such as replication factor and number of partitions.
- The offsets are replicated to make sure the consumers can resume processing from the next message onwards when the services are started in AWS.
Solution Architecture
MirrorMaker 2 (MM2) was implemented for unidirectional message replication between the on-prem and self-managed AWS Apache Kafka clusters and Amazon MSK. The MM2 service was installed on the on-prem K8S cluster and all topics and messages were replicated to Amazon MSK.
MSK Monitoring and Alerting
The Danish SaaS company leverages Datadog as the core monitoring and alerting solution across its infrastructure and application estate. There are two options to integrate Datadog with the AWS MSK service:
CloudWatch Crawler
Datadog can pull all CloudWatch metrics from the AWS account.
Datadog Agent
Datadog agent is software that runs on an EC2 instance. It collects events and metrics from the source instance, which in our case is the Amazon MSK cluster, and forwards them to Datadog. Datadog agent was used to crawl monitoring metrics from Amazon MSK JMX port (using open monitoring) to provide near real-time monitoring.
For the MSK implementation, an autoscaling group was implemented across two AZs with pre-configured Datadog agents. The Datadog agents collected metrics from Amazon MSK Open Monitoring port and forwarded them to Datadog.
TCO
The on-prem cluster had 200 topics with 24 partitions per topic and a replication factor of three, resulting in 14,400 partitions (200*24*3). To accommodate these partitions in AWS without comprising on resiliency and durability of the cluster, a highly available solution was developed consisting of four broker nodes across two AZs on kafka.m5.4x.large nodes. Due to the hard limit of 4,000 partitions per broker node in a 4x.large cluster, the cluster was configured with four broker nodes (14,000/4000) as a minimum configuration.
Ready to migrate to Cloud?
Contact us
"*" indicates required fields