Technical - Instaclustr

Rebalance your Apache Kafka® partitions with the next generation Consumer Rebalance Protocol—up to 20x faster!

Paul Brebner — Tue, 08 Jul 2025 12:00:34 +0000

Kafka 4.0 was recently released (March 2025) with many new features and improvements (see the release notes for more details), closely followed by general availability on the NetApp Instaclustr Managed Platform. So, what features are both new and released in general availability in 4.0? This blog focuses on KIP-848, the next generation consumer rebalance protocol—which is a bit of a mouthful—how about NGCRP for short? Maybe not… ChatGPT suggested “Kafka Turbo Rebalance” which is cool!

The next generation consumer rebalance protocol (actually called “consumer” for short in the documentation—even though it’s now managed on the server side) first appeared in Kafka 3.7.0 as an early release (release notes and instructions here). Previously, KIP-429 (the Kafka consumer incremental rebalance protocol) was the best rebalancing protocol available, having appeared in Kafka 2.4.0, replacing the original “stop-the-world” rebalancing protocol.

So, what’s the big deal?

The primary mechanism that Kafka uses to achieve high scalability is topic partitions. Each topic is split into 1 or more partitions to enable high concurrency on both the cluster and consumer sides. To achieve consumer concurrency, consumers run in a cooperative group called a consumer group. The number of consumers in a group is limited only by the number of partitions.

Each consumer is single-threaded, and can be allocated 1 or more partitions, but partitions can only be allocated to a single consumer, and every consumer should have at least 1 partition. For maximum concurrency and throughput, the best case is exactly one consumer per partition. Here’s a summary of the rules:

>= 1 partitions per topic
consumers <= partitions
>= 1 partition per consumer (else the consumer will time out and be removed from the group)
Exact 1:1 mapping between partitions and consumers (i.e. every partition has a consumer, but sharing of the same partition between consumers)

The following diagram shows a correctly balanced consumer group – each partition has exactly 1 consumer, and the partitions are as evenly distributed as possible (given the odd number of partitions,3, and even number of consumers, 2) across the available consumers:

However, the following example breaks the rules (Partition 1 is mapped to 2 consumers, so in practice Consumer 2 has no partitions, and Partition 3 is mapped to no consumers):

However, there is a problem. To ensure that the mappings between partitions and consumers follow the rules and goals (every partition has exactly one consumer, every consumer has at least one partition, and partitions are evenly balanced across all the available consumers) there needs to be a mechanism to do the mapping initially and ensure that the mapping is still correct even after changes to the number of partitions, consumers timing out or failing, consumers being created and destroyed, etc. This is called consumer rebalancing.

Rebalancing must be performed under all these circumstances and can take significant amounts of time when there are many partitions, many consumers, or rapid changes in partitions or consumers.

For example, in the past, I’ve encountered multiple rebalancing issues and challenges. Problematic rebalancing is often detected by a reduction in throughput and an increase in consumer latency and lag during a rebalancing operation for a consumer group.

Watch out for the Kafka key “parking” problem, rebalancing storms, slow consumers, millions of partitions and ephemeral consumers!

From past experience scaling Kafka applications, there are many things to watch out for that can exacerbate Kafka consumer rebalancing including:

Knuth’s “car parking” (hash collision) problem
- Ensure you have many more keys than partitions to prevent consumer starvation and timeouts
Rebalancing storms
- Don’t ramp up the number of consumers in a group too quickly
Slow consumers
- Try to minimize the time to process records for consumers
A large number of partitions increases rebalancing time
- e.g. it’s possible to have way more partitions with KRaft c.f. ZooKeeper
Ephemeral consumers
- Short-lived Kafka consumers that come and go rapidly (the problem, and the solution) will result in frequent rebalances

For more information, see my Apache Kafka Anti Patterns blog.

Incremental rebalancing protocol improvements

The original rebalancing algorithm was a “stop the world” (or eager) approach—once a change requiring a rebalance was detected in a group, the consumers were all stopped, partitions were recomputed and reassigned, and finally the consumers were restarted, and this was all controlled by the consumer group leader. There were lots of steps and communication between consumers, and consumers could not process records during the rebalancing. During the rebalancing, throughput for the group was therefore reduced to zero, and latency increased significantly during the pause in processing.

KIP-429 (Kafka Consumer Incremental Rebalance Protocol) appeared in Kafka 2.4.0 and was a significant improvement. The main advantage of this new protocol was that it was incremental, not “stop the world” like the original. It prevented unnecessary partition reassignments by using an innovative partition assignment strategy, the cooperative “sticky” assignor, which tries to preserve the previous partition assignments where possible. i.e. only the change between the old and new partitions assignments needs to be revoked/migrated. And because it’s incremental, consumers can still keep processing records for unaffected partitions, so there’s less impact on throughput and latency.

Note that the sticky assignor is not directly related to the other Kafka “sticky” thing, the sticky partitioner, or more completely, the strictly uniform sticky partitioner KIP-794, which distributes non-keyed batches of produced messages evenly in batches among brokers (thereby reducing producer latency).

Also note that these two original rebalancing approaches (Classic Eager – which required consumers to revoke all owned partitions for each rebalancing; and Classic Cooperative, allowing a consumer to retain its partitions before rebalancing – are really just the same protocol but with a different partition assignor for the Cooperative version (The CooperativeStickyAssignor). The role of the ConsumerPartitionAssignor interface is to map partition assignments for Consumers, and all implementations available (in 4.0) are CooperativeStickyAssignor, RangeAssignor, RoundRobinAssignor and StickyAssignor. The ConsumerPartitionAssignor interface will be deprecated in future Kafka versions.

The next generation Consumer Rebalance Protocol

KIP-848 (The next generation consumer rebalance protocol) takes a different approach to rebalancing consumer groups. The broker-side group coordinator now manages the rebalance process, which simplifies the client-side implementation. And now the rebalancing process is completely asynchronous with no blocking, meaning that most consumers in a group will be able to continually process records without impact during a rebalance.

There are two new (server side) assignors:

Range
- org.apache.kafka.coordinator.group.assignor.RangeAssignor
- An assignor which co-partitions topics.
Uniform
- org.apache.kafka.coordinator.group.assignor.UniformAssignor
- An assignor which uniformly assigns partitions amongst the members. This is somewhat similar to the existing “sticky” assignor (and is the default for the consumer rebalancing protocol).

Note that they are both sticky, and the goal is to minimize partition movements. These are set with the new server-side group.consumer.assignors configuration (uniform or range, the default is uniform). The previous configuration for setting partition assignment strategies (partition.assignment.strategy) is now deprecated.

20 times faster!

From watching several talks on KIP-848, including one at Current London 2025, it was apparent that the Kafka developers behind the new protocol had goals of ensuring that there were no performance regressions, that the broker CPU wasn’t significantly higher, that the new protocol was more stable, and that for some scenarios, the performance was better (e.g. see this video). This was confirmed, with both Classic Cooperative and the new protocol being more stable than the Classic Eager, and the new protocol having a more consistent throughput than either. The Cooperative protocol performs well but requires all consumers to be responsive. If there are any slow/unresponsive consumers, the new protocol performs better, with minimal impact on latencies.

For our NetApp Instaclustr Managed Kafka service, we always perform benchmarking to detect any potential performance regressions with each new Kafka release, and Kafka 4.0 was no exception. In fact, we developed some new tests specifically to test for performance regressions for the new rebalancing protocol. We found that for the scenarios where consumers are removed or added to a group, the new protocol has better latencies than the previous version. This was to be expected, given the improved completely incremental approach.

But I wondered if the new protocol could do even better in more extreme situations?

For this experiment, I created a topic with 100 partitions, and a consumer group with 10 consumers subscribed to this topic. I then increased the number of partitions for the topic to 1000 (x10) and ran the kafka-consumer-group.sh command every second to detect how long the rebalancing takes. The results were surprising!

Rebalancing with the classic protocol took 103 seconds, but rebalancing with the new protocol only takes 5 seconds – that’s 20x faster. I wasn’t expecting such a big performance improvement, particularly given that both protocols are incremental, so this is very encouraging.

Here’s the bash script I used if you would like to repeat the experiment. For the new protocol replace --consumer-property group.protocol=classic with --consumer-property group.protocol=consumer (Note: double check that you are using --consumer-property as --property is not correct and will silently be ignored – I wasted several hours trying to work out why there was no difference in performance between the two protocols – the reason being that the default was being used for both experiments!)

./kafka-topics.sh --bootstrap-server IP:9092 --topic test100 --create --partitions 100 

for c in 1 2 3 4 5 6 7 8 9 10 
do 
    echo "consumer: $c" 
    ./kafka-console-consumer.sh --bootstrap-server IP:9092 --topic test100 --group 101 --consumer-property group.protocol=classic >/dev/null& 
done 

sleep 10 
date 
echo "starting producer" 
./kafka-producer-perf-test.sh --producer-props bootstrap.servers=IP:9092 --topic test100 --record-size 10 --throughput 1000 --num-records 2000000& 

sleep 10 
echo "increasing to 1000 partitions" 
./kafka-topics.sh --bootstrap-server IP:9092 --topic test100 --alter --partitions 1000 

counter=0 
while true 
do 
    echo "counter is: $counter" 
    date 
    ./kafka-consumer-groups.sh --bootstrap-server IP:9092 --describe --group 101 
    ((counter++)) 
    sleep 1 
done

For the classic protocol, this script worked perfectly with no errors. However, for the new consumer protocol, there were some errors:

ERROR [Consumer clientId=6, groupId=101] OffsetCommit failed for member T1G-Y9lyQQKWFovmGIYHtw with stale member epoch error. 
(org.apache.kafka.clients.consumer.internals.CommitRequestManager) 
… 
Failed org.apache.kafka.common.errors.StaleMemberEpochException: The member epoch is stale. The member must retry after receiving its updated member epoch via the ConsumerGroupHeartbeat API.

Note that STALE_MEMBER_EPOCH is a new error for this protocol:

FENCED_MEMBER_EPOCH – The member epoch is fenced by the coordinator. The member must abandon all its partitions and rejoins.
STALE_MEMBER_EPOCH – The member epoch is stale. The member must retry after receiving its updated member epoch via the ConsumerGroupHeartbeat API.
UNRELEASED_INSTANCE_ID – The instance ID is still used by another member. The member must leave first.
UNSUPPORTED_ASSIGNOR – The assignor used by the member or its version range are not supported by the group.
INVALID_REGULAR_EXPRESSION – The regular expression used by the member is not a valid RE2J regular expression.

And I found out the hard way (by trying to use them) that some previous consumer configurations are now deprecated:

partition.assignment.strategy
session.timeout.ms
heartbeat.interval.ms

The heartbeat and timeouts are now set on the server side with these configurations:

The member uses the ConsumerGroupHeartbeat API to establish a session with the group coordinator.
The member is expected to heartbeat every group.consumer.heartbeat.interval.ms in order to keep its session open (default is now 5 seconds, increased from 3 seconds in Kafka 3.X).
If it does not heartbeat at least once within the group.consumer.session.timeout.ms, the group coordinator will kick the member out from the group.
group.consumer.heartbeat.interval.ms is defined on the server side and the member is told about it in the heartbeat response.
The group.consumer.session.timeout.ms is also defined on the server side (default is 45 seconds, the same as session.timeout.ms in Kafka 3.X).

Other things to be aware of:

There are no Consumer API changes, you use the same KafkaConsumer
But you do need to update Kafka client libraries to 4.0+
And handle any new errors and property changes (see below)
There is no support for Kafka Connect (and therefore MM2), or Kafka Streams yet.
The new protocol will provide good support for KIP-932 (Kafka queues).

The new protocol is available by default, but consumers need to opt-in and make the following changes:

You need to add group.protocol=consumer to consumer properties
Remove unsupported consumer properties such as session.timeout.ms (default still 45s) and heartbeat.interval.ms (note change in detault time from 3s to 5s).
Handle any new error types
Optionally, set assignor with the consumer property group.remote.assignor
If you are using a NetApp Instaclustr managed Kafka service, you can request changes to server-side configurations.

Finally, here’s some example Java code for a Kafka 4.0 Consumer using the new protocol!

import org.apache.kafka.clients.consumer.KafkaConsumer; 
import java.time.Duration; 
import java.util.List; 
import java.util.Properties; 

public class Test4 { 
    public static void main(String[] args) { 
        Properties props = new Properties(); 
        props.put("bootstrap.servers", "IP:9092"); 
        props.put("group.id", "test-group"); 
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); 
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); 
        // new 4.0 rebalance protocol properties 
        props.put("group.protocol", "consumer"); 
        props.put("group.remote.assignor", "uniform"); 

        // Create the Kafka consumer 
        KafkaConsumer consumer = new KafkaConsumer<>(props); 

        // Subscribe to topics 
        consumer.subscribe(List.of("t1")); 

        // Poll for records 
        while (true) { 
            var records = consumer.poll(Duration.ofMillis(100)); 
            for (var record : records) { 
                System.out.printf("Consumed record with key %s and value %s%n", record.key(), record.value()); 
            } 
        } 
    } 
}

You also need this entry in your pom.xml file:

 
    org.apache.kafka 
    kafka-clients 
    4.0.0

Discover Kafka 4.0’s next-gen Consumer Rebalance Protocol on the Intaclustr Managed Platform and elevate your scaling efficiency today! Start your free trial today – no credit card required.

The post Rebalance your Apache Kafka® partitions with the next generation Consumer Rebalance Protocol—up to 20x faster! appeared first on Instaclustr.

Unlock seamless Elasticsearch to OpenSearch® migration with NetApp Instaclustr

Alex Bunday — Tue, 01 Jul 2025 04:00:12 +0000

Introduction

Facing high licensing costs for Elasticsearch? Globally, major organizations such as Uber and Slack are transitioning to OpenSearch, an open source alternative to Elasticsearch. Now you can do the same, with hands-on help from NetApp Instaclustr, the go-to experts for seamless Elasticsearch to OpenSearch migration.

OpenSearch is an open source, distributed search and analytics engine that is transforming the way businesses handle their data. Our journey with OpenSearch began at its inception in 2021, and we have been active contributors and members of the OpenSearch community ever since. With our deep expertise and extensive experience, we have successfully managed numerous migrations from Elasticsearch to OpenSearch, helping our customers unlock the full potential of their data infrastructure.

Mastering Elasticsearch to OpenSearch migrations

Migrating from Elasticsearch to OpenSearch can be a challenging process, but with NetApp Instaclustr’s expertise, we’ll help you navigate the challenges and complexity. Our team has conducted numerous migrations, making us intimately familiar with the intricacies and best practices involved. We understand the challenges and nuances of the migration process, and we are equipped to handle them efficiently, minimizing downtime and ensuring data integrity.

Why choose OpenSearch over proprietary technologies?

OpenSearch offers several compelling advantages over proprietary technologies like Elasticsearch, making it an attractive option for businesses looking to avoid vendor lock-in and restrictive licenses. Here are some key reasons why customers might want to migrate to OpenSearch:

Open source freedom: OpenSearch is fully open source under the Apache 2.0 License, providing transparency and flexibility without the constraints of proprietary software.
Community-driven innovation: As an open source project and part of the Linux Foundation, OpenSearch benefits from the collaborative efforts of a vibrant community, driving continuous innovation and improvements.
Cost-effective: Avoiding vendor lock-in means you can choose the best hosting and support options for your needs, potentially reducing costs.
Customizability: OpenSearch’s modular architecture allows for extensive customization, enabling businesses to tailor their search and analytics solutions to their specific requirements.

Migration methods: Pros and cons

When it comes to migrating from Elasticsearch to OpenSearch, there are several methods available, each with its own advantages and disadvantages. Here’s a summary of the primary migration methods and their benefits and drawbacks:

Method	Description	Main pro	Main con
Reindex from existing cluster	Use Elasticsearch’s reindex API to directly index documents from the source cluster to the destination cluster	Broad version compatibility	Requires direct network connectivity and requires downtime
Snapshot/restore	Create snapshots of indices and restore them to the destination cluster	Includes cluster state	Requires downtime
Add/remove nodes (over-the-wire)	Directly join new nodes to the existing cluster and then stop the original nodes	No downtime	Limited version compatibility
Aging over (dual write)	Configure the application to write to both source and destination clusters simultaneously (typically for data with a shorter lifespan)	No downtime	Only works if application is capable of dual writing
Indexing from source of truth	Index documents into the destination cluster from the primary source of truth (e.g., an SQL database)	Low effort required	Requires a source of truth cluster to index from, usually requires downtime
OpenSearch Migration Assistant	Works by capturing live traffic from your existing Elasticsearch or OpenSearch cluster and replaying it in a new OpenSearch environment.	No downtime and wide range of migration paths supported	Doesn’t yet fully support all OpenSearch features (such as ISM) but broader feature support is expected soon

In this table, “requires downtime” means that we must stop the application from writing during the migration process if we want to avoid data loss or inconsistency.

Let NetApp Instaclustr help you migrate

Performing an Elasticsearch to OpenSearch migration can be a daunting prospect— especially if your cluster is business critical. Luckily, Instaclustr are the Elasticsearch to OpenSearch migration experts. We will work with you to understand your bespoke needs and devise a migration plan to ensure minimum impact on your business. We will also help during the migration process to ensure you don’t have any issues when cutting over to your new OpenSearch cluster.

Summary

Migrating from Elasticsearch to OpenSearch with NetApp Instaclustr offers a pathway to greater flexibility, cost savings, and innovation. As experts in both OpenSearch and Elasticsearch migrations, we provide the knowledge and experience needed to ensure a smooth transition. OpenSearch’s open source nature and community-driven development make it a superior choice for businesses looking to leverage powerful search and analytics capabilities without the constraints of proprietary software.

If you have any further questions or need additional assistance, feel free to reach out to our support team at support@instaclustr.com. Or sign up for a free Instaclustr trial.

The post Unlock seamless Elasticsearch to OpenSearch® migration with NetApp Instaclustr appeared first on Instaclustr.

Top 3 reasons to upgrade to Valkey 8.0 on the Instaclustr Managed Platform

Chris Carter — Thu, 26 Jun 2025 12:00:03 +0000

NetApp is thrilled to announce the availability of Valkey 8, the latest major release, on the NetApp Instaclustr Managed Platform.

Valkey is an advanced open source, in-memory data structure store, used as a database, cache, and message broker. NetApp Instaclustr is a leading provider of open source data solutions, helping customers avoid the risks of vendor lock-in and unsupported software.

Valkey 8 introduces significant improvements in performance, reliability, and observability, alongside new features that enhance efficiency and user experience:

Performance enhancements: Valkey 8 optimizes handling of temporary set objects in SUNION and SDIFF commands, resulting in up to a 41% performance improvement for SUNION and 27% for SDIFF. It also introduces asynchronous I/O threading for improved multi-threaded performance and dual channel replication for more efficient full sync replication.
Reliability improvements: Valkey 8 enhances cluster reliability with improvements to slot migration and failover processes. It reduces failover time in Valkey clusters and ensures replicas handle migration correctly when receiving commands.
Compatibility: Valkey 8 takes command set compatibility seriously, making no backwards incompatible changes to existing command syntax or responses. However, due to changes in threading, users may need to re-evaluate their cluster’s infrastructure for optimal performance.

With every new release, Instaclustr revises our lifecycle policy for older versions which you can find here.

As with all major releases, please review the official release notes for detailed information on new features, improvements, and potential incompatibilities. Existing customers will be contacted by our support engineers shortly to arrange for upgrade timelines. For customers who want to experience a Valkey cluster on the Instaclustr Managed Platform, you can sign up and create a Valkey cluster with these new versions through a free trial.

Happy upgrading!

The post Top 3 reasons to upgrade to Valkey 8.0 on the Instaclustr Managed Platform appeared first on Instaclustr.

Mastering ClickHouse best practices: Infrastructure and operational excellence

Tanvir Hossain — Wed, 25 Jun 2025 12:00:09 +0000

As organizations increasingly seek high-performance analytic database solutions, ClickHouse stands out as a leading choice for its speed and efficiency. To help you get the most out of ClickHouse, we’ve compiled a comprehensive guide of best practices. These recommendations cover various aspects of ClickHouse deployment and usage, ensuring that your implementation is both robust and efficient. This article is the first of a two-part blog series where we will cover topics relating to infrastructure and operational excellence.

ClickHouse infrastructure essentials

Unlocking CPU potential

ClickHouse is heavily multi-threaded, benefiting significantly from multiple CPU cores. It’s recommended to have at least four or more cores to handle parallel data processing efficiently each with higher clock speeds and strong single-threaded performance per core, as ClickHouse thrives on higher clock speeds.

Modern CPUs with advanced vector extensions (like AVX2 or AVX-512) can significantly speed up ClickHouse operations, so ensure your hardware supports these extensions. Additionally, configuring the max_threads setting to match the number of CPU cores can optimize CPU utilization, making your ClickHouse deployment even more efficient.

The Instaclustr for ClickHouse service is already designed keeping in mind these recommendations allowing you to focus on getting the best from your ClickHouse deployments.

Mastering memory for peak performance

ClickHouse performs most operations directly in memory, making ample RAM crucial. This is even more important if you are using integrations like for AWS S3 or Apache Kafka®, and we suggest you consider using even higher-memory machines as these are high-throughput, streaming-heavy integrations that require plenty of memory to handle buffered writes and reads efficiently.

To prevent queries from hogging all the memory, use settings like max_memory_usage, max_bytes_before_external_group_by, and max_bytes_before_external_sort to limit memory usage per query (see Restrictions on Query Complexity for details). Similarly, setting global memory usage limits per query (max_memory_usage) and overall (max_memory_usage_for_all_queries) can prevent out-of-memory situations. However, be cautious as these settings can impact query performance.

Depending on the frequency and nature of queries, adjust the maximum number of concurrent queries allowed using max_concurrent_queries. This will prevent excessive memory usage and improve overall query performance.

Use data skipping indexes to reduce the amount of data read and processed, saving memory. Ensure memory overcommit is not disabled (cat /proc/sys/vm/overcommit_memory should be 0 or 1) to allow the system to allocate more memory than is physically available.

Avoid using swap space, as frequent memory swaps can severely degrade performance; ensure you have sufficient physical memory to handle workloads.

Certain operations like GROUP BY, JOIN, and large array functions can be particularly memory-intensive, so understand their implications and size your system appropriately or adjust queries to manage memory usage.

Turbocharging disk I/O

For optimal I/O performance, preferably use SSDs. In cloud deployments, choose IO-optimized instances. This will ensure data can be written to and read from disk fast enough for running queries performantly. Storages with relatively lower performance, such as magnetic HDDs or object storage, are better suited for cold storage tiers.

Adequate disk space planning is essential to accommodate data growth and temporary files generated during query execution and merges. Ensure you have more than enough disk space to handle your anticipated data volumes and operations.

Instaclustr for ClickHouse managed service includes support for Tiered Storage allowing you to use predefined patterns for offloading some data to remote storage.

Network secrets for fast and secure queries

Deploy high-speed network interfaces (10GbE or higher) to avoid network throughput bottlenecks, which is crucial for processing distributed queries and replication between nodes. Nodes supported with Instaclustr for ClickHouse have such high-speed network interfaces.

Secure communication is also vital, so use encryption (TLS/SSL) for data in transit, especially if your ClickHouse servers are distributed across different data centers or cloud regions.

Keeping a cluster within a virtual private network without an Internet Gateway makes it inaccessible from the Internet and thereby more secure. NetApp Instaclustr allows you to provision Private Network Clusters, where the setup and configuration of such clusters is all managed for you.

Operational excellence

Version management and updates

It is essential to keep your ClickHouse version up to date. However, upgrading a cluster can be complex. There are several factors that you should carefully consider, including potential breaking changes in the new version, selecting the most appropriate version for the upgrade, thoroughly testing the new version prior to implementation, as well as minimizing human errors. You may also need to ensure the cluster remains available throughout the upgrade process, especially if high availability is critical. You can achieve this by configuring your cluster to have replicas for each shard and upgrade one node per shard at a time.

Additionally, if you also have a Network Load Balancer set up, client connections will be automatically routed to the remaining replicas ensuring high availability. On the other hand, it may be troublesome to keep up with the continuous stream of minor patch releases with important bug fixes and critical security updates. Hence, you should determine an optimal frequency of updates that is practical to maintain while not falling too far behind.

Instaclustr performs regular updates to our managed ClickHouse clusters so that customers can be confident about application security and reliability. At any given time, we support multiple ClickHouse versions in our managed fleet, with a full list available here.

Load balancing like a pro

Implement load balancing to distribute query loads evenly across cluster nodes, either through application logic or dedicated load balancers. This not only optimizes performance but also provides high availability by ensuring client requests are not forwarded to unhealthy or offline nodes. Load balancers add an additional layer of health check, making node replace or resize operations transparent to clients. (On a related note, ClickHouse drivers for most popular languages also support client-side load balancing and failover.)

With Instaclustr for ClickHouse, load balancing is another feature we provide which can be used by simply checking a box when creating a new cluster. Read more about it here.

ClickHouse Keeper best practices

ClickHouse Keepers must be deployed in an ensemble with an odd number of nodes (3 or 5) for high availability and fault tolerance through quorum-based decision making. Running ClickHouse Keeper on separate nodes from your ClickHouse servers helps isolate the coordination workload from query processing, benefiting performance and stability, and is recommended for production workloads. When deployed on independent nodes, with Instaclustr for ClickHouse, your ClickHouse cluster may be eligible for higher uptime SLAs.

Ensure ClickHouse Keeper deployments are updated and compatible with your ClickHouse version and maintain quorum by promptly repairing or replacing downed nodes. Use SSDs for Keeper coordination service due to its sensitivity to disk write latency.

Where to go from here

In this article, we’ve explored some of the essential best practices to optimize your ClickHouse deployment, providing you with valuable insights to maximize the performance and efficiency of your setup. However, we recognize that the implementation and management of these practices can be both complex and time-consuming.

So, why manage it all yourself? With the Instaclustr for ClickHouse Managed Platform we have the right expertise to handle these complexities for you, allowing you to focus on your core business needs. Our team ensures that you benefit from a curated and carefully selected hardware and network configuration tailored to your specific requirements.

There is a lot we can take off your shoulders. To name a few—we take care of regular patching and migration management, so you do not have to worry about staying up to date with the latest updates and security fixes. Additionally, we provide comprehensive monitoring and alerting to keep your ClickHouse deployment running smoothly. Our out-of-the-box, click-and-go, production-ready setups mean you can get started quickly and efficiently. With Infrastructure as Code (IaC) support, we ensure that your infrastructure is managed consistently and reliably.

Let NetApp Instaclustr manage your ClickHouse deployment, so you can concentrate on what you do best—growing your business. Ready to implement these ClickHouse best practices?

The post Mastering ClickHouse best practices: Infrastructure and operational excellence appeared first on Instaclustr.

Instaclustr for OpenSearch® 3.0: Experience the next level of search and analytics

Alex Bunday — Mon, 23 Jun 2025 14:00:58 +0000

OpenSearch® 3.0 is here, bringing game-changing improvements to search performance, AI/ML capabilities, and operational efficiency. NetApp, the leading provider of multi-cloud OpenSearch managed services, is delighted to announce the general availability of OpenSearch 3.0 to the NetApp Instacustr Managed Platform. OpenSearch, an Apache 2.0 licensed, open source search and analytics engine, has taken a significant leap forward with version 3.0. Let’s explore how this new release and our platform make the perfect match.

Why choose Instaclustr for OpenSearch?

As founding and general members of the OpenSearch Software Foundation and active participants in the OpenSearch community, we are particularly excited about this release—the first major update in nearly three years. Here are some key reasons why you should be excited too:

Enhanced search performance

The big news is that Apache Lucene™ (the underlying search engine library of OpenSearch) is getting a version upgrade to v10. There has also been a significant focus on improving overall performance in OpenSearch 3.0 through streamlining the code base. This has paid big dividends with OpenSearch 3.0 performing an impressive 8.4 times faster than OpenSearch 1.3 on the Big5 benchmark performance aggregate.

What does this all mean for you? It means a faster, leaner, and more efficient OpenSearch helping you find your data quicker while consuming less resources. This means better performance, while adding less to your bottom line. The best thing about this? Upgrading to OpenSearch 3.0 is completely free!

AI and ML improvements

AI and ML are playing a big role in the development of OpenSearch, making it one of the leading vector engines in the market. A big focus of OpenSearch 3.0 has been to improve the usability and performance of AI and ML to enhance the search experience. In particular, k-NN and Neural Search have been boosted by the introduction of parallelized execution of search queries.

What this means, is that you can more easily introduce AI/ML functionality into your OpenSearch deployment, making search results more personalized to the end user. This significantly improves usability and makes OpenSearch one of the easiest databases to build AI/ML workflows that power your business.

Operational enhancements

With OpenSearch 3.0, one of the primary goals was to improve the experience of those who need to operate and administer OpenSearch clusters at scale. Namely OpenSearch Discover tool, which allows cluster operators to find and visualize their data, which has been completely rebuilt. Additionally, the SQL plugin has been improved, and the JavaScript client has been bumped to version 3.0—bringing a range of improvements across the board.

Why should you care about operational improvements? If your DevOps teams have more and better tools at their disposal, they will be able to more effectively diagnose operational problems and resolve them quicker. In turn, this leads to decreased downtime and business impacts for your organization.

Challenges of managing OpenSearch 3.0

While OpenSearch 3.0 offers many new capabilities and opportunities for DevOps and IT teams, managing it presents a set of challenges that need to be considered. These challenges are nuanced and present significant complexities that businesses often underestimate. For example, recruiting or upskilling employees with niche OpenSearch knowledge is both time-consuming and resource-intensive; regular maintenance takes significant IT resources and can leave teams overwhelmed if sufficient processes aren’t in place; and managing version upgrades and patching without disrupting workflows demands foresight, testing, and time that many teams find difficult to spare. For organizations that need assistance, NetApp Instaclustr provides answers to these challenges and more.

Key features of Instaclustr for OpenSearch

This release of OpenSearch means our customers will be some of the first to enjoy the benefits of this exciting new release.

Fully managed service: Eliminate operational complexities as we handle provisioning , configuration, maintenance, and updates.
Multi-cloud support: Deploy OpenSearch seamlessly on AWS, GCP, Azure, or on-premises with our dynamic provisioning system.
Zero-downtime upgrades: Transition smoothly to OpenSearch 3.0 without worrying about downtime or compatibility issues.
Enhanced security: Keep your data protected with our robust encryption, monitoring, and compliance measures.
Customizable deployment options: Scale your deployment to match your business needs with our flexible hosting solutions.

All you have to do is focus on your business applications, and we’ll do the rest!

Getting started with Instaclustr for OpenSearch

Whether you’re an existing OpenSearch user or considering migration from Elasticsearch, Instaclustr for OpenSearch can get you up and running effortlessly.

For OpenSearch users
- Current customers can upgrade directly to 3.0 with our support. You’ll be among the first to enjoy the benefits of this release. Until then, why not spin up an OpenSearch 3.0 cluster with a free trial on our platform or contact our team for more information.
- Running self-managed OpenSearch? We can also help! We have extensive experience in OpenSearch and have deep knowledge about OpenSearch 3.0 and how to upgrade with minimal impact.
For Elasticsearch users
Thinking of switching to a fully open source alternative? Our expert migration team will ensure a smooth transition with minimal risk. Contact our team today to discuss your business use case for migration.

Final thoughts

OpenSearch 3.0 is a breakthrough release, and Instaclustr for OpenSearch ensures you capitalize on every feature without the operational burden. From blazing-fast search to AI-powered personalization, we’re here to simplify your data strategy.

Take the first step towards scalable, efficient search today.

Try It Now: Spin up the latest OpenSearch 2.x with a free trial on Instaclustr while planning your upgrade to 3.0.

The post Instaclustr for OpenSearch® 3.0: Experience the next level of search and analytics appeared first on Instaclustr.

PostgreSQL® is now available in PCI mode for the NetApp Instaclustr Platform

Chris Carter — Mon, 23 Jun 2025 12:00:28 +0000

Announcing PCI Certification for PostgreSQL® on the Instaclustr Managed Platform.

NetApp Instaclustr is now PCI Certified for PostgreSQL on AWS and GCP. PostgreSQL—a powerful, open source, object-relational database system with a strong reputation for reliability, feature robustness, and performance—is now even more secure with the addition of PCI-DSS (Payment Card Industry Data Security Standard) certification.

The PCI-DSS certification for PostgreSQL enhances Instaclustr’s existing PCI-DSS and SOC 2 accreditations, underscoring our unwavering commitment to rigorous data security practices and architecture. Enterprise customers with stringent security requirements can now leverage Instaclustr’s advanced managed open source data solutions, assured that their environment complies with the most stringent security protocols.

PCI-DSS is the payment card industry’s mandated information security standard and applies to all organizations that store, process, and/or transmit cardholder data. PCI-DSS certification requirements dictate that all system components, either within the cardholder data environment or with access to it, must feature specific and strict technical, physical, and operational security controls. Beyond applications directly involving finance and payment card transactions, PCI-DSS controls are increasingly adopted across various industries as the recognized “gold standard” for security practices. Instaclustr has recently achieved the latest PCI-DSS 4.0 compliance.

“We remain proactive in ensuring that the data technologies we deliver for enterprise customers are always held to the highest security standards,” said Ben Slater, General Manager, NetApp Instaclustr. “Instaclustr Managed PostgreSQL now joins our other offerings in achieving PCI-DSS certification.”

Customers requiring full PCI-DSS compliance can opt-in when creating a PostgreSQL cluster on the NetApp Instaclustr Managed Platform to enable the required security options (for example, password complexity in the Instaclustr console). There are additional customer responsibilities involved to achieve full compliance. For more detailed information, please visit the Instaclustr support page.

Instaclustr customers with existing PostgreSQL clusters who are interested in PCI-DSS compliance should contact support@instaclustr.com for options on achieving PCI compliance on their existing clusters. For further information or a technical briefing, contact an Instaclustr Customer Success representative or sales@instaclustr.com.

Thank you for your continued trust and partnership as we strive to provide the most secure and reliable managed open source solutions.

The post PostgreSQL® is now available in PCI mode for the NetApp Instaclustr Platform appeared first on Instaclustr.

Exploring and creating engaging OpenSearch Dashboards in OpenSearch® 3.0

Kassian Wren — Tue, 17 Jun 2025 12:00:25 +0000

OpenSearch Dashboards are a powerful tool for visualizing and interacting with your OpenSearch data. With version 3.0, new features and enhancements have made it easier than ever to create, customize, and use dashboards for better data-driven decisions. Whether you’re a beginner or looking to level up your skills, this guide walks you through everything you need to know about OpenSearch Dashboards, from setup to visualization.

You can follow along in one of two ways:

Sign up for the NetApp Instaclustr free trial and spin up an OpenSearch 3.0 preview cluster: Instructions for spinning up an OpenSearch cluster can be found here, just change the OpenSearch version to 3.0 (Public Preview)—you may need to scroll down in the version list dropdown:
(Note: You’ll populate your cluster with sample data in a later step.)
You can use the OpenSearch playground—it has all the data pre-populated that you’ll need.

Picture of the NetApp Instaclustr cluster creation screen, with OpenSearch 3.0 selected

When you navigate to your OpenSearch cluster and login or go to the playground cluster, you’ll be greeted with two giant buttons: one for Dashboards, one for Discovery. Discovery is outside the scope of this article but think of it as a query builder for OpenSearch data.

Below are options to populate data on your OpenSearch cluster. If you’re using Instaclustr-hosted OpenSearch, you’ll want to click ‘Add sample data’ on the bottom left:

The following screen will appear, which offers a few different test data sets:

Scroll down to the ‘Sample Web Logs’ data set and click the ‘Add Data’ button:

Once it’s done, click ‘View Data’ and you’ll be greeted with an OpenSearch Dashboard:

To get to this page on the playground, when you click ‘Dashboard’, you’ll see a list of Dashboards:

Click on [Logs] Web Traffic to see the same dashboard.

What is an OpenSearch dashboard?

It may look like an overwhelming amount of information at first, but OpenSearch Dashboards can be broken down into components called Panels. Panels contain visualizations of your OpenSearch data:

Metrics:

Markdown test:

And a control panel (newly added) that allows you to manipulate the dashboard from the UI:

Try playing with the controls in the panel to see how they change the visualizations in real time.

Now that you’ve learned about the panels of a Dashboard and played around with a visualization panel, it’s time to make your first dashboard.

Making your own dashboard

On the Dashboards page (you can click ‘Dashboards’ in the upper left corner to get there) you’ll click the ‘Create Dashboard’ button in the upper right corner:

This will take you to a blank dashboard, where you’ll click the ‘Create New’ button in the dashboard:

In the pop-up window that comes up, select the ‘Area’ chart.

Then for the data source, select opensearch_dashboards_sample_data_logs.

Note: if you are using the playground there will be multiple sources in this list.

Once you’ve chosen your source data, you’ll be taken to the visualization editor screen:

Now this doesn’t look very informative. To enhance this, you’ll add to the Metrics and Buckets boxes on the right side of the screen. Think of them as your Y and X axis data buckets, respectively. To start, in the Metrics window, click the arrow next to ‘Y-axis count’ and change the aggregation from count to average. Then, change the field to ‘bytes’. It should look like the following:

Now you’ll add a date histogram to the ‘Buckets’ section. To do so, click the ‘Add’ button in the Buckets pane. You should see something like this:

Click ‘X-axis’. You’ll see an Aggregation dropdown, select ‘Date Histogram’ from the list:

Leave everything else on default. At the bottom right of the screen, hit the ‘Update’ button to update with the new data buckets:

Now that’s a graph! Hit ‘Save’ at the very top right of the page to save your visualization and, after giving your visualization a title (I used ‘Average Bytes over Time’), hit Save to return to the dashboard with your new area chart:

Next, you’ll create a control to manipulate the data in the area chart you just created. Click ‘Create New’ in the dashboard to create a new panel, and when selecting a type select ‘visualization’. For this panel, use the Controls type:

You’ll be greeted with the same style of editing screen with different options:

You’re going to add two controls: one for request byte size and one for request country. First, the request byte size slider: under ‘Controls’ in the right pane, select ‘Range Slider’ in the dropdown, and select the ‘Add’ button to the right of the dropdown.

Fill out the form: for ‘Control Label’, I put ‘Request byte size’, for the index pattern I selected the opensearch_dashboards_sample_data_logs data set. Then for Field, put ‘bytes’:

Click ‘Update’ at the bottom of the page to see the slider:

You can now filter out requests based on their size in bytes. Now you’ll add a dropdown to select the country the request was from. To start, in the right pane, look at the bottom of the pane for the ‘Add’ button, and in the dropdown to the left of it, select ‘Options List’ and click the ‘Add’ button.

You’ll see a few options for this Options List object:

For ‘Control Label’, put “Request Origin Country”, and for index pattern, select opensearch_dashboards_sample_data_logs, and for field, select geo.src. Hit Update in the lower right corner to see the new control:

Hit ‘Save’ in the upper right corner of the screen twice; once for the control panel and once for the dashboard itself. You’ll be taken to your new dashboard:

Play around with the controls, and hit ‘Apply Changes’ in the control panel to make changes to the visualization on the left:

Wrapping it up

OpenSearch Dashboards are a gateway to making your data more accessible, interactive, and actionable. From setting up your first dashboard to mastering advanced controls, OpenSearch 3.0 empowers users to visualize data like never before.

Don’t wait to leverage these powerful features. Start exploring OpenSearch Dashboards today with the NetApp Instaclustr free trial or OpenSearch playground and unlock key insights from your data. Plus learn more about the OpenSearch 3.0 release on the NetApp Instaclustr blog.

The post Exploring and creating engaging OpenSearch Dashboards in OpenSearch® 3.0 appeared first on Instaclustr.

Instaclustr for Apache Kafka® and Kafka ® Connect 4.0 are generally available

Varun Ghai — Mon, 02 Jun 2025 12:00:56 +0000

NetApp is announcing the general availability of Instaclustr for Apache Kafka and Kafka Connect 4.0.0 on the NetApp Instaclustr Managed Platform.

Kafka 4.0.0, announced by the Kafka project in March 2025, was the first major release in over 3 years and introduces a few significant changes. A full list of inclusions and changes can be found in the release notes.

Some of the significant changes are:

KAFKA-17611: This is the first major release of Kafka that works exclusively with KRaft, not requiring Apache ZooKeeper. This significant change comes some 10 years after Kafka started making use of ZooKeeper for metadata management. What this means for our Instaclustr for Kafka customers:
- New Kafka clusters being created on Kafka 4.0.0 and later versions will by default run in KRaft mode, and there will be no option to use ZooKeeper. This will happen by default; Customers do not need to take any particular action.
- Existing Kafka clusters on Kafka versions 3.x must first be upgraded to Kafka 3.9.x and can then be migrated from ZooKeeper to KRaft. Once that’s been successfully completed, these clusters can be upgraded to Kafka 4.x and later. Note: KRaft mode clusters on our managed platform are only supported with current Instaclustr API versions (APIv2 and later). If you’re using the deprecated APIv1, a migration to a current API version will need to be undertaken before planning a move to KRaft. Please reach out to our support team to get started working out the details for the Kafka version upgrade and any required migrations.
KIP-848: The next-generation consumer rebalance protocol has been released to general availability. This new protocol improves rebalance performance by reducing downtime and latency, especially for large deployments. It is enabled by default on Kafka 4.0 servers, but consumers need to explicitly add the setting group.protocol=consumer.
KIP-932: With the early access release of queue support, the Kafka project has added support for share groups, making it more suitable for additional use-cases. For example, ones where messages are independent work items that have non sequential ordering. Share groups allow multiple member consumers to process messages from the same partition and allow more active members than there are partitions, enabling increased parallelism. Instaclustr for Kafka customers wanting to try this feature out are advised to get in touch with our support team, noting this feature is not yet recommended for production use and isn’t yet covered with production SLAs.

There are several other changes we advise customers should evaluate the impacts of in a non-production environment prior to planning a move to Kafka 4.0:

Removal of previously deprecated APIs: As in the past, Kafka’s major releases including this one remove APIs that have been deprecated for over 12 months. We advise all Kafka customers wanting to deploy Kafka 4.0 clusters to ensure they’ve removed dependency on any impacted APIs. Some of those removed are as part of: KAFKA-18262, KAFKA-18264, KAFKA-18289, KAFKA-18290, KAFKA-18291, KAFKA-18292, KAFKA-18293, KAFKA-18294, KAFKA-18295, KAFKA-18296, KAFKA-18348, KAFKA-12822, KAFKA-12690, KAFKA-15387, KAFKA-15907, KAFKA-16188, KAFKA-16769. For an exhaustive list, please refer to the Kafka 4.0 release notes.
KIP-750, KIP-1013 and KIP-1032: Kafka brokers, tools, Kafka Connect and dependent modules (like MirrorMaker 2) now require Java 17, having dropped support for older Java versions. But this change does not impact the remaining modules (for example, Kafka clients, Kafka streams), which will continue to support Java 11.
KIP-896: To reduce the cost of maintaining support for older protocol API versions, this release updates the minimum supported broker and client version. The new baseline for protocol API versions is Kafka 2.1. Customers should ensure brokers are at least version 2.1 before upgrading clients to 4.0. They should also ensure that their Java client version is at least 2.1 before upgrading brokers to 4.0. More details on upgrading Kafka clients are published in KIP-1124. Alternatively, our customers can open a support ticket to discuss this further with our Kafka experts.
KIP-724: Writes with message formats v0 or v1 in Kafka 4.0 is no longer supported.
KIP-1030: Some configs’ default values or constraints have been changed. Please refer to the KIP for all the changes, keeping in mind that some have already been changed in Kafka 4.0, while others are planned for changes with Kafka 5.0. Where the defaults do not suit, please let us know, and we can change them to something more appropriate for your workload.
For other deprecations and changes, please review, respectively:
- https://kafka.apache.org/documentation.html#upgrade_servers_400_notable
- https://kafka.apache.org/40/documentation/compatibility.html

With this new release, we have reviewed and updated the lifecycle states for our older supported Kafka versions (as per our lifecycle policy). To ensure you get the full benefit of our support and SLAs, please reach out to us and request an upgrade to a GA version. To stay on a supported version of Kafka, we advise customers upgrade their Kafka versions at least once a year to the latest GA version. More details on all Kafka versions supported on our managed platform are available here.

Prior to deploying a new Kafka version in production, we recommend trying your preferred Kafka version in a non-production environment to confirm compatibility with your Kafka clients. If you need any help spinning up a new Kafka cluster or upgrading existing managed Kafka clusters, please feel free to get in touch with us via our support website.

The post Instaclustr for Apache Kafka® and Kafka ® Connect 4.0 are generally available appeared first on Instaclustr.

OpenSearch® performance benchmarking for cross-cluster replication (CCR)

Paris Larkins — Tue, 13 May 2025 12:00:35 +0000

When dealing with large-scale deployments, ensuring data reliability and minimal performance impact can be challenging. OpenSearch’s Cross-Cluster Replication (CCR) introduces a solution by replicating data across clusters to improve disaster recovery and availability. To understand the real-world impact of CCR, we performed detailed OpenSearch performance benchmarking, focusing on metrics like CPU usage, throughput, latency, and replication lag.

This blog explores our key findings, providing actionable insights for anyone looking to optimize their OpenSearch clusters with CCR.

Why test OpenSearch performance for CCR?

Cross-cluster replication enables secondary clusters (followers) to replicate data from a primary cluster (leader). While it increases redundancy, it’s crucial to measure its effect on cluster performance. Our two goals were:

Assess how CCR impacts OpenSearch cluster performance.
Measure replication lag to evaluate its reliability under varying workloads.

This study evaluates the performance impact of cross-cluster replication (CCR) on OpenSearch clusters, focusing on CPU usage, indexing throughput, latency, and replication lag. Using OpenSearch 2.16.0, we established a baseline without CCR and assessed the effects of active replication with 1, 2, and 3 follower clusters. Key findings include:

Leader CPU usage: Increased by 12.41% during indexing and 9.45% overall with 3 followers
Indexing throughput: Decreased by 3.42% with 3 followers
Indexing latency: 50th percentile latency increased by 1.61% and 90th percentile latency by 3.91% with 3 followers
Search throughput and latency: Remained largely unaffected
Replication lag: Averaged 4380.68 operations during indexing, consistently less than 20 seconds behind the leader

These results suggest that CCR can be effectively implemented to enhance data redundancy and availability without significant degradation of service, supporting its use in production environments for improved disaster recovery and data distribution.

Testing methodology

To ensure accurate and reliable OpenSearch performance benchmarking, we established a baseline performance for clusters without CCR. We then incrementally tested with one, two, and three follower clusters using OpenSearch Benchmark and a PMC workload (574,199 academic documents). Tests were conducted in the AWS us-east-1 region with r6g.xlarge instances running OpenSearch 2.16.0. We had two goals for this performance test:

Evaluate the performance impact of an OpenSearch cluster being a cross-cluster replication leader with active replication occurring and see how the performance impact scales with multiple followers. This helped us understand what sort of performance headroom is required to safely add replication to a cluster’s workload.
Monitor the replication lag between a leader and its followers. For cross-cluster replication, this is quantified by the number of replication operations by which a follower lags behind the leader. This analysis provided insight into acceptable levels of replication lag and identified when such lag might indicate potential issues.

Baseline for OpenSearch performance benchmarking

First, we established a performance baseline of a cluster without cross-cluster replication and performed 3 test runs using the OpenSearch Benchmark tool. The OpenSearch Benchmark configuration remained the same across all test runs (baseline, 1, 2, and 3 followers). We chose the following OpenSearch Benchmark configuration:

Parameter	Description	Chosen value	Rationale
Workload	Pre-defined dataset and sequence of OpenSearch operations to run against that dataset	Built-in PMC workload that contains academic medical articles	The PMC workload represents a general-purpose use case, with an appropriate mix of indexing and searching tasks for our testing
Bulk indexing clients	Number of clients OpenSearch Benchmark will use for indexing tasks	1	Chosen via preliminary testing, to ensure baseline CPU Usage was as close to 50% as possible
Search clients	Number of clients OpenSearch Benchmark will use for search tasks	3	Chosen via preliminary testing, to ensure baseline CPU Usage was as close to 50% as possible
Target search throughput	Number of search requests per second that OpenSearch Benchmark will attempt to perform across all clients combined	50	Chosen via preliminary testing, to ensure baseline CPU Usage was as close to 50% as possible

These parameters were chosen so the baseline test would not exceed 50% mean CPU usage, to allow headroom for adding followers in later tests without affecting cluster health.

All testing was performed in the AWS us-east-1 region, using OpenSearch 2.16.0 (the latest version at the time this testing was performed), and 3 cluster nodes using the r6g.xlarge instance size. The OpenSearch Benchmark tool was run from a dedicated r6g.xlarge instance in a different VPC but the same region as the test cluster to minimise network latency as a factor.

The metrics we collected from this baseline case were:

The CPU usage of the cluster nodes.
the mean throughput for the index-append and term tasks as reported by OpenSearch Benchmark.
50th and 90th percentile latencies for the index-append and term tasks.

The index-append task indexes the entire PMC article dataset of 574,199 documents using the bulk indexing API to index 500 documents at a time. The term task performs a search request with a term query for documents that contain the term “physician”.

Collecting mean throughput gives us a measure of the overall performance of indexing/searching, while collecting 50th and 90th percentile latencies give insight into the typical (50th percentile) and near-worst-case (90th percentile) latencies.

The OpenSearch Benchmark metrics were taken directly from the output of the tool and then aggregated across each test run to produce a mean result:

Metric	Task	Unit	Round 1	Round 2	Round 3	Mean
Mean Throughput	index-append	docs/s	890.54	861.06	896.86	882.82
50th percentile latency	index-append	ms	505.9	502.36	503.36	503.87
90th percentile latency	index-append	ms	654.64	655.74	637.53	649.30
Mean Throughput	term	ops/s	49.99	49.98	49.99	49.99
50th percentile latency	term	ms	10.73	11.58	11.87	11.39
90th percentile latency	term	ms	12.17	12.93	13.43	12.84

CPU usage was measured via the Instaclustr Monitoring API, using the node with the highest usage in each test round, excluding the initial warmup period of OpenSearch Benchmark. The CPU usage from each run was broken down into overall mean CPU usage, mean during indexing peak, and mean during searching peak, and then aggregated to produce an overall result:

CPU usage period	Round 1	Round 2	Round 3	Mean
Overall mean CPU usage	27.60%	30.60%	31.70%	29.97%
During indexing peak	37.80%	38.20%	40.80%	38.93%
During searching peak	50.20%	54.70%	59.30%	54.73%

We recorded the indexing and searching peak separately to the overall mean CPU usage as the CPU usage changes dramatically throughout the test, based on the tasks being performed at the time, for example:

Figure 1: The different lines on the graph denote the CPU usage for each of the cluster’s three nodes during one test run.

This also allowed us to see the impact of replication on indexing and searching performance specifically.

Active replication testing

Once we established the baseline performance level with cross-cluster replication disabled, we repeated the same testing again but this time for cross-cluster replication leader clusters with 1, 2, and 3 actively replicating followers. In addition to the CPU usage, indexing and search mean throughput and latency recorded in the baseline test, we also recorded the cross-cluster replication lag from each follower for the entire test duration. The lag was calculated by subtracting the follower cluster’s reported checkpoint value from leader cluster’s reported checkpoint value at the same point in time (or as close as possible). The lag represents how many replication operations a follower is behind its leader cluster and cannot simply be expressed as a time delay.

To more easily test cross-cluster replication using OpenSearch Benchmark, we used a modified version of the “pmc” workload that used a unique index name for each test run. We did this by changing all references to the pmc index name to pmc{{ now | int }} in an openearch-benchmark-workloads branch. This means each test run would use a unique index name, like pmc1741753021. Without this we would need to manually delete the pmc index from the leader and followers in between test runs to avoid replication errors. The default workload behaviour would delete and re-create the test index on the leader, but since these changes are not replicated to followers, the leader and follower would end up with inconsistent state.

Before starting the OpenSearch Benchmark tests, we configured the followers to automatically replicate any indices matching the pmc* pattern, using this API request:

curl -X POST -u : ":9200/_plugins/_replication/_autofollow" -H 'Content-Type: application/json' -d' 
{ 
  "leader_alias": "", 
  "name": "pmc-autofollow", 
  "pattern": "pmc*", 
  "use_roles":{ 
    "leader_cluster_role": "cross_cluster_replication_leader_full_access", 
    "follower_cluster_role": "cross_cluster_replication_follower_full_access" 
  } 
}'

Results

OpenSearch-Benchmark Metric	Task	Unit	Baseline Mean	1 Follower	2 Followers	3 Followers
Mean Throughput	index-append	docs/s	882.82	+1.00%	-3.22%	-2.23%
50th percentile latency	index-append	ms	503.87	-1.59%	-0.56%	+1.61%
90th percentile latency	index-append	ms	649.30	-0.81%	-0.41%	+3.91%
Mean Throughput	term	ops/s	49.99	-0.05%	-0.02%	-0.03%
50th percentile latency	term	ms	11.39	-4.80%	-11.97%	-17.73%
90th percentile latency	term	ms	12.84	-4.15%	-10.33%	-17.93%

Leader CPU usage period	Baseline Mean	1 Follower	2 Followers	3 Followers
Overall mean CPU usage	29.97%	31.30%	31.27%	32.80%
During indexing peak	38.93%	40.77%	40.67%	43.77%
During searching peak	54.73%	53.63%	52.60%	52.27%

Follower Replication Lag (# of operations behind Leader)	1 Follower	2 Followers	3 Followers	Overall mean
Replication Lag mean across all followers during indexing	4575.12	4212.14	4354.77	4380.68

Our results show a measurable increase in leader CPU usage when there are actively replicating followers, which increases as more followers are added. With 3 followers, leader CPU usage increased by 12.41% during indexing, and 9.45% during the test overall. There was no leader CPU usage increase during the search part of the test, regardless of the number of followers. This was expected as no replication is being performed during the searching part of the test workload. In fact, we even recorded reduced CPU usage during the searching component—likely due to uncontrolled performance variations between the tests.

Indexing throughput and latency were negatively affected with three followers, but search throughput and latency seemed unaffected. Mean indexing throughput decreased 3.42% with 3 followers, 50th and 90th percentile latencies increased by 1.9% and 2.7% respectively.

While indexing was measurably impacted, it was only relatively minor and with no clearly defined trend. Perhaps a clear trend would appear if we kept increasing followers far above 3, where we are only beginning to see a small impact (2.23% decreased throughput, 1.61% increased 50th percentile latency).

An interesting additional observation was reduced search latencies in the 1, 2, and 3 follower cases compared to the baseline test. This is like the reduced leader CPU usage we saw in the follower cases compared to baseline, but more pronounced. While we are not certain why this is, one potential theory is that the presence of replicating followers could affect the way documents are cached in memory (e.g. a document could be kept in memory longer than normal if a follower was recently replicating it). Unfortunately, because we did not analyze memory usage patterns in the different test cases we can only guess for now.

Follower CPU usage was similar for all the 1, 2, and 3 follower cases, averaging 23.32% overall and 36.2% during the indexing peak of the test.

Follower replication lag (how many operations behind the leader cluster a follower is) was not affected by the number of followers and averaged 4380.68 operations during indexing. While this number cannot be directly converted to an elapsed time value, we observed that during the test the follower checkpoint values always exceeded the leader checkpoint value reported at the previous 20 second interval. This indicates that the followers consistently remained less than 20 seconds behind the leader, but our metrics collection interval cannot offer any more granularity than that.

While the replication lag spiked up and down throughout the test, the mean lag remained consistent rather than constantly increasing. The graphs below show how the replication lag spiked up and down during the three test runs, likely due to batch processing of replication operations.

OpenSearch performance benchmarking: Conclusion

Our performance testing of OpenSearch’s Cross-Cluster Replication (CCR) revealed that while CCR introduces a measurable increase in CPU usage on the leader cluster, particularly during indexing activities, the overall impact on search performance was minimal. With up to three followers, the indexing throughput saw a slight decrease, and latency experienced a modest increase, but these changes were within acceptable ranges for most use cases. Importantly, the replication lag remained consistent and manageable, indicating that CCR can be effectively used without significant degradation of service.

For customers using CCR, this means that it can be confidently implemented to enhance data redundancy and availability. This ensures that the benefits of CCR, such as improved disaster recovery and data distribution, can be leveraged without compromising the performance and reliability of the OpenSearch cluster.

Interested in optimizing your OpenSearch clusters? Stay tuned for more insights and guides to make the most of your deployment! For professional support in implementing CCR, connect with our team today.

The post OpenSearch® performance benchmarking for cross-cluster replication (CCR) appeared first on Instaclustr.

Apache Kafka® security: The 5 non-negotiables for secure data streaming

Instaclustr — Mon, 12 May 2025 15:11:45 +0000

Securing real-time data streaming

Imagine you’re flying 30,000 feet in the air, trying to doze off, but you’re also tempted to catch up on that show you’ve been meaning to finish for the last several years. In the meantime, you’re being jolted awake by turbulence along the way—maybe causing a bit of panic here and there. But, hey, you still have several hours left until landing, so you manage to doze off after all. And all the while, that airplane–and every other airplane currently in the sky–is streaming real-time data back to different stakeholders:

The turbine company monitoring for any abnormalities
The airline keeping track of its geographical location
Weather monitoring for turbulence from other planes on the same route

This data streaming keeps your flight safe, smooth, and on time. Now, imagine how critical security becomes when sensitive data is involved. Enterprises relying on data streaming must prioritize robust security measures to protect their operations.

Real-time fraud detection, hyper-personalized shopping experiences, data from IoT devices, AI insights…the list of use cases for data streaming is extensive and growing.

And along with a growing need for data streaming is the need to make sure that data is secure.

But with so many changes going on (here’s looking at you AI) how can enterprises secure their data streaming? For organizations leveraging Apache Kafka® to power their data infrastructure, here are five essential best practices to ensure Kafka security and long-term scalability.

Secure your Kafka streams with confidence

Download our white paper on Kafka compliance strategies and learn how to build scalable and secure streaming architectures

Download the white paper

1. Stay on top of regular updates and patching notices

This one’s a no-brainer—or rather, it should be a no brainer, but unfortunately, it’s not. The sheer number of enterprises that don’t stay on top of regular updates and patching for their software is uncomfortably larger than it should be.

In 2024 alone, there were a record-breaking 40,009 CVEs announced, and so far in 2025, we’re well on our way to surpassing that number by a good margin—some are already estimating that we’ll have upwards of 50,000 CVEs this year. With so many vulnerabilities out there (and growing), maintaining regular patching and staying on top of software updates is imperative. Thankfully, the solution is straightforward.

Don’t kick the can down the road. When a CVE is announced, action it. When a software update is rolled out, prioritize updating it. This will help save you from far more severe problems later.

2. Create strong authentication and access controls

How many people have access to your database? How easy it is to access it? Does everyone really need to have that access to it? Ask yourself these questions and you could very well be surprised by the answer.

Use role-based access control (RBAC) or attribute-based access control (ABAC) to limit access to data streams based on user roles or attributes.

Control access to resources by setting up fine-grained access control lists (ACLs). This ensures that only authorized users or applications can read from or write to specific topics/channels.

By creating strong authentication and access controls, you’re helping to prevent unauthorized access, protect sensitive data, and ultimately mitigating security risks.

3. Keep it private: Use VPC peering and private networks

What’s the best way to enhance the confidentiality, integrity, and availability as you scale your data streams? Utilize privacy measures like VPC peering and private networks.

Why? Using VPC peering and private network connections ensures that your data streams are transmitted securely within a controlled environment, reducing exposure to external threats and improving performance.

This is beneficial for a few reasons:

It reduces the risk of attacks like man-in-the-middle (MITM), DDoS, or unauthorized access that can occur when data is exposed to the public internet.
By keeping data within a private network, you eliminate the need for public IP addresses, which are more vulnerable to exploitation.
Publicly accessible endpoints increase the attack surface for your data streaming infrastructure. With private connections, only internal resources (e.g., producers, consumers, and brokers) can communicate, significantly reducing the risk of external threats.

By taking active privacy measures like VPC peering and private networks, you’re helping to build a robust and secure data streaming architecture.

4. Stay compliant with data privacy rules and regulations

Admittedly, this one is easier said than done—but that doesn’t mean it’s not worth investing in.

It’s no surprise that many industries that rely heavily on streaming data (like finance and healthcare) have strict compliance requirements: GDPR, HIPAA, PCI-DSS, etc. and these requirements aren’t going away anytime soon—if anything, they’ll become more stringent.

While using VPC and private network connections can help meet these regulatory requirements, you can—and should—take it further. If you’re at the early stages of architecture design, then you can start by implementing privacy principles from the get-go. Incorporate privacy controls and data anonymization techniques to minimize the exposure of sensitive data.

Beyond the early-stage position? Not to worry, there’s still a lot you can do:

Implement monitoring mechanisms to detect and report any potential data breaches or non-compliance incidents.
Consistently audit logs, access attempts, and permissions.
Define data retention policies to ensure sensitive data is securely deleted or anonymized.
Train employees on data privacy rules, best practices, and regulations to build a culture of security and awareness across your organization.

5. Stick with open source

And finally, what’s the best way to secure data streaming and help you scale for whatever your enterprise may face?

Use open source software.

Sure, that may seem counterintuitive at first; after all, the actual source code for open source (as the name suggests) is literally out in the open for all to see. But open source provides security benefits that its proprietary counterparts simply cannot match.

Apache Kafka® has proved itself to become a leading streaming data technology, and for good reason. Its distributed architecture is designed to scale seamlessly, making it capable of handling massive volumes of real-time data without performance degradation.

Plus, Kafka’s flexibility enables it to easily integrate with AI-driven systems—a key driver of increased workloads both now and in the future. Being open source, you get numerous security benefits that propriety code simply cannot match: a strong community bringing transparency and quick updates to any security vulnerabilities that can (and will) arise.

Final thoughts on Kafka security

AI is changing everything, and enterprises are already living it.

According to the 2024 Data complexity report from NetApp, 69% of enterprises are already noting an increase in security threats because of AI. Most C-level executives list global security challenges as their main stressor, which will only increase in 2025 and beyond.

But the demand for streaming data is not slowing down anytime soon. Financial institutions still need to stay ahead of the curve with fraud detection. Healthcare providers want to deliver the best possible outcome. Retailers are coming up with new ways to make the shopping experience as personalized as possible.

And airplanes—with all the real-time data monitoring keeping them in the sky—still need to fly; odds are, that data is streaming with Apache Kafka, too.

Want to know exactly how to implement these best practices?

The post Apache Kafka® security: The 5 non-negotiables for secure data streaming appeared first on Instaclustr.