Dev Rel - Instaclustr https://www.instaclustr.com/blog/category/dev-rel/ Managed Services and Support for Apache Cassandra, Apache Kafka, PostgreSQL, OpenSearch, Elasticsearch and Valkey Thu, 12 Jun 2025 12:31:12 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.1 https://www.instaclustr.com/wp-content/uploads/cropped-netapp_favicon-32x32.png Dev Rel - Instaclustr https://www.instaclustr.com/blog/category/dev-rel/ 32 32 OpenSearch® version 3.0 is here! What’s new and shiny in this major release https://www.instaclustr.com/blog/opensearch-3-0-is-here-whats-new-and-shiny-in-this-major-release/ Tue, 06 May 2025 22:30:15 +0000 https://www.instaclustr.com/?p=16519 This post explores the new functionality and performance gains offered by the release of OpenSearch 3.0.

The post OpenSearch® version 3.0 is here! What’s new and shiny in this major release appeared first on Instaclustr.

]]>
OpenSearch version 3.0 GA is here! What’s new and shiny in this major release

OpenSearch 3.0 is officially GA! This is an exciting time for OpenSearch, as version 3.0 brings a lot of capabilities, while maintaining the scalability and flexibility of OpenSearch. I’ve covered some of the changes over on the OpenSearch blog, and this post will dive just a little bit deeper into the new functionality, performance gains, and more. Specifically, I’ll dig into the Lucene 10 upgrade, OpenSearch dashboard changes, changes to the core code of OpenSearch, performance improvements in OpenSearch 3.0, and finally, I’ll cover migrating from OpenSearch 2.X or Elasticsearch.

These changes will quickly and accurately let you search your data; the performance gains alone will ripple into things like infrastructure cost and newly revamped features will allow your developers and analysts faster insights into your data.

Lucene 10 upgrade

The reasons for updating to Apache Lucene 10 are predominantly related to performance. Lucene 10 is more performant in many categories than its predecessor in OpenSearch 2.x, meaning faster searches for users. There were also major changes to I/O performance and performance of parallel-executing tasks; and there’s even more Lucene 10 brings:

  • Sparse indexing: Also known as primary-key indexing, this organizes data into blocks with recorded minimum and maximum sizes. This allows queries to more efficiently skip non-relevant blocks when querying, which increases CPU/storage efficiency. This was made possible through the I/O parallelism improvements mentioned earlier, allowing OpenSearch users to search faster and get more done in less time.
  • KNN/Neural search improvements: Because of the improvements to parallel searches we covered previously, parallelized execution of KNN and neural searches has improved in Lucene 10 as well. Lucene 10 also adds improvements to indexing vectors; better parallelization of I/O allows for optimizations in terms of how vectors are stored, creating a much more scalable neural search engine. This coincides well with the rise of in-house LLM and AI building, as these improvements make OpenSearch a stronger candidate for searching AI-related vector data or other AI/ML applications.

You can learn more about Lucene 10 improvements from its changelog.

OpenSearch dashboard changes

The OpenSearch dashboard has seen a lot of changes in 2024, including releasing a new UI. The updates in OpenSearch 3.0 include OpenSearch Workspaces, allowing you to assign individual accounts different dashboard views, and a totally revamped discover tool. These changes allow users to tailor data visualizations to each person’s needs on the dashboard and write queries against their OpenSearch data easier and faster by improving UX and other functionality.

OpenSearch 3.0 dashboard

Core changes in OpenSearch code

The OpenSearch server code monolith has been broken up into modular pieces to increase the speed at which OpenSearch can be developed, making the code system easier to navigate and understand. For OpenSearch maintainers and committers, this means a streamlining of the PR process. Instead of having to navigate hundreds of changes to the server monolith, these changes can be more clearly tracked in their respective breakout classes.

This is a monumental undertaking by the OpenSearch team, and you can track their changes on this ticket.

Performance in OpenSearch 3.0

Performance in OpenSearch wasn’t just boosted by Lucene 10. The OpenSearch team has been hard at work fine-tuning and overhauling OpenSearch 3.0 to make it as performant as possible.

From increasing performance by 20% on the Big5 queries in 2.19, to being up to 8.4x more performant in many cases than OpenSearch 1.X, and even higher increases in other cases, OpenSearch 3.0 has brought its full sights on the goal of better performance.

You can learn more about these changes from this GitHub ticket.

Migrating from OpenSearch 2.X and Elasticsearch to version 3.0

Migrating from OpenSearch 2.19 to OpenSearch 3.0 is a one-step process, but the same cannot be said for Elasticsearch. For instance, migrating from Elasticsearch to OpenSearch 3.0 is a multi-step process (Elasticsearch to OpenSearch 2.19, then to OpenSearch 3.0). If you’re thinking about making the switch, the time to is now: there are tools and guides available, but the OpenSearch project has made several innovations since the initial fork, and the more time passes, the more that divergence will make migration harder. Our team can help you migrate—contact us!

I hope you’ve enjoyed this whirlwind tour of the new OpenSearch 3.0! To learn more about OpenSearch, check out these YouTube videos:

NetApp Instaclustr is committed to releasing OpenSearch 3.0 on our platform as soon as possible, but in the meantime, you can try out OpenSearch 2.19 with the free trial.

The post OpenSearch® version 3.0 is here! What’s new and shiny in this major release appeared first on Instaclustr.

]]>
Instaclustr and the Apache Cassandra® project: A continuing commitment to open source https://www.instaclustr.com/blog/instaclustr-and-the-apache-cassandra-project/ Fri, 11 Apr 2025 00:28:54 +0000 https://www.instaclustr.com/?p=16339 Apache Cassandra: Our original adventure in open source When we founded Instaclustr over a decade ago, we began with a single open source project: Apache Cassandra. Since then we have expanded into other Apache Software Foundation (ASF)-sponsored projects like Apache Kafka®, Kafka® Connect, and Apache ZooKeeper™. But Cassandra has always remained a core part of...

The post Instaclustr and the Apache Cassandra® project: A continuing commitment to open source appeared first on Instaclustr.

]]>
Apache Cassandra: Our original adventure in open source

When we founded Instaclustr over a decade ago, we began with a single open source project: Apache Cassandra.

Since then we have expanded into other Apache Software Foundation (ASF)-sponsored projects like Apache Kafka®, Kafka® Connect, and Apache ZooKeeper™. But Cassandra has always remained a core part of the Instaclustr Managed Platform.

Why Cassandra? Simply put, there just wasn’t anything like it being developed at the time—or since.

Cassandra’s ability to handle incredible amounts of data across multiple commodity servers, high availability, fault tolerance, and linear scalability offered a way to fundamentally change how people managed NoSQL databases. Its distributed architecture meant that there wasn’t a single point of failure, and being run in the Cassandra Query Language (CQL)—closely related to SQL—made it easier for developers to quickly adapt to using CQL.

But even more so than its incredible architecture was the ASF and the open source community backing this project. We wanted to be an integral part of collaborating and growing Cassandra into the NoSQL powerhouse it is today and maintain that dynamism well into the future.

Here’s how Instaclustr is contributing to the lasting success of Apache Cassandra.

Powering the Cassandra process

As part of our multi-year Silver Sponsorship of the Apache Software Foundation, we’re excited to announce that Instaclustr has donated instances to provide committers with additional compute capability inside the Apache Software Foundation infrastructure.

With the infrastructure now set up, this will provide Cassandra project committers with significant extra capacity to ci-cassandra.apache.org:

  1. Having more executors available means that build times are shortened and the development feedback-loop is faster, increasing overall Cassandra progress
  2. In addition, having more executors will add both stability and improved throughput of CI of the community
  3. NetApp Instaclustr now runs 60 of the total 140 executors–roughly 42% of the Continuous Integration (CI) infrastructure for this project–showing our firm commitment to the project and the wider community.

Other recent open source contributions

The AWS instances are not the only way Instaclustr has been contributing to the ongoing success of the Apache Cassandra project. Here’s a breakdown of some of our most recent contributions.

  • CEP-24: Cassandra password validator and generator
    • Although users have always had the ability to create whatever password they wanted in Cassandra, this inadvertently led to security vulnerabilities; while organizations may have internal policies for creating strong passwords, there was no way to enforce these standards.
    • With the development of CEP-24, users now have robust, configurable password policies with built-in enforcement and convenient password generation capabilities.
  • Cassandra 5 support for Go Driver
    • Apache Cassandra community have adopted the original gocql driver which is now part of the Cassandra project. We have been supporting this transition, and we are collaborating with upstream in order to support Cassandra 5 as well.
  • Debezium/Cassandra 5
    • Debezium is an open source distributed, low-latency, data streaming platform for change data capture (CDC). We have been adding support for Debezium and Cassandra 5, as well as continuing to support older Cassandra releases.

Proudly maintaining support for the ASF, Cassandra, and the open source community

Open source has been our mission since Day One at Instaclustr, and we continue to offer nothing but 100% open source solutions more than a decade later.

By donating AWS instances to the Apache Software Foundation, we’re excited to see what Cassandra project committers will do with this extra computing power and how Apache Cassandra will maintain its edge as the world’s leading NoSQL database. But most importantly, we’re proud to be a continuing partner of the ASF and a leader in community engagement and open source development.

For more details about the latest open source contributions from Instaclustr, check out:

The post Instaclustr and the Apache Cassandra® project: A continuing commitment to open source appeared first on Instaclustr.

]]>
CEP-24 Behind the scenes: Developing Apache Cassandra®’s password validator and generator https://www.instaclustr.com/blog/cep-24-behind-the-scenes/ Wed, 09 Apr 2025 03:39:36 +0000 https://www.instaclustr.com/?p=16043 Introduction: The need for an Apache Cassandra® password validator and generator Here’s the problem: while users have always had the ability to create whatever password they wanted in Cassandra–from straightforward to incredibly complex and everything in between–this ultimately created a noticeable security vulnerability. While organizations might have internal processes for generating secure passwords that adhere...

The post CEP-24 Behind the scenes: Developing Apache Cassandra®’s password validator and generator appeared first on Instaclustr.

]]>
Introduction: The need for an Apache Cassandra® password validator and generator

Here’s the problem: while users have always had the ability to create whatever password they wanted in Cassandra–from straightforward to incredibly complex and everything in between–this ultimately created a noticeable security vulnerability.

While organizations might have internal processes for generating secure passwords that adhere to their own security policies, Cassandra itself did not have the means to enforce these standards. To make the security vulnerability worse, if a password initially met internal security guidelines, users could later downgrade their password to a less secure option simply by using “ALTER ROLE” statements.

When internal password requirements are enforced for an individual, users face the additional burden of creating compliant passwords. This inevitably involved lots of trial-and-error in attempting to create a compliant password that satisfied complex security roles.

But what if there was a way to have Cassandra automatically create passwords that meet all bespoke security requirements–but without requiring manual effort from users or system operators?

That’s why we developed CEP-24: Password validation/generation. We recognized that the complexity of secure password management could be significantly reduced (or eliminated entirely) with the right approach–and improving both security and user experience at the same time.

The Goals of CEP-24

A Cassandra Enhancement Proposal (or CEP) is a structured process for proposing, creating, and ultimately implementing new features for the Cassandra project. All CEPs are thoroughly vetted among the Cassandra community before they are officially integrated into the project.

These were the key goals we established for CEP-24:

  • Introduce a way to enforce password strength upon role creation or role alteration.
  • Implement a reference implementation of a password validator which adheres to a recommended password strength policy, to be used for Cassandra users out of the box.
  • Emit a warning (and proceed) or just reject “create role” and “alter role” statements when the provided password does not meet a certain security level, based on user configuration of Cassandra.
  • To be able to implement a custom password validator with its own policy, whatever it might be, and provide a modular/pluggable mechanism to do so.
  • Provide a way for Cassandra to generate a password which would pass the subsequent validation for use by the user.

The Cassandra Password Validator and Generator builds upon an established framework in Cassandra called Guardrails, which was originally implemented under CEP-3 (more details here).

The password validator implements a custom guardrail introduced as part of CEP-24. A custom guardrail can validate and generate values of arbitrary types when properly implemented. In the CEP-24 context, the password guardrail provides CassandraPasswordValidator by extending ValueValidator, while passwords are generated by CassandraPasswordGenerator by extending ValueGenerator. Both components work with passwords as String type values.

Password validation and generation are configured in the cassandra.yaml file under the password_validator section. Let’s explore the key configuration properties available. First, the class_name and generator_class_name parameters specify which validator and generator classes will be used to validate and generate passwords respectively.

Cassandra ships CassandraPasswordValidator and CassandraPasswordGenerator out of the box. However, if a particular enterprise decides that they need something very custom, they are free to implement their own validators, put it on Cassandra’s class path and reference it in the configuration behind class_name parameter. Same for the validator.

CEP-24 provides implementations of the validator and generator that the Cassandra team believes will satisfy the requirements of most users. These default implementations address common password security needs. However, the framework is designed with flexibility in mind, allowing organizations to implement custom validation and generation rules that align with their specific security policies and business requirements.

password_validator: 
 # Implementation class of a validator. When not in form of FQCN, the 
 # package name org.apache.cassandra.db.guardrails.validators is prepended. 
 # By default, there is no validator. 
 class_name: CassandraPasswordValidator 
 # Implementation class of related generator which generates values which are valid when 
 # tested against this validator. When not in form of FQCN, the 
 # package name org.apache.cassandra.db.guardrails.generators is prepended. 
 # By default, there is no generator. 
 generator_class_name: CassandraPasswordGenerator

Password quality might be looked at as the number of characteristics a password satisfies. There are two levels for any password to be evaluated – warning level and failure level. Warning and failure levels nicely fit into how Guardrails act. Every guardrail has warning and failure thresholds. Based on what value a specific guardrail evaluates, it will either emit a warning to a user that its usage is discouraged (but ultimately allowed) or it will fail to be set altogether.

This same principle applies to password evaluation – each password is assessed against both warning and failure thresholds. These thresholds are determined by counting the characteristics present in the password. The system evaluates five key characteristics: the password’s overall length, the number of uppercase characters, the number of lowercase characters, the number of special characters, and the number of digits. A comprehensive password security policy can be enforced by configuring minimum requirements for each of these characteristics.

# There are four characteristics: 
 # upper-case, lower-case, special character and digit. 
 # If this value is set e.g. to 3, a password has to 
 # consist of 3 out of 4 characteristics. 

 # For example, it has to contain at least 2 upper-case characters, 
 # 2 lower-case, and 2 digits to pass, 
 # but it does not have to contain any special characters. 
 # If the number of characteristics found in the password is 
 # less than or equal to this number, it will emit a warning. 
 characteristic_warn: 3 
 # If the number of characteristics found in the password is 
 #less than or equal to this number, it will emit a failure. 
 characteristic_fail: 2

Next, there are configuration parameters for each characteristic which count towards warning or failure:

# If the password is shorter than this value, 
# the validator will emit a warning. 
length_warn: 12 
# If a password is shorter than this value, 
# the validator will emit a failure. 
length_fail: 8 
# If a password does not contain at least n 
# upper-case characters, the validator will emit a warning. 
upper_case_warn: 2 
# If a password does not contain at least 
# n upper-case characters, the validator will emit a failure. 
upper_case_fail: 1 
# If a password does not contain at least 
# n lower-case characters, the validator will emit a warning. 
lower_case_warn: 2 
# If a password does not contain at least 
# n lower-case characters, the validator will emit a failure. 
lower_case_fail: 1 
# If a password does not contain at least 
# n digits, the validator will emit a warning. 
digit_warn: 2 
# If a password does not contain at least 
# n digits, the validator will emit a failure. 
digit_fail: 1 
# If a password does not contain at least 
# n special characters, the validator will emit a warning. 
special_warn: 2 
# If a password does not contain at least 
# n special characters, the validator will emit a failure. 
special_fail: 1

It is also possible to say that illegal sequences of certain length found in a password will be forbidden: 

# If a password contains illegal sequences that are at least this long, it is invalid. 
# Illegal sequences might be either alphabetical (form 'abcde'), 
# numerical (form '34567'), or US qwerty (form 'asdfg') as well 
# as sequences from supported character sets. 
# The minimum value for this property is 3, 
# by default it is set to 5. 
illegal_sequence_length: 5

Lastly, it is also possible to configure a dictionary of passwords to check against. That way, we will be checking against password dictionary attacks. It is up to the operator of a cluster to configure the password dictionary:

# Dictionary to check the passwords against. Defaults to no dictionary. 
# Whole dictionary is cached into memory. Use with caution with relatively big dictionaries. 
# Entries in a dictionary, one per line, have to be sorted per String's compareTo contract. 
dictionary: /path/to/dictionary/file

Now that we have gone over all the configuration parameters, let’s take a look at an example of how password validation and generation look in practice.

Consider a scenario where a Cassandra super-user (such as the default ‘cassandra’ role) attempts to create a new role named ‘alice’.

cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'cassandraisadatabase' AND LOGIN = true; 

InvalidRequest: Error from server: code=2200 [Invalid query] 
message="Password was not set as it violated configured password strength 
policy. To fix this error, the following has to be resolved: Password 
contains the dictionary word 'cassandraisadatabase'. You may also use 
'GENERATED PASSWORD' upon role creation or alteration."

The password is not found in the dictionary, but it is not long enough. When an operator sees this, they will try to fix it by making the password longer:

cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'T8aum3?' AND LOGIN = true; 
InvalidRequest: Error from server: code=2200 [Invalid query] 
message="Password was not set as it violated configured password strength 
policy. To fix this error, the following has to be resolved: Password 
must be 8 or more characters in length. You may also use 
'GENERATED PASSWORD' upon role creation or alteration."

The password is finally set, but it is not completely secure. It satisfies the minimum requirements but our validator identified that not all characteristics were met.

cassandra@cqlsh> CREATE ROLE alice WITH PASSWORD = 'mYAtt3mp' AND LOGIN = true; 

Warnings: 

Guardrail password violated: Password was set, however it might not be 
strong enough according to the configured password strength policy. 
To fix this warning, the following has to be resolved: Password must be 12 or more 
characters in length. Passwords must contain 2 or more digit characters. Password 
must contain 2 or more special characters. Password matches 2 of 4 character rules, 
but 4 are required. You may also use 'GENERATED PASSWORD' upon role creation or alteration.

The password is finally set, but it is not completely secure. It satisfies the minimum requirements but our validator identified that not all characteristics were met. 

When an operator saw this, they noticed the note about the ‘GENERATED PASSWORD’ clause which will generate a password automatically without an operator needing to invent it on their own. This is a lot of times, as shown, a cumbersome process better to be left on a machine. Making it also more efficient and reliable.

cassandra@cqlsh> ALTER ROLE alice WITH GENERATED PASSWORD; 

generated_password 
------------------ 
   R7tb33?.mcAX

The generated password shown above will satisfy all the rules we have configured in the cassandra.yaml automatically. Every generated password will satisfy all of the rules. This is clearly an advantage over manual password generation.

When the CQL statement is executed, it will be visible in the CQLSH history (HISTORY command or in cqlsh_history file) but the password will not be logged, hence it cannot leak. It will also not appear in any auditing logs. Previously, Cassandra had to obfuscate such statements. This is not necessary anymore.

We can create a role with generated password like this:

cassandra@cqlsh> CREATE ROLE alice WITH GENERATED PASSWORD AND LOGIN = true; 

or by CREATE USER: 

cassandra@cqlsh> CREATE USER alice WITH GENERATED PASSWORD;

When a password is generated foralice (out of scope of this documentation), she can log in: 

$ cqlsh -u alice -p R7tb33?.mcAX 
... 
alice@cqlsh>

Note: It is recommended to save password to ~/.cassandra/credentials, for example: 

[PlainTextAuthProvider] 
username = cassandra
password = R7tb33?.mcAX

and by setting auth_provider in ~/.cassandra/cqlshrc 

[auth_provider] 
module = cassandra.auth 
classname = PlainTextAuthProvider

It is also possible to configure password validators in such a way that a user does not see why a password failed. This is driven by configuration property for password_validator called detailed_messages. When set to false, the violations will be very brief:

alice@cqlsh> ALTER ROLE alice WITH PASSWORD = 'myattempt'; 

InvalidRequest: Error from server: code=2200 [Invalid query] 
message="Password was not set as it violated configured password strength policy. 
You may also use 'GENERATED PASSWORD' upon role creation or alteration."

The following command will automatically generate a new password that meets all configured security requirements.

alice@cqlsh> ALTER ROLE alice WITH GENERATED PASSWORD;

Several potential enhancements to password generation and validation could be implemented in future releases. One promising extension would be validating new passwords against previous values. This would prevent users from reusing passwords until after they’ve created a specified number of different passwords. A related enhancement could include restricting how frequently users can change their passwords, preventing rapid cycling through passwords to circumvent history-based restrictions.

These features, while valuable for comprehensive password security, were considered beyond the scope of the initial implementation and may be addressed in future updates.

Final thoughts and next steps

The Cassandra Password Validator and Generator implemented under CEP-24 represents a significant improvement in Cassandra’s security posture.

By providing robust, configurable password policies with built-in enforcement mechanisms and convenient password generation capabilities, organizations can now ensure compliance with their security standards directly at the database level. This not only strengthens overall system security but also improves the user experience by eliminating guesswork around password requirements.

As Cassandra continues to evolve as an enterprise-ready database solution, these security enhancements demonstrate a commitment to meeting the demanding security requirements of modern applications while maintaining the flexibility that makes Cassandra so powerful.

Ready to experience CEP-24 yourself? Try it out on the Instaclustr Managed Platform and spin up your first Cassandra cluster for free.

CEP-24 is just our latest contribution to open source. Check out everything else we’re working on here.

The post CEP-24 Behind the scenes: Developing Apache Cassandra®’s password validator and generator appeared first on Instaclustr.

]]>
How to size Apache Kafka® clusters for Tiered Storage: Part 3 https://www.instaclustr.com/blog/how-to-size-apache-kafka-clusters-for-tiered-storage-part-3/ Tue, 01 Apr 2025 02:54:40 +0000 https://www.instaclustr.com/?p=16257 A Kafka performance model for SSDs/EBS, network, IO, and now—storage! In my previous series on Apache Kafka® Tiered Storage, we explored how Kafka storage works, including the new Tiered Storage architecture which allows potentially more data to be stored in remote storage (e.g. AWS S3) for less money. And in Part 3 specifically, we explored...

The post How to size Apache Kafka® clusters for Tiered Storage: Part 3 appeared first on Instaclustr.

]]>
A Kafka performance model for SSDs/EBS, network, IO, and now—storage!

In my previous series on Apache Kafka® Tiered Storage, we explored how Kafka storage works, including the new Tiered Storage architecture which allows potentially more data to be stored in remote storage (e.g. AWS S3) for less money.

And in Part 3 specifically, we explored some of the space/time trade-offs with Kafka— an interesting point that’s worth digging around some more.

As we discovered already in this new series, records are initially written to local storage on the leader broker (SSD or EBS storage), where they are retained for the “local” retention time (retention.ms if Tiered Storage is not enabled, otherwise local.retention.ms).

If Kafka Tiered Storage is enabled for the cluster and topic then the records are also written to remote storage after a slight delay—(the segment must be closed first, and the write is asynchronous). The remote records/segments are retained for the “remote” retention time (retention.ms).

So, assuming you have a replication factor (RF) of 3 (typical in production) you will actually have 4+ copies of each record (apart from the active segment which only ever has 3) until local segments are deleted, after which the only copy is remote (which is replicated under the hood by AWS S3).

We can easily compute the local and remote data storage quantities using Little’s Law which predicts the population in a system from the throughput and time spent in the system on average:

Users = Throughput x Time

In my previous blog, we discovered that you could pack a certain number of boxes of cheese, rabbits, or eggs in an old-time ‘R’ (Refrigerator) train car. Let’s focus on eggs this time as egg production is easy to understand—it’s all about the chickens!

(Source: Adobe Stock)

Imagine you have a chicken farm with 10,000 chickens, each laying one egg on average every 26 hours. For this part of the system (chicken/eggs sub-system), we know the population of chickens and time to lay, so we can rearrange the formula to compute the egg laying throughput:

Eggs/hour = 10,000/26 = 385 (rounded up)

Let’s assume the eggs are packed and transported once a day, so we are interested in the number of eggs that need to be stored/processed per day (the egg/transportation sub-system), i.e. just throughput times time:

Eggs/day = 385 x 24 = 9240 eggs

Assuming one box of eggs has 30 eggs then you can compute the egg storage requirements as:

Boxes of eggs = 9240/30 = 308

Coincidentally an ‘R’ car could hold 330 boxes of eggs, so the train needs just under 1 ‘R’ car to transport your eggs to market every day (But maybe there should be a saying “Don’t put all your eggs in one railway carriage!”)

For Kafka data storage the equation is:

Data size (TB) = (retention period (s) x data throughput [GB/s])/1000

We can compute this for local and remote storage as follows:

Local data size (TB) = 3 x (local.retention.ms/1000 x producer data volume [GB/s])/1000

Remote data size (TB) = (remote.retention.ms/1000 x producer data volume [GB/s])/1000

Note: these calculations compute the approximate average (ignoring workload spikes) minimum storage space used (it may be more). Due to the way that retention works in Apache Kafka, slightly more space may be needed as once the retention period is reached deletions are typically asynchronous and potentially delayed.

Also note: this doesn’t allow for any storage headroom, and the capacity of the disk will need to be higher—you don’t want to suddenly use up your storage on a Kafka cluster (particularly for fixed sized SSDs—for EBS and S3 this isn’t such an issue).

And another thing: to compute storage sizes, it’s important to use accurate average workload metrics rather than peak metrics; this is one occasion in performance engineering when averages are good! (Why’s that? Normally averages are “wrong”, as they are not statistically robust; medians and percentiles are more robust).

From this we can compute the approximate minimum storage costs as $ TB/month, given the cloud providers’ pricing for different types of storage. For example, AWS EBS is around $0.08 GB/month, and AWS S3 is cheaper at $0.023 GB/month (x 1000 to get $ TB/month). Here are some limitations:

  • The cost of SSDs depends on instance type and sizes so I will ignore these for the time being
  • There may also be extra costs for data transfer but I’m just looking at storage costs for now
  • And note that the storage costs are also simplified a bit (as typically prices depend on tiered quantities)

For the same retention period, more data is stored locally due to replication. And noting how Tiered Storage actually works (records are retained on local storage for local.retention.ms, and on remote storage for retention.ms), it really only makes sense for local.retention.ms << retention.ms.

After enabling Tiered Storage, you therefore have a couple of options:

(1) Retaining the original (local) retention period for the remote storage and reducing the local retention period, or

(2) Retaining the local retention period and increasing the remote retention period.

We can add this to our Kafka performance model (Excel and JavaScript versions) and see what some different scenarios reveal. The new inputs are local and remote retention times (s) and costs ($ GB/month), and it computes local and remote storage used (TB) and storage costs per month.

Let’s start with a simple example: assuming we have one day retention time for local storage to start with. Once Tiered Storage is enabled, let’s assume we have 1-day maximum retention still—for remote storage—but a shorter 1-hour local retention time.

That is, we still have the records available for the same period, but access to records within an hour of creation will be faster using local storage, but slower for records older than an hour.

Here’s a screen shot of the updated javascript calculator for the Tiered Storage option (kafka_storage.html on GitHub here):

The following chart shows the substantial (2.6x) reduction in storage with Tiered Storage enabled:

Unsurprisingly, there are potential cost savings (7x):

As mentioned above, in addition to offering the possibility of reduced local storage space (and therefore reduced cost), another scenario is to increase the amount and age of the available for replaying.

For example, let’s see what happens if we retain the 1 hour local retention, but increase the remote retention to 1 month.

Here’s the total storage comparison (the local storage amount of 10.8 TB for the Tiered Storage scenario is swamped by the remote storage amount of 2,592 TB):

Increasing the remote retention time means a lot more data must be stored, but the majority of it is on cheaper S3 storage as this cost comparison reveals:

Using local-only storage for a month’s worth of records is likely unaffordable at $622,000/month! But using more remote storage is potentially more affordable at $60,000/month (about 10x cheaper).

Conclusions

From my past blogs, we noticed that turning Kafka Tiered Storage on will increase I/O and Network resources slightly. We previously computed the minimum I/O and network resources required for different workload scenarios, and the minimum number of brokers that would meet these requirements. But in practice, your existing cluster may have I/O and network resources to spare and not require resizing at all.

So, what are my conclusions here? Well, reducing local retention times (and therefore storage space as well), and increasing remote storage retention times and space, will likely save you money. Why? Because it will enable you to store more streaming data for less, opening up lots of potential Tiered Storage use cases.

Here are some of my final observations about Kafka Tiered Storage sizing:

  • We built a simple “best-guess” (i.e., an in-depth Kafka architectural knowledge combined with benchmarking) Kafka sizing model of a Kafka cluster including Tiered Storage changes and cluster-level metrics (workloads)
    • It hasn’t been formally validated (theoretically or practically) yet, so use it with discretion!
  • In general (Tiered Storage disabled or enabled), EBS uses more network capacity than SSDs
  • Tiered Storage enabled uses more I/O and/or network (maybe 10-30% more), but your current cluster may have sufficient I/O and network resources already
  • Benchmarking and monitoring will tell you if it still works with the desired SLAs, or if you need to resize it
  • Tiered Storage enables you to have way more storage remotely for less money (i.e. local only storage)
  • Given that Tiered Storage allows you to scale processing and storage independently,
    • it’s possible to scale your Kafka cluster elastically (bigger or more brokers for a short period of time) to enable higher throughput for consumer read processing for historic remote records
      • (As long as the records were written with sufficient partitions originally, that is)
  • Better models–and therefore better predictions–are possible in the future
    • E.g. including CPU, topics, partitions, consumers, Tiered Storage configurations, and more!

***

And one final piece of cheese/storage trivia!

Q: How much space would you need to store 1.4 billion pounds of cheese (that’s about 635,000 metric tons)? (Well…maybe not exactly “cheese” as most people know it, but “government cheese”)

A: A 3.3 million square foot (306,580 square metres, 74 acres, or 56 American football fields) underground warehouse in Missouri that was used to even out supply and demand for cheese production and consumption in the US, starting with the Reagan era. That’s a lot of storage—and cheese!

 

Follow along in the How to size Apache Kafka clusters for Tiered Storage series

 

And check out my companion series, Why is Apache Kafka Tiered Storage more like a dam than a fountain?

The post How to size Apache Kafka® clusters for Tiered Storage: Part 3 appeared first on Instaclustr.

]]>
How to size Apache Kafka® clusters for Tiered Storage: Part 2 https://www.instaclustr.com/blog/how-to-size-apache-kafka-clusters-for-tiered-storage-part-2/ Wed, 19 Mar 2025 23:13:16 +0000 https://www.instaclustr.com/?p=16027 A Kafka performance model for SSDs/EBS, network, I/O, and brokers In Part 1 of this new series, we explored the question: how do you resize a Kafka cluster (using Solid State Drives [SSDs] for local storage) for producer and consumer workloads on topics with remote Tiered Storage enabled? Now in Part 2, we’re going to...

The post How to size Apache Kafka® clusters for Tiered Storage: Part 2 appeared first on Instaclustr.

]]>
A Kafka performance model for SSDs/EBS, network, I/O, and brokers

In Part 1 of this new series, we explored the question: how do you resize a Kafka cluster (using Solid State Drives [SSDs] for local storage) for producer and consumer workloads on topics with remote Tiered Storage enabled?

Now in Part 2, we’re going to explore a Kafka performance model for SSDs/EBS, network, I/O and brokers!

Not all Apache Kafka® clusters use SSDs (Solid State Drives) for local storage. In fact, some of our biggest Kafka clusters opt for AWS Elastic Block Store (EBS) for local storage instead (they can be SSDs or HDDs [hard disk drives] but connected to Kafka brokers via a network).

The “Elastic” in EBS also gives our customers the ability to resize their NetApp Instaclustr Kafka clusters. I’ve also previously used an EBS-backed Kafka cluster to achieve 19 Billion anomaly checks per day for a massively scalable demonstration application (Anomalia Machina), scaling the “Kongo” (IoT) application on a production cluster, and benchmarking for The Power of Apache Kafka Partitions.

The main difference we now need to include in our Kafka Tiered Storage performance model is that writes to and reads from EBS storage all occur over a network rather than local I/O (so they are similar to remote storage workloads in our Tiered Storage model in that respect).

Continuing the “cheese” theme from the previous blog, let’s try some stretchy “elastic” cheese (fondue)!
(Source: Adobe Stock)

1. EBS and Tiered Storage enabled

Here’s the amended complete Kafka Tiered Storage model for EBS local storage. I’ve added a new internal component for the total EBS network bandwidth (to and from EBS storage from the Kafka cluster, EBSNet).

The main obvious difference to the previous SSD model is that there’s no load on the cluster I/O anymore (there is no “local” storage attached directly to the Kafka brokers), the new EBS Network load is 5,000 MB/s, and the total Network load has increased to 13,000 MB/s. This is the “default” scenario with a 1,000 MB/s producer workload, fan-out of 3, and an equal split for each type of consumer workload (real-time, delayed, and remote).

2. EBS and Tiered Storage disabled

Now let’s keep the EBS local storage but disable the Tiered Storage for comparison (representing the initial cluster before resizing for tiered remote storage). The producer workload is still 1,000 MB/s, but real-time and delayed consumer workloads have a 50:50 split still resulting in a fan-out of 3.

The EBS Network load has reduced to 4,500 MB/s and the total Network load is 10,500 MB/s.


Comparing cheeses! (Source: Adobe Stock)

3. Comparing SSDs vs. EBS and remote Tiered Storage enabled vs. disabled

Let’s now summarize our findings so far with the following graphs comparing SSD/EBS and remote Tiered Storage enabled and disabled for the default workloads and the x10 remote consumer workloads (see previous blog):

What do we notice?

Naturally, there’s no I/O demand for the EBS scenarios (as local storage is networked). As we noted in the last blog, enabling Tiered Storage for SSDs increases the network bandwidth (+33%), and the absolute network load is higher for the EBS scenarios as everything happens over the network. However, the percentage increase is smaller (+23% for EBS tiered c.f. local).

4. How many brokers does the cluster need?


Which type and how much cheese do you need? (Source: Adobe Stock)

The NetApp Instaclustr Kafka team regularly runs Kafka benchmarks on Kafka clusters composed of different AWS instance types and sizes, so I had a look at the most recent results and discovered a few things that may be relevant for Kafka Tiered Storage modelling and prediction including:

  • A subset of the instance types and sizes that we benchmark are:
    • AWS R6g, large to 2xlarge (2 to 8 vCPUs)
      • These support EBS for the local storage
    • AWS i3 and i3en (xlarge to 2xlarge, 4 to 8 vCPUs)
      • These include NVMe SSDs for the local storage
  • Test results include the “baseline” network bandwidth, and actual network bandwidth, throughput achieved, I/O wait (< 5% for a successful run) and CPU utilisation (< 50% for a successful run).
  • The maximum measured network bandwidth achieved is typically around 70-80% of the “baseline” bandwidth.
  • Larger instances consistently provide more CPU and network, but not necessarily more I/O.

So, with this information as a starting point, I asked our team a few more questions and they pointed me to the AWS EC2 specifications. This page (and sub-pages for the different EC2 Categories) provides more complete information on the capacity of each instance type and size. I looked up the instances above and found that R6g instances are:

  • memory optimized
  • use AWS Graviton2 Processors
  • have only 1 thread per core
  • have a baseline network bandwidth (minimum)
  • and a burst bandwidth (maximum)
  • and include dedicated baseline and burst network and I/O for the EBS storage.

Given that the burst capacity is only available for relatively short periods of time, we will rely on the baseline values for modelling and prediction.

AWS i3 (Xen hypervisor) and i3en (Nitro v3 hypervisor) are storage optimised instance types. I3 instances are Intel Broadwell E5-2686v4 processors (2 threads per core) and i3en are Intel Xeon Platinum 8175 processors (also 2 threads per core). They also have baseline/burst network bandwidths but are designed for SSDs and include specifications for SSD sizes and read/write IOPs (from which throughput for 4k byte block sizes can be estimated)—reads are faster than writes.

A summary of this information is in the following table:

By adding this information to the Kafka performance model (for the default scenario), we can now compute the minimum number of brokers required to satisfy the total I/O and Network demands on the Kafka cluster, taking into account which resource is the first bottleneck encountered. Here are the results comparing local-only storage vs. remote Tiered Storage enabled:

The bottleneck for the blue highlighted results is the network, but for the green results (i3en) the bottleneck is SSD I/O (the prediction is the minimum number of brokers of each type/size for the I/O, so there’s more Network capacity spare). The potential increase in brokers is from 24% to 33 % more for networked limited clusters (blue), and 13% more for I/O limited clusters (green).

Also please note that:

  • Because the current model doesn’t compute separate write and read SSD loads, and I averaged the SSD write/read IOPs capacity, these results are only approximate.
  • You will also need to increase these estimates by 20 to 30% to allow for the 70-80% maximum network bandwidth utilisation we obtained with our internal benchmarks.
  • You can choose to use average or peak workload rates, but the resource estimates will naturally increase substantially for peak workloads (and depending on how long the peaks last, you may have sufficient burst credits to cover the increased resource demand, or not).
  • The model doesn’t take into account CPU utilization, yet which is likely to be more with Tiered Storage enabled.
  • These are purely predictions of the absolute minimum cluster wide I/O and Network demands – your existing cluster may already have these to spare, and therefore be “bigger” (more/bigger brokers) than this, due to disk storage and/or CPU requirements etc.
  • Finally, these results are just general guidelines. For more specific and customised predictions, the model needs to be parameterized with actual workloads, fan-out ratio, and percentage of topics that have Tiered Storage enabled, etc.

The potential savings for Tiered Storage vs local storage are also likely to be substantial, but that will depend on the retention settings, throughputs, and local storage type (SSD vs. EBS) etc–we will explore the storage trade-offs and costs in the next blog.

Update! I’ve put some of the functionality from the Excel/Sankey Diagram model (not including broker types/sizes) into a prototype JavaScript/HTML Kafka sizing model available on GitHub. This includes producer and consumer workloads, SSD or EBS, RF, and Tiered Storage enabled/disabled, and predicts and graphs total Kafka cluster I/O and Network bandwidth loads, and fan-out ratio – enjoy!

======

And now for a final historical “cheese” related capacity/sizing fact! How much cheese (or rabbits or eggs) could you fit in a South Australian Railways refrigerator (‘R’) car?


(Source: The History Trust of South Australia, South Australian Government Photo, CC0)

Well, back in the day, you could load 261 boxes of cheese in an ‘R’ car, or 420 boxes of rabbits, or 336 boxes of eggs! Or even overload them by 10% (but only in South Australia!)


Peterborough Railway Museum, SA Railways car loading regulation (Source: Paul Brebner)

 

Follow along in the How to size Apache Kafka clusters for Tiered Storage series

And check out my companion series, Why is Apache Kafka Tiered Storage more like a dam than a fountain?

The post How to size Apache Kafka® clusters for Tiered Storage: Part 2 appeared first on Instaclustr.

]]>
Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type https://www.instaclustr.com/blog/introduction-to-similarity-search-part-2/ Mon, 17 Mar 2025 12:33:39 +0000 https://www.instaclustr.com/?p=16037 In Part 1 of this series, we explored how you can combine Cassandra 4 and OpenSearch to perform similarity searches with word embeddings. While that approach is powerful, it requires managing two different systems. But with the release of Cassandra 5, things become much simpler. Cassandra 5 introduces a native VECTOR data type and built-in...

The post Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type appeared first on Instaclustr.

]]>
In Part 1 of this series, we explored how you can combine Cassandra 4 and OpenSearch to perform similarity searches with word embeddings. While that approach is powerful, it requires managing two different systems.

But with the release of Cassandra 5, things become much simpler.

Cassandra 5 introduces a native VECTOR data type and built-in Vector Search capabilities, simplifying the architecture by enabling Cassandra 5 to handle storage, indexing, and querying seamlessly within a single system.

Now in Part 2, we’ll dive into how Cassandra 5 streamlines the process of working with word embeddings for similarity search. We’ll walk through how the new vector data type works, how to store and query embeddings, and how the Storage-Attached Indexing (SAI) feature enhances your ability to efficiently search through large datasets.

The power of vector search in Cassandra 5

Vector search is a game-changing feature added in Cassandra 5 that enables you to perform similarity searches directly within the database. This is especially useful for AI applications, where embeddings are used to represent data like text or images as high-dimensional vectors. The goal of vector search is to find the closest matches to these vectors, which is critical for tasks like product recommendations or image recognition.

The key to this functionality lies in embeddings: arrays of floating-point numbers that represent the similarity of objects. By storing these embeddings as vectors in Cassandra, you can use Vector Search to find connections in your data that may not be obvious through traditional queries.

How vectors work

Vectors are fixed-size sequences of non-null values, much like lists. However, in Cassandra 5, you cannot modify individual elements of a vector — you must replace the entire vector if you need to update it. This makes vectors ideal for storing embeddings, where you need to work with the whole data structure at once.

When working with embeddings, you’ll typically store them as vectors of floating-point numbers to represent the semantic meaning.

Storage-Attached Indexing (SAI): The engine behind vector search

Vector Search in Cassandra 5 is powered by Storage-Attached Indexing, which enables high-performance indexing and querying of vector data. SAI is essential for Vector Search, providing the ability to create column-level indexes on vector data types. This ensures that your vector queries are both fast and scalable, even with large datasets.

SAI isn’t just limited to vectors—it also indexes other types of data, making it a versatile tool for boosting the performance of your queries across the board.

Example: Performing similarity search with Cassandra 5’s vector data type

Now that we’ve introduced the new vector data type and the power of Vector Search in Cassandra 5, let’s dive into a practical example. In this section, we’ll show how to set up a table to store embeddings, insert data, and perform similarity searches directly within Cassandra.

Step 1: Setting up the embeddings table

To get started with this example, you’ll need access to a Cassandra 5 cluster. Cassandra 5 introduces native support for vector data types and Vector Search, available on Instaclustr’s managed platform. Once you have your cluster up and running, the first step is to create a table to store the embeddings. We’ll also create an index on the vector column to optimize similarity searches using SAI.

CREATE KEYSPACE aisearch WITH REPLICATION = {{'class': 'SimpleStrategy', 	'	replication_factor': 1}}; 

 

CREATE TABLE IF NOT EXISTS embeddings ( 
    id UUID, 
    paragraph_uuid UUID, 
    filename TEXT, 
    embeddings vector<float, 300>, 
    text TEXT, 
    last_updated timestamp, 
    PRIMARY KEY (id, paragraph_uuid) 
); 
 

CREATE INDEX IF NOT EXISTS ann_index 
  ON embeddings(embeddings) USING 'sai';

This setup allows us to store the embeddings as 300-dimensional vectors, along with metadata like file names and text. The SAI index will be used to speed up similarity searches on the embedding’s column.

You can also fine-tune the index by specifying the similarity function to be used for vector comparisons. Cassandra 5 supports three types of similarity functions: DOT_PRODUCT, COSINE, and EUCLIDEAN. By default, the similarity function is set to COSINE, but you can specify your preferred method when creating the index:

CREATE INDEX IF NOT EXISTS ann_index 
    ON embeddings(embeddings) USING 'sai' 
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

Each similarity function has its own advantages depending on your use case. DOT_PRODUCT is often used when you need to measure the direction and magnitude of vectors, COSINE is ideal for comparing the angle between vectors, and EUCLIDEAN calculates the straight-line distance between vectors. By selecting the appropriate function, you can optimize your search results to better match the needs of your application.

Step 2: Inserting embeddings into Cassandra 5

To insert embeddings into Cassandra 5, we can use the same code from the first part of this series to extract text from files, load the FastText model, and generate the embeddings. Once the embeddings are generated, the following function will insert them into Cassandra:

import time  
from uuid import uuid4, UUID
from cassandra.cluster import Cluster  
from cassandra.query import SimpleStatement  
from cassandra.policies import DCAwareRoundRobinPolicy  
from cassandra.auth import PlainTextAuthProvider  
from google.colab import userdata  

# Connect to the single-node cluster 
cluster = Cluster( 
# Replace with your IP list 
["xxx.xxx.xxx.xxx", "xxx.xxx.xxx.xxx ", " xxx.xxx.xxx.xxx "], # Single-node cluster address 
load_balancing_policy=DCAwareRoundRobinPolicy(local_dc='AWS_VPC_US_EAST_1'), # Update the local data centre if needed 
port=9042, 
auth_provider=PlainTextAuthProvider ( 
username='iccassandra', 
password='replace_with_your_password' 
) 
) 
session = cluster.connect() 

print('Connected to cluster %s' % cluster.metadata.cluster_name) 

def insert_embedding_to_cassandra(session, embedding, id=None, paragraph_uuid=None, filename=None, text=None, keyspace_name=None):
try:
embeddings = list(map(float, embedding))

# Generate UUIDs if not provided  
if id is None:
id = uuid4()  
if paragraph_uuid is None:
paragraph_uuid = uuid4()  
# Ensure id and paragraph_uuid are UUID objects
if isinstance(id, str):
id = UUID(id)  
if isinstance(paragraph_uuid, str):  
paragraph_uuid = UUID(paragraph_uuid)  

# Create the query string with placeholders
insert_query = f"""  
INSERT INTO {keyspace_name}.embeddings (id, paragraph_uuid, filename, embeddings, text, last_updated)
VALUES (?, ?, ?, ?, ?, toTimestamp(now()))
"""  

# Create a prepared statement with the query  
prepared = session.prepare(insert_query)

# Execute the query  
session.execute(prepared.bind((id, paragraph_uuid, filename, embeddings, text)))

return None # Successful insertion

except Exception as e:  
error_message = f"Failed to execute query:\nError: {str(e)}"
return error_message # Return error message on failure

def insert_with_retry(session, embedding, id=None, paragraph_uuid=None,
filename=None, text=None, keyspace_name=None, max_retries=3,
retry_delay_seconds=1):
retry_count = 0 
while retry_count < max_retries: 
result = insert_embedding_to_cassandra(session, embedding, id, paragraph_uuid, filename, text, keyspace_name) 
if result is None: 
return True # Successful insertion 
else: 
retry_count += 1 
print(f"Insertion failed on attempt {retry_count} with error: {result}") 
if retry_count < max_retries: 
time.sleep(retry_delay_seconds) # Delay before the next retry 
return False # Failed after max_retries 

# Replace the file path pointing to the desired file 
file_path = "/path/to/Cassandra-Best-Practices.pdf" 
paragraphs_with_embeddings =
extract_text_with_page_number_and_embeddings(file_path)

from tqdm import tqdm 

for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): 
if not insert_with_retry( 
session=session, 
embedding=paragraph['embedding'], 
id=paragraph['uuid'], 
paragraph_uuid=paragraph['paragraph_uuid'], 
text=paragraph['text'], 
filename=paragraph['filename'], 
keyspace_name=keyspace_name, 
max_retries=3, 
retry_delay_seconds=1 
): 
# Display an error message if insertion fails 
tqdm.write(f"Insertion failed after maximum retries for UUID
{paragraph['uuid']}: {paragraph['text'][:50]}...")

This function handles inserting embeddings and metadata into Cassandra, ensuring that UUIDs are correctly generated for each entry.

Step 3: Performing similarity searches in Cassandra 5

Once the embeddings are stored, we can perform similarity searches directly within Cassandra using the following function:

import numpy as np 
# ------------------ Embedding Functions ------------------ 
def text_to_vector(text): 
"""Convert a text chunk into a vector using the FastText model.""" 
words = text.split() 
vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] 
return np.mean(vectors, axis=0) if vectors else np.zeros(fasttext_model.vector_size) 

def find_similar_texts_cassandra(session, input_text, keyspace_name=None, top_k=5): 
# Convert the input text to an embedding 
input_embedding = text_to_vector(input_text) 
input_embedding_str = ', '.join(map(str, input_embedding.tolist())) 

# Adjusted query without the ORDER BY clause and correct comment syntax 
query = f""" 
SELECT text, filename, similarity_cosine(embeddings, ?) AS similarity 
FROM {keyspace_name}.embeddings 
ORDER BY embeddings ANN OF [{input_embedding_str}] 
LIMIT {top_k}; 
""" 

prepared = session.prepare(query) 
bound = prepared.bind((input_embedding,)) 
rows = session.execute(bound) 

# Sort the results by similarity in Python 
similar_texts = sorted([(row.similarity, row.filename, row.text) for row in rows], key=lambda x: x[0], reverse=True) 

return similar_texts[:top_k] 

from IPython.display import display, HTML 

# The word you want to find similarities for 
input_text = "place" 

# Call the function to find similar texts in the Cassandra database 
similar_texts = find_similar_texts_cassandra(session, input_text, keyspace_name="aisearch", top_k=10)

This function searches for similar embeddings in Cassandra and retrieves the top results based on cosine similarity. Under the hood, Cassandra’s vector search uses Hierarchical Navigable Small Worlds (HNSW). HNSW organizes data points in a multi-layer graph structure, making queries significantly faster by narrowing down the search space efficiently—particularly important when handling large datasets.

Step 4: Displaying the results

To display the results in a readable format, we can loop through the similar texts and present them along with their similarity scores:

# Print the similar texts along with their similarity scores 
for similarity, filename, text in similar_texts: 
html_content = f""" 
<div style="margin-bottom: 10px;"> 
<p><b>Similarity:</b> {similarity:.4f}</p> 
<p><b>Text:</b> {text}</p> 
<p><b>File:</b> {filename}</p> 
</div> 
<hr/> 
""" 

display(HTML(html_content))

This code will display the top similar texts, along with their similarity scores and associated file names.

Cassandra 5 vs. Cassandra 4 + OpenSearch®

Cassandra 4 relies on an integration with OpenSearch to handle word embeddings and similarity searches. This approach works well for applications that are already using or comfortable with OpenSearch, but it does introduce additional complexity with the need to maintain two systems.

Cassandra 5, on the other hand, brings vector support directly into the database. With its native VECTOR data type and similarity search functions, it simplifies your architecture and improves performance, making it an ideal solution for applications that require embedding-based searches at scale.

Feature  Cassandra 4 + OpenSearch  Cassandra 5 (Preview) 
Embedding Storage  OpenSearch  Native VECTOR Data Type 
Similarity Search  KNN Plugin in OpenSearch  COSINE, EUCLIDEAN, DOT_PRODUCT 
Search Method  Exact K-Nearest Neighbor  Approximate Nearest Neighbor (ANN) 
System Complexity  Requires two systems  All-in-one Cassandra solution 

Conclusion: A simpler path to similarity search with Cassandra 5

With Cassandra 5, the complexity of setting up and managing a separate search system for word embeddings is gone. The new vector data type and Vector Search capabilities allow you to perform similarity searches directly within Cassandra, simplifying your architecture and making it easier to build AI-powered applications.

Coming up: more in-depth examples and use cases that demonstrate how to take full advantage of these new features in Cassandra 5 in future blogs!

Ready to experience vector search with Cassandra 5? Spin up your first cluster for free on the Instaclustr Managed Platform and try it out!

The post Introduction to similarity search: Part 2–Simplifying with Apache Cassandra® 5’s new vector data type appeared first on Instaclustr.

]]>
How to size Apache Kafka® clusters for Tiered Storage: Part 1–A Kafka performance model for SSDs, network, and I/O https://www.instaclustr.com/blog/how-to-size-apache-kafka-clusters-for-tiered-storage-part-1/ Wed, 05 Mar 2025 21:57:26 +0000 https://www.instaclustr.com/?p=15929 Introduction: The next phase of Kafka Tiered Storage In my previous blog series, I explored how Apache Kafka Tiered Storage is more like a dam than a fountain by comparing local vs remote storage (Part 1), performance results (Part 2), Kafka time and space (Part 3), and the impact of various consumer behaviors on the...

The post How to size Apache Kafka® clusters for Tiered Storage: Part 1–A Kafka performance model for SSDs, network, and I/O appeared first on Instaclustr.

]]>
Introduction: The next phase of Kafka Tiered Storage

In my previous blog series, I explored how Apache Kafka Tiered Storage is more like a dam than a fountain by comparing local vs remote storage (Part 1), performance results (Part 2), Kafka time and space (Part 3), and the impact of various consumer behaviors on the amount and age of data processed (Part 4).

Now that I have a solid understanding of Tiered Storage—it’s a pretty useful feature to have—this did bring up an entirely new set of questions: does my Apache Kafka cluster have to be bigger (and by approximately how much) if I enable Tiered Storage? And what happens if I change producer and/or consumer workloads?

Or, to put it another way: how do you resize a Kafka cluster (using Solid State Drives [SSDs] for local storage) for producer and consumer workloads on topics with remote Tiered Storage enabled?

This new series will explore doing just that by examining the impact of producer and consumer workloads on cluster sizing by using a performance modeling approach.

We’ll then extend the model to include local Elastic Block Store (EBS) storage, take a look at how many brokers (at minimum) are needed, what size AWS instances to use, how big your storage needs to be, and more.

Setting up the model

But first things first: we need to make some assumptions and caveats.

The initial approach for this model will look at the impact of local vs. Tiered Storage for:

  • Data workloads only
    • Excluding meta-data workloads
  • I/O, EBS bandwidth, and network bandwidth only
    • CPU demand and utilization is ignored
  • Local storage using SSDs (we will compare with EBS storage in the next part)
  • The cluster level
    • Whole cluster resources rather than brokers or availability zones
  • Aggregated/total producer and consumer workloads
    • Total producer/consumer workload data bandwidth (in/out) only
    • Continuous and average rates
    • Ignoring message rates and sizes, the number of topics and partitions
  • Assuming no data compression
    • Although in practice the model just captures the total volume of data written to and read from the cluster
    • Whether it’s compressed or not
  • A single cluster

To simplify things even further for this blog, we need a “representative” illustrative cluster.

One of the talks I gave last year was at Community over Code EU (Bratislava); I presented “Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandaries Explored)”. Part of this talk was an analysis of performance metrics across our top 10 (biggest) Kafka production clusters.

For this blog, I’m going to use those average metrics, including:

  • The average producer input (ingress) bandwidth of 1 GB/s (1,000 MB/s)
  • The average fan-out ratio of 3
    • i.e. for every byte coming into the cluster, 3 bytes are emitted, implying that there are an average of 3 consumer groups at the cluster level
  • Giving an output (egress) bandwidth of 3 GB/s (3,000 MB/s), for our example.

Note that this “average” cluster may not really exist in the top 10 clusters, and we’ll change the assumptions for the fan-out ratio for the last scenario below.

For this blog, I’ve manually built a very simple performance model in Excel and will show the input parameter values and predictions using Sankey Diagrams (which date from the 1800’s when they were first used to show energy flows in steam engines).

I’ve used Sankey diagrams previously to explain system performance to great effect (check out my blogs for Kafka Metrics, OpenTelemetry, and Kafka Kongo).

Why use Sankey Diagrams? Well, because they’re a good choice for performance models as they show values and flows between multiple components on the same diagram. I’ll add one workload type at a time until we complete the model and explore how it works for local vs. remote Tiered Storage, and SSD vs. EBS local storage types.

(Oh, and I’ll do this all with the help of some cheese!)

Cheese is produced from raw ingredients – the first part of its journey to ultimately being consumed… (Source: Adobe Stock)

1. Producers

We’ll start with the basic producer workload, assuming SSD storage for brokers, and that remote Tiered Storage is enabled for all topics.

The following Sankey diagram shows the total producer workload of 1,000 MB/s into the cluster (on the left), resulting in a cluster I/O load (on SSDs) of 1,000 MB/s and network load of 1,000 MB/s (InOutNet).

Basic writes from producers to partition leaders use network bandwidth into the cluster, cluster (local storage) I/O—but no inter broker network or network out of the cluster.

The total of 2,000 MB/s for the ProducersCluster component is a combination of all I/O and Network bandwidth due to the producer workload at the cluster level. For this model at least, it isn’t really a meaningful number at the cluster level, so can just be ignored.

(In practice, however, CPU load is likely to be proportional to the combined I/O and network, and potentially other factors that we haven’t included in the model yet – message rate, number of partitions and consumers, compression, etc.).

The numbers that matter are on the left and right-hand sides.

…and after production, cheese is then replicated… (Source: Adobe Stock)

2. Replication

Next, we add the replication workload due to the producer workload.

This workload copies records from the leader partitions (from active segments, assumed to be in the cache so there’s no I/O local on the leaders), to the follower brokers (with RF=3 the number of followers = 2). This contributes double (x2) of the producer input bandwidth (1,000 x 2 = 2,000 MB/s to both the cluster network (broker to broker) and local I/O, taking the total I/O and network bandwidth to 3,000 MB/s.

Leader partition replication copies from the cache on the leader broker to the number of followers (2 for RF=3) using cluster network and I/O to write on each follower.

…and then stored remotely (even in caves)… (Source: Adobe Stock)

3. Remote Tiered Storage writes

With remote storage enabled (we assume for all topics), closed segments are copied to remote storage—the reading of this data uses cluster I/O (for SSDs) and network for transferring it to remote cloud storage (e.g. AWS S3).

We assume that closed segments are no longer in the cache so must use I/O to read from local storage (i.e. there is a delay between records being written to the active segment, and the active segment being eventually closed and asynchronous copying of the records to remote storage).

The next diagram adds the remote storage writes workload including the reads from local storage (I/O, 1,000 MB/s), and writes to remote storage using the network (1,000 MB/s), this takes the combined I/O load to 4,000 MB/s and combined network load to 4,000 MB/s.

Note that the remote storage workload is comparable to adding a delayed-consumer (see below) in terms of the overhead on I/O and network.

So even with an incomplete model, what can tell about the overhead of Tiered Storage for the write workloads? Comparing the load predicted by this diagram with the previous one we can see that for writes there’s about a 25% overhead for Tiered Storage. For a 1GB/s producer workload, and all topics enabled for Tiered Storage: 4,000 MB/s I/O and network for Tiered Storage, c.f. 3,000 MB/s I/O and network for local storage.

So, what happens when we add consumer workloads? Bring on the mice!

…and then ultimately consumed! (Source: Adobe Stock)

4. Consumer workloads

For simplicity, we model consumer workloads as a combination of real-time, delayed or remote. Given the fan-out of 3 for this representative cluster, we assume an initial ratio of 100%:100%:100% across the different consumer workloads (1,000 MB/s per consumer type, a total of 3,000 MB/s).

Real-time consumers are assumed to keep up with the producer rates, so read from local cache only. They use 1,000 MB/s of network only.

Delayed consumers are assumed to be running a bit behind the producers, so are likely to read from local storage. They use 1,000 MB/s each of I/O and Network.

Remote consumers may re-read from the start of topics and are assumed to use remote storage only. They use 2,000 MB/s of network in total (1,000 MB/s with the broker, and 1,000 MB/s with the remote storage).

This gives a grand total of 5,000 MB/s for I/O, and 8,000 MB/s for network bandwidth used for Tiered Storage enabled for this scenario. I used SankeyMATIC, and below is an example model, so you can copy it to https://sankeymatic.com/build/ and try it yourself.

// Tiered Storage, SSDs

// producer workload, 1,000 MB/s in

Producers [1000] ProducersCluster

ProducersCluster [1000] IO

ProducersCluster [1000] InOutNet

 

// producer RF workload, RF=3 so x2 producer workload

ReplicationCluster [2000] IO

ReplicationCluster [2000] ClusterNet

 

// remote storage workload, copies from local storage to remote storage

RemoteStorageWritesCluster [1000] IO

RemoteStorageWritesCluster [1000] RemoteStoreNet

 

RealTimeConsumers [1000] RTConsumerCluster

RTConsumerCluster [1000] InOutNet

 

DelayedConsumers [1000] DConsumersCluster

DConsumersCluster [1000] IO

DConsumersCluster [1000] InOutNet

 

RemoteConsumers [1000] RConsumersCluster

RConsumersCluster [1000] InOutNet

RConsumersCluster [1000] RemoteStoreNet

 

// Total Network is sum of In/Out, Cluster and Remote network

InOutNet [4000] Net

ClusterNet [2000] Net

RemoteStoreNet [2000] Net

Cheese (like clusters) comes in different sizes (Source: Adobe Stock)

5. Local vs. remote storage sizing

Now that we have a complete model of remote storage, we can compare this with a complete model of local storage and see what the size difference is likely to be.

For local-only storage, there are no remote workloads (writes to remote storage, or consumers from remote storage), and we assume a 50/50 split of real-time and delayed consumers and fan-out of 3 still (i.e. the total consumer workload is still 1,500 + 1,500 = 3,000 MB/s but split across two not three workload types as in the remote storage model). Here’s the local storage model:

What overhead is there for Tiered Storage for this scenario? The following graph shows the comparison between network (5,000/8,000 MBs) and I/O (4,500/6,000 MBs) for local vs. remote storage, revealing an overhead of 33% for Network and 11% for I/O.

6. Scaling the remote consumer workload

As I mentioned in the introduction, the “average” cluster may not in fact exist in our top 10 clusters, and there is a significant variation in workloads and cluster sizes across the top 10.

Having a Kafka cluster with remote storage enabled and lots of records stored on cloud native storage opens up the possibilities for new consumer workload use cases, including replaying and reprocessing a larger number of historical records, or migrating all the records to a new sink system, etc.

In common with the delayed consumer workload in our model, the remote consumer workload is also not limited to the incoming producer rate—it can in theory be faster. Let’s explore what happens when the remote consumer workload rate is significantly higher.

Let’s assume a similar real-time consumer workload as the local scenario above, but no delayed workload. In theory, with remote storage, you could reduce the local retention rate and therefore the size of local storage (SSDs in this case) to close to zero and rely entirely on the remote storage for processing older records.

Let’s therefore increase the remote consumer workload rate substantially to 10,000 MB/s (x10 the producer input rate). As there is a lot of data in remote storage it will take longer to read and process than the original producer/real-time workloads, so this assumption is more realistic for processing lots of historical data. This gives a fan-out ratio of 11.5.

The network load rate has jumped substantially to 25,500 MB/s, around 3x the original default Tiered Storage scenario network usage or 4x the local only scenario. This is logical, as increasing the remote consumer workload will result in more network load (as the read path is consumer → broker → remote storage). And more CPU will also be needed.

Apache Kafka can sometimes be…Kafkaesque! (Source: Bing AI Generated)

7. Scaling in practice

But watch out! To significantly increase the remote consumer workload throughput, you need to have previously increased the number of topic partitions—in fact, before the data was written to remote storage.

This is because for higher consumer concurrency you need more consumers and partitions, and even though remote storage doesn’t “use” partitions, it only stores the data as it existed on local storage initially— i.e. with the partitions that originally existed.

So, increasing the number of consumers before reading back to more than the number of partitions that the remote storage segments were written with won’t increase the read concurrency—it will just mean some consumers time out due to read starvation.

However, it will work if there are sufficient partitions in the original data, sufficient consumers and consumer resources, and importantly, as long as there are sufficient cluster resources including CPU and network.

Note that this partition constraint is common to all consumer workload types (real-time and delayed). The difference is that cloud-native storage such as AWS S3 is elastic and scalable; if the load increases gradually, in the same region and with sufficient connections and error handling. On the other hand, SSDs have fixed non-elastic I/O limits (although real-time consumer workloads are likely faster due to reading directly from the kernel page cache).

Also, remember that consuming from remote storage will increase read latencies (see Part 2 of my Kafka Tiered Storage series) which will necessitate an increase in the number of partitions and consumers to achieve higher throughput.

And if you are using Kafka® Connect for a scenario like migrating data from an existing Kafka cluster to a new sink system, then you may need a bigger Kafka Connect cluster, more connector workers/tasks, and to check that the sink system can cope with the expected load.

For more about scaling Kafka Connect, check out my real-time zero-code data pipeline series here.

Conclusion and what’s next

By leveraging performance models and diagrams, we’ve successfully compared local and remote storage setups, showcasing the trade-offs in scalability and resource usage. Scaling workloads, particularly for remote consumers accessing historical data, emphasizes the importance of planning partitions and cluster resources.

That’s all for now! In the next part of this series, we’ll explore a Kafka Tiered Storage model using AWS EBS for local storage (c.f. SSD), and more.

This is the first part in my new series on sizing clusters for Kafka Tiered Storage. Check out my companion series exploring Tiered Storage from the ground up–and why it’s more like a dam than a fountain!

Part 1: Introduction to Kafka Tiered Storage

Part 2: Tiered Storage performance

Part 3: Kafka time and space

Part 4: Tiered Storage use cases

Want to create a NetApp Instaclustr Apache Kafka cluster with Tiered Storage? See our support documentation Using Kafka Tiered Storage. We recently added new metrics for Tiered Storage clusters, making the feature even cooler!

Try it out yourself and spin up your first cluster on the Instaclustr Managed Platform with a free no-obligation 30-day trial of developer-sized instances.

Update!

I’ve put some of the functionality from the Excel model into a prototype JavaScript/HTML Kafka sizing calculator available from GitHub. This includes EBS modelling which you can read about in the next blog.

The post How to size Apache Kafka® clusters for Tiered Storage: Part 1–A Kafka performance model for SSDs, network, and I/O appeared first on Instaclustr.

]]>
Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch® https://www.instaclustr.com/blog/introduction-to-similarity-search-with-word-embeddings-part-1/ Wed, 05 Mar 2025 00:34:31 +0000 https://www.instaclustr.com/?p=15949 Word embeddings have revolutionized how we approach tasks like natural language processing, search, and recommendation engines. They allow us to convert words and phrases into numerical representations (vectors) that capture their meaning based on the context in which they appear. Word embeddings are especially useful for tasks where traditional keyword searches fall short, such as...

The post Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch® appeared first on Instaclustr.

]]>
Word embeddings have revolutionized how we approach tasks like natural language processing, search, and recommendation engines.

They allow us to convert words and phrases into numerical representations (vectors) that capture their meaning based on the context in which they appear. Word embeddings are especially useful for tasks where traditional keyword searches fall short, such as finding semantically similar documents or making recommendations based on textual data.

scatter plot graph

For example: a search for “Laptop” might return results related to “Notebook” or “MacBook” when using embeddings (as opposed to something like “Tablet”) offering a more intuitive and accurate search experience.

As applications increasingly rely on AI and machine learning to drive intelligent search and recommendation engines, the ability to efficiently handle word embeddings has become critical. That’s where databases like Apache Cassandra come into play—offering the scalability and performance needed to manage and query large amounts of vector data.

In Part 1 of this series, we’ll explore how you can leverage word embeddings for similarity searches using Cassandra 4 and OpenSearch. By combining Cassandra’s robust data storage capabilities with OpenSearch’s powerful search functions, you can build scalable and efficient systems that handle both metadata and word embeddings.

Cassandra 4 and OpenSearch: A partnership for embeddings

Cassandra 4 doesn’t natively support vector data types or specific similarity search functions, but that doesn’t mean you’re out of luck. By integrating Cassandra with OpenSearch, an open-source search and analytics platform, you can store word embeddings and perform similarity searches using the k-Nearest Neighbors (kNN) plugin.

This hybrid approach is advantageous over relying on OpenSearch alone because it allows you to leverage Cassandra’s strengths as a high-performance, scalable database for data storage while using OpenSearch for its robust indexing and search capabilities.

Instead of duplicating large volumes of data into OpenSearch solely for search purposes, you can keep the original data in Cassandra. OpenSearch, in this setup, acts as an intelligent pointer, indexing the embeddings stored in Cassandra and performing efficient searches without the need to manage the entire dataset directly.

This approach not only optimizes resource usage but also enhances system maintainability and scalability by segregating storage and search functionalities into specialized layers.

Deploying the environment

To set up your environment for word embeddings and similarity search, you can leverage the Instaclustr Managed Platform, which simplifies deploying and managing your Cassandra cluster and OpenSearch. Instaclustr takes care of the heavy lifting, allowing you to focus on building your application rather than managing infrastructure. In this configuration, Cassandra serves as your primary data store, while OpenSearch handles vector operations and similarity searches.

Here’s how to get started:

  1. Deploy a managed Cassandra cluster: Start by provisioning your Cassandra 4 cluster on the Instaclustr platform. This managed solution ensures your cluster is optimized, secure, and ready to store non-vector data.
  2. Set up OpenSearch with kNN plugin: Instaclustr also offers a fully managed OpenSearch service. You will need to deploy OpenSearch, with the kNN plugin enabled, which is critical for handling word embeddings and executing similarity searches.

By using Instaclustr, you gain access to a robust platform that seamlessly integrates Cassandra and OpenSearch, combining Cassandra’s scalable, fault-tolerant database with OpenSearch’s powerful search capabilities. This managed environment minimizes operational complexity, so you can focus on delivering fast and efficient similarity searches for your application.

Preparing the environment

Now that we’ve outlined the environment setup, let’s dive into the specific technical steps to prepare Cassandra and OpenSearch for storing and searching word embeddings.

Step 1: Setting up Cassandra

In Cassandra, we’ll need to create a table to store the metadata. Here’s how to do that:

  1. Create the Table:
    Next, create a table to store the embeddings. This table will hold details such as the embedding vector, related text, and metadata:CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {‘class’: ‘SimpleStrategy’, ‘

CREATE KEYSPACE IF NOT EXISTS aisearch WITH REPLICATION = {'class': 'SimpleStrategy',    	'
replication_factor': 3};

USE file_metadata;
 
DROP TABLE IF EXISTS file_metadata; 
    CREATE TABLE IF NOT EXISTS file_metadata ( 
      id UUID, 
      paragraph_uuid UUID, 
      filename TEXT, 
      text TEXT, 
      last_updated timestamp, 
      PRIMARY KEY (id, paragraph_uuid) 
    );

Step 2: Configuring OpenSearch

In OpenSearch, you’ll need to create an index that supports vector operations for similarity search. Here’s how you can configure it:

  1. Create the index:
    Define the index settings and mappings, ensuring that vector operations are enabled and that the correct space type (e.g., L2) is used for similarity calculations.

{ 
  "settings": { 
   "index": { 
     "number_of_shards": 2, 
      "knn": true, 
      "knn.space_type": "l2" 
    } 
  }, 
  "mappings": { 
    "properties": { 
      "file_uuid": { 
        "type": "keyword" 
      }, 
      "paragraph_uuid": { 
        "type": "keyword" 
      }, 
      "embedding": { 
        "type": "knn_vector", 
        "dimension": 300 
      } 
    } 
  } 
}

This index configuration is optimized for storing and searching embeddings using the k-Nearest Neighbors algorithm, which is crucial for similarity search.

With these steps, your environment will be ready to handle word embeddings for similarity search using Cassandra and OpenSearch.

Generating embeddings with FastText

Once you have your environment set up, the next step is to generate the word embeddings that will drive your similarity search. For this, we’ll use FastText, a popular library from Facebook’s AI Research team that provides pre-trained word vectors. Specifically, we’re using the crawl-300d-2M model, which offers 300-dimensional vectors for millions of English words.

Step 1: Download and load the FastText model

To start, you’ll need to download the pre-trained model file. This can be done easily using Python and the requests library. Here’s the process:

1. Download the FastText model: The FastText model is stored in a zip file, which you can download from the official FastText website. The following Python script will handle the download and extraction:

import requests 
import zipfile 
import os 

# Adjust file_url  and local_filename  variables accordingly 
file_url = 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip' 
local_filename = '/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec.zip' 
extract_dir = '/content/gdrive/MyDrive/0_notebook_files/model/' 

def download_file(url, filename): 
    with requests.get(url, stream=True) as r: 
        r.raise_for_status() 
        os.makedirs(os.path.dirname(filename), exist_ok=True) 
        with open(filename, 'wb') as f: 
            for chunk in r.iter_content(chunk_size=8192): 
                f.write(chunk) 
 

def unzip_file(filename, extract_to): 
    with zipfile.ZipFile(filename, 'r') as zip_ref: 
        zip_ref.extractall(extract_to) 

# Download and extract 
download_file(file_url, local_filename) 
unzip_file(local_filename, extract_dir)

2. Load the model: Once the model is downloaded and extracted, you’ll load it using Gensim’s KeyedVectors class. This allows you to work with the embeddings directly: 

from gensim.models import KeyedVectors 

# Adjust model_path variable accordingly
model_path = "/content/gdrive/MyDrive/0_notebook_files/model/crawl-300d-2M.vec"
fasttext_model = KeyedVectors.load_word2vec_format(model_path, binary=False)

Step 2: Generate embeddings from text

With the FastText model loaded, the next task is to convert text into vectors. This process involves splitting the text into words, looking up the vector for each word in the FastText model, and then averaging the vectors to get a single embedding for the text.

Here’s a function that handles the conversion:

import numpy as np 
import re 

def text_to_vector(text): 
    """Convert text into a vector using the FastText model.""" 
    text = text.lower() 
    words = re.findall(r'\b\w+\b', text) 
    vectors = [fasttext_model[word] for word in words if word in fasttext_model.key_to_index] 

    if not vectors: 
        print(f"No embeddings found for text: {text}") 
        return np.zeros(fasttext_model.vector_size) 

    return np.mean(vectors, axis=0)

This function tokenizes the input text, retrieves the corresponding word vectors from the model, and computes the average to create a final embedding.

Step 3: Extract text and generate embeddings from documents

In real-world applications, your text might come from various types of documents, such as PDFs, Word files, or presentations. The following code shows how to extract text from different file formats and convert that text into embeddings:

import uuid 
import mimetypes 
import pandas as pd 
from pdfminer.high_level import extract_pages 
from pdfminer.layout import LTTextContainer 
from docx import Document 
from pptx import Presentation 

def generate_deterministic_uuid(name): 
    return uuid.uuid5(uuid.NAMESPACE_DNS, name) 

def generate_random_uuid(): 
    return uuid.uuid4() 

def get_file_type(file_path): 
    # Guess the MIME type based on the file extension 
    mime_type, _ = mimetypes.guess_type(file_path) 
    return mime_type 

def extract_text_from_excel(excel_path): 
    xls = pd.ExcelFile(excel_path) 
    text_list = [] 

for sheet_index, sheet_name in enumerate(xls.sheet_names): 
        df = xls.parse(sheet_name) 
        for row in df.iterrows(): 
            text_list.append((" ".join(map(str, row[1].values)), sheet_index + 1))  # +1 to make it 1 based index 

return text_list 

def extract_text_from_pdf(pdf_path): 
    return [(text_line.get_text().strip().replace('\xa0', ' '), page_num) 
            for page_num, page_layout in enumerate(extract_pages(pdf_path), start=1) 
            for element in page_layout if isinstance(element, LTTextContainer) 
            for text_line in element if text_line.get_text().strip()] 

def extract_text_from_word(file_path): 
    doc = Document(file_path) 
    return [(para.text, (i == 0) + 1) for i, para in enumerate(doc.paragraphs) if para.text.strip()] 

def extract_text_from_txt(file_path): 
    with open(file_path, 'r') as file: 
        return [(line.strip(), 1) for line in file.readlines() if line.strip()] 

def extract_text_from_pptx(pptx_path): 
    prs = Presentation(pptx_path) 
    return [(shape.text.strip(), slide_num) for slide_num, slide in enumerate(prs.slides, start=1) 
            for shape in slide.shapes if hasattr(shape, "text") and shape.text.strip()] 

def extract_text_with_page_number_and_embeddings(file_path, embedding_function): 
    file_uuid = generate_deterministic_uuid(file_path) 
    file_type = get_file_type(file_path) 

    extractors = { 
        'text/plain': extract_text_from_txt, 
        'application/pdf': extract_text_from_pdf, 
        'application/vnd.openxmlformats-officedocument.wordprocessingml.document': extract_text_from_word, 
        'application/vnd.openxmlformats-officedocument.presentationml.presentation': extract_text_from_pptx, 
        'application/zip': lambda path: extract_text_from_pptx(path) if path.endswith('.pptx') else [], 
        'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': extract_text_from_excel, 
        'application/vnd.ms-excel': extract_text_from_excel
    }

    text_list = extractors.get(file_type, lambda _: [])(file_path) 

    return [ 
      { 
          "uuid": file_uuid, 
          "paragraph_uuid": generate_random_uuid(), 
          "filename": file_path, 
          "text": text, 
          "page_num": page_num, 
          "embedding": embedding 
      } 
      for text, page_num in text_list 
      if (embedding := embedding_function(text)).any()  # Check if the embedding is not all zeros 
    ] 

# Replace the file path with the one you want to process 

file_path = "../../docs-manager/Cassandra-Best-Practices.pdf"
paragraphs_with_embeddings = extract_text_with_page_number_and_embeddings(file_path)

This code handles extracting text from different document types, generating embeddings for each text chunk, and associating them with unique IDs.

With FastText set up and embeddings generated, you’re now ready to store these vectors in OpenSearch and start performing similarity searches.

Performing similarity searches

To conduct similarity searches, we utilize the k-Nearest Neighbors (kNN) plugin within OpenSearch. This plugin allows us to efficiently search for the most similar embeddings stored in the system. Essentially, you’re querying OpenSearch to find the closest matches to a word or phrase based on your embeddings.

For example, if you’ve embedded product descriptions, using kNN search helps you locate products that are semantically similar to a given input. This capability can significantly enhance your application’s recommendation engine, categorization, or clustering.

This setup with Cassandra and OpenSearch is a powerful combination, but it’s important to remember that it requires managing two systems. As Cassandra evolves, the introduction of built-in vector support in Cassandra 5 simplifies this architecture. But for now, let’s focus on leveraging both systems to get the most out of similarity searches.

Example: Inserting metadata in Cassandra and embeddings in OpenSearch

In this example, we use Cassandra 4 to store metadata related to files and paragraphs, while OpenSearch handles the actual word embeddings. By storing the paragraph and file IDs in both systems, we can link the metadata in Cassandra with the embeddings in OpenSearch.

We first need to store metadata such as the file name, paragraph UUID, and other relevant details in Cassandra. This metadata will be crucial for linking the data between Cassandra, OpenSearch and the file itself in filesystem.

The following code demonstrates how to insert this metadata into Cassandra and embeddings in OpenSearch, make sure to run the previous script, so the “paragraphs_with_embeddings” variable will be populated:

from tqdm import tqdm 

# Function to insert data into both Cassandra and OpenSearch 
def insert_paragraph_data(session, os_client, paragraph, keyspace_name, index_name): 
    # Insert into Cassandra 
    cassandra_result = insert_with_retry( 
        session=session, 
        id=paragraph['uuid'], 
        paragraph_uuid=paragraph['paragraph_uuid'], 
        text=paragraph['text'], 
        filename=paragraph['filename'], 
        keyspace_name=keyspace_name, 
        max_retries=3, 
        retry_delay_seconds=1 
    ) 

    if not cassandra_result: 
        return False  # Stop further processing if Cassandra insertion fails 

    # Insert into OpenSearch 
    opensearch_result = insert_embedding_to_opensearch( 
        os_client=os_client, 
        index_name=index_name, 
        file_uuid=paragraph['uuid'], 
        paragraph_uuid=paragraph['paragraph_uuid'], 
        embedding=paragraph['embedding'] 
    ) 

    if opensearch_result is not None: 
        return False  # Return False if OpenSearch insertion fails 

    return True  # Return True on success for both 

# Process each paragraph with a progress bar 
print("Starting batch insertion of paragraphs.") 

for paragraph in tqdm(paragraphs_with_embeddings, desc="Inserting paragraphs"): 
    if not insert_paragraph_data( 
        session=session, 
        os_client=os_client, 
        paragraph=paragraph, 
        keyspace_name=keyspace_name, 
        index_name=index_name 
    ): 

        print(f"Insertion failed for UUID {paragraph['uuid']}: {paragraph['text'][:50]}...") 

print("Batch insertion completed.")

Performing similarity search

Now that we’ve stored both metadata in Cassandra and embeddings in OpenSearch, it’s time to perform a similarity search. This step involves searching OpenSearch for embeddings that closely match a given input and then retrieving the corresponding metadata from Cassandra.

The process is straightforward: we start by converting the input text into an embedding, then use the k-Nearest Neighbors (kNN) plugin in OpenSearch to find the most similar embeddings. Once we have the results, we fetch the related metadata from Cassandra, such as the original text and file name.

Here’s how it works:

  1. Convert text to embedding: Start by converting your input text into an embedding vector using the FastText model. This vector will serve as the query for our similarity search.
  2. Search OpenSearch for similar embeddings: Using the KNN search capability in OpenSearch, we find the top k most similar embeddings. Each result includes the corresponding file and paragraph UUIDs, which help us link the results back to Cassandra.
  3. Fetch metadata from Cassandra: With the UUIDs retrieved from OpenSearch, we query Cassandra to get the metadata, such as the original text and file name, associated with each embedding.

The following code demonstrates this process:

import uuid 
from IPython.display import display, HTML 

def find_similar_embeddings_opensearch(os_client, index_name, input_embedding, top_k=5): 
    """Search for similar embeddings in OpenSearch and return the associated UUIDs.""" 
    query = { 
        "size": top_k, 
        "query": { 
            "knn": { 
                "embedding": { 
                    "vector": input_embedding.tolist(), 
                    "k": top_k 
                } 
            } 
        } 
    }

        response = os_client.search(index=index_name, body=query) 

    similar_uuids = [] 
    for hit in response['hits']['hits']: 
        file_uuid = hit['_source']['file_uuid'] 
        paragraph_uuid = hit['_source']['paragraph_uuid'] 
        similar_uuids.append((file_uuid, paragraph_uuid))  

    return similar_uuids 

def fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, keyspace_name): 
    """Fetch the metadata (text and filename) from Cassandra based on UUIDs.""" 
    file_uuid = uuid.UUID(file_uuid) 
    paragraph_uuid = uuid.UUID(paragraph_uuid) 

    query = f""" 
    SELECT text, filename 
    FROM {keyspace_name}.file_metadata 
    WHERE id = ? AND paragraph_uuid = ?; 
    """ 
    prepared = session.prepare(query) 
    bound = prepared.bind((file_uuid, paragraph_uuid)) 
    rows = session.execute(bound)    

    for row in rows: 
        return row.filename, row.text 
    return None, None 

# Input text to find similar embeddings 
input_text = "place" 

# Convert input text to embedding 
input_embedding = text_to_vector(input_text) 

# Find similar embeddings in OpenSearch 
similar_uuids = find_similar_embeddings_opensearch(os_client, index_name=index_name, input_embedding=input_embedding, top_k=10) 

# Fetch and display metadata from Cassandra based on the UUIDs found in OpenSearch 
for file_uuid, paragraph_uuid in similar_uuids: 
    filename, text = fetch_metadata_from_cassandra(session, file_uuid, paragraph_uuid, 
keyspace_name)

    if filename and text: 
        html_content = f""" 
        <div style="margin-bottom: 10px;"> 
            <p><b>File UUID:</b> {file_uuid}</p> 
            <p><b>Paragraph UUID:</b> {paragraph_uuid}</p> 
            <p><b>Text:</b> {text}</p> 
            <p><b>File:</b> {filename}</p> 
        </div> 

        <hr/> 
        """ 

        display(HTML(html_content))

This code demonstrates how to find similar embeddings in OpenSearch and retrieve the corresponding metadata from Cassandra. By linking the two systems via the UUIDs, you can build powerful search and recommendation systems that combine metadata storage with advanced embedding-based searches.

Conclusion and next steps: A powerful combination of Cassandra 4 and OpenSearch

By leveraging the strengths of Cassandra 4 and OpenSearch, you can build a system that handles both metadata storage and similarity search. Cassandra efficiently stores your file and paragraph metadata, while OpenSearch takes care of embedding-based searches using the k-Nearest Neighbors algorithm. Together, these two technologies enable powerful, large-scale applications for text search, recommendation engines, and more.

Coming up in Part 2, we’ll explore how Cassandra 5 simplifies this architecture with built-in vector support and native similarity search capabilities.

Ready to try vector search with Cassandra and OpenSearch? Spin up your first cluster for free on the Instaclustr Managed Platform and explore the incredible power of vector search.

The post Introduction to similarity search with word embeddings: Part 1–Apache Cassandra® 4.0 and OpenSearch® appeared first on Instaclustr.

]]>
OpenSearch ® 102: How to set up full-text search in Node.js https://www.instaclustr.com/blog/opensearch-102-how-to-set-up-full-text-search-in-node-js/ Wed, 12 Feb 2025 13:57:55 +0000 https://www.instaclustr.com/?p=15901 Ready for another round of OpenSearch lessons? Let’s dive into OpenSearch 102 where we’re learn how to set up full-text search! You’ve done a great job with OpenSearch 101. You’ve set up an OpenSearch cluster, connected to with Node.js, indexed your data, and even initiated a lexical search. Remember when you searched for the term...

The post OpenSearch ® 102: How to set up full-text search in Node.js appeared first on Instaclustr.

]]>
Ready for another round of OpenSearch lessons? Let’s dive into OpenSearch 102 where we’re learn how to set up full-text search!

You’ve done a great job with OpenSearch 101. You’ve set up an OpenSearch cluster, connected to with Node.js, indexed your data, and even initiated a lexical search. Remember when you searched for the term ‘manager’ in the title field? OpenSearch was only looking for that exact word. This, my friends, is a lexical term search.

No wiggle room for misspellings! This is rapid fire search mechanism that seeks out exact matches of the search term in a tokenized version of the search text, leaving no room for errors. Term search has a wide range of use cases, many of which you use in your day-to-day life on the web, such as autocomplete and type-ahead functionality. There are, however, some ways to integrate wiggle room into your term searches; some allow wildcard or regex matching, and some implement algorithms to give a “fuzzy” search that can help with misspellings, for instance.

However, if you want things that not only exactly match your search terms, but are similar, then welcome to the world of lexical full-text search. A tad more lenient, full text tokenizes not only the search text but the search term and returns any relevant intersections of the two groups of tokens. Because these intersections have different levels of overlap, a confidence score is returned, which is the level of confidence that this overlap is what you’re looking for, based only on the tokenized texts it has.

Now that we’ve covered the vocabulary, let’s dig into some code. Open the OpenSearch-101 code folder that you either created from the 101 post or get it from the GitHub. (P.S. You can also set up full-text search with PostgreSQL!)

First, you’re going to need a few things to get started with semantic search. Dummy text that you can perform more complex searches on would be a good starting point. Create a placeholder_text.txt file in your code directory and paste the text from here into the file. Now you have many (well, 8) paragraphs of text to index and search. It is not a terribly interesting read for us humans, but that’s not what we’re here for.

Now you’re going to clean up the code to match our new use case.

First, at the top of the file with the other require statements, you’ll want to require in the ‘fs’ module, which allows us to read from the file system. It should look like this:

'use strict' 

require('fs') 
require('dotenv').config() 

var host = process.env.OPENSEARCH_HOST

Next, use find and replace to replace ‘devrel-team’ with ‘opensearch-102’. This will change the index name, etc. to opensearch-102, as you can see in this example:

const start = async () => { 
  try { 
    // Check the cluster health 
    const clusterHealthResponse = await client.cluster.health({}) 
    printResponse('Get Cluster Health', clusterHealthResponse) 
    // Check if the 'opensearch-102' index exists 
    const indexExistsResponse = await client.indices.exists({ index: 'opensearch-102' }) 
    if (indexExistsResponse.statusCode === 200) { 
      // Delete the 'opensearch-102' index if it exists 
      const deleteIndexResponse = await client.indices.delete({ index: 'opensearch-102' }) 
      printResponse('Delete existing `opensearch-102` Index', deleteIndexResponse) 
    } catch (error) {
      console.error('Error:', error.message)
    }
  }
}

Now, you’re going to remove the cluster health call; it’s not necessary for this demo. That’s these three lines:

// Check the cluster health 
const clusterHealthResponse = await client.cluster.health({}) 
printResponse('Get Cluster Health', clusterHealthResponse)

The code remains the same until you get to the start() function. Before you run this function, you want to pull in the placeholder text with the ‘fs’ module you included earlier. To do this, we’ll use fs.readFileSync() and then use toString() and split() to end up with an array of paragraphs. This code looks like:

// pull the dummy text into an array that we can pass to the OpenSearch cluster 
let placeholderText = fs.readFileSync('./placeholder_text.txt').toString().split('\n')

Now that you have the data, it’s time to index it into your OpenSearch cluster. First, you should set the mapping for the new data. Change the createIndexResponse() function to look like:

const start = async () => { 
  try { // Create the `opensearch-102` index 
    const createIndexResponse = await client.indices.create({ 
      index: 'opensearch-102', 
      body: { 
        mappings: { 
          properties: { 
            lineNumber: { type: 'integer'}, 
            text: { type: 'text' } 
          }, 
        }, 
      }, 
    });
  } catch (error) {
    console.log(error.message);
  }
}

So you’ll be passing in a line number and a line of text for each document you index.

Now that the opensearch-102 index is created, you can index your placeholder data. Go to the indexText() function and modify it to suit the new placeholder data:

const indexText = async () => { 
  await (async () => { 
    for (let i = 0; i < placeholderText.length; i++) { 
      let response = await client.index({ 
        index: 'opensearch-102', 
          id: i, 
          body: { 
            lineNumber: i+1, 
            text: placeholderText[i] 
          }, 
        });
      printResponse(`Added index ID ${i}:`, response) 
    }  
  })() 
}

The next change comes in the searchText() function. You want to print more than ‘element._source’ in order to see the relevance scores for your full-text search. After the change the function should look as follows:

const searchText = async (query) => { 
  const response = await client.search({ 
    index: 'opensearch-102', 
    body: query, 
  });
  console.log('\nSearch Results:'); 
  response.body.hits.hits.forEach((element) => { 
    console.log(element); 
  }); 
}

Note the change from ‘term’ to ‘match’; this tells OpenSearch that this is not a term search, but a full-text search.

That’s all the code changes; if you want to make sure everything matches, the completed code is in this gist. go ahead and run your program with:

node index.js

You should see many added index results like the following:

opensearch results screenshot

Then you should start to see your search results:

opensearch results screenshot

As you can see, each result comes with a relevance score that tells you how close to the original term the result is.

Try changing the search phrase in the code and see what you can come up with!

In the end, when it comes to lexical searches, you have two major types: term and full-text. While lexical is very literal, looking only for the exact matches of the search term, full-text search is a bit more flexible, tokenizing both search term and text, adding relevance scores to its results to show how close the match is. Also, Full-text search is built right into OpenSearch, and creating a client that makes semantic search calls by changing just a few lines of code.

This wraps up OpenSearch 102! In the next part of our series, OpenSearch 103, we’ll get into dashboards and querying your OpenSearch data. Until then, feel free to spin up your very own OpenSearch cluster with our free trial.

Happy coding!

The post OpenSearch ® 102: How to set up full-text search in Node.js appeared first on Instaclustr.

]]>
Getting started with vector search on Instaclustr for ClickHouse® https://www.instaclustr.com/blog/getting-started-with-vector-search-on-instaclustr-for-clickhouse/ Fri, 24 Jan 2025 01:04:26 +0000 https://www.instaclustr.com/?p=15724 Introduction In the current excitement around Generative AI (GenAI) and Large Language Models (LLMs), the quest for the perfect vector database has become a focal point for many companies! Indeed, vector storage and vector search are both crucial for many GenAI applications – but are dedicated vector databases truly essential for scaling and optimizing GenAI?...

The post Getting started with vector search on Instaclustr for ClickHouse® appeared first on Instaclustr.

]]>
Introduction

In the current excitement around Generative AI (GenAI) and Large Language Models (LLMs), the quest for the perfect vector database has become a focal point for many companies!

Indeed, vector storage and vector search are both crucial for many GenAI applications – but are dedicated vector databases truly essential for scaling and optimizing GenAI? When does investing in these technologies begin to pay off?

Last August, my colleague published an insightful article shining a light on this very phenomenon that suggested part of the answer: what if on the lookout for the latest and greatest in vector database technology, companies are overlooking existing and versatile databases they already trust and use at scale for their critical data workloads? (Think Apache Cassandra 5.0®, OpenSearch® and PostgreSQL®).

The conclusion resonated deeply with me, and it serves as a cornerstone for the discussion in this article:

“The solution to tailored LLM responses isn’t investing in some expensive proprietary vector database and then trying to dodge the very real risks of vendor lock-in or a bad fit. At least it doesn’t have to be.

“Recognizing that available open-source vector databases are among the top options out there for AI development — including some you may already be familiar with or even have on hand — should be a very welcome revelation.”

This perspective is particularly relevant as we explore how ClickHouse can be leveraged for vector search.

The work required to put together this article was driven by the need to highlight open source, multi-purpose, and cost-effective solutions that can empower businesses to harness the full potential of their data without falling into the trap of unnecessary expenditures and dependencies (if not vendor lock in and up skilling investments).

Some background context

Before you jump into the rest of this blog, I recommend that you read through this article. It showcases the paramount role of open source in modern data infrastructure, but with a particular focus on vector databases.

You may have noticed something missing: ClickHouse®!

So, what actually is it?

ClickHouse is an open source, high-performance, columnar-oriented database management system (DBMS) designed for Online Analytical Processing (OLAP) of large datasets. It’s the latest addition to Instaclustr’s managed open source frameworks and databases to help our customers build their advanced data and AI workloads. Our engineering team is ensuring we are aligned with its latest releases, as well as providing the latest features and improvements to our users.

Our engineering team is ensuring we are aligned with its latest releases, as well as providing the latest features and improvements to our users.

Now that ClickHouse is available on the Instaclustr Managed Platform, it’s time to showcase its amazing capabilities in the vector search space.

Why ClickHouse for vector search?

It’s simple, really: ClickHouse offers several key advantages that make it great for vector search, like:

  1. High Performance: Its columnar storage and efficient compression make ClickHouse ideal for efficient storage and fast queries, enabling rapid data retrieval and analysis.
  2. Scalability and Distributed Architecture: Designed for distribution across multiple nodes, ClickHouse can handle petabytes of data and leverage all available CPU cores and disks.
  3. Compatibility: ClickHouse integrates perfectly with other systems like Apache Kafka® and Apache Spark™, making it easy to incorporate into existing data pipelines and workflows. Additional compatibility with AI frameworks like LangChain and Hugging Face allows integrations within a broader AI ecosystem.
  4. Open Source: Free to use with a strong community and active development.

These points highlight how compelling ClickHouse can be for handling large-scale analytical workloads (and here’s a blog I wrote that explains even more benefits).

ClickHouse has expanded its capabilities to support vector operations, enabling it to handle AI and GenAI tasks such as vector and semantic search without requiring additional infrastructure.

This new functionality allows users to store and query vectors directly within ClickHouse, leveraging its high-performance architecture to perform complex vector-based operations efficiently.

What can we do with this? Endless opportunities!

Vector search applications are widely used in various domains. For instance, they are essential for real-time Retrieval Augmented Generation (RAG), where integrating vector databases helps retrieve relevant information to enhance LLM outputs. They also improve contextual understanding by enriching generative responses with relevant data stored in the database.

Additionally, they enable cross-modal capabilities, allowing for comparison of multiple data types (e.g., text, images, video). This magic happens thanks to representations (embeddings) that capture the underlying similarities between two pieces of content, going beyond their superficial attributes.

Imagine comparing items based on their deep, intrinsic features rather than just their looks—it’s like judging a book by its content, not its cover! This is a game-changer since we can achieve more nuanced and accurate similarity assessments.

For example: the images accompanying this blog were generated by Microsoft Designer. By entering a phrase, the model generates these images, likely using a vector database to achieve this.

Here what pops up when you enter “Vector Search database” or” Vector similarity”:


A “vector search” image generate by Microsoft Designer (Source: Microsoft Designer)

Now, the possibilities are endless and only limited by your imagination. You could create a recommendation system for music based on the books you’re reading, the work you’re doing (like coding or writing futuristic blogs), or even your mood or the weather.

Yes, your playlist could change from Singing in the Rain to Here Comes the Sun (that was an easy one). However, you will still need to create embeddings for all the raw data (books, songs, etc.)

Now, let’s bring the key concepts together using a wildly popular website: Wikipedia. By using Wikipedia, we can develop search systems that fetch relevant information based on user query. I’ll show you how to use Instaclustr Managed ClickHouse for vector embedding storage and perform vector search, and explore KNN and some other best practices for optimizations.

Using Instaclustr for ClickHouse to store vectors: a real-life example

With its distributed architecture, ClickHouse can efficiently handle datasets containing millions (or even billions) of embeddings. Instaclustr Managed ClickHouse comes fully equipped and optimized to ensure that the storage and retrieval of high-dimensional vectors are both scalable and performant.

Generating vector embeddings involves using models to convert raw data into numerical vectors that capture the essence of the data. This process is not a built-in function of ClickHouse (although some specialized vector databases offer it as a built-in feature). The widely adopted approach is to preprocess the data and compute embeddings externally using various frameworks.

Options for generating embeddings include using external APIs, which are chargeable services. Additionally, you can rely on seamless integrations with frameworks like Hugging Face and LangChain that allow you to run a model of your choice and create a complete pipeline for embedding generation.

The recommended approach, if you prefer to keep your data private or avoid relying on external services, is using open source models and frameworks. Be mindful that this process requires some expertise and substantial cost, especially if you plan to scale up significantly.

The dataset

Thanks to the vibrant ecosystem and communities with a sharing spirit, companies like Cohere have created millions of Wikipedia article embeddings (which made this blog possible) in different languages and made them available (and free to use) in Hugging Face datasets. Using their Multilingual embedding model, Cohere have embedded millions of Wikipedia articles in many languages. The articles are broken down into passages, and an embedding vector is calculated for each passage.


Source: The Embedding Archives: Millions of Wikipedia Article Embeddings in Many Languages

Let’s also take a moment to celebrate the incredible role of Hugging Face as the go-to platform for sharing models and datasets. Providing access to thousands of models and datasets, we are excited to highlight that Hugging Face relies on NetApp storage technology to efficiently serve this vast repository of models and datasets, ensuring high performance and reliability.

In the following, we will use the encoded Wikipedia (simple English) stored in this Hugging Face dataset containing both the text, embedding vector, and additional metadata values. Creating a simple yet powerful example of how vectors can be stored and utilized to perform a vector search using Instaclustr’s ClickHouse managed service.

First steps

Before we jump in, if you are new to Instaclustr and want to get hands-on and follow the rest of this blog, please refer to this tutorial to create a free trial account in a few clicks. You can then create a ClickHouse managed cluster under the free trial period. The provisioning may take up to 10 minutes, depending on the AWS region.

I will use DBeaver Community as my companion (as it has been for years now for database development) for all the SQL queries and results you will find below. You can connect DBeaver to your Instaclustr managed ClickHouse by following the guide.

Before creating the table and inserting the data, I spent some time analyzing the data. You can start by familiarizing yourself with the full description of the columns using the Dataset viewer in Hugging Face (screenshot below):

 

From Cohere’s blog: The emb column contains the embedding of that passage of text (with the title of the article appended to its beginning). This is an array of 768 floats (the embedding dimension of Cohere’s multilingual-22-12 embedding model).1

Jump into your SQL editor and use DESCRIBE to identify the columns and infer their type. ClickHouse’s ability to infer data types is usually sufficient and doesn’t require additional manual effort.

Additionally, you can use the SELECT command to inspect a few lines of the data. This allows you to understand the structure and content of the data, ensuring that the inferred types align with your expectations.

-- Describes the content of the parquet file
DESCRIBE  
url('https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple
embeddings/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet',
'Parquet')
SETTINGS enable_url_encoding = 0, max_http_get_redirects = 1; 
-- Select few lines to get the data in the parquet files 
SELECT *  
FROM  
url('https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple
embeddings/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet',
'Parquet')
LIMIT 2 
FORMAT Vertical 
SETTINGS enable_url_encoding = 0, max_http_get_redirects = 1;

In ClickHouse, the enable_url_encoding setting is used to control whether URL encoding is applied to the URLs used in functions like the url table function. This setting can be important when dealing with URLs that contain special characters or spaces that need to be properly encoded to be valid in HTTP requests.

The max_http_get_redirects setting is used to specify the maximum number of HTTP redirects that the server will follow when making an HTTP GET request. This setting is particularly relevant when ClickHouse is configured to fetch data from external sources over HTTP, such as when using the url table function or the remote table engine. You can set these parameters in your session for all the rest of the remote queries.

The result from the above queries will return the following:

Based on the inferred data type result, we can proceed to create our source table. While starting with a MergeTree table engine is a solid choice, using the id column as an index is somewhat arbitrary (we will discuss some optimizations to consider later on).

CREATE TABLE wiki_emb 
( 
    `id` UInt32, 
    `title` String, 
    `text` String, 
    `url` String, 
    `wiki_id` UInt32, 
    `views` UInt32, 
    `paragraph_id` UInt32, 
    `langs` UInt32, 
    `emb`Array(Float32) 
) 
ENGINE = MergeTree 
ORDER BY id;

Inserting data

We will use the url function to read the Parquet files and insert their data into the wiki_emb table. Considering that you are using the Instalcustr managed service for ClickHouse, this will be done in parallel.

Since the size of the data remains small (1.63 Gb) and using the free trial Instaclustr managed ClickHouse service we can do a bulk insert using the following query:

SET max_http_get_redirects = 1 
SET enable_url_encoding = 0 

INSERT INTO wiki_emb 
SELECT * 
FROM ( 
    SELECT * FROM url('https://huggingface.co/datasets/Cohere/wikipedia-22-12
simple-embeddings/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet',
'Parquet')
    UNION ALL 
    SELECT * FROM url('https://huggingface.co/datasets/Cohere/wikipedia-22-12
simple-embeddings/resolve/refs%2Fconvert%2Fparquet/default/train/0001.parquet',
'Parquet')
    UNION ALL 
    SELECT * FROM url('https://huggingface.co/datasets/Cohere/wikipedia-22-12
simple-embeddings/resolve/refs%2Fconvert%2Fparquet/default/train/0002.parquet',
'Parquet')
    UNION ALL 
    SELECT * FROM url('https://huggingface.co/datasets/Cohere/wikipedia-22-12
simple-embeddings/resolve/refs%2Fconvert%2Fparquet/default/train/0003.parquet',
'Parquet')
) AS data_sources;

We have created the table and inserted the data from the Parquet file containing the embeddings and the additional metadata. We have kept things simple so far by using some basic default choices. This approach allows us to establish a functional baseline, which we can then refine and optimize as we gain more insights into our data and performance requirements.

Next, let’s discuss some optimization considerations to enhance the previous 2 steps. If we want to scale significantly, the Cohere/wikipedia-2023-11-embed-multilingual-v3 containing the full dump from 2023-11-01 from Wikipedia in all 300+ languages is 536 Gb. We would need terabytes of embedded data to build a full-scale search engine using all the extra media content included in Wikipedia.

Optimizing performance for vector storage and insertions

Besides the usual – indexing, partitioning (when necessary), and materialized views —that depend on the potential applications we want to build on top of these embeddings, there are two points I want to emphasize, one for each part of the two previous steps: storage and insertions.

Storage optimization with compression: the compression of floating-point numbers remains an active and evolving area of research due to the unique challenges these data types present. Their variability and lack of repetitive patterns make them less amenable to traditional compression algorithms.

Therefore, don’t expect to gain significant compression ratios using the default codecs such as LZ4 and Delta. These codecs are particularly effective for columns with repetitive or similar values, such as categorical data or time series.

However, slightly better compression ratios for arrays of floating-point numbers can be achieved using ZSTD. You can specify the use of ZSTD compression for columns storing floating-point numbers to achieve better compression ratios.

ALTER TABLE wiki_emb MODIFY COLUMN emb Array(Float32) CODEC(ZSTD);

Another potential approach commonly used in the AI field is Quantization. While we may not fully test and implement this technique here, we can benefit from its principles.

Quantization involves representing the vectors with lower-precision data types, such as 16-bit floats instead of the usual 32-bit floating points. This can significantly reduce the storage requirements while maintaining an acceptable level of precision for many applications.

Batch Inserts and Data management: With ClickHouse, you should ALWAYS use batch inserts to load your data. This reduces the overhead associated with multiple insert operations.

Another Good practice is including a file_name column in our schema that contains the name of the original remote Parquet file from which data is inserted. Allowing us to keep track of the specific file’s success during insertion into ClickHouse. Additionally, if an error occurs during data insertion (network connectivity is a major cause) you can quickly identify which file was not inserted and address it accordingly.

Using Instaclustr for ClickHouse to search and retrieve similar vectors


Source: Microsoft Designer

In this section, we’ll break down the process of obtaining an embedding and making a ClickHouse request to determine the distance from the query phrase to the closest match. When a user inputs a search phrase like “Who is Linus Torvalds,” we need to convert this phrase into a vector to compare it with the Wikipedia embeddings and find the best match.

Please keep in mind that this blog was built to showcase the vector search capabilities of ClickHouse. By breaking down the process into small sets of basic SQL queries and Python scripts, the goal is to make the concepts accessible and easy to follow.

All the manual work, such as copying and pasting a 768-long embedding, is not only unnecessary, but also not recommended outside of this blog! Instead, using a framework like Langchain, we can streamline the workflow. We can integrate various components seamlessly, creating a pipeline that ties everything together. Such a pipeline handles everything from taking a query as input to embedding generation, vector search, and result retrieval.

Steps to generate embeddings

To generate embeddings using Cohere’s multilingual-22-12 embedding model, you need to follow these steps; this example assumes you are using Python and the Cohere API.

You will need first to get an API key from Cohere. Sign up at Cohere’s website and get your API key from the dashboard. The Cohere API calls made using trial keys are free of charge, rate-limited, and cannot be used for commercial purposes (this blog is intended for informational purposes only)

Here’s a sample code snippet that demonstrates how to do this:

# Install the Cohere Python SDK 
# pip install cohere 
import cohere 

# Initialize the Cohere client with your API key 
api_key = 'your-api-key-here' 
co = cohere.Client(api_key) 

# Define the text you want to generate embeddings for 
text = " Who created Unix " # Replace with your query   

# Generate the embeddings using the multilingual-22-12 model 
response = co.embed( 
    texts=[text], 
    model='multilingual-22-12' 
) 

# Extract the embedding from the response 
embedding = response.embeddings[0] 

# Print the embedding 
print(embedding) 

# Verify the length of the embedding 

print(f'Length of embedding: {len(embedding)}') 
Output: 
[0.12451172, 0.20385742, -0.22717285, 0.39697266, -0.04095459 
… 
0.42578125, 0.23034668, 0.39160156, 0.116760254, 0.046661377, 0.1430664] 
Length of embedding: 768

Querying for vector search

Source: Microsoft Designer

Now comes the exciting part: utilizing these vectors to find the most relevant to a given search text. What we need to do is to go through all the stored vectors and compare how far each of them is from our search query. ClickHouse will calculate the distance and sort them by the distance and return top results with the minimal value for the distance.

Native vector functions

ClickHouse has introduced different functions allowing to calculate similarity metrics between vectors directly in SQL queries. The most useful, in our context, are cosineDistance and L2Distance.

Using the cosineDistance, we get better results.

SELECT 
    title, 
    url, 
    paragraph_id, 
    text, 
    cosineDistance(emb, [Paste the embeddings]) AS distance 
FROM wiki_emb  
ORDER BY distance ASC 
LIMIT 5 
FORMAT Vertical;

Not bad as a first result! With 8 GB of RAM and 4 CPUs, and without any extra tuning or optimization, it took less than a second (0.633 sec) to get the top 5 similar vectors in a table with 485,859 rows. It’s always good to start small and create some baseline metrics.

Now, you might be wondering: if we had a table with 4 billion rows, would it take 10,000 times longer to get the same result?

Spoiler alert: No, it wouldn’t!

ClickHouse is designed to handle massive datasets with impressive efficiency. You can try it out and see for yourself.

Advantages of using ClickHouse for vector search

Let’s recap some of the benefits of ClickHouse for vector search. Some of these benefits will become even more apparent as we advance in this series:

ClickHouse is a good fit for vector search when you need to combine vector matching with metadata filtering or aggregation, particularly for very large vector datasets that require parallel processing across multiple CPU cores.

If you choose to use (or are already using) ClickHouse for analytics workloads, expanding it to include vector search eliminates the need to deploy and manage additional services, simplifying your tech stack capabilities (making it the Swiss Army knife for real-time analytics and AI). To summarize:

  • Leverage existing infrastructure for new workloads.
  • Combine semantic search results with standard SQL analytics in a single query.
  • Highly configurable, allowing you to tailor performance optimizations (e.g., compression, indexing) to your use case.
  • Full control over your data and freedom to customize without vendor lock-in.


Source: Microsoft Designer

Conclusion

ClickHouse offers a performant and cost-effective option that covers most vector storage needs. With its vibrant community and evolving ecosystem, ClickHouse’s vector search capabilities are likely to expand even more, further solidifying its position as a multi-purpose platform for both traditional OLAP and AI-driven workloads.

At NetApp, we address one of the major challenges with ClickHouse: the complexity in setup and maintenance. We ensure optimal performance, configuration, monitoring, and tuning with a dedicated team of experts to manage it all and help you get the best from it. This allows you to focus on deriving value from your business without getting burdened by the technical intricacies of managing a ClickHouse deployment.

So, if you want to experience Instaclustr for ClickHouse yourself, sign up for a free trial and get started today or reach out to sales@instaclustr.com and discuss your use case.

==

1 From The Embedding Archives: Millions of Wikipedia Article Embeddings in Many Languages by Nils Reimers and Jay Alamaar

The post Getting started with vector search on Instaclustr for ClickHouse® appeared first on Instaclustr.

]]>