Top Real-Time Analytics Databases in 2023: Rockset, Apache Druid, ClickHouse and Pinot

February 8, 2023

Shruti Bhat

CPO & SVP Marketing, Rockset

Updated February 2023

We built Rockset with the mission to make real-time analytics easy and affordable in the cloud. We put our users first and obsess about helping our users achieve speed, scale and simplicity in their modern real-time data stack (some of which I discuss in depth below). But we, as a team, still take performance benchmarks seriously. Because they help us communicate that performance is one of the core product values at Rockset.

Benchmarking Responsibly

We are in complete agreement with Snowflake and Databricks on one thing: that anyone who publishes benchmarks should do them in a fair, transparent, and replicable manner. In general, the way vendors conduct themselves during benchmarking is a good signal of how they operate and what their values are. Earlier this week, one of the Imply (one of the companies behind Apache Druid), published what appears to be a tongue-in-cheek blog claiming to be more efficient than Rockset. Well, as a discerning customer, here are the questionable aspects of Imply's benchmark for you to consider:

Imply has used a hardware configuration that has 20% higher CPU in comparison to Rockset. Good benchmarks aim for hardware parity to show an apples to apples comparison.
Rockset’s cloud consumption model allows independently scaling compute & storage. Imply has made inaccurate price-performance claims that misrepresent competitor pricing.

Rockset beat both ClickHouse and Druid query performance on the Star Schema Benchmark. Rockset is 1.67 times faster than ClickHouse with the same hardware configuration. And 1.12 times faster than Druid, even though Druid used 12.5% more compute.

SSB Benchmark Results

The SSB measures the performance of 13 queries typical of data applications. It is a benchmark based on TPC-H and designed for data warehouse workloads. More recently, it has been used to measure the performance of queries involving aggregations and metrics in column-oriented databases ClickHouse and Druid.

To achieve resource parity, we used the same hardware configuration that Altinity used in its last published ClickHouse SSB performance benchmark. The hardware was a single m5.8xlarge Amazon EC2 instance. Imply has also released revised SSB numbers for Druid using a hardware configuration with more vCPU resources. Even so, Rockset was able to beat Druid’s numbers on absolute terms.

We also scaled the dataset size to 100 GB and 600M rows of data, a scale factor of 100, just like Altinity and Imply did. As Altinity and Imply released detailed SSB performance results on denormalized data, we followed suit. This removed the need for query time joins, even though that is something Rockset is well-equipped to handle.

All queries ran under 88 milliseconds on Rockset with an aggregate runtime of 664 milliseconds across the entire suite of SSB queries. Clickhouse’s aggregate runtime was 1,112 milliseconds. Druid’s aggregate runtime was 747 milliseconds. With these results, Rockset shows an overall speedup of 1.67 over ClickHouse and 1.12 over Druid.

ssb-table

Figure 1: Chart comparing ClickHouse, Druid and Rockset runtimes on SSB. The configuration of m5.8xlarge is 32 vCPUs and 128 GiB of memory. c5.9xlarge is 36 vCPUs and 72 GiB of memory.

ssb-graph

Figure 2: Graph showing ClickHouse, Druid and Rockset runtimes on SSB queries.

You can dig further into the configuration and performance enhancements in the Rockset Performance Evaluation on the Star Schema Benchmark whitepaper. This paper provides an overview of the benchmark data and queries, describes the configuration for running the benchmark and discusses the results from the evaluation.

Real-Time Data in the Real World

Car companies measure, optimize and publish how fast they can go from 0-60 mph, but you as the customer test-drive and evaluate a car based on that and a plethora of other dimensions. Similarly, as you choose your real-time solution, here are the technical considerations and the different dimensions to compare Rockset, Apache Druid and ClickHouse on.

Starting from first principles, here are the five characteristics of real-time data that most analytical systems have fundamental problems coping with:

Massive, often bursty data streams. With clickstream or sensor data, the volume can be incredibly high — many terabytes of data per day — as well as incredibly unpredictable, scaling up and down rapidly.
Change data capture streams. It is now possible to continuously capture changes as they happen in your operational database like MongoDB or Amazon DynamoDB. The problem? Most analytics databases, including Apache Druid and ClickHouse, are immutable, meaning that data can’t easily be updated or rewritten. That makes it very difficult for it to stay synced in real time with the OLTP database
Out-of-order event streams. With real-time streams, data can arrive out of order in time or be re-sent, resulting in duplicates.
Deeply-nested JSON and dynamic schemas. Real-time data streams typically arrive raw and semi-structured, say in the form of a JSON document, with many levels of nesting. Moreover, new fields and columns of data are constantly appearing.
Destination: data apps and microservices. Real-time data streams typically power analytical or data applications. This is an important shift, because developers are now end users, and they tend to iterate and experiment fast, while demanding more flexibility than what was expected of first-generation analytical databases like Apache Druid.

Comparing Rockset, Apache Druid and ClickHouse

Given the technical characteristics of real-time data in the real world, here are the useful dimensions to compare Rockset, Apache Druid and ClickHouse. Apache Pinot is not included in this comparison table, but it is in a similar as other databases, with horizontal scaling - an open-source system that was designed during the on-premise era. All competitor comparisons are derived from their documentation as of today

	Rockset	Apache Druid	ClickHouse
*Setup*
Initial setup	Create cloud account, start ingesting data	Plan capacity, provision and configure nodes on-prem or in cloud	Plan capacity, provision and configure nodes on-prem or in cloud
*Ingesting data*
Ingesting nested JSON	Ingest nested JSON without flattening	Flatten nested JSON	Supports nested JSON, but JSON is commonly flattened
Ingesting CDC streams	Mutable database handles updates, inserts and deletes in place	Insert only	Mostly insert only, with asynchronous updates implemented as ALTER TABLE UPDATE statements
Schema design and partitioning	Ingest data as is with no predefined schema	Schema specified on ingest, partitioning and sorting of data needed to tune performance	Schema specified on table creation
*Transforming data*
Ingest transformations	SQL-based ingest transformations including DBT support	Use ingestion specs for limited ingest filtering	Use materialized views to transform data between tables
Type of ingest rollups	SQL-based rollups with aggregations on any field	Use ingestion specs for specific time-based rollups	Use materialized views to transform data between tables
*Querying Data*
Query language	SQL	Druid native language and a parser for SQL-like queries	SQL
Support for JOINs	Supports JOINs	Only broadcast JOINs, with high performance overhead, data is denormalized to avoid JOINs	Supports JOINs
*Scaling*
Scaling compute	Independently scale compute in the cloud	Configure and tune multi-node clusters, add nodes for more compute	Configure and tune multi-node clusters, add nodes for more compute
Scaling storage	Independently scale storage in the cloud	Configure and tune multi-node clusters, add nodes for more storage	Configure and tune multi-node clusters, add nodes for more storage
Total cost of ownership	Managed service optimized for cloud efficiency and developer productivity	Requires Apache Druid expert for performance engineering and cost control	Requires ClickHouse expert for performance engineering and cost control

Raw price-performance is definitely important so we will continue to publish performance results - but in this day and age, cloud efficiency and developer productivity are equally important. Cloud efficiency means never having to overprovision compute or storage, instead scaling them independently based on actual consumption. Real-world data is messy and complex, and Rockset saves users considerable time and effort by eliminating the need to flatten data prior to ingestion. Also, we ensure users don’t have to denormalize data with a JOIN pattern in mind, because even if these patterns were known in advance, denormalizations are costly in terms of user effort and speed of iteration. By indexing every field, we eliminate the need for complex data modeling. And with standard SQL we aim to truly democratize access to real-time insights. The other area where Rockset shines is that it is built to handle both time-series data streams as well as as CDC streams with updates, inserts and deletes, making it possible to stay in real-time sync with databases like DynamoDB, MongoDB, PostgreSQL, MySQL without any reindexing overhead.

In the words of our customer: “Rockset is pure magic. We chose Rockset over Druid, because it requires no planning whatsoever in terms of indexes or scaling. In one hour, we were up and running, serving complex OLAP queries for our live leaderboards and dashboards at very high queries per second. As we grow in traffic, we can just ‘turn a knob’ and Rockset scales with us.“

We’re focused on accelerating our customers’ time to market: “Rockset shrank our 6-month long roadmap into one afternoon” said one customer. No wonder Imply has embarked on project Shapeshift in an attempt to get closer to Rockset’s cloud efficiency - however lifting and shifting datacenter-era tech into the cloud is not an easy endeavor and we wish them good luck. For someone who claims to care about real-world use cases more than performance, Apache Druid is surprisingly lacking in functionality that actually matters in the real world of real-time data: ease of deployment, ease of use, mutability, ease of scaling. Rockset will continue to innovate to make real-time analytics in the cloud more efficient for users with a focus on actual customer use cases. Price-performance does matter. Rockset will continue to publish regular benchmarking results and rest assured we will do our utmost not to misrepresent ourselves or our competitors in this process - and most importantly we will not mislead our customers. In the meantime we invite you to test drive Rockset for yourself and experience real-time analytics at cloud scale.

Deep dive references:

Top Real-Time Analytics Databases in 2023: Rockset, Apache Druid, ClickHouse and Pinot

Benchmarking Responsibly

SSB Benchmark Results

Real-Time Data in the Real World

Comparing Rockset, Apache Druid and ClickHouse

Similar blogs

Blogs by role

Comparing Rockset to...