Serverless Real-time Indexing: A Low Ops Alternative to Elasticsearch

Both Rockset and Elasticsearch are queryable datastores that can store data and serve queries. Both of them index data and use the index to serve queries. Both systems are document-sharded. But that is where their similarities end. Rockset is a serverless realtime indexing database built to exploit cloud elasticity with minimal ops, while Elastic requires special expertise and effort to manage the ELK stack.

Ben Hagan, a former solutions architect from Elasticsearch, and Shruti Bhat will go over some of the ops considerations for deploying and managing elastic clusters at scale, as we compare and contrast to Rockset’s serverless ops in the cloud

  • Collecting real-time events, managing change data capture and denormalization: With Elasticsearch, you manage different ingestion and input source pipelines to denormalize data. In contrast, Rockset supports click and connect integrations for continuously indexing data from MongoDB, DynamoDB, Kafka, S3 etc. It has native support for JOINs, so there is no need to denormalize your data.

  • Configuring clusters and managing node types: When deploying Elasticsearch, it is important to configure master nodes, data nodes, ingest nodes, coordinating nodes, alerting nodes in your cluster and optimize them based on the use case and requirements. In contrast, Rockset is a modern serverless system that is highly optimized for fast queries out-of-the-box

  • Scaling writes, sharding and re-indexing: Elasticsearch uses a primary-backup model for replication so each replica re-indexes the data locally again. As your data size grows, you may will typically increase the shard size and re-index your data in elasticsearch. In contrast, Rockset uses RocksDB remote compaction and micro-sharding to eliminate the need for re-indexing overhead.

  • Scaling reads and isolating workloads: The Elastic Cloud offers different types of nodes each with fixed compute/memory ratios such as io-optimized and storage-optimized nodes, and moving between these requires a data migration. In contrast, Rockset separates compute from storage to allow seamless scaling of reads by increasing the compute allocation in the form of fully isolated virtual compute for each workload.

  • Managing data durability and performance: Elasticsearch assumes a shared-nothing storage architecture where data durability is guaranteed via replication among data nodes, and you manually configure the resiliency of new writes. Rockset uses the cloud’s storage model with automatic S3-backed durable storage already configured in the cloud.


Speakers

Ben Hagan - Ben is a Solutions Architect specializing in real-time, big data and distributed systems with over 20 years industry experience. Prior to Rockset, Ben was a Principal Solutions Architect at Elastic working with Bay Area customers. Prior to Elastic, Ben built and led the Sales Engineering team at real-time social data startup, DataSift.
Shruti Bhat - Shruti is SVP Product at Rockset. Prior to Rockset, Shruti led Product Management for Oracle Cloud, with a focus on AI, IoT and Blockchain. Previously, Shruti was VP Marketing at Ravello Systems, where she drove the start-up's rapid growth from pre-launch to hundreds of customers and a successful acquisition. Prior to that, she was responsible for launching VMware's vSAN and has led engineering teams at HP and IBM.

Show Notes

Shruti Bhat:

Welcome to the session. Thank you so much for making the time to join us today. I hope everybody is staying safe and healthy in these weird times we live in. Today we're going to be talking about real-time indexing and how you can use it as a serverless low ops alternative to Elasticsearch. My name is Shruti Bhat. I am head of product here at Rockset, and I have with me my co-presenter, Ben Hagan. Ben.

Ben Hagan:

Hi everyone. Yes. I'm a solutions architect here at Rockset, and before that I was a solutions architect at Elastic. So my history over many years has been in big data analytics and scaling distributed systems.

Shruti Bhat:

Awesome. A quick look at the agenda today. Now most of you are very familiar with search indexing, you're familiar with Elasticsearch. Today we're going to compare and contrast search indexing with real-time converged indexing. We'll talk about what converged indexing is, how it's similar, how it's different, how the architecture is set up. And then we'll also go over some of the details in terms of how is it different, in terms of operations. When you talk about serverless systems, cloud native systems, there's a huge advantage that we have in the cloud. And we really want to spend some time talking about how does that look in terms of initial setup, in terms of day two operations. Since Ben has spent many hours helping Elasticsearch customers size their environments, we're lucky to have him lend his expertise here.

Shruti Bhat:

Let's get started. Search indexing, as you all know, has been around for a while. As we look at where search indexing started, its roots in text search, and then over time, all the different use cases that it's being used for, we looked at some design goals in terms of designing Rockset and designing converged indexing a little differently. So one of our primary goals here is, how do we get better scaling in the cloud? The second one is more flexibility, especially now, in the last few years, with how data has changed, how the shape of the data coming from many different places tends to be completely different, and how it's being used for very different types of applications. How do we give you more schema-query flexibility? And the last one is around low ops.

Shruti Bhat:

So given these design goals, just setting the context here that, as far as speed and scale is concerned, we're looking at new data being queryable in about two seconds with P95, or two seconds, even if you have millions of writes per second coming in. At the same time, we also want to make sure that queries return in milliseconds, even on terabytes of data. Now, of course, this is possible today with Elasticsearch, right? Elastic is used at very high scale. The challenge though, is managing it at that scale becomes very difficult. So when we talk about better scaling, what we mean is, how do we enable this type of scaling in the cloud while making it very easy? The second piece is for flexibility. We heard feedback loud and clear, that you want to be able to do a lot more complex queries.

Shruti Bhat:

You want to be able to do, for example, standard SQL queries, say, including JOINs on whatever your data is, wherever it's coming from. It could be nested JSON coming from MongoDB. It could be Avro coming from Kafka. It could be Parquet coming from S3, or structured data coming from other places. So how can you run any types of complex queries on this without having to denormalize your data? That's one of the design goals. And the last one is where we'll spend a lot more time in today's session, which is low ops. When you build a cloud native system, you can enable serverless cloud scaling. And the vectors we're optimizing for are both hardware efficiency and human efficiency in the cloud. For example, memory is very expensive in the cloud, managing clusters and scaling up and down is painful when you have a lot of bursty workloads. So how can we handle all of that in a more simple way in the cloud?

Shruti Bhat:

These are the three design goals we're going to talk about, and we're going to show you how we achieve this today. But let's take a deep dive into really what is the difference between the two indexing technologies. Search indexing, as you all know, is inverted ... Elasticsearch has an inverted index and it also has doc value storage built using Apache Lucene. Lucene has been around for a while. It's open source, we're all intimately familiar with it. It was originally built for text search and log analytics, and this is something it really shines at. It also means that you have to denormalize your data as you put your data in, and you get very fast search and aggregation queries. Converged indexing though, you can think of it as a next generation of indexing. What we do here is we combine the search index. So we're combining the inverted index with a row-based index and a column-store.

Shruti Bhat:

And then all of this is built on top of a key-value abstraction, not Lucene. This is built on top of RocksDB. What does that lend itself really well to? Because of the kind of flexibility and scale that it gives you, it lends itself really well to real-time analytics and real-time applications. And the advantage here is if you're doing this real-time analytics, you don't need to denormalize your data. And you basically get really fast search, aggregation, time-based queries because, by default, you now have a time index built here, geo credits because you have a geo index, and your JOINs are also possible. Not only are they possible, they're really fast. So that's, in a nutshell, the difference between search and converged indexing.

Shruti Bhat:

A quick look at what converged indexing looks like under the hood, right? We talked about having your columnar, inverted, and row index in the same system. If you were to conceptually visualize it, you can think of it as your document that comes in is shredded and mapped to many keys and values. And when it's stored in terms of many keys and values, it internally looks like this. So you have your column-store, you have your row store, and you have your inverted index. And then we store all of this on RocksDB. RocksDB is an embedded key-value store. In fact, ours is the team that built it. If you're not familiar with RocksDB, I'll give you a one second overview. Our team built RocksDB back in Facebook, and open sourced it. Today you will find RocksDB is used in Apache Kafka, it's used in Flink, it's used in CockroachDB. And all the modern, cloud scale distributed systems use RocksDB.

Shruti Bhat:

We also use RocksDB under the hood, and it's a very different representation than what is done in Elasticsearch. One of the big differences here is that because you have these three different types of indexes, we can have now a SQL optimizer that decides in real-time, which is the best index to use, and then returns your queries really fast by picking the right index and optimizing your query in real-time. And because this is a key-value store, the other advantage you have is that each and every field is mutable. What does this mutability give you as you scale? It basically means you don't have to ever worry about re-indexing as you get ... If you're using, for example, database change streams, you don't have to worry about what happens when you have a lot of upserts coming in, you have updates, deletes, inserts in your database, change data capture. You don't have to worry about how that's handled in your index. So every individual field being mutable is very powerful as you start scaling your system, as you have massive scale indexes.

Shruti Bhat:

Ben is going to talk about this in more detail as we go through the rest of the presentation. But now that we talked about converged indexing and search indexing, let's zoom out for a second and say, okay, now we have these two types of indexes, right? You either have search indexing built on Lucene or you have converged indexing built on RocksDB. What else is the difference between Elasticsearch and Rockset? There's another big difference, which is that the query language, in the case of Rockset, is standard SQL, including JOINs in the case of Elasticsearch. I know you're all familiar with the query DSL. The other thing is Elastic has a tightly integrated ELK stack. So you would use Kibana for visualization. And of course, Elastic is available on-prem, or it can be cloud hosted.

Shruti Bhat:

The difference on the Rockset side is, because we support standard SQL, we don't have any one particular visualization tool that we're tied to. We support all the standard SQL tools and SQL visualization tools, like Tableau, Grafana, Redash. You pick your favorite SQL tool. And the other difference is that we have a lot of developers using Rockset to build applications. So we obsess about the developer experience. We obsess about the developer workflows, and how do we integrate into CICD. We have things like query Lambdas, which, again, we'll talk about as we do the demo, but you can think of query Lambdas as a saved, named SQL query that you can just trigger and use it as an API. All of these concepts are slightly different. We will compare and contrast the approach to managing Elasticsearch or managing Rockset, and see how that impacts your workflows. And by the end of the session, hopefully you'll have a very clear picture of how your world would look different if you were to use Elastic or Rockset.

Shruti Bhat:

All right. Some of the use cases that we see the Rockset side, right? It's really interesting. When you bring in event streams, when you bring in user behavior data, sensor data, device metrics, again, the kicker here is that you have many different formats, right? You have nested JSON, Avro, Parquet, XML, geo structured data. How can you search and analyze this type of data in real-time? And what would you use it for? The most common use cases we see are ad optimization, A/B testing, fraud detection, real-time 360. And oftentimes these look like live dashboards. However, it doesn't stop there. When you start doing this in the cloud, you also see more interesting real-time applications coming in. Things like logistics, fleet tracking, gaming leader boards, personalization, real-time recommendations. This is why, when you think about these types of applications, we also spend a lot of time optimizing it for, not just live dashboards, but also developer workflows, and actually making it a core part of your app dev life cycle.

Shruti Bhat:

So I think the big difference is that if you're only using Elasticsearch for text search, or you're only using Elasticsearch for log analytics, it probably serves you really well because that's the core of Elastic, that's what it was built for. And over the last 10 years or so, it's become a fantastic text search engine. But as we're all familiar, over time, Elastic has been used for more and more use cases. And if you're one of those people who's using Elasticsearch for all of the other things, right? Not just text search, but also for real-time analytics across event streams and user behavior, and sensor data, and device metrics, or if you're offloading data from your primary database into Elastic, for aggregations. Well, that's where you will see in the rest of the presentation, the advantages that converged indexing brings in will be really interesting for you. With that, let me hand it over to Ben so he can show you a demo of what this actually looks like.

Ben Hagan:

Thanks, Shruti. What I'm going to do is just take you through a brief demonstration of what Rockset actually looks like. For those of you who haven't seen it, just an end-to-end use case, just to bring you up to speed on how it works. Hopefully you can see my browser dashboard here. As we've already described, Rockset is a complete, fully managed SaaS environment. So there's no downloads, there's no installs, there's no plugins. You simply sign up, create yourself an account, and you're good to go. Now if I click onto this collection side on the left here, what you'll see is this. Collections are effectively analogous to tables or indexes, depending on what system you're using, but analogous to a table in a relational system. Now much like Elastic, I think Rockset is effectively an indexing system. And we very much focus on allowing people to get feature into the platform easily. As you'll see, each collection has a source associated with it here.

Ben Hagan:

These are integrations into upstream data sources, be it DynamoDB, or S3, or things like Kafka. Most of these are click and connect integrations. Let me click into one of these. Let's have a look. If I go into this Kafka feed here, which we'll take a bit of a deep dive on, this gives you some interesting insights as to how Rockset actually works. So first of all, all of the fields that are coming into this collection via this Kafka stream, we've got a real-time Kafka stream coming in. It's actually Twitter data, just out of curiosity. And on the left hand side here, you'll see all of the fields that are coming in as part of those documents. You can see we've got about 1,200 per document. So it's pretty much standard Twitter data, but this is all deeply nested JSON data in this case.

Ben Hagan:

You can see that Rockset, like Elastic, automatically type maps this data, and it handles the messy data well. So it all the advantages you want from a no SQL system where I can change my schema, I can throw in different types, maybe the same values coming in as a string or as an integer. That shouldn't cause any problems or break any inbound ingestion streams. And that's exactly the case in Rockset. So we really see a huge amount of benefit where people don't have to build ingestion pipelines, where you're not having to do the acrobatics of formatting my data and bringing in other sources. It just ingests the data as it is, in whatever format. On the right hand side, this is where we start to see some of the real benefits, above and beyond just a standard no SQL environment in that, as Shruti mentioned, what was she doing under the covers is we're slicing up every key in value from every document, into the platform, and storing that individually.

Ben Hagan:

But what we actually give you back, as a consumer or a user of this data, is this tabular representation. So it's really cool that you can come in here and look at this JSON nested data, but in a tabular representation. And you can see, I've got these deeply nested objects and arrays. From a usability perspective, it's then really simple to come in and start querying that data, which we'll get onto in a moment. The other key things here, if I go ahead and create a new collection ... Let's click on the create collections. I'm creating a new one. Basically says, where's my upstream data coming from? Is it going to be from a change data capture stream, or just the API? So on and so forth. If I pick Kafka, because that's what we're looking at today, and I go to Kafka connect integration here, you can see a number of properties that we can set on a table or on a collection. If I just join into my Kafka topic, you can see that, much the same way as you can in Elastic, you can actually do sorts of manipulations on that data as it comes into the platform.

Ben Hagan:

So I can actually run custom mappings and fully fledged SQL on that data, as it comes in, to do things like typecasting. Or maybe I want to do some look-ups in other tables, and bring that feature in. I can drop fields or documents. So all sorts of data manipulation you can do at ingest time, just using native SQL. And then one of the other key parts about a collection is the ability to set a retention policy on this. We'll touch on this in more detail a bit later, but I can drop these documents automatically after a certain period of time, days, weeks, hours, months, whatever you need. So that allows you to, on a per collection basis, just automatically expire data that you no longer want to query. Maybe it's no longer in that hot tier where you want to ... It's primary important, and I want to keep that on this instance. Then at the bottom here, we've actually got what instance type we're running on. We're going to come onto this in a lot more detail, so I won't spend too much time here.

Ben Hagan:

Once your collection is created, the data starts flowing in, and the platform keeps that in synchronization. So Rockset will keep synchronized without doing upstream data source, without doing any manipulation, any config, any set up. It will do that for you. Let's dig into this collection a bit further. I'm going to hit this query button on the right hand side here. This is where you're effectively dropped into your full SQL interface. And going back to that relational schema representation, you can see here, I can just start querying that data as you would expect with SQL. I can browse across all the different fields of my different collections here, which is nice with doing things like JOINs and stuff. You can easily drop these fields in. I've got an example here, on that Twitter data. Three examples, actually. Let's run this first one here.

Ben Hagan:

What you'll see is this is a simple select, and it's looking at that specific field, Twitter Firehose. You'll see it's pulled out this entities field, which is an array of objects in this case. Pretty simple stuff, as I said. Let's look up the last day. On my second query here, if I run this one, what we're doing here is using this unnest command, which just blows out the ticker symbol, specifically, that was in those deeply nested array of objects, into its own columnar representation, effectively. And using that, we can easily then run JOINs on this nested JSON data. The final query here, what I'm actually doing is we're doing a JOIN on this ticker symbol here. And of course, we're looking at our live Kafka data that's coming into the platform in real-time. If I scroll down, you'll just see I'm using a standard SQL JOIN to join between another collection, which happens to be coming from S3 in this case. It could be any other data source.

Ben Hagan:

I'm actually using the company name and industry, joining on the ticker symbol, coming from a different data source. And then also, I've got my live count coming through Kafka on the right-hand side here. So you're getting that real-time, leader board style, request here, but you're also matching that up or joining with a next level, another data source, which has cool. And then the final piece of this is typically, developers really don't want to be in the display, the consoles we're looking at here. Typically, they're in their own IBEs. They're thinking about source control around the SQL, how it gets integrated with CICD, and how that gets versioned. That's really where our query Lambdas come into to play. Let's remove these. Let's just have one query here. I'm going to hit this create query Lambda button. And what this does is, if I give this a creative name, it effectively wraps the SQL up and gives it a version, and makes it addressable via HTTP.

Ben Hagan:

So we almost give it a little microservice endpoint where you can trigger the execution of the SQL query, just by HTTP. Of course, we support this, but any sort of HTTP resources you can now integrate with your SQL. And we can see the definition here. I can also add parameters. So if I want to, of course, change things like user IDs or names, or any other SQL parameter that you want to change, that I pass in through an HTTP parameter, you could, of course, do that and place variables in here as well. As I mentioned, this is all versions, so that when you're deploying into production, you can just update your versions as you need to.

Ben Hagan:

That's really how developers take a SQL query and then embed that into their real-time applications. There's actually a huge amount of tooling, the support system sits behind this. That's definitely for another day, but if anyone's interested, there's plenty of resources on YouTube and, of course, the Rockset documentation as well. Okay, I'm going to leave the demo there. Let's drop back to the slides. Want to touch on, briefly-

Shruti Bhat:

As you go off the demo, just a quick reminder to everyone, we are taking questions on chat. So if you want to type in your questions on chat while Ben is talking us through the rest of it, that'll be great. We will take them as and when they come in. Back to you, Ben.

Ben Hagan:

Perfect. Thanks, Shruti. I just wanted to touch briefly on the underlying architecture of Rockset. And of course, I'm guessing most people on this call are interested in, what does that look like and how does it work? So what you'll see here is we have this architecture that we refer to as ALT, Aggregator Leaf Tailer. What it effectively does is it separates out ingest compute from storage compute, from query compute. And that's really been the fundamental premise of how Rockset was developed from the start. These things fluctuate. They obviously have different impacts on cost, but the key thing here is, you can see surrounded with the pink line is stuff that is actually managed by Rockset. And that's managed by the platform. So these things can independently scale up and down, based on your workloads.

Ben Hagan:

And then as we'll come onto in a moment, we can actually apply different compute resources to these, to allow you to return or get the exact performance, the cost ratio that you require. Just at a really high level, tailers are responsible for ingesting that data. So as you just saw with it, we've got data coming from change data capture streams or Kafka, or whatever it may be. They keep the data synchronized so that it's always being ingested. And they will scale up accordingly, automatically, to meet those requirements. The leaf nodes are the actual storage nodes, and that's where we talk about hot, cold storage. That's where the actual underlying RocksDB system is used. That's all abstracted away so it's not something you have to interface with, but that's what's happening under the covers. And then finally, when I'm querying this data, there's actually a two level, aggregator query architecture whereby the queries are distributed and run in parallel. That allows us to scale up that compute, and to actually increase the performance of those queries, as you would expect in parallel.

Ben Hagan:

There's three sections of this really, getting set up, getting started, and then how do we scale? How do we move into those bigger infrastructures as bigger clusters? And in the Elastic world, looking specifically at the Elastic cloud, you can select and pick, and choose between the type of compute you require, whether you have a memory intensive application or a CPU intensive application, or it's more generalist, a bit of both. And the instance types are actually defined mapping the storage to memory ratios, which is optimal for the type of CPU instance that you're picking. You do need to think about capacity planning here, but it's very easy just to scale this stuff up, as we'll talk about in a bit more. How much of this space do I need? And then the other thing to think about is the different node types that come into effect as you scale up your cluster.

Ben Hagan:

Do I need more specific master nodes or dedicated coordinator nodes? Those things get introduced as you scale up your platform or your cluster design. On the Rockset side, as we just talked about, one of the key things is the storage auto scale. So whether you create an account, literally, while we're on this call now, and start throwing in tens of gigabytes, hundreds of gigabytes, terabytes, whatever it may be, there's no impact of that. There's nothing you have to understand or tune, or change to scale up the storage side. That's all taken care of for you. The way we actually manage the compute as in, how much resources are actually allocated to my virtual instance, is done based on a number of CPU's and a fixed ratio, similar to Elastic, on the amount of memory that's allocated to that as well. We'll get into that in a few moments so you can get a clearer picture of what that looks like.

Ben Hagan:

And then finally, one of the other key considerations here is cloud durability. So one of the things that we haven't talked about in depth is the fact that when data comes into the platform, it gets moved on to that hot storage tier automatically, but it's also being backed up under the covers into an S3 storage environment. So you have this built-in durability for resilience, core part of the infrastructure. One of the very recent innovations that we have, which we're just releasing at the moment, is the ability to deploy Rockset inside of your VPC. This is a really nice feature set in that you can keep the data in your own VPC, it never leaves your environment, yet the management of the platform, that SaaS aspect is actually managed by Rockset externally. So a really good hybrid approach of the data being in your own environment, full control, security, things of that nature, and still without that operational burden of having to manage such an infrastructure on your own. So pretty excited about that release.

Ben Hagan:

Secondly, how things look when we start to think about scaling ingest. If we look at the Elasticsearch side of things, obviously a lot of this is use case specific, and that's really not where we want to go into. On the Elastic side, if you're looking at metrics, observability, logging, there's a vast array of integrations and technologies for beats and modules, and Logstash, which makes that really nice and a huge ecosystem. Things to think about in the ops side of things are, how am I going to query that data? The reason that matters is you need to make sure that you are denormalizing your data accordingly. So what data goes into what index actually matters, because that then makes you think about, okay, if I'm writing a real-time application, what type of metrics do I need to run? What fields do I need to be able to query? And what happens if that changes later down the line? So things to think about, how do I normalize that data? But as with many SQL and Rocksets alike, what do those type mappings look like?

Ben Hagan:

I talk briefly about those on the Rockset side of things, but key considerations when you're thinking about ingest. And again, on the use case specific side, we've talked a little about these connectors and I've shown you a demonstration. So out the box, click and connect connectors are really the focus of Rockset, to make it simple to bring that data into the platform. On the consideration side of things, it's really focused on, as you saw, we don't need to denormalize our data. You can literally throw it into the platform in whatever format and still experience that relational representation of the data, and then optimize using just SQL. So you can be picking and choosing across all these different data sources that may have different retention policies, to service your queries in real-time, as you need. Same as Elastic, you need to think about your mapping. So Rockset will imply mappings and ingest time, and you can go in and overwrite those if you need to.

Ben Hagan:

When we start moving into the scaling type of thing, so scaling writes, what are the implications of that? And how's that done? On the Elastic side of things, there's the classic stuff like, think about this space, and there's all sorts of alerting and monitoring to look after these things for you. You may want to look at dedicated ingest nodes when your ingest workload gets significant. And I know a lot of people who are on this call, who are Elastic users on-premise, will be very familiar with that. It gives you the ability to split out your node types. And then at some point you want to start thinking about, when you're getting to decent scale, how your shards are managed. So from an ingest perspective, do you have enough primary shards? Are these spread evenly across your nodes so you're not getting hot nodes on the significant load? And you may need to tweak and tune that, based on re-indexing data around to meet that change in shard strategy, if needed.

Ben Hagan:

On the Rockset side, the way we do this under the covers is we have this concept of microshards, and it's a very small, fragmented approach. What that means is that we don't have to move those indexes. They naturally resettle themselves across a different number of nodes, automatically. So you never have any operational overhead of supporting that. And the other thing we briefly touched on is all these indexes are fully mutable. Literally every key value is mutable. So for upserts and updates with very high throughput, that's something that's a very good fit for Rockset. The other part of this, and we'll dig into this in a few moments as well, is an efficient use of compute and storage. Because the compute and the storage are scaled independently, your usage, the hardware actually hugs that usage graph. So you're never leaving compute on the table, that is not being used, or you're leaving storage on the table, that is not being used.

Ben Hagan:

That goes for scaling up and scaling down. So I don't have to go and allocate myself a few terabytes worth of storage that I'm not using. It's something that's taken care of for you. For these efficiencies on the indexing side, we use RocksDB remote compaction, and that allows us to get really good efficiencies from storing these multiple indexes. And then a feature we've got coming soon is the ability to actually query your data on cold storage. That's not with us at the moment, but that's something that's coming in the future where we'll actually be able to drop the data onto S3 and actually run live queries there, which will give you a really nice performance, efficiencies. That was the ingest side of things. On the read side of things, we deal with many customers who have a very high QPS, queries per second requirement where they're running real-time applications at scale. And you need to be able to support this high concurrency with complex queries. Now in the Elasticsearch side of things, you can simply increase the number of nodes or your read replicas.

Ben Hagan:

So again, you can scale this up to meet those requirements. Again, you might want to start thinking about what your shards and a number of replicas actually look like under the covers. You can tweak and tune this stuff, and as it says here, you potentially have more overhead by having more shards. But effectively, if you have a small number of those, it may be faster. So as with a lot of distributed systems, a good way to test this out is actually do it with your own data. Ingest the right data, and then run your own queries to get that solid, real performance benchmark. But on the Rockset side of things, as we mentioned already, with decoupling this compute from storage, and I've already mentioned, the microshards are rebalanced when you change this compute. So on the right hand side there, you can see a little screenshot. Create an account, have a play, you'll see this. You can just click a button to literally scale up and scale down, and there's no downtime. It's a very fast operation because of that microsharding implementation.

Ben Hagan:

We automatically rebalance this. There's no degregation to performance. There's no degregation to storage capacity, or anything of that nature. So it really is trivial to scale up and down on your compute. And then just for the essence of time, I know we're almost at the end here, let's jump onto this slide where, here I was working with a customer, and there's actually a full blog article on this as you can see at the bottom here. I was working with a customer where we're scaling ... Similar use case where we're just scaling a high number of queries per second, coming into the platform. This was a real-time application, a web app and a mobile app. The query, specifically, that we were looking at here was actually, really expensive for any system to perform. And what they did here was, we did this basic analysis and said, "How fast does this query run?" Along the X axis here, you can see the number of days of historical data that they were querying.

Ben Hagan:

They were going back 30, 60, and 90 days, depending on what that user interaction was, and they were able just to, as we've just talked about, shift the virtual compute up and down, to meet the performance requirements and the throughput requirements. So just an example of real metrics on a real query for a previous customer, or current customer. Final section here, looking at the index life cycle management. What do we do with data as it ages? It becomes potentially less relevant, or maybe it's not so important. In the logging side of the world, then that's a really key one where typically, the last seven days worth of logs are really important to me, but the security and compliance, you want to age this stuff out and store it. And Elastic supports the ILM set of features where you can migrate data to different types of hardware, depending on the priority and importance of that data. On the cloud front, it's a pre-configured hot/warm instance type, which will take care of that for you.

Ben Hagan:

On the Rockset side of things, we effectively just manage the compute allocated to that data. On the warm environment, you have hot storage, but it has less compute. So you're just allocating fewer resources to that. Your queries are going to be slower, and obviously then you get a better storage to price efficiency. And on the cold storage, which I mentioned is coming soon, that's even less compute, and actually querying on S3 directly. So that's really going to be the ability to store massive amounts of data, potentially outside of the platform, but still have the ability to query it, which is going to be super powerful. Okay. With that, I'll hand over back to Shruti.

Shruti Bhat:

Thanks, Ben. That was a really interesting set of slides there. As we go into the final section here, just a reminder, please use the chat window to ask questions. We will take questions in real-time here. The design goals we talked about in the beginning, right? We wanted to design a system that gave you a better experience at scale. Now obviously, Elasticsearch is used at massive scale in a lot of companies, and it does perform at scale. However, the challenge is when you're operating at that scale, you're trying to get those low latency queries with high velocity ingest, and massive data volumes. The biggest challenge is around operational efficiency, right? How do you operate at that scale? How do you keep your costs low? How do you make sure that you are not wasting resources? Because it can get really expensive if you're over provisioning resources at that scale.

Shruti Bhat:

Our whole premise is that we believe that there is a better scaling experience that's possible in the cloud. That's what we're shooting for, so that you get the performance you need, but you also get it at the price that matters. The second big design goal that we talked about was, Hey, in the last few years, we've seen real-time data has changed. We've seen that you have data coming in many shapes, you have deeply nested data, you have Avro, you have Parquet, XML. And not only has the shape of the data changed, but also the types of applications that need to create this data. They have been changing, right? So how do you get that flexibility where, without knowing ahead of time, what the shape of your data is, without knowing ahead of time, what types of queries you need to run? How can we give you the flexibility to just keep going really fast? Our approach there is to give you standard SQL, including JOINs on semi-structured data. You don't need to denormalize anything.

Shruti Bhat:

The final part is what Ben spent a lot of time talking about. How can we give you serverless auto-scaling, right? Decoupling compute and storage is something that, interestingly enough, Snowflake has talked a lot about in the warehouse market, right? In the warehouse market, when Snowflake first did this in the cloud, it was really interesting because they were able to get groundbreaking, price performance efficiencies. It's very hard to do this in the real-time world, right? And this is what Rockset is doing in the real-time analytics space. We've been able to decouple compute and storage in the cloud, and still give you that low latency, real-time experience. And the whole point of it is because we can do this in the cloud, we can make it completely serverless. You don't have to, you shouldn't have to manage indexes and shards, and clusters, and node types, right? You had to do that in the data center world. But now in the cloud, is it really necessary to manage your shards? Right?

Shruti Bhat:

So we take away all of that, still give you a lot of visibility and control into what's happening, but automate the grunt work, or managing that. That's our whole goal. As a result of doing all this, our eventual goal here is, how can we get you faster time to market? Interestingly enough, one of our customers, just yesterday, was sharing that they had a particular workload that started failing in production. Very quickly, they decided to switch that workload over to Rockset. And within an hour, they were up and running. In one hour flat, they were up and running in production. That sounds a little crazy, but you can do that in the cloud. That's important, right? So how can you get faster time to market when your developers are running really fast? Certainly, you have a roadmap requirement. You have to support this new feature. How can you get up and running really fast? That's important.

Shruti Bhat:

And then on the TCO side, because of all of this cloud efficiency, we have seen up to 50% lower TCO. Of course, it all depends on your use case, your scale, how you're operating it, but that's the goal, isn't it? To be able to get to a much better cost model because of the efficiency you can get in the cloud. I'll stop here and switch over to questions. Also in the meantime, Ben, if you want to switch to the last slide. We want to open it up for Q&A, and tell you about our community channel where we will hang out for a few more minutes, after the session. And we'll continue to take questions over the Slack community. In the meantime, Julie, do you have any questions coming in?

Julie:

I do. I do have a couple questions coming in. One is, are there any plans to support Protobuf data from the schema query flexibility perspective? Ben, do you want to take this one?

Ben Hagan:

Yeah. I don't know if it's a roadmap item off the top of my head. We can find out, but there's a number of ways ... I assume, the question's obviously directly into Rockset as a native interface. At the moment, no, but I don't know about the roadmap, whether that's something we're looking at. There're obviously multiple ways you can architect to support that, but not like a native integration today.

Julie:

Okay. And then the next question is, does Rockset support transactional OLTP workloads? And how do you compare it to something like CockroachDB?

Ben Hagan:

Rockset is not a transactional system. The transactional systems are really good at doing what they do, right? Running transactions. We effectively pull data that you want to query, or offload that read workload into Rockset. And that's really those change data capture streams that we mentioned. We see customers all the time, that are running things like Postgres and MySQL, and SQL Server. And they get to a point where the data ingest is really starting to affect their query performance, or they've just in general, met the limits of what they can do with those types of systems. That's a really good use case for, we call it transactional offloading, whereby they take the reads, the complex queries, and actually run those on Rockset, and keep the transactional side of things running on those platforms like Postgres.

Ben Hagan:

And if you're powering or ingesting data into those systems, it's very easy to then push it straight into Rockset as well. Or as we've already mentioned, you can take a change data capture stream from that transactional system and directly push it into Rockset. So you get the best of both worlds. Transaction systems doing what it does best, and then you can offload all your reads, and scale that up to cloud levels in Rockset.

Shruti Bhat:

If I can add to that, one way to think about it is, in the real world, imagine Amazon.com, right? Your shopping cart would absolutely be on your transactional database, right? That's where you need very high consistency, and you absolutely want your transactions to support your shopping cart. And then everything else like product recommendations, like personalization, like inventory from third-party sellers, all of that will live in a different system. And you can imagine something like Rockset powering that. So we do not support transactions. That is really the trade off here. We are an indexing database. We do not support transactions.

Julie:

Great. Keep the questions coming. Here's another one. What are the main differences between Lucene and RocksDB?

Ben Hagan:

That's a big question,. But just at high level, I think RocksDB is effectively a key-value store. So as we use it within Rockset, we actually slice up all of that data and store it into a set format, which is key-values in this case. On the Lucene side of things, effectively, you're creating an inverted index. That's what it's doing under the covers. And the way it actually takes data out of memory and pushes it down the disk is very different. They're very different systems. And typically, you'll see Elastic's the best example of Lucene. There's fast, textual search look ups, whereas RocksDB's use case outside of Rockset are actually much broader. It's a general key-value store. Pretty high level, I'm afraid, but they're quite different in those respects.

Julie:

Great. Those are all the questions that we're going to take for now. As Shruti did mention, we will be hanging out in the Rockset community. So please join us there, and we'll be taking any additional questions through that forum.

Shruti Bhat:

Great. Thanks, everyone. Thank you for joining us, and we'll see you over in the community channel.

Julie:

Thanks.

Ben Hagan:

Thanks, everyone.


Recommended Resources