Real-Time Analytics at Speed and Scale: When Managing Elasticsearch Gets Too Hard

Indexing is commonly used to improve query performance when application speed is critical, and some of the most used large-scale systems today, like Google Search and Facebook News Feed, are built on indexing. When developers look to implement indexing, Elasticsearch is a well-known solution, primarily geared towards text search and logging use cases. However, Elasticsearch is complex to operate at scale.

In this tech talk, we discuss the requirements for real-time analytics and how indexing can be used in its implementation. We will compare and contrast indexing in Elasticsearch to Rockset’s cloud-based Converged Index and examine how these characteristics may impact how you build your applications.

  • Operations at scale - Elasticsearch requires expertise and effort to deploy, configure and manage on an ongoing basis. Aside from standing up the initial cluster, scaling Elasticsearch involves the complexities of managing hardware, sharding and reindexing. Rockset is a serverless system, offered as a fully managed service, so all operations—scaling, index management, upgrades—are handled transparently to users.
  • Data flexibility - Elasticsearch is optimized for search use cases but is less suitable for analytics across multiple data sets as it does not support joins. One alternative is to denormalize the data at ingest, but this does not scale beyond simple cases. Rockset ingests data without requiring a pre-defined schema and builds multiple indexes—search, column-based and row-based—on the data. Rockset allows developers to build their applications using full SQL functionality, including joins.
  • Real-time ingest - Using Elasticsearch for real-time analytics requires building and maintaining ingestion pipelines to sync data from operational databases. As Elasticsearch documents are immutable, each updated document would have to be reindexed, consuming additional compute and I/O and impacting performance. Rockset, in contrast, has built-in connectors to common data sources. All documents are mutable and can be updated at the field level, allowing for efficient syncing to operational databases.

Speakers

Kevin Leong is Director of Product Marketing at Rockset, where he works closely with Rockset's product team and partners to help users realize the value of real-time analytics. He has been around data and analytics for the last decade, holding various product management roles at SAP, VMware, and MarkLogic.

Show Notes

Julie:

Good morning and thanks for joining our talk today on Real-Time Analytics at Speed and Scale: When Managing Elasticsearch Gets Too Hard. I'm joined today by Kevin Leong, Rockset's Director of Marketing. And Kevin works closely with product and partners on realizing real-time analytics. He's also held previous roles in Product Management at SAP, VMware and MarkLogic. For this talk, all participants will be muted but we'd love to make the session interactive. So please feel free to drop questions into the chat, and we'll answer them as they come in or at the end of the session. So for now, I'm going to go ahead and turn it over to Kevin.

Kevin Leong:

Great, thanks for the intro, and welcome everybody to our talk today on Real-Time Analytics at Speed and Scale. So just to give folks an overview of where we'll be going today, I'll talk somewhat on real-time analytics and the role indexing plays in that. And then we'll look at two different types of indexing, search indexing, converged indexing across different dimensions such as operations, data flexibility and real-time ingest. And we'll look to close this out with a demo and take some Q&A following that. So with that, let's dive right in on real-time analytics and indexing.

Kevin Leong:

So really, why is real-time analytics gaining importance? So here we have a chart taken from some IDC research and really shows the growth of real-time data over time forecasted out to 2025 right now, but really more interesting than the magnitude of real-time data is just the proportion of all data that real-time data will represent. So real-time data will be steadily increasing and also has been steadily increasing over time as a percentage of all data created, expected to be greater than 25% in 2025. And this is the result of a whole lot more devices, machines throwing off a lot of real-time data. And from the same report here, another measure of real-time data might be the number of interactions a person has a date, a number of digital data engagements really as what was listed in the report.

Kevin Leong:

And so we are seeing almost an exponential trend here, we are expecting a dramatic increase in the volume of digital data engagement. So whether that's on an app, in a store or on a connected device and in 2025 it's estimated to be right around now one interaction every 18 seconds per person and that even includes the entire 24 hours, right? Even when you should be sleeping. So people interacting digitally across devices and machines, there's just a lot of real-time data that's estimated to be growing in the next few years. So where does that leave us? Right? So let's define what we're talking about today when we talk about real-time analytics and applications.

Kevin Leong:

So what types of use cases are we discussing when we talk about real-time analytics? For sure, we talked about the real-time data that's being generated today, that's coming from mainly event streams or frequently updated data from a database. It's fast moving data and that tends to be a lot of it. Right? And what are some of the analytics and applications that these real-time data are powering? We have things like Ad optimization, A/B testing, fraud detection, obviously, these are things you want to do in real-time, you don't have the luxury of catching bad actors, minutes or even several seconds after the fact. We have customers doing Real-time 360, you want to know everything that's going on with each individual customer like the latest interactions, that's Customer 360s.

Kevin Leong:

We also have users doing things like Entity 360s where you want to know everything about an entity such as a company in order to make investment decisions and get investment insights. And then on the real-time application side, we have a logistics use cases around fleet management, around distribution and delivery. Think companies like Uber who have to track where cars and drivers and riders are at all times, we have things like gaming leaderboards that need to reflect scores and players in real-time and then real-time personalization recommendations. Again, obviously, requiring real-time information because you want to be able to give customized experiences as the visitor is on your website.

Kevin Leong:

So these are the types of real-time analytics and applications that we're talking about in the context of this Tech-Talk and I'll just say what we are little less concerned about in today's discussion are use cases like text search and logging and these are excellent use cases for Elasticsearch but for the types of use cases mentioned on this slide, you will tend to see that they require complex analytics that go beyond just text and usually involve multiple data sets and having to join them together. And it's for these types of real-time analytics and App use cases that we'll discuss Elasticsearch, indexing and alternatives to that.

Kevin Leong:

So just to give you some examples of some of the uses and customers we've been seeing in the space, in the area of construction logistics, we do have the company, Common L Con and what they do is they handle right around 80% of the concrete deliveries in the US and they're tracking millions of tickets, ordering concrete, delivering concrete every day. And what's really important is the ability to real-time search analytics and all these job tickets. And what Rockset allows them to do is to implement a service stack for their software product that they in turn offer to their clients. So that's one example in the logistics space.

Kevin Leong:

And then another example coming out of gaming is EGoGames out of Europe and Esports platform for mobile games. And what they're doing is that they have multiple data sets, transactional data and the DynamoDB. And also trying to merge that, combine that with customer acquisition and retention data from vendors, from whom they're sourcing the data and doing analysis on the joint data sets. And so they do some of these things for internal leaks. They do this to improve customer experience in the matchmaking process, they do this for fraud detection as well. And in all these situations, real-time analytics is really key and so what Rockset allows them to do is to query across all these data sources within seconds of the data being produced.

Kevin Leong:

What all these real-time analytics use cases have in common is really the requirement for both speed and scale, more specifically speed in the form of being able to ingest and query the data that's being produced in real-time and also being able to have good query performance for low latency queries. And then the ability to scale to support large volumes of data into tens, hundreds of terabytes even. So what we see is that traditional approaches either designed for speed or scale, typically, not both, right? So you've all TPU systems that are great for speed, not so great when it comes to scaling out, you have data warehouses which are great for large scale, not so great when it comes to getting real-time data in and being able to query at speed. So what if we use indexing techniques to address the speed and scale requirements of real-time analytics to provide these real time queries on large scale data?

Kevin Leong:

If we look at well known examples of products that achieve both speed and scale, Facebook, Newsfeed and Google search come to mind. These systems rely on indexing. So we thought, what if we could take the same concept and apply it to real-time analytics. And so that's the path that we went down at Rockset to deliver a simple way for developers to build real-time analytics by way of indexing. And it's possibly the only way to achieve the speed and scale that many of these modern applications that we've been describing require. So having discussed that, as we talk about indexing, Elasticsearch suddenly comes to mind as one of the notable examples of indexing. It's great that Elasticsearch is all about indexing as well. So what I want to do is explore two types of indexing, search indexing used by Elasticsearch and converged indexing, which is used by Rockset and see how their designs relate to where they are best employed.

Kevin Leong:

So let's look at an overview of search indexing and converged indexing. Search indexing as popularized by Elasticsearch, it's based on an inverted index and with also the option for doc value storage in Elasticsearch which is built upon Apache Lucene, built primarily for text search and logging use cases. And with Rockset what we've come up with, what we call the conversion index, which is combined row-based index, column-store index and inverted index built on top of RocksDB, which is a key value store. And it was built with real-time analytics and applications in mind, some of the things that we talked about earlier. What this means for search indexing, the way you would use it is that you would have to typically denormalize data if you need to bring in multiple data sets and their search and aggregation queries are fast, that's what the index accelerates.

Kevin Leong:

On the converged index side, we've taken an approach where there's no need to denormalize your data and we accelerate multiple types of queries, search aggregation, time-based, geo queries and also joints. And we'll look at more detail into this in later sections of the talk. So what is converged indexing under the hood? Like I mentioned, columnar, inverted and row indexes are built in the same system and they're built on top of RocksDB, P-value store abstraction. So if you take this example, we have two documents here, what we actually do is we strip them down into key value pairs, right? So each document will map to many key values. You will see that two of these key value entries pertain to the row store, right? And then two of them in this example will write into our column store index, and then another table right in to our inverted index. So what's important to notice that all indexes are written atomically in the converged index and every individual field is mutable.

Kevin Leong:

Now, you might be thinking how do we end up using the converged index practically speaking and so what Rockset does is choose or pick between the different indexes depending on the query type, right? So the value of the current converged index comes from accelerating a query performance for multiple types of queries, whether you have a highly selective query on one side or a large scan on the other side and what happens is that the optimizer picks the most appropriate index to use depending on the situation. So with that, let's look at it in more detail into the different types of indexing that we described earlier, search indexing with Elasticsearch, converged indexing with Rockset and let's take a deeper look at how these approaches compare across different dimensions and one being operations.

Kevin Leong:

So in terms of day one operations, in terms of setting up in Elasticsearch and Rockset deployment, with Elasticsearch, you do have a self manage option, right? You can manage it on-prem, you can manage on cloud infrastructure, setting up your own EC2 instances, for example, or this pass offering with elastic cloud as well. So multiple options there. On the Rockset side, it is built for the cloud, right? It is cloud only and we've done this to take advantage of cloud elasticity and we recently also announced the option to deploy inside a customer's VPC. So that if you need additional security or compliance, the data won't leave your VPC with the control plane being managed by Rockset. So that's the two deployment options that you would have on the Elasticsearch side and the Rockset side.

Kevin Leong:

And then what happens with Elasticsearch is that it typically requires considerable planning upfront when signing up a cluster, you'll have to think about things like capacity, hardware types, node types, replication and that's completely different than Rockset side, right? With Rockset, compute is decoupled from storage such that compute size is not tied to storage size. So there are no hardware types to manage on that front. And then on the durability side, Rockset relies on cloud storage with S3 being the destination for automatic backups and so that's some of the ways in which Rockset was designed to run in the cloud. In terms of day two operations, now that you've set up the cluster, what do you need to do on Rockset side and on the Elasticsearch side, you'll find that with Elasticsearch, there'll be things you'll have to do around shard management to spread the indexing load evenly across all nodes, you'll want to pick the optimal number of shards as needed, you may need to do things like re-indexing which may be needed as sharding or replication configurations change or your source data changes.

Kevin Leong:

And on the Rockset side, you will find that all these things are fully managed because the indexes are fully mutable, there is no need for re-indexing and sharding is handled in the background by Rockset as a fully managed services. What we do is we have thousands of micro-shots that are auto rebalanced by Rockset on the backend. So much less operational workload as well and that's on a steady state basis. And then as you think about capacity as your data size grows, as the number of users grows, searcher queries grow, you might need to do things like manage disk space on Elasticsearch side to allow sufficient headroom for growth and periodic fluctuations in your data.

Kevin Leong:

And when you think about scaling, you will have to add nodes to your Elasticsearch cluster to deal with additional storage or compute requirements, you might think about things like dedicated ingest nodes or read replicas to speed up writes or reads respectively. So these are all things that you will have to think about, the OLAP experiment and test to find the right balance across nodes, shards and replication on the Elasticsearch side whereas with Rockset, now the compute and the storage and actually, on the compute side, even the ingest compute and the query compute can be scaled independently so that users can scale efficiently and minimize the level of wasted resources in the cluster. So you might ask how does Rockset achieve that?

Kevin Leong:

This cloud scalability, this independent scalability of compute and storage, what we have adopted is a cloud native architecture called the ALT or aggregator leaf tailor architecture that's in use at companies like Facebook and LinkedIn. You will see here that basically, three tiers, we have tailors that do the ingestion, we have leaves that store the data and then we have aggregators that handle the queries. So because we have three tiers that can be independently scaled, you can scale your ingest compute if you have a lot of ingestion at a particular point in time, you can scale your leaves independently if you are storage heavy or storage bias. You can scale your aggregators independently if you have a lot of compute but maybe you don't have a whole lot of storage as well. So this is the way we've done, set the separation of ingest compute, query compute and storage in this cloud architecture. And behind the scenes, as mentioned, we're using RocksDB for persistence for elasticity of storage as well.

Kevin Leong:

Actually, we're using RocksDB cloud which is a cloud version of RocksDB that was modified soon to get this elasticity as well. So we see some of this come to fruition and the way we can scale really quickly to meet query or ingestion requirements. In this chart, we're looking at query performance, query response time and what you'll see is a roughly linear improvement in query performance as we scale the compute from four to eight, to 16 vCpus for this particular customer example. So because of Rockset it's easy, instant scalability users can handle increased load as needed by simply using more compute and this is achieved by simply switching compute resources as needed. There's no need to provision nodes to add more compute. There's no need to rebalance data to make sure everything is well balanced for optimal performance. This is all done behind the scenes and so helps with the operational burden with Rockset as well, which is really minimal, given the cloud deployment model that Rockset uses. So, that's on the operation side.

Julie:

We had a question a little while back, but I thought this might be a good time that I put it in but somebody had asked, typically, what kind of scale are we talking about from a data volume perspective? We're talking about gigabytes, terabytes or petabytes?

Kevin Leong:

Yeah, we have customers across the spectrum, right? So one thing about this decoupling of compute and storage is that you can have fairly complex applications and analytics operating on potentially gigabytes of data. And so they're very compute intensive, compute heavy and that's the way some applications are but that's on the low end, certainly because storage can scale independently off compute, you can have a whole lot of data stored in Rockset. So we'll see tens, hundreds of terabytes. Typically, again, these are applications so they tend to be data that is in use, we won't see petabytes, typically, because that's people usually use data lakes or some other archival storage for that. And it's they don't need it to be operated on in real-time by application. Yeah, we'll see from gigabytes to tens or hundreds of terabytes. Great, keep the questions coming.

Julie:

They will. Thank you.

Kevin Leong:

Okay, so the next section we'll talk about is data flexibility. So how this data used and queried in Elasticsearch compared to Rockset. So as mentioned previously, Elasticsearch is primarily search indexing based on Lucene for text and logging use cases, whereas Rockset leverages the converged index which accelerates multiple types of complex analytics. So more general purpose, real-time analytics, if you will. In terms of using the data, Elasticsearch has its own query DSL for writing queries and some basic support for non native SQL. On the other hand, what Rockset does is that it speaks full SQL, including JOINs and supports the SQL ecosystem visualization tools like Tableau, Grafana and so on.

Kevin Leong:

One thing that we'll talk about later is also the query lambdas feature that Rockset has which is a way to execute SQL queries in Rockset by simply hitting a REST endpoint. So with this in mind, we're talking about data flexibility, which is how easy is it for you to use the data sets that you have. As we mentioned earlier, because that's not support for JOINs in Elasticsearch, what are some ways that developers and users get around that? One is to do JOINs on the application side which typically introduces additional complexity and requires writing and maintaining additional code at the application level to implement the joint operations. So we see some people do that. And then very commonly that you can denormalize the data on the way into Elasticsearch and oftentimes it's something like a spark job which takes multiple data sets, does the join and then writes the denormalized data to Elasticsearch.

Kevin Leong:

What we see here is, obviously, requests additional entering effort to write the code to do to join and to denormalize. It can be on the order of maybe a week to write this code and then you have to maintain it over time, right? Some of the downsides of this approach is that it requires additional storage because we are now duplicating data into flattened documents. We've spoken to some users who report that this could be on the order of 100X amplification after the data is denormalized. And it also increases data latency, right? When we talk about real-time analytics, it's important to be able to get to query the data as quickly as possible. However, if it has to go through, let's say, a spark job, it can be tens of minutes before that denormalized data is queriable.

Kevin Leong:

I think the upshot is that denormalizing data may not be a great idea if your data changes frequently but you will have to update all the documents whenever a subset of the data changes for instance. For the indexing operation will definitely take longer with flattened data sets since more data is being indexed. So again, takes more time, takes more effort, takes more maintenance if data changes frequently and can cause performance issues either with data latency or in the cluster as well. So whether you're thinking about doing application side joins or denormalization, both aren't great options if you need to combine multiple data sets. It's usually often easier and better to do that in SQL and join at query time for real time analytics situations.

Kevin Leong:

So with that in mind, Rockset is really designed for developer productivity, right? So the increased data flexibility in the way that you can use Rockset makes developers more productive overall. So we talked about earlier, there are no data pipelines to do ingest time joins or denormalization that you have to build and maintain. Something that's often underrated is the ability to use familiar SQL and this is quite separate from comparing the functionality of query languages like Elasticsearch as query DSL to SQL. Aside from functionality a lot of our customers report that a lot more people are able to interact with data because they're able to use SQL on it. It saves time when writing queries in terms of complexity and in terms of understanding the correctness of the query you're writing because it's easier to understand that people are familiar with that language.

Kevin Leong:

And then one thing that I mentioned we'll talk about is query lambdas which are named parameterized SQL queries that are stored in Rockset and they are executed from a dedicated REST endpoint. So with this feature, developers can write SQL queries that can be shared, they can organize their queries by versions and tags and collaborate efficiently and effectively and also avoid having SQL and application code that way. So this has been very valuable for a lot of our developer users that we've been talking to. So let's move on to the ingest piece, right? So how do you ingest data and how quickly can you query it in the search indexing instance and in the converge indexing instance.

Kevin Leong:

So with Elasticsearch, if you need to do continuous sync of your data that's achieved through beats, agents or logstash and you'll often have to configure and manage them or possibly write an ingestion pipeline for denormalization that we talked about earlier or when an integration doesn't exist. In the Rockset situation, most of the data that we see comes from places like operational databases, streaming platforms or data lakes and there are managed built in connectors for these common sources, right? Mongo, Dynamo, Kafka, Kinesis, S3 and so on. That's simply a click and connect, so there's minimal effort required from the user. Data can be ingested and queried ASaaS. So there's no denormalization or transformation required when you're using Rockset to query your data.

Kevin Leong:

One thing to note that's a bit of a difference in the indexing technique is that with Elasticsearch documents are immutable, which means that any update to a document or a field in a document will require a new doc to be indexed and the old version to be deleted. So we'll trigger re-index of the entire document that way. With Rockset, all documents are mutable and can be updated at the field level. So updates only result in a re-index of the affected fields. So what this results in is that with Elasticsearch, we will see that you'll need additional compute and IO to be re-indexed, even the unchanged fields and to write entire documents upon update which can be inefficient for real time analytics scenarios that have constant updates, right? Especially if you're hooking up an operational database to your indexing layer, you might see constant updates coming in and might be inefficient to re-index entire documents whenever you have updates coming.

Kevin Leong:

So one of the design principles for offset is really data latency. As I talked about earlier, data latency is the time from which the data is produced to the time that it's queriable, to the time that you can ask questions of it. And yes, Rockset is designed to minimize the data latency even at higher ingestion rates. And the data latency time also includes any pipelines that are in the middle there that you need for denormalization on the time required to re-index or index a document. So in some of the head tests that we run, Rockset can maintain roughly a one second data latency. So, that's the time from when the data is produced to when it's been queried. So one second at steady state in writing on the order of a billion documents a day. So that's around 12,000 megabytes per second based on events of size 1K of the data latency of about one second there. So pretty fast, right? So you produce real time data and you're able to query it within a second and that's the way we've optimized and designed Rockset.

Kevin Leong:

Oh, with that and I've talked about operations, we've talked about data flexibility, I talked about the ability to do real time ingest and how Elasticsearch and search indexing compared to Rockset and converged indexing. I thought, for those of you who may not have seen Rockset in action, we can take some time to walk through a simple demo. So what I have here is the Rockset console. If you're a Rockset user, if you go to console.rockset.com, this is what you'll see on the overview page. And oftentimes, when you want to start with real-time analytics project, what you want to do is create a collection using a data. So let's say we have data in Apache Kafka, real-time data, real-time event streams coming in, what we can do is we can create an integration with your Kafka cluster and this has already been done in this case. Once we've done that, we can create a collection from let's say, a Kafka topic, right?

Kevin Leong:

So you can give your collection the name, you'll specify a topic, let's say we have Twitter data that's flowing through Kafka that we want to bring into Rockset, so you can specify that Kafka topic. And if the integration is up and running, it will bring you a preview of the data that it will bring into the collection once you click Create. So once you have examined the preview, it looks satisfactory, looks realistic, right? It's got hashtags, it's got URLs in the tweets, you can do things like add some simple mappings to transform your data, your custom field mappings or drop fields as required on the way in, you can set a retention time for your data if you want to keep the data for 30, 60 or 90 days, for instance. And you go ahead and click Create on your collection. We've done that already. So we do have a collection created on Twitter Kafka that's been just ingesting all this data over time and retaining it for, I think, a week and we can take a look at what's in this collection.

Kevin Leong:

So I can do a simple SQL query, right? So the data it's coming in and it's flowing in continuously. And I don't have to do any transformations or cleaning up the data at all, I can simply run a SQL query that's a select star to see what it looks like. That might be more instructive to look at the raw JSON data that's coming in here, you can see that this is what Twitter data looks like. This is what a Twitter feed looks like. It is nested JSON. That's pretty, it's got quite a lot of fields and the fields that we're mostly interested in are under entities and symbols, right? So under entities what Twitter gives you are things like hashtags and URLs of people embedded in the tweet. Sometimes they also embed stock symbols, right? They'll do dollar sign with the stock symbol. In this case, this particular tweet doesn't have any stock symbols but that's what we'll be pulling out after tweets and subsequent queries.

Kevin Leong:

So what we've seen so far is I've done a simple select star that returns data in a tabular format if you want and then on subsequent queries, I can do things like a nest, right? The symbols because there might be multiple symbols in a particular tweet, want to unnest them, I want to check for situations where there are symbols in the tweet and I want to pull all of them out for tweets that happened in the last day, for instance. So that's the query I'll run, you will see the tweets, yes, these tweets have symbols, this one has apple, this one has Facebook, for instance, and this is what this query returns all the tweets in the last day that have symbols in a table here. And then what I want to continue to do is do some aggregations on the data that I have coming back from the symbols in the tweets, I want to count up the number of times each symbol as we mentioned over the last day.

Kevin Leong:

So if I run that people query, I will find that Tesla, unsurprisingly, as we mentioned, most often followed by Apple, some very common stocks, Amazon, some of the big boys up there. And if we get more complex with our SQL query, what we can do is we can also try to join the stock ticker symbol that we've pulled out of Twitter with other data that we have sitting in another data set that tells us what TSLA actually is or what AAPL actually is. I have a file that actually has these stock symbols and then tells us the company name and industry that it's in and so we're joining the Twitter data with data from this file. And if we run that, we have the same stock ticker symbols in the top position and then we also have match that up with the company name in cases where I may not know what GLD might be, it will tell me it's good at sciences and it's in the biotech space. So that's great.

Kevin Leong:

What if I want to productize this query, right? So this query is great, I want to display this in a dashboard, I want to call an API in order to do that. If I look under the develop tab here, I can actually create query lambda, which I've done out of this, which gives me an endpoint that I can simply copy into my code and then hit whenever I want this data coming back to me. And if I want to change the query in any way, I can change the query, I can update a query lambda and then I don't have to change my application code, it will just give me the results from the updated query. So those are some of the ways in which Rockset helps with making developers usage, developers lives easier with query lambdas, with the ability to do sequel and joins on even complex nested data coming from a real-time stream from Kafka in this case. So that's kind of an overview of Rockset, certainly feel free to try it out for yourself. Let me just click over here.

Kevin Leong:

You can get started for free, right? There's no credit card required to get started with our trial version, you get $300 in free credits to try Rockset with your queries, your data set. So go ahead and try it out for yourself and I'll just conclude this session by talking about some of the things that we've designed for with Rockset and converged indexing. So we talked about speed at scale, low latency even at that scale with billion messages a day, new data is credible and in seconds. And queries also return in milliseconds even on terabytes of data just because we're using indexing to accelerate all kinds of different queries. Data flexibility, that's something else we designed for, the ability to do SQL on semi-structured data, the ability to do JOINs without having to denormalize or to do joints on the application side directly on the data assets.

Kevin Leong:

And then Rockset being designed for the cloud, built for the cloud, no apps is definitely a big win for a lot of our users. You can scale, obviously, it's optimized for both hardware efficiency and the human cost of doing things of operating your cluster scaling, managing and so on. Very little human intervention required with Rockset. So that's what I had for today's webinar and if you have any questions, please, please send them in the Q&A tool and your go-to-webinar interface and Julie, you can pick them up and send them to me.

Julie:

I will do that. Yes. Please continue to type in your questions. I have one question to start, which is, how do you configure the converged indexed index in three different ways?

Kevin Leong:

Yeah, we actually get that question quite a bit and the general thought here is that when you have an index, you need to set it up, you need to configure it, you need to manage it. I think one of the beauties of the way we've done it at Rockset is that you don't have to do any of that. When we say that we build multiple indexes, in this case, the column index, the row index, the inverted index, we do that automatically on all the data that you ingest, on every field of all the data that you ingest into Rockset. So there's no need for any user configuration management, all that is done behind the scenes, all that is done automatically, every query has access to the multiple indexes that's automatically built on all the data.

Julie:

And, Kevin, what types of data can be indexed using the converged index?

Kevin Leong:

Yeah, there two different ways to answer that question. I think one of the things that I spoke about is the built in support for common data sources, right? We see a lot of users of real-time data and operational databases, no SQL databases, especially like DynamoDB and MongoDB. We see real-time data coming in from Kafka and Kinesis. Sometimes there's a lot of data sitting in data lakes, not as real time as some of the other data sources but potentially, you want to combine that data with your real-time data to get, let's say, a Customer 360, for example. So those are the common data sources and more that we have connectors for. And then in terms of the data formats, that can be indexed from these data sources. JSON is obviously very popular and then XML, parquet, our role coming from Kafka, for instance. So semi-structured data coming from these common data sources.

Julie:

And just building multiple indexes result in using more storage?

Kevin Leong:

Yeah. Actually, I'll get to that question a bit but let me just add something to the previous question which is, yes, we have connectors to these common data sources but a lot of customers also use our REST API. And they can do streaming and jazz via our REST API. Let's say if we don't have a connector to dated data source or if it's something pretty custom, they can simply use a write API to ingest the data as well. And then to your question, Julia, that you mentioned, I think it was doesn't indexing results and using a lot more storage? I think what we see here is a trade off, right? So yeah, with indexing, you get better query performance. That's what we're optimizing for with our real time analytics and real time applications, it will result in more storage being used.

Kevin Leong:

But storage is relatively cheap compared to, let's say, the compute. You will need to be able to deliver good query performance without indexing if that were possible. Storage is relatively cheap compared to the human cost of having to write and maintain data pipelines in order to get your data in such a way that it can't be queried and it can't be queried in a performant way. So the trade off here is that we're using storage and it's mostly like S3 storage which is relatively cheap and storing our indexes on there and then delivering good performance and a lot of users see that as a very positive trade off.

Julie:

Great. Thanks, Kevin. I appreciate you taking the time today for the talk and we are going to go ahead and close it out. If you do have any other questions, please feel free to follow up and email us. Kevin's email is Kevin@rockset.com.

Kevin Leong:

That's right, it's on the slide. Thanks all for joining us today.

Julie:

Take care, Bye.


Recommended Resources