Mutable Data in Rockset

January 19, 2024

,
Register for
Index Conference

Hear talks on search and AI from engineers at Netflix, DoorDash, Uber and more.

Data mutability is the ability of a database to support mutations (updates and deletes) to the data that’s stored inside it. It’s a critical feature, especially in real-time analytics where data constantly changes and you need to present the latest version of that data to your customers and end users. Data can arrive late, it can be out of order, it can be incomplete or you might have a scenario where you need to enrich and extend your datasets with additional information for them to be complete. In either case, the ability to change your data is very important.

real-time-mutations

Rockset is fully mutable

Rockset is a fully mutable database. It supports frequent updates and deletes on document level, and is also very efficient at performing partial updates, when only a few attributes (even those deeply nested ones) in your documents have changed. You can read more about mutability in real-time analytics and how Rockset solves this here.

Being fully mutable means that common problems, like late arriving data, duplicated or incomplete data can be handled gracefully and at scale within Rockset.

There are three different ways how you can mutate data in Rockset:

  1. You can mutate data at ingest time through SQL ingest transformations, which act as a simple ETL (Extract-Transform-Load) framework. When you connect your data sources to Rockset, you can use SQL to manipulate data in-flight and filter it, add derived columns, remove columns, mask or manipulate personal information by using SQL functions, and so on. Transformations can be done on data source level and on collection level and this is a great way to put some scrutiny to your incoming datasets and do schema enforcement when needed. Read more about this feature and see some examples here.
  2. You can update and delete your data through dedicated REST API endpoints. This is a great approach if you prefer programmatic access or if you have a custom process that feeds data into Rockset.
  3. You can update and delete your data by executing SQL queries, as you normally would with a SQL-compatible database. This is well suited for manipulating data on single documents but also on sets of documents (or even on whole collections).

In this blog, we’ll go through a set of very practical steps and examples on how to perform mutations in Rockset via SQL queries.

Using SQL to manipulate your data in Rockset

There are two important concepts to understand around mutability in Rockset:

  1. Every document that is ingested gets an _id attribute assigned to it. This attributes acts as a primary key that uniquely identifies a document within a collection. You can have Rockset generate this attribute automatically at ingestion, or you can supply it yourself, either directly in your data source or by using an SQL ingest transformation. Read more about the _id field here.
  2. Updates and deletes in Rockset are treated similarly to a CDC (Change Data Capture) pipeline. This means that you don’t execute a direct update or delete command; instead, you insert a record with an instruction to update or delete a particular set of documents. This is done with the insert into select statement and the _op field. For example, instead of writing delete from my_collection where id = '123', you would write this: insert into my_collection select '123' as _id, 'DELETE' as _op. You can read more about the _op field here.

Now that you have a high level understanding of how this works, let’s dive into concrete examples of mutating data in Rockset via SQL.

Examples of data mutations in SQL

Let’s imagine an e-commerce data model where we have a user collection with the following attributes (not all shown for simplicity):

  • _id
  • name
  • surname
  • email
  • date_last_login
  • country

We also have an order collection:

  • _id
  • user_id (reference to the user)
  • order_date
  • total_amount

We’ll use this data model in our examples.

Scenario 1 - Update documents

In our first scenario, we want to update a specific user’s e-mail. Traditionally, we would do this:

update user 
set email = 'new_email@company.com' 
where _id = '123';

This is how you would do it in Rockset:

insert into user 
select 
    '123' as _id, 
    'UPDATE' as _op, 
    'new_email@company.com' as email;

This will update the top-level attribute email with the new e-mail for the user 123. There are other _op commands that can be used as well - like UPSERT if you want to insert the document in case it doesn’t exist, or REPLACE to replace the full document (with all attributes, including nested attributes), REPSERT, etc.

You can also do more complex things here, like perform a join, include a where clause, and so on.

Scenario 2 - Delete documents

In this scenario, user 123 is off-boarding from our platform and so we need to delete his record from the collection.

Traditionally, we would do this:

delete from user
where _id = '123';

In Rockset, we will do this:

insert into user
select 
    '123' as _id, 
    'DELETE' as _op;

Again, we can do more complex queries here and include joins and filters. In case we need to delete more users, we could do something like this, thanks to native array support in Rockset:

insert into user
select 
    _id, 
    'DELETE' as _op
from
    unnest(['123', '234', '345'] as _id);

If we wanted to delete all records from the collection (similar to a TRUNCATE command), we could do this:

insert into user
select 
    _id, 
    'DELETE' as _op
from
    user;

Scenario 3 - Add a new attribute to a collection

In our third scenario, we want to add a new attribute to our user collection. We’ll add a fullname attribute as a combination of name and surname.

Traditionally, we would need to do an alter table add column and then either include a function to calculate the new field value, or first default it to null or empty string, and then do an update statement to populate it.

In Rockset, we can do this:

insert into user
select
    _id,
    'UPDATE' as _op, 
    concat(name, ' ', surname) as fullname
from 
    user;

Scenario 4 - Remove an attribute from a collection

In our fourth scenario, we want to remove the email attribute from our user collection.

Again, traditionally this would be an alter table remove column command, and in Rockset, we will do the following, leveraging the REPSERT operation which replaces the whole document:

insert into user
select
    * 
    except(email), --we are removing the email atttribute
    'REPSERT' as _op
from 
    user;

Scenario 5 - Create a materialized view

In this example, we want to create a new collection that will act as a materialized view. This new collection will be an order summary where we track the full amount and last order date on country level.

First, we will create a new order_summary collection - this can be done via the Create Collection API or in the console, by choosing the Write API data source.

Then, we can populate our new collection like this:

insert into order_summary
with
    orders_country as (
        select
            u.country,
            o.total_amount,
            o.order_date
        from
            user u inner join order o on u._id = o.user_id
)
select
    oc.country as _id, --we are tracking orders on country level so this is our primary key
    sum(oc.total_amount) as full_amount,
    max(oc.order_date) as last_order_date
from
    orders_country oc
group by
    oc.country;

Because we explicitly set _id field, we can support future mutations to this new collection, and this approach can be easily automated by saving your SQL query as a query lambda, and then creating a schedule to run the query periodically. That way, we can have our materialized view refresh periodically, for example every minute. See this blog post for more ideas on how to do this.

Conclusion

As you can see throughout the examples in this blog, Rockset is a real-time analytics database that is fully mutable. You can use SQL ingest transformations as a simple data transformation framework over your incoming data, REST endpoints to update and delete your documents, or SQL queries to perform mutations at the document and collection level as you would in a traditional relational database. You can change full documents or just relevant attributes, even when they are deeply nested.

We hope the examples in the blog are useful - now go ahead and mutate some data!